[langev] Gerhard Jager

Global-scale phylogenetic linguistic inference from lexical resourcesPDF

arXiv, 2018

Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used ...MORE ⇓

Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning techniques to compile data suitable for phylogenetic inference from the ASJP database, a collection of almost 7,000 phonetically transcribed word lists over 40 concepts, covering two third of the extant world-wide linguistic diversity. First, we estimated Pointwise Mutual Information scores between sound classes using weighted sequence alignment and general-purpose optimization. From this we computed a dissimilarity matrix over all ASJP word lists. This matrix is suitable for distance-based phylogenetic inference. Second, we applied cognate clustering to the ASJP data, using supervised training of an SVM classifier on expert cognacy judgments. Third, we defined two types of binary characters, based on automatically inferred cognate classes and on sound-class occurrences. Several tests are reported demonstrating the suitability of these characters for character-based phylogenetic inference. Background & Summary The cultural transmission of natural languages with its patterns of near-faithful replication from generation to generation, and the diversification resulting from population splits, are known to display striking similarities to biological evolution [1, 2]. The mathematical tools to recover evolutionary history developed in computational biology — phylogenetic inference — play an increasingly important role in the study of the diversity and history of human languages. [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] The main bottleneck for this research program is the so far still limited availability of suitable data. Most extant studies rely on manually curated ar X iv :1 80 2. 06 07 9v 1 [ cs .C L ] 1 7 Fe b 20 18 collections of expert judgments pertaining to the cognacy of core vocabulary items or the grammatical classification of languages. Collecting such data is highly labor intensive. Therefore sizeable collections currently exist only for a relatively small number of well-studied language families. [8, 11, 15, 16, 17, 18] Basing phylogenetic inference on expert judgments, especially judgments regarding the cognacy between words, also raises methodological concerns. The experts making those judgments are necessarily historical linguists with some prior information about the genetic relationships between the languages involved. In fact, it is virtually impossible to pass a judgment about cognacy without forming a hypothesis about such relations. In this way, data are enriched with prior assumptions of human experts in a way that is hard to control or to precisely replicate. Modern machine learning techniques provide a way to greatly expand the empirical base of phylogenetic linguistics while avoiding the above-mentioned methodological problem. The Automated Similarity Judgment Program (ASJP) [19] database contains 40-item core vocabulary lists from more than 7,000 languages and dialects across the globe, covering about 75% of the extant linguistic diversity. All data are in phonetic transcription with little additional annotations.1 It is, at the current time, the most comprehensive collection of word lists available. Phylogenetic inference techniques comes in two flavors, distance-based and character-based methods. Distance-based methods require as input a matrix of pairwise distances between taxa. Character-based methods operate on a character matrix, i.e. a classification of the taxa under consideration according to a list of discrete, finite-valued characters. While some distance-based methods are computationally highly efficient, character-based methods usually provide more precise results and afford more fine-grained analyses. The literature contains proposals to extract both pairwise distance matrices and character data from phonetically transcribed word lists. [20, 21, 22] In this paper we apply those methods to the ASJP data and make both a distance matrix and a character matrix for 6,892 languages and dialects2 derived this way available to the community. Also, we demonstrate the suitability of the results for phylogenetic inference. While both the raw data and the algorithmic methods used in this study are freely publicly available, the computational effort required was considerable (about ten days computing time on a 160-cores parallel server). Therefore the resulting resource is worth publishing in its own right. 1The only expert judgments contained in the ASJP data are rather unsystematic manual identifications of loan words. This information is ignored in the present study. 2These are all languages in ASJP v. 17 except reconstructed, artificial, pidgin and creole languages.

Semantic Scholar | Search Google Scholar

Support for linguistic macrofamilies from weighted sequence alignmentdoi.org PDF

G Jager

PNAS 112(41):12752-7, 2015

Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been ...MORE ⇓

Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been limited to single-language families because they rely on a large body of expert cognacy judgments or grammatical classifications, which is currently unavailable for most language families. The present study pursues a different approach. Starting from raw phonetic transcription of core vocabulary items from very diverse languages, it applies weighted string alignment to track both phonetic and lexical change. Applied to a collection of ∼1,000 Eurasian languages and dialects, this method, combined with phylogenetic inference, leads to a classification in excellent agreement with established findings of historical linguistics. Furthermore, it provides strong statistical support for several putative macrofamilies contested in current historical linguistics. In particular, there is a solid signal for the Nostratic/Eurasiatic macrofamily.

Cited by 10 in Semantic Scholar | Search Google Scholar

Studying Language Change Using Price Equation and Polya-urn Dynamicsdoi.org

T Gong, L Shuai, M Tamariz, G Jager

PLoS ONE 7(3):e33171, 2012

Language change takes place primarily via diffusion of linguistic variants in a population of individuals. Identifying selective pressures on this process is important not only to construe and predict changes, but also to inform theories of evolutionary dynamics of socio-cultural ...MORE ⇓

Language change takes place primarily via diffusion of linguistic variants in a population of individuals. Identifying selective pressures on this process is important not only to construe and predict changes, but also to inform theories of evolutionary dynamics of socio-cultural factors. In this paper, we advocate the Price equation from evolutionary biology and the Polya-urn dynamics from contagion studies as efficient ways to discover selective pressures. Using the Price equation to process the simulation results of a computer model that follows the Polya-urn dynamics, we analyze theoretically a variety of factors that could affect language change, including variant prestige, transmission error, individual influence and preference, and social structure. Among these factors, variant prestige is identified as the sole selective pressure, whereas others help modulate the degree of diffusion only if variant prestige is involved. This multidisciplinary study discerns the primary and complementary roles of linguistic, individual learning, and socio-cultural factors in language change, and offers insight into empirical studies of language change.

Search Google Scholar

Power laws and other heavy-tailed distributions in linguistic typologydoi.org PDF

G Jager

Advances in Complex Systems 15(03n04):1150019, 2012

The paper investigates the quantitative distribution of language types across languages of the world. The studies are based on three large-scale typological data bases: The World Color Survey, the Automated Similarity Judgment Project data base, and the World Atlas of Language ...MORE ⇓

The paper investigates the quantitative distribution of language types across languages of the world. The studies are based on three large-scale typological data bases: The World Color Survey, the Automated Similarity Judgment Project data base, and the World Atlas of Language Structures. The main finding is that a surprisingly large and varied collection of linguistic typologies show power law behavior. The bulk of the paper deals with the statistical validation of these findings.

Cited by 5 in Semantic Scholar | Search Google Scholar

Applications of the price equation to language evolutiondoi.org

G Jager

Proceedings of the 8th International Conference on the Evolution of Language, pages 192-197, 2010

In the early seventies, the bio-mathematician George Price developed a simple and concise mathematical description of evolutionary processes that abstracts away from the specific properties of biological evolution. In the talk I will argue argued that Price's framework is ...MORE ⇓

In the early seventies, the bio-mathematician George Price developed a simple and concise mathematical description of evolutionary processes that abstracts away from the specific properties of biological evolution. In the talk I will argue argued that Price's framework is well-suited to model various aspects of the cultural evolution of language. The first part of the talk describes Price's approach in some detail. In the second part, case studies about its application to language evolution are presented.

Search Google Scholar

Evolutionary stability conditions for signaling games with costly signalsdoi.org PDF

G Jager

Journal of Theoretical Biology 253(1):131-141, 2008

The paper investigates the class of signaling games with the following properties: (a) the interests of sender and receiver coincide, (b) different signals incur differential costs, and (c) different events (meanings/types) have different probabilities. Necessary and sufficient ...MORE ⇓

The paper investigates the class of signaling games with the following properties: (a) the interests of sender and receiver coincide, (b) different signals incur differential costs, and (c) different events (meanings/types) have different probabilities. Necessary and sufficient conditions are presented for a profile to be evolutionarily stable and neutrally stable, and for a set of profiles to be an evolutionarily stable set.

The main finding is that a profile belongs to some evolutionarily stable set if and only if a maximal number of events can be reliably communicated. Furthermore, it is shown that under the replicator dynamics, a set of states with a positive measure is attracted to ``sub-optimal'' equilibria that do not belong to any asymptotically stable set.

Cited by 13 in Semantic Scholar | Search Google Scholar

Applications of game theory in linguisticsdoi.org PDF

G Jager

Language and Linguistics Compass 2(3):406--421, 2008

Abstract The article gives a brief overview over the budding field of game theoretic linguistics, by focusing on game theoretic pragmatics on the one hand, and the usage of evolutionary game theory to model cultural language evolution on the other hand. Two ...

Cited by 15 in Semantic Scholar | Search Google Scholar

Evolutionary game theory and typology: A case study

G Jager

Language 83(1):74-109, 2007

This article deals with the typology of the case marking of semantic core roles. The competing economy considerations of hearer (disambiguation) and speaker (minimal effort) are formalized in terms of EVOLUTIONARY GAME THEORY. It is shown that the case-marking patterns that are ...MORE ⇓

This article deals with the typology of the case marking of semantic core roles. The competing economy considerations of hearer (disambiguation) and speaker (minimal effort) are formalized in terms of EVOLUTIONARY GAME THEORY. It is shown that the case-marking patterns that are attested in the languages of the world are those that are evolutionarily stable for different relative weightings of speaker economy and hearer economy, given the statistical patterns of language use that were extracted from corpora of naturally occurring conversations.

Search Google Scholar

Language structure: psychological and social constraintsdoi.org PDF

G Jager, R van Rooij

Synthese 159(1):99-130, 2007

In this article we discuss the notion of a linguistic universal, and possible sources of such invariant properties of natural languages. In the first part, we explore the conceptual issues that arise. In the second part of the paper, we focus on the explanatory potential of ...MORE ⇓

In this article we discuss the notion of a linguistic universal, and possible sources of such invariant properties of natural languages. In the first part, we explore the conceptual issues that arise. In the second part of the paper, we focus on the explanatory potential of horizontal evolution. We particularly focus on two case studies, concerning Zipf's Law and universal properties of color terms, respectively. We show how computer simulations can be employed to study the large scale, emergent, consequences of psychologically and psychologically motivated assumptions about the working of horizontal language transmission.

Cited by 16 in Semantic Scholar | Search Google Scholar

Convex meanings and evolutionary stabilityPDF

G Jager

Proceedings of the 6th International Conference on the Evolution of Language, pages 139-144, 2006

Gardenfors (2000) argues that natural denotations of natural language predicates are convex regions in a conceptual space. Using techniques from evolutionary game theory, the paper shows that this convexity criterion is a consequence of the evolutionary dynamics of language use.

Search Google Scholar

Simulating language change with Functional OTPDF

G Jager

Proceedings of Language Evolution and Computation Workshop/Course at ESSLLI, pages 52-61, 2003

The research reported here is a reaction to recent work by Judith Aissen on the typology of case marking systems within Optimality Theory (OT). Aissen (2000) explains certain linguistic universals by assuming universal sub-hierarchies of OT constraints. I found this ...

Search Google Scholar

Evolutionary Game Theory and Linguistic Typology: A Case StudyPDF

G Jager

Proceedings of the 14th Amsterdam Colloquium, 2003

Abstract The paper deals with the typology of the case marking of semantic core roles. The competing economy considerations of hearer (disambiguation) and speaker (minimal effort) are formalized in terms of evolutionary game theory. It will be shown that the case marking ...

Cited by 29 in Semantic Scholar | Search Google Scholar

Language Evolution and Computation Bibliography