Language Evolution and Computation Bibliography

Our site (www.isrl.uiuc.edu/amag/langev) retired, please use https://langev.com instead.
Gerhard Jager
2018
Global-scale phylogenetic linguistic inference from lexical resourcesPDF
arXiv, 2018
Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used ...MORE ⇓
Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning techniques to compile data suitable for phylogenetic inference from the ASJP database, a collection of almost 7,000 phonetically transcribed word lists over 40 concepts, covering two third of the extant world-wide linguistic diversity. First, we estimated Pointwise Mutual Information scores between sound classes using weighted sequence alignment and general-purpose optimization. From this we computed a dissimilarity matrix over all ASJP word lists. This matrix is suitable for distance-based phylogenetic inference. Second, we applied cognate clustering to the ASJP data, using supervised training of an SVM classifier on expert cognacy judgments. Third, we defined two types of binary characters, based on automatically inferred cognate classes and on sound-class occurrences. Several tests are reported demonstrating the suitability of these characters for character-based phylogenetic inference. Background & Summary The cultural transmission of natural languages with its patterns of near-faithful replication from generation to generation, and the diversification resulting from population splits, are known to display striking similarities to biological evolution [1, 2]. The mathematical tools to recover evolutionary history developed in computational biology — phylogenetic inference — play an increasingly important role in the study of the diversity and history of human languages. [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] The main bottleneck for this research program is the so far still limited availability of suitable data. Most extant studies rely on manually curated ar X iv :1 80 2. 06 07 9v 1 [ cs .C L ] 1 7 Fe b 20 18 collections of expert judgments pertaining to the cognacy of core vocabulary items or the grammatical classification of languages. Collecting such data is highly labor intensive. Therefore sizeable collections currently exist only for a relatively small number of well-studied language families. [8, 11, 15, 16, 17, 18] Basing phylogenetic inference on expert judgments, especially judgments regarding the cognacy between words, also raises methodological concerns. The experts making those judgments are necessarily historical linguists with some prior information about the genetic relationships between the languages involved. In fact, it is virtually impossible to pass a judgment about cognacy without forming a hypothesis about such relations. In this way, data are enriched with prior assumptions of human experts in a way that is hard to control or to precisely replicate. Modern machine learning techniques provide a way to greatly expand the empirical base of phylogenetic linguistics while avoiding the above-mentioned methodological problem. The Automated Similarity Judgment Program (ASJP) [19] database contains 40-item core vocabulary lists from more than 7,000 languages and dialects across the globe, covering about 75% of the extant linguistic diversity. All data are in phonetic transcription with little additional annotations.1 It is, at the current time, the most comprehensive collection of word lists available. Phylogenetic inference techniques comes in two flavors, distance-based and character-based methods. Distance-based methods require as input a matrix of pairwise distances between taxa. Character-based methods operate on a character matrix, i.e. a classification of the taxa under consideration according to a list of discrete, finite-valued characters. While some distance-based methods are computationally highly efficient, character-based methods usually provide more precise results and afford more fine-grained analyses. The literature contains proposals to extract both pairwise distance matrices and character data from phonetically transcribed word lists. [20, 21, 22] In this paper we apply those methods to the ASJP data and make both a distance matrix and a character matrix for 6,892 languages and dialects2 derived this way available to the community. Also, we demonstrate the suitability of the results for phylogenetic inference. While both the raw data and the algorithmic methods used in this study are freely publicly available, the computational effort required was considerable (about ten days computing time on a 160-cores parallel server). Therefore the resulting resource is worth publishing in its own right. 1The only expert judgments contained in the ASJP data are rather unsystematic manual identifications of loan words. This information is ignored in the present study. 2These are all languages in ASJP v. 17 except reconstructed, artificial, pidgin and creole languages.
2015
PNAS 112(41):12752-7, 2015
Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been ...MORE ⇓
Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been limited to single-language families because they rely on a large body of expert cognacy judgments or grammatical classifications, which is currently unavailable for most language families. The present study pursues a different approach. Starting from raw phonetic transcription of core vocabulary items from very diverse languages, it applies weighted string alignment to track both phonetic and lexical change. Applied to a collection of ∼1,000 Eurasian languages and dialects, this method, combined with phylogenetic inference, leads to a classification in excellent agreement with established findings of historical linguistics. Furthermore, it provides strong statistical support for several putative macrofamilies contested in current historical linguistics. In particular, there is a solid signal for the Nostratic/Eurasiatic macrofamily.
2012
PLoS ONE 7(3):e33171, 2012
Language change takes place primarily via diffusion of linguistic variants in a population of individuals. Identifying selective pressures on this process is important not only to construe and predict changes, but also to inform theories of evolutionary dynamics of socio-cultural ...MORE ⇓
Language change takes place primarily via diffusion of linguistic variants in a population of individuals. Identifying selective pressures on this process is important not only to construe and predict changes, but also to inform theories of evolutionary dynamics of socio-cultural factors. In this paper, we advocate the Price equation from evolutionary biology and the Polya-urn dynamics from contagion studies as efficient ways to discover selective pressures. Using the Price equation to process the simulation results of a computer model that follows the Polya-urn dynamics, we analyze theoretically a variety of factors that could affect language change, including variant prestige, transmission error, individual influence and preference, and social structure. Among these factors, variant prestige is identified as the sole selective pressure, whereas others help modulate the degree of diffusion only if variant prestige is involved. This multidisciplinary study discerns the primary and complementary roles of linguistic, individual learning, and socio-cultural factors in language change, and offers insight into empirical studies of language change.
Advances in Complex Systems 15(03n04):1150019, 2012
The paper investigates the quantitative distribution of language types across languages of the world. The studies are based on three large-scale typological data bases: The World Color Survey, the Automated Similarity Judgment Project data base, and the World Atlas of Language ...MORE ⇓
The paper investigates the quantitative distribution of language types across languages of the world. The studies are based on three large-scale typological data bases: The World Color Survey, the Automated Similarity Judgment Project data base, and the World Atlas of Language Structures. The main finding is that a surprisingly large and varied collection of linguistic typologies show power law behavior. The bulk of the paper deals with the statistical validation of these findings.
2010
Proceedings of the 8th International Conference on the Evolution of Language, pages 192-197, 2010
In the early seventies, the bio-mathematician George Price developed a simple and concise mathematical description of evolutionary processes that abstracts away from the specific properties of biological evolution. In the talk I will argue argued that Price's framework is ...MORE ⇓
In the early seventies, the bio-mathematician George Price developed a simple and concise mathematical description of evolutionary processes that abstracts away from the specific properties of biological evolution. In the talk I will argue argued that Price's framework is well-suited to model various aspects of the cultural evolution of language. The first part of the talk describes Price's approach in some detail. In the second part, case studies about its application to language evolution are presented.
2008
Journal of Theoretical Biology 253(1):131-141, 2008
The paper investigates the class of signaling games with the following properties: (a) the interests of sender and receiver coincide, (b) different signals incur differential costs, and (c) different events (meanings/types) have different probabilities. Necessary and sufficient ...MORE ⇓
The paper investigates the class of signaling games with the following properties: (a) the interests of sender and receiver coincide, (b) different signals incur differential costs, and (c) different events (meanings/types) have different probabilities. Necessary and sufficient conditions are presented for a profile to be evolutionarily stable and neutrally stable, and for a set of profiles to be an evolutionarily stable set.

The main finding is that a profile belongs to some evolutionarily stable set if and only if a maximal number of events can be reliably communicated. Furthermore, it is shown that under the replicator dynamics, a set of states with a positive measure is attracted to ``sub-optimal'' equilibria that do not belong to any asymptotically stable set.

Language and Linguistics Compass 2(3):406--421, 2008
Abstract The article gives a brief overview over the budding field of game theoretic linguistics, by focusing on game theoretic pragmatics on the one hand, and the usage of evolutionary game theory to model cultural language evolution on the other hand. Two ...
2007
Evolutionary game theory and typology: A case study
Language 83(1):74-109, 2007
This article deals with the typology of the case marking of semantic core roles. The competing economy considerations of hearer (disambiguation) and speaker (minimal effort) are formalized in terms of EVOLUTIONARY GAME THEORY. It is shown that the case-marking patterns that are ...MORE ⇓
This article deals with the typology of the case marking of semantic core roles. The competing economy considerations of hearer (disambiguation) and speaker (minimal effort) are formalized in terms of EVOLUTIONARY GAME THEORY. It is shown that the case-marking patterns that are attested in the languages of the world are those that are evolutionarily stable for different relative weightings of speaker economy and hearer economy, given the statistical patterns of language use that were extracted from corpora of naturally occurring conversations.
Synthese 159(1):99-130, 2007
In this article we discuss the notion of a linguistic universal, and possible sources of such invariant properties of natural languages. In the first part, we explore the conceptual issues that arise. In the second part of the paper, we focus on the explanatory potential of ...MORE ⇓
In this article we discuss the notion of a linguistic universal, and possible sources of such invariant properties of natural languages. In the first part, we explore the conceptual issues that arise. In the second part of the paper, we focus on the explanatory potential of horizontal evolution. We particularly focus on two case studies, concerning Zipf's Law and universal properties of color terms, respectively. We show how computer simulations can be employed to study the large scale, emergent, consequences of psychologically and psychologically motivated assumptions about the working of horizontal language transmission.
2006
Convex meanings and evolutionary stabilityPDF
Proceedings of the 6th International Conference on the Evolution of Language, pages 139-144, 2006
Gardenfors (2000) argues that natural denotations of natural language predicates are convex regions in a conceptual space. Using techniques from evolutionary game theory, the paper shows that this convexity criterion is a consequence of the evolutionary dynamics of language use.
2003
Simulating language change with Functional OTPDF
Proceedings of Language Evolution and Computation Workshop/Course at ESSLLI, pages 52-61, 2003
The research reported here is a reaction to recent work by Judith Aissen on the typology of case marking systems within Optimality Theory (OT). Aissen (2000) explains certain linguistic universals by assuming universal sub-hierarchies of OT constraints. I found this ...
Evolutionary Game Theory and Linguistic Typology: A Case StudyPDF
Proceedings of the 14th Amsterdam Colloquium, 2003
Abstract The paper deals with the typology of the case marking of semantic core roles. The competing economy considerations of hearer (disambiguation) and speaker (minimal effort) are formalized in terms of evolutionary game theory. It will be shown that the case marking ...