Language Evolution and Computation Bibliography

Andrew Meade
Current Biology 25:1-9, 2015
BACKGROUND Concerted evolution is normally used to describe parallel changes at different sites in a genome, but it is also observed in languages where a specific phoneme changes to the same other phoneme in many words in the lexicon—a phenomenon known as regular sound change. We develop a general statistical model that can detect concerted changes in aligned sequence data and apply it to study regular sound changes in the Turkic language family. RESULTS Linguistic evolution, unlike the genetic substitutional process, is dominated by events of concerted evolutionary change. Our model identified more than 70 historical events of regular sound change that occurred throughout the evolution of the Turkic language family, while simultaneously inferring a dated phylogenetic tree. Including regular sound changes yielded an approximately 4-fold improvement in the characterization of linguistic change over a simpler model of sporadic change, improved phylogenetic inference, and returned more reliable and plausible dates for events on the phylogenies. The historical timings of the concerted changes closely follow a Poisson process model, and the sound transition networks derived from our model mirror linguistic expectations. CONCLUSIONS We demonstrate that a model with no prior knowledge of complex concerted or regular changes can nevertheless infer the historical timings and genealogical placements of events of concerted change from the signals left in contemporary data. Our model can be applied wherever discrete elements—such as genes, words, cultural trends, technologies, or morphological traits—can change in parallel within an organism or other evolving group.
Proceedings of the Royal Society B: Biological Sciences 280(1762), 2013
There is disagreement about the routes taken by populations speaking Bantu languages as they expanded to cover much of sub-Saharan Africa. Here, we build phylogenetic trees of Bantu languages and map them onto geographical space in order to assess the likely pathway of expansion and test between dispersal scenarios. The results clearly support a scenario in which groups first moved south through the rainforest from a homeland somewhere near the Nigeria–Cameroon border. Emerging on the south side of the rainforest, one branch moved south and west. Another branch moved towards the Great Lakes, eventually giving rise to the monophyletic clade of East Bantu languages that inhabit East and Southeastern Africa. These phylogenies also reveal information about more general processes involved in the diversification of human populations into distinct ethnolinguistic groups. Our study reveals that Bantu languages show a latitudinal gradient in covering greater areas with increasing distance from the equator. Analyses suggest that this pattern reflects a true ecological relationship rather than merely being an artefact of shared history. The study shows how a phylogeographic approach can address questions relating to the specific histories of certain groups, as well as general cultural evolutionary processes.
PNAS 110(21):8471--8476, 2013
The search for ever deeper relationships among the World’s languages is bedeviled by the fact that most words evolve too rapidly to preserve evidence of their ancestry beyond 5,000 to 9,000 y. On the other hand, quantitative modeling indicates that some “ultraconserved” words exist that might be used to find evidence for deep linguistic relationships beyond that time barrier. Here we use a statistical model, which takes into account the frequency with which words are used in common everyday speech, to predict the existence of a set of such highly conserved words among seven language families of Eurasia postulated to form a linguistic superfamily that evolved from a common ancestor around 15,000 y ago. We derive a dated phylogenetic tree of this proposed superfamily with a time-depth of ∼14,450 y, implying that some frequently used words have been retained in related forms since the end of the last ice age. Words used more than once per 1,000 in everyday speech were 7- to 10-times more likely to show deep ancestry on this tree. Our results suggest a remarkable fidelity in the transmission of some words and give theoretical justification to the search for features of language that might be preserved across wide spans of time and geography.
BioEssays, 2013
The Homeric epics are among the greatest masterpieces of literature, but when they were produced is not known with certainty. Here we apply evolutionary-linguistic phylogenetic statistical methods to differences in Homeric, Modern Greek and ancient Hittite vocabulary items to estimate a date of approximately 710–760 BCE for these great works. Our analysis compared a common set of vocabulary items among the three pairs of languages, recording for each item whether the words in the two languages were cognate – derived from a shared ancestral word – or not. We then used a likelihood-based Markov chain Monte Carlo procedure to estimate the most probable times in years separating these languages given the percentage of words they shared, combined with knowledge of the rates at which different words change. Our date for the epics is in close agreement with historians' and classicists' beliefs derived from historical and archaeological sources. The Homeric epics are among the greatest masterpieces of literature. The Iliad's story of the Trojan Wars tells us that the epics were almost certainly produced sometime after the 12th century BCE – if indeed the wars were ever fought – but the question is how much later? Herodotus thought considerably later: Writing in the Histories Book II.53 around 450 BCE, he stated that Homer ‘lived, as I believe, not more than 400 years ago’. The most commonly accepted date among modern classicists, drawing on historical, literary and archaeological analyses, is around the mid-8th century BCE 1, 2, although some authors propose a more recent 7th century BCE date 3. Here, we investigate whether formal statistical modelling of languages can help to inform this historical question. In particular, we investigate whether evolutionary-linguistic statistical methods can be usefully applied to differences in Homeric, Modern Greek and ancient Hittite vocabulary items to provide a date for these great works.
Proceedings of the Royal Society B: Biological Sciences 277(1693):2443-2450, 2010
There are approximately 7000 languages spoken in the world today. This diversity reflects the legacy of thousands of years of cultural evolution. How far back we can trace this history depends largely on the rate at which the different components of language evolve. Rates of lexical evolution are widely thought to impose an upper limit of 6000-10 000 years on reliably identifying language relationships. In contrast, it has been argued that certain structural elements of language are much more stable. Just as biologists use highly conserved genes to uncover the deepest branches in the tree of life, highly stable linguistic features hold the promise of identifying deep relationships between the world's languages. Here, we present the first global network of languages based on this typological information. We evaluate the relative evolutionary rates of both typological and lexical features in the Austronesian and Indo-European language families. The first indications are that typological features evolve at similar rates to basic vocabulary but their evolution is substantially less tree-like. Our results suggest that, while rates of vocabulary change are correlated between the two language families, the rates of evolution of typological features and structural subtypes show no consistent relationship across families.
Science 319(5863):588, 2008
Linguists speculate that human languages often evolve in rapid or punctuational bursts, sometimes associated with their emergence from other languages, but this phenomenon has never been demonstrated. We used vocabulary data from three of the world's major language groups -- Bantu, Indo-European, and Austronesian -- to show that 10 to 33\% of the overall vocabulary differences among these languages arose from rapid bursts of change associated with language-splitting events. Our findings identify a general tendency for increased rates of linguistic evolution in fledgling languages, perhaps arising from a linguistic founder effect or a desire to establish a distinct social identity.
Science 320(5875):446, 2008
While Noah Webster may have produced the earliest compendium on American English, the divergence from British English dates from much earlier. Long before the publication of Webster's Dictionary in 1806, pronunciation in America and in Britain had begun to differ (1, 2). The Dictionary thus does not mark a fixed point when all Americans shifted abruptly from British to American English. The speciation, rather, was gradual, because individual speakers change gradually, by increments, in their lifetimes; individual changes also spread gradually from speaker to speaker.
Nature 449(7163):717--720, 2007
Greek speakers say 'omicronupsilonrho', Germans 'schwanz' and the French 'queue' to describe what English speakers call a 'tail', but all of these languages use a related form of 'two' to describe the number after one. Among more than 100 Indo-European languages and dialects, the words for some meanings (such as 'tail') evolve rapidly, being expressed across languages by dozens of unrelated words, while others evolve much more slowly--such as the number 'two', for which all Indo-European language speakers use the same related word-form. No general linguistic mechanism has been advanced to explain this striking variation in rates of lexical replacement among meanings. Here we use four large and divergent language corpora (English, Spanish, Russian and Greek) and a comparative database of 200 fundamental vocabulary meanings in 87 Indo-European languages to show that the frequency with which these words are used in modern language predicts their rate of replacement over thousands of years of Indo-European language evolution. Across all 200 meanings, frequently used words evolve at slower rates and infrequently used words evolve more rapidly. This relationship holds separately and identically across parts of speech for each of the four language corpora, and accounts for approximately 50\% of the variation in historical rates of lexical replacement. We propose that the frequency with which specific words are used in everyday language exerts a general and law-like influence on their rates of evolution. Our findings are consistent with social models of word change that emphasize the role of selection, and suggest that owing to the ways that humans use language, some words will evolve slowly and others rapidly across all languages.