Geoff K. Nicholls
2011
Journal of the Royal Statistical Society. Series C: Applied Statistics 60(1):71-92, 2011
Nicholls and Gray have described a phylogenetic model for trait data. They used their model to estimate branching times on Indo-European language trees from lexical data. Alekseyenko and co-workers extended the model and gave applications in genetics. We extend the inference to ...MORE ⇓
Nicholls and Gray have described a phylogenetic model for trait data. They used their model to estimate branching times on Indo-European language trees from lexical data. Alekseyenko and co-workers extended the model and gave applications in genetics. We extend the inference to handle data missing at random. When trait data are gathered, traits are thinned in a way that depends on both the trait and the missing data content. Nicholls and Gray treated missing records as absent traits. Hittite has 12% missing trait records. Its age is poorly predicted in their cross-validation. Our prediction is consistent with the historical record. Nicholls and Gray dropped seven languages with too much missing data. We fit all 24 languages in the lexical data of Ringe and co-workers. To model spatiotemporal rate heterogeneity we add a catastrophe process to the model. When a language passes through a catastrophe, many traits change at the same time. We fit the full model in a Bayesian setting, via Markov chain Monte Carlo sampling. We validate our fit by using Bayes factors to test known age constraints. We reject three of 30 historically attested constraints. Our main result is a unimodal posterior distribution for the age of Proto-Indo-European centred at 8400 years before Present with 95% highest posterior density interval equal to 7100-9800 years before Present.
2008
Journal of the Royal Statistical Society. Series B: Statistical Methodology 70(3):545-566, 2008
Binary trait data record the presence or absence of distinguishing traits in individuals. We treat the problem of estimating ancestral trees with time depth from binary trait data. Simple analysis of such data is problematic. Each homology class of traits has a unique birth event ...MORE ⇓
Binary trait data record the presence or absence of distinguishing traits in individuals. We treat the problem of estimating ancestral trees with time depth from binary trait data. Simple analysis of such data is problematic. Each homology class of traits has a unique birth event on the tree, and the birth event of a trait that is visible at the leaves is biased towards the leaves. We propose a model-based analysis of such data and present a Markov chain Monte Carlo algorithm that can sample from the resulting posterior distribution. Our model is based on using a birth-death process for the evolution of the elements of sets of traits. Our analysis correctly accounts for the removal of singleton traits, which are commonly discarded in real data sets. We illustrate Bayesian inference for two binary trait data sets which arise in historical linguistics. The Bayesian approach allows for the incorporation of information from ancestral languages. The marginal prior distribution of the root time is uniform. We present a thorough analysis of the robustness of our results to model misspecification, through analysis of predictive distributions for external data, and fitting data that are simulated under alternative observation models. The reconstructed ages of tree nodes are relatively robust, whereas posterior probabilities for topology are not reliable.
2006
Quantifying Uncertainty in a Stochastic Model of Vocabulary Evolution
Phylogenetic Methods and the Prehistory of Languages 14.0:161-, 2006
2. Background In this section we introduce the data, discuss a recent a empt to reconstruct the ancestry of languages in the data, and introduce some basic assumptions which will be important in our analysis. What is the data, and how was it gathered? Dyen et al.(1997) ...
2005
Transactions of the Philological Society 103(2):193-219, 2005
Gray & Atkinson's (2003) application of quantitative phylogenetic methods to Dyen, Kruskal & Black's (1992) Indo-European database produced controversial divergence time estimates. Here we test the robustness of these results using an alternative data set of ancient Indo-European ...MORE ⇓
Gray & Atkinson's (2003) application of quantitative phylogenetic methods to Dyen, Kruskal & Black's (1992) Indo-European database produced controversial divergence time estimates. Here we test the robustness of these results using an alternative data set of ancient Indo-European languages. We employ two very different stochastic models of lexical evolution - Gray & Atkinson's (2003) finite-sites model and a stochastic-Dollo model of word evolution introduced by Nicholls & Gray (in press). Results of this analysis support the findings of Gray & Atkinson (2003). We also tested the ability of both methods to reconstruct phylogeny and divergence times accurately from synthetic data. The methods performed well under a range of scenarios, including widespread and localized borrowing.