Empirical evaluation of a prior for Bayesian phylogenetic inference

The Bayesian method of phylogenetic inference often produces high posterior probabilities (PPs) for trees or clades, even when the trees are clearly incorrect. The problem appears to be mainly due to large sizes of molecular datasets and to the large-sample properties of Bayesian model selection and...

Full description

Bibliographic Details
Published in:Philosophical Transactions of the Royal Society B: Biological Sciences
Main Author: Yang, Ziheng
Format: Article in Journal/Newspaper
Language:English
Published: The Royal Society 2008
Subjects:
Online Access:http://dx.doi.org/10.1098/rstb.2008.0164
https://royalsocietypublishing.org/doi/pdf/10.1098/rstb.2008.0164
https://royalsocietypublishing.org/doi/full-xml/10.1098/rstb.2008.0164
Description
Summary:The Bayesian method of phylogenetic inference often produces high posterior probabilities (PPs) for trees or clades, even when the trees are clearly incorrect. The problem appears to be mainly due to large sizes of molecular datasets and to the large-sample properties of Bayesian model selection and its sensitivity to the prior when several of the models under comparison are nearly equally correct (or nearly equally wrong) and are of the same dimension. A previous suggestion to alleviate the problem is to let the internal branch lengths in the tree become increasingly small in the prior with the increase in the data size so that the bifurcating trees are increasingly star-like. In particular, if the internal branch lengths are assigned the exponential prior, the prior mean μ 0 should approach zero faster than but more slowly than 1/ n , where n is the sequence length. This paper examines the usefulness of this data size-dependent prior using a dataset of the mitochondrial protein-coding genes from the baleen whales, with the prior mean fixed at μ 0 =0.1 n −2/3 . In this dataset, phylogeny reconstruction is sensitive to the assumed evolutionary model, species sampling and the type of data (DNA or protein sequences), but Bayesian inference using the default prior attaches high PPs for conflicting phylogenetic relationships. The data size-dependent prior alleviates the problem to some extent, giving weaker support for unstable relationships. This prior may be useful in reducing apparent conflicts in the results of Bayesian analysis or in making the method less sensitive to model violations.