What is the difference between a species tree, a gene tree and a phylogenetic tree?

What is the difference between a species tree, a gene tree and a phylogenetic tree?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have found websites and text books disagreeing over this. For example,

Phylogenies are also known as “species trees”…


It is well known that a phylogenetic tree (gene tree) constructed from DNA sequences for a genetic locus does not necessarily agree with the tree that represents the actual evolutionary pathway of the species involved (species tree)…

Although the second source, being from a journal, would seem to be more reliable, my original thought was that phylogenetic trees do show the evolutionary relationships between species and thus are species trees as opposed to gene trees.

Any clarification on the issue would be much appreciated.

Phylogenetic tree

A phylogenetic tree is a tree showing relationship between lineages. These lineages might be computed for genome-wide DNA or from only a single gene. As such the term phylogenetic tree is general.

gene tree vs species tree

If you compute a phylogenetic tree, from genome-wide DNA, then you are computing a species tree (although some sister lineages might not perfectly fit the definition of species). If the phylogenetic tree is computed from data coming from a single gene, then we are talking of gene tree.

Why would a gene tree not match a species tree?

But a gene tree is not just a species tree with fewer data. There are reasons for why a gene tree may not match a gene tree. Genes may duplicate within a given genome. The two duplicated genes are free to evolve independently since the split. All descendants lineages will then inherit these two genes but if you map the two different copies of the gene (red and green below) in two sister clade, they will be much more different then the two same copies (say the green copy) in two distantly related species. Here is a picture to show that

Also, copies of genes can be deleted later on and there might have horizontal gene transfer. For all these reasons the phylogenetic tree of a gene might well be very different than the phylogenetic tree of species.

Below is a gene tree and the associated species at the tips. Take time to make sense of what is going on on the picture.

What is a species tree exactly?

One could call a species tree as some kind of average tree among all gene trees.

The field of phylogenetics is entering a new era in which trees of historical relationships between species are increasingly inferred from multilocus and genomic data. A major challenge for incorporating such large amounts of data into inference of species trees is that conflicting genealogical histories often exist in different genes throughout the genome. Recent advances in genealogical modeling suggest that resolving close species relationships is not quite as simple as applying more data to the problem. Here we discuss the complexities of genealogical discordance and review the issues that new methods for multilocus species tree inference will need to address to account successfully for naturally occurring genomic variability in evolutionary histories.

We use cookies to help provide and enhance our service and tailor content and ads. By continuing you agree to the use of cookies .

What is a Rooted Phylogenetic Tree

A rooted phylogenetic tree is a type of phylogenetic tree that describes the ancestry of a group of organisms. Importantly, it is a directed tree, starting from a unique node known as the recent common ancestor. Basically, the roots of the phylogenetic tree describe this recent common ancestor.

Figure 1: A Rooted Phylogenetic Tree

However, this recent common ancestor is an extra and distantly-related organism to the group of organisms used to build up the phylogenetic tree. But , it serves as the parent of all organisms in the group.

Freshwater Gastropods of North America

Editor's Note. This essay was subsequently published as: Dillon, R.T., Jr. (2019b) What is a species tree? Pp 199-206 in The Freshwater Gastropods of North America Volume 2, Essays on the Pulmonates. FWGNA Press, Charleston.

Back in July of 2008 (1) we reviewed the relationship between gene trees and species trees, a subject that has become fashionable at the highest levels of evolutionary science. The phenomenon that leads to differences between the two types of evolutionary tree is usually called “lineage sorting” by phylogenetic systematists, who otherwise ignore it, hoping it will go away. But population geneticists, who tend to run their gene trees backward and call the phenomenon "coalescence," have pointed out that the differences in a set of gene trees can be used to date divergence events in a species tree. This may be the one thing that gene trees are actually good for.

But what is a "species tree?" Wayne Maddison, in his seminal 1997 paper on the subject (2), had in mind a phylogeny of bona fide biological species, "when reproductive communities are split." But reproductive isolation need not evolve between a pair of populations for lineage sorting to commence in their gene pools. The clock starts ticking on the gene trees as gene flow becomes disrupted for any reason between any pair of populations, reproductively isolated or not. So although the scientific community most actively involved in this area of research has always used the term "species tree" to describe the phylogeny they are comparing to their gene trees (3), the term "population tree" would clearly be much more accurate.

Genuine species trees turn out to be surprisingly difficult even to visualize, much less to work out. I did not appreciate the difficulty myself until I tried to draw one, or actually a set of them, for the paper Amy Wethington, Chuck Lydeard, and I recently published in BMC Evolutionary Biology (4).
Our new paper summarizes over ten years of research we've conducted on the evolution of reproductive isolation in Physa, including papers we've published comparing Charleston P. acuta, P. gyrina, P. pomilia, and P. carolinae (5). Brand new for 2011 we've added a second population of P. acuta, sampled from Philadelphia.

Here's an obvious attribute of the word "species" that I think we may all take too much for granted. The word "species" is relational, like the word "brother." It is not a point-character that can be measured on an OTU and cast simply on a phylogenetic tree. So when Amy, Chuck, and I undertook to add a second population of P. acuta to the species tree we had been building for years, we added a row to our triangular matrix. Between acuta, carolinae, pomilia and gyrina there are 3+2+1=6 sets of pairwise mate choice tests to be completed, and 6 corresponding sets of no-choice hybridization experiments. Adding the Philadelphia acuta population added 4 more sets of both, almost doubling our effort. A pain in the butt.

And here’s another bullet point we might profitably highlight before drawing our first species tree. The single, original population at the base of our hypothetical tree was (of course) reproductively compatible with itself. What evolves, as a species tree splits, is reproductive incompatibility. So (to borrow a term from cladism, which I hate) retention of reproductive compatibility is a "symplesiomorphy." I can still remember the day back in the early 1980s when an ANSP ichthyologist first called my attention to a paper by Donn Rosen rejecting the biological species concept because an ideologically pure classification cannot be based on symplesiomorphy (6). I'm still pissed off about that.

In any case, I should think that most of us, if we were to visualize a bona fide species tree, would begin with something like tree (c) at left for F1 fertility in Physa. No hybrids were recovered from any of the no-choice experiments testing gyrina against any of the other four populations, so gyrina is not shown in tree (c). At least some hybrids were born in the other 3+2+1 experiments, the ones between Charleston acuta-c and Philadelphia acuta-p proving perfectly fertile, the other 3+2 categories of hybrids not. So starting with hybrid fertility at the base of the tree, the situation may have been as simple as two separate mutations evolving at a single locus, from fertile to sterile. The actual phenomenon in Physa is almost certainly controlled by multiple loci, but our observations can be modeled very simply, as just the one.

The situation cannot be so simple for hybridization, however. Of course the Charleston acuta-c and the Philadelphia acuta-p will hybridize, and as we just noted, gyrina does not hybridize in our experimental conditions with any of the other species at all (7). Our Physa carolinae population freely hybridizes with either population of acuta, while P. pomilia partially hybridizes with either population of acuta, yielding mixtures of selfed and outcrossed F1 progeny. Our carolinae x pomilia no-choice experiments have not yielded any hybrids.

My best effort to depict this messy set of relationships is shown in tree (b) at right above. Our model suggests two loci, a "complete" locus J and a "partial" locus K, at which unique compatibility alleles segregate, locus J epistatic over locus K. See the text of our paper for the details. The bottom line is, however, that the simplest model I can conceive of for the evolution of barriers to hybridization in Physa is not especially simple.

Nor is the situation on the evolution of sexual incompatibility simple, at all. Although our mate choice tests did show prezygotic reproductive isolation between P. carolinae and Charleston acuta-c, no behavioral barriers seem to be in place to lower the frequency of copulation between carolinae and Philadelphia acuta-p. Those observations seem to require another two-locus model (tree d), again see our paper for the gory details.

Our paper has a very mild conclusion, given the years of sturm und drang through which we passed to arrive upon it. The three species trees shown above, with their minimum of five genes for reproductive isolation as postulated, do indeed match the CO1+16s mtDNA gene trees previously published by Wethington & Lydeard and Wethington et al (8). I was kind-of hoping that they would not, because our conclusions would have been cooler. But (try as I might) I cannot rearrange the three species trees shown above to make them any simpler, to not match the gene tree.

I'm sure this isn't the first time in the history of science that the relationship between gene trees and bona fide species trees has ever been tested, but I don't know of any other. And the match seems to be a good one, darn it. But I am not issuing a warrant to all you gene tree jocks out there to get cocky.

(1) Gene Trees and Species Trees [15July08]

(2) Maddison, W. 1997. Gene trees in species trees. Systematic Biology, 46, 523-536.

(3) For example see Hudson, R. R. (1992) Gene trees, species trees, and the segregation of ancestral alleles. Genetics 131: 509-512. Wakeley, J. (2008) Coalescent Theory, An Introduction. Roberts & Co., Greenwood Village, CO 326 pp. Degnan et al. (2009) Properties of consensus methods for inferring species trees from gene trees. Syst. Biol. 58: 35-54.

(4) Dillon, R. T., Jr., A. R. Wethington & C. Lydeard (2011) The evolution of reproductive isolation in a simultaneous hermaphrodite, the freshwater snail Physa. BMC Evolutionary Biology 11:114.
Online clickable version [html]
Standard [pdf]

(5) Dillon, R.T., Jr., Robinson, J. & Wethington, A. 2007. Empirical estimates of reproductive isolation among the freshwater pulmonate snails Physa acuta, P. pomilia, and P. hendersoni. Malacologia, 49, 283-292. [pdf] Dillon, R. T., Jr., Earnhardt, C. & Smith, T. 2004. Reproductive isolation between Physa acuta and Physa gyrina in joint culture. American Malacological Bulletin, 19, 63-68. [pdf] Dillon, R. T., Jr. 2009. Empirical estimates of reproductive isolation among the Physa species of South Carolina (Pulmonata: Basommatophora). The Nautilus, 123, 276-281. [pdf] That last paper was featured in my blog post entitled "True Confessions: I described a new species" [7Apr10]

(6) Rosen, D. E. (1979) Fishes from the uplands and intermontane basins of Guatemala: revisionary studies and comparative geography. Bull. Amer. Mus. Nat. Hist. 162: 270-375.

(7) Although we recovered no hybrids from our no-choice experiments published in 2004, Tom Smith and I did discover a couple acuta x gyrina hybrids naturally occurring on the margins of the Delaware River at Washington Crossing. They showed up quite unexpectedly on allozyme gels. That's all I know, until God grants me another life, or a few more good students.

(8) Wethington AR, Lydeard C (2007) A molecular phylogeny of Physidae (Gastropoda: Basommatophora) based on mitochondrial DNA sequences. J Moll Stud 73: 241-257. [12Oct07]Wethington AR, Wise J, Dillon RT Jr (2009) Genetic and morphological characterization of the Physidae of South Carolina (Pulmonata: Basommatophora), with description of a new species. Nautilus 123: 282-292. [pdf]

F uture C hallenges

Methodological issues are still numerous and leave open wide research avenues, while at the same time the potential of already available methods can be exploited on an increasingly large scale.

Bypassing the Gene Tree in the Multispecies Coalescent

The multispecies coalescent model describes the evolution of polymorphisms along a species phylogeny. Computing the likelihood of a gene alignment using this model requires summing over a large space of gene trees, given a species tree. This computational difficulty is a major hurdle to using this approach on large data sets, containing large numbers of species, and large numbers of gene families. Very recently, Bryant et al. (2012) and De Maio et al. (2013) came up with two elegant approaches to computing the likelihood of an alignment under the multispecies coalescent, by bypassing entirely the gene tree level, and instead analytically integrating over the space of possible allele histories. These models present the first methods to explicitly carry out the integral in the Felsenstein equation ( Felsenstein 1988 Hey and Nielsen 2007). Bryant et al. (2012) consider biallelic data, and provide a model and an algorithm, called SNAPP, that can be used to reconstruct a species tree given an alignment of single nucleotide polymorphisms for instance. They develop a specific algorithm to address the fact that the coalescent process fundamentally functions from the tips of the species tree to its root, whereas the mutation process works forwards. They use this algorithm to reconstruct species trees with 69 individuals in 6 species of Digitalis plants. De Maio et al. (2013) instead propose a model for sequence data with A , C , G , T data by using a substitution matrix over a larger state space than the usual 4 × 4 substitution matrices: it contains all 6 biallelic states < A , C >⁠ , < A , G >… with a range of frequencies. They focus on a specific model, where they consider a range of 10 possible frequencies per biallelic frequencies: for the state < A , C >⁠ , we therefore have the states < A 10 % , C 90 % >⁠ , < A 20 % , C 80 % >⁠ , …, < A 90 % , C 10 % >⁠ . Two assumptions are made: first, no more than 2 alleles at a given site can be found at any time in a population, and second their frequencies are well approximated by the limited range included in the model. They construct transitions between states of this matrix from a population size parameter, selection coefficients, and mutation rates. The resulting instantaneous rate matrix is then exponentiated to provide a matrix of substitution probabilities. Overall, the matrix obtained with a range of 10 possible frequencies per biallelic state contains 58 states, that is about the same number of states as a codon substitution model. De Maio et al. (2013) use this model, with some further refinements to account for context-dependent mutations and strand-specificity on a large alignment of four species of primates and find evidence for a smaller ancestral population size in orangutans, and selection on splicing enhancers in exons.

Such analytical approaches seem very promising for combining coalescent models with duplication, loss and transfer models, as they bypass the problem of sampling allele histories. How they improve upon multispecies coalescent gene tree-species tree models is still an open question.

More Integrative Models

The integrative program of Goodman et al. (1979) is being progressively implemented. The probabilistic framework makes it possible to integrate sequence mutations with gene duplications and losses through the coalescent ( Rasmussen and Kellis 2012), or to integrate duplications, losses, and transfers with substitutions ( Szöllősi et al. 2012 Boussau et al. 2013 Szöllősi et al. 2013a, b). Rearrangements can be handled using parsimony if ILS is ignored ( Bérard et al. 2012 Patterson et al. 2013).

A model and method to handle a union of all of these processes is currently missing. However, there are very good reasons for the integration of different levels of data analysis to continue. For instance, below the gene tree / species tree problem, is the inference of gene alignments. Only recently has the problem of joint inference of alignments and gene trees been considered seriously, with attempts to model the process of insertion/deletion in the evolution of sequences. Such approaches show dramatic improvements over phylogenetically unaware alignment methods ( Redelings and Suchard 2005 Satija et al. 2009 Warnow 2013). However, they obviously need all the information necessary to have the best possible gene tree, for example a link to the species tree. Hence, it is probable that the integration of gene tree–species tree models and alignment methods should benefit the inference of alignments, gene trees and perhaps species trees.

Although a global model seems difficult to imagine presently, the entire pipeline of sequence data analysis, from sequencing error corrections to gene annotation and genome assembly is likely to benefit from probabilistic evolutionary models. The recognition of homologous sequences, the prediction of gene functions based on information from other organisms, and the proximity of genes on chromosomes all depend ultimately on the structure of the species tree and the possible events of substitution, duplication, loss, and lateral transfer that may have occurred in the history of genomes. There is currently no proposition of an integration of these processes on all levels of the pipeline described in Figure 5, but phylogenetically aware methods have proved very promising at many different steps of the process ( Boussau and Daubin 2010) including on genome assembly ( Husemann and Stoye 2010 Rajaraman et al. 2013).

Algorithmics and Computing Time

The score of a gene tree, especially if it is the combination of scores from several models, can be fairly costly to compute. Therefore, the exploration of trees is always time consuming. Already the inference of a gene tree that maximizes the probability of the alignment given the gene tree is provably hard. The joint inference, estimation of parameters, and exploration of dated or ordered species trees combine intractable problems. In practice, optimizing a gene tree can necessitate up to a few hours for very large families. As there can be thousands of gene families in a typical dataset, the computations even for a fixed species tree can take a long time. However, models of gene family evolution as well as sequence-based models all make the assumption that genes evolve independently from each other. This assumption can be questioned (see below) and is also broken by evolutionary parameters shared among gene families. But it allows a trivial parallelization by the data. All genes trees can be computed independently, given a common species tree. Hence, a species tree exploration is mainly constrained by the largest multigene families. A simple way to increase computational efficiency is to ignore these large families in a first step of species tree exploration. Large multigene families can be considered later, when a good species tree is found based on smaller gene families, or, in a sampling context, using importance sampling. However, such tricks can only help as long as the number of genomes under study is relatively small. For studying larger datasets, we will need to devise more efficient algorithms.

Reconstructing and Dating the Tree of Life

A confusion between gene trees and species trees is arguably at the origin of the claim that Darwin was wrong when he evoked the image of a tree of life, because he failed to foresee the role of lateral gene transfer in microbial evolution ( Doolittle 1999). The models and methods described above actually show that the plurality of gene histories can not only be overcome but more importantly provides additional information on the processes and patterns of species evolution. The phylogenies for a diversity of clades have been reconstructed with coalescent, DL or DTL models. In each case, the degree of conflict among gene trees can be interpreted in biological terms, such as divergence time and ancestral population size with the coalescent, or relative timing of speciation with LGT. There is a great hope that the development and use of these models will help resolve many issues that were left pending by traditional methods.

Beyond the Gene as an Evolutionary Unit

Although we have adopted a liberal sense for “gene”, in many of the studies we reported, a gene is a sequence coding for a protein or a functional RNA, and is considered as an evolutionary unit. However, within such genes, different parts may have different histories ( Didelot et al. 2010 Wu et al. 2012). Alternatively, some genes may be associated throughout evolutionary times because their functions are interdependent or simply because they are close to each others in the genome. As such, they may be duplicated or transferred together ( Bansal et al. 2013 Patterson et al. 2013). Hence, the definition of evolutionary units is difficult, and fluctuates in time ( Fig. 8). As we have shown, almost all existing models describe the reconciliation of one gene tree with one species tree, supposing its evolution is coherent and independent from other genes. Some genomic studies, however, allow genome-wide parameters like the rates of duplications and losses to vary across branches of the species tree ( Boussau et al. 2013). This can be seen as a trick to model large-scale events like genome duplications without doing away with the independence of genes, which is computationally advantageous. But it fails to model more local rearrangements such as duplications of parts of a chromosome. These events could be informative for phylogeny, but models of genome rearrangements are often combinatorially so complex ( Fertin et al. 2009) that they do not scale up well with the size and number of genomes ( York et al. 2002 Darling et al. 2008 Miklós and Tannier 2010). Until now, their complexity has precluded a coupling with other models such as gene tree–species tree reconciliation. However, assuming neighborhoods between genes are independent, meaning that for any 3 genes A ⁠ , B ⁠ , C the neighborhood between genes A and B is independent of whether genes A and C are neighbors or not, it is possible to integrate rearrangements into DL ( Bérard et al. 2012) or DTL ( Patterson et al. 2013) models. Such approaches describe the evolution of neighborhoods (or any other relationship between genes, including functional ones) along pairs of reconciled gene trees, allowing one to reconstruct adjacencies in ancestral genomes and evolutionary events of duplication, loss, and transfer that have affected genomic fragments comprising several genes. Because such multiple events are frequent, it is likely that the parameters of duplication, transfer, and loss that are estimated in DL and DTL models are biased and it seems necessary to integrate models of neighborhood evolution with phylogenetic reconstruction into the reconstruction of genome histories.

Evolutionary units below or above genes. Individual units (red and blue online) can be inside genes or genes that are neighbors along a chromosome or genes involved in a protein complex. Adjacencies are binary relations between genes, and evolve along a species phylogeny. Adjacencies can be gained or lost regardless of the birth and death of the units. When two units together undergo speciation, duplication, or transfer, adjacencies undergo the same events.

Evolutionary units below or above genes. Individual units (red and blue online) can be inside genes or genes that are neighbors along a chromosome or genes involved in a protein complex. Adjacencies are binary relations between genes, and evolve along a species phylogeny. Adjacencies can be gained or lost regardless of the birth and death of the units. When two units together undergo speciation, duplication, or transfer, adjacencies undergo the same events.

There are also models for detecting breakpoints inside gene sequences using HMMs for instance ( McGuire et al. 2000 Suchard et al. 2002 Martins et al. 2008 Boussau et al. 2009), or detecting breakpoints of phylogenetic discordance at a whole genome scale ( Ané 2011), but so far these models have not been included in models of gene family evolution.

Keeping Up with the Pace of Data Acquisition

Currently, genome sequencing is no longer a limiting step for comparative genomics. Instead, assembling gene families, gene alignments, gene trees, and a species tree are becoming increasingly problematic. In this context, methods using models of gene family evolution may offer an advantage because they effectively reduce the space of possible solutions to explore: given a species tree, the space of possible gene trees is limited compared with species tree unaware methods, and consequently, so is the space of possible alignments. Devising smart algorithms that make use of these reductions of complexity may provide fast yet accurate inferences for large-scale comparative genomics projects.

Another area where progress is needed is in the reuse of prior information. Currently, every time a new comparative genomics project is undertaken, or every time a database of homologous sequences is updated, many inference tasks need to be redone from scratch. The computations of gene families, alignments, trees, and species tree are usually done as if there was no prior information obtained from previous analyses. This is obviously a huge waste of useful information, as these computations are often very demanding. Future approaches to comparative genomics will need to be not only integrative, but also incremental. There is a clear need for new developments, and the Systematic Biology community is well equipped to undertake them.

What is the difference between a species tree, a gene tree and a phylogenetic tree? - Biology

How Are Organisms Classified?

    is the field of biology that classifies living and extinct organisms according to a set of rules.

    Taxonomy produces a hierarchy of groups of organisms the organisms are assigned to groups based on similarities and dissimilarities of their characteristics.

    A phylogenetic tree is a hypothesis that depicts the evolutionary relationships among groups of organisms in detailed phylogenetic trees, branch points indicate when new species diverged from a common ancestor.

How Are Phylogenetic Trees Constructed?

    Phylogenetic trees are usually based on morphological or genetic homology.

    A comparison of anatomical traits can reveal an evolutionary relationship among species.

    Shared, derived characteristics are used to construct a tree called a cladogram.

How is Molecular Systematics Changing Our View of Taxonomy?

    Taxonomy is a work in progress.

    As new species are found, taxonomic groups may no longer be monophyletic.

Topological variation

One of the phylogenetic problems analyzed by Huerta-Cepas et al. [3] is the relationship between primates, rodents, and laurasiatherians (the latter comprising the Cetartiodactyla, which include whales and artiodactyles, as well as the Carnivora, and certain other mammalian orders). By means of an algorithm that scans topologies in the trees of the human phylome, the authors quantified the number of trees supporting different relationships. They found, after eliminating unstable trees, 4,806 phylogenetic trees supporting the grouping of primates and laurasiatherians into a clade with the exclusion of rodents, 3,459 trees supporting a primates and rodents grouping (a clade known as Euarchontoglires or Supraprimates, and supported by recent molecular phylogenies [5] this is the arrangement depicted in Figure 1), and 2,258 trees grouping rodents and laurasiatherians in a single clade. Thus, the topological variation found was extreme, not far from the maximum possible, and represents a serious methodological challenge, especially as all these trees are statistically well supported, with a Bayesian posterior probability higher than 0.9 in the node of interest. Given the large numbers of genes supporting each of the three possible arrangements of these mammalian lineages, it is not surprising that recent phylogenomic studies have produced different trees relating human, mouse and dog [11, 12]. Huerta-Cepas et al. [3] did not calculate a consensus tree (this was not the purpose of this study), and thus it is not straightforward to determine the 'true' tree topology relating these mammals. Just getting the best-supported topology is not enough, and even using all genes in a genome may not help you come to an unambiguous solution. This is because different genes produce different biases, and rigorous criteria for selecting the genes to be used to build a species tree are necessary to get less ambiguous results, as has been done in other work (see [13] for a review). The important message from this part of the study is that, whatever the true tree may be, trees derived from single genes are more likely than not to point to a wrong topology.

Huerta-Cepas et al. [3] also looked at the relationship among chordates, arthropods and nematodes, a tree that has been the subject of much recent work (see references in [3]). In this case, 2,431 trees support a grouping of chordates and arthropods (Coelomata), 1,759 trees support a nematode-arthropod clade (the Ecdysozoa in Figure 1, this group is included in the protostomes) and 1,040 trees support a grouping of chordates and nematodes. A great diversity of topologies was also found and we can see again that, even without knowing the true tree, most trees must be wrong. A third problem studied by Huerta-Cepas et al. [3] regarding the position of several basal eukaryotic lineages is more difficult to interpret, as there are more than three possible topologies, but the results also point to a high variability among topologies.

It is true that the three examples discussed above are inherently difficult phylogenies, but the authors indicate that they found considerable levels of topological diversity in trees of other, undisputed, phylogenies. These very instructive results should make us realize that not all single-gene trees, even those with high support, must necessarily be coincident with the real species tree. Thus, the methodological approach of the pioneering work of Penny et al. [1], which implied a certain degree of topological variation among different genes without denying the existence of a unique tree, is largely supported from this much larger analysis using the most up-to-date methods of statistical analysis.

Forensic speciation: Splicing genetic and phylogenic trees of life

Evolutionary relationships of eutherian mammals. The phylogeny was estimated using the maximum-pseudolikelihood coalescent method MP-EST with multilocus bootstrapping. The numbers on the tree indicate bootstrap support values, and nodes with bootstrap support >90% are not shown. (Inset) The eutherian phylogeny estimated using the Bayesian concatenation method implemented in MrBayes. The ML (maximum likelihood) concatenation tree built by RAxML (search algorithm for maximum likelihood) is identical to the Bayesian concatenation tree in topology. Branches of the concatenation tree are coded by the same colors as in the MP-EST tree. The blue asterisks indicate the position of Scandentia (tree shrews), Chiroptera (bats), Perissodactyla (odd-toed ungulates), and Carnivora (carnivores),whose placement differs from the coalescent tree. The Bayesian concatenation tree received a posterior probability support of 1.0 for all nodes. Copyright © PNAS, doi:10.1073/pnas.1211733109

(—The Tree of Life is a beautiful and elegant metaphor that has proven deceptively difficult to reconstruct. The main culprit may be the overwhelming reliance on so-called concatenation methods, which combine different genes into a single matrix and so force all genes to conform to the same topology. Since these methods do not take into account differences between alternative gene trees, they have been thought to lead to uncertainty or incongruence in the phylogenic tree of the eutherian (placental) mammals. While historically this incongruence had not previously been confirmed by empirical studies, scientists at Shenyang Normal University, Tsinghua University, University of Georgia and Harvard University have recently demonstrated that this is indeed the case – and that concatenation-derived uncertainty may be found in other clades (biological groups derived from a common ancestor) as well. Moreover, the authors suggest that such uncertainty can be resolved by augmenting phylogenomic data with coalescent methods – that is, techniques for dealing with differences in genomic ancestral trees.

The research team – Prof. Shaoyuan Wu, Prof. Sen Song, Asst. Prof. Liang Liu, and Prof. Scott V. Edwards – faced a number of complex issues in conducting their study. "To demonstrate that concatenation methods are actually underlying the controversies in the phylogeny of eutherian mammals, we need to find out what is wrong with concatenation methods," Wu tells "This is a challenging topic since concatenation methods are to date the most dominant approach in the field of phylogenetics." Wu points out that It would be difficult for people to admit that these well-established methods are the cause of controversies in phylogenetic relationships, since for a long time people believe that controversial relationships among eutherian mammals and other clades in the Tree of Life would be resolved as more taxa – groups of one or more populations of organisms – and/or genetic data become available. "However," he notes, "the persistence of these controversies in recent concatenation studies despite the increasing sampling of taxa and genes lead us to believe that something must be wrong with concatenation methods."

Concatenation methods are based on the assumption that all genes have the same or similar phylogenies. However, the team's mammalian data set, gene tree heterogeneity can be found everywhere. While computational simulations have predicted that ignoring gene tree heterogeneity may result in misleading phylogenies, the challenge has been how to empirically test the effect of gene tree heterogeneity on estimating phylogenies.

To address this challenge, Wu explains, the researchers designed their experiment with the innovative approach of using subsampling analysis of loci and taxa – because if gene tree heterogeneity is indeed a confounding factor, the results of the concatenation method are expected to vary according to the histories of the genes represented in a particular subsample. "The subsampling portion of our analysis confirms the prediction that concatenation methods using different subsamples of our data set often conflict with each other, even though metrics such as the bootstrap indicate strong support for each topology – but trees generated from subsamples using the coalescent method are much more topologically consistent."

In addition, he adds, they developed two techniques in this study: estimating the scale of genetic data for accurately resolving a phylogeny based on taxon sampling, and testing if the multispecies coalescent model can explain the observed gene tree data set heterogeneity.

Beyond controversies in eutherian mammal phylogeny, similar phylogenetic controversies also exist in other clades – for example, the relationships among nemerteans, annelids, and molluscs with regards to arthropods. "Because the phylogenic reconstruction in the Tree of Life has so far been mostly based on concatenation methods," Wu adds, "it's likely that concatenation methods are the major cause of phylogenetic incongruence across the Tree of Life." Wu also describes the insights gleaned from the study. Firstly, the researchers showed using coalescent methods to deal explicitly with gene tree heterogeneity is preferable to applying concatenation methods to data sets with high gene tree heterogeneity. A second insight was that it is also critical to gather a sufficient number of loci to obtain an accurate phylogeny for mammals and other clades despite the importance of taxon sampling for phylogenetic analysis. "For example," Wu illustrates, "the intensive taxon sampling employed in recent research 1 cannot compensate for the effect of insufficient genetic sampling in their data set."

Finally, Wu notes, incomplete linage sorting (ILS), a major source of gene tree heterogeneity, is relevant to deep-level phylogenies. "This is in contrast to the conventional assumption that ILS is only relevant to recent radiations," he stresses. "ILS is prevalent in coding sequences, which is in contrast to recent suggestion that coding sequences may be less subject to ILS than noncoding sequences due to frequent selective sweeps, which tend to remove ILS."

Wu expands on the paper's key conclusion – namely, that such incongruence can be resolved using phylogenomic data and coalescent methods that deal explicitly with gene tree heterogeneity. "The prevalence of gene tree heterogeneity in genomic data indicates that a good phylogenetic method should take this complexity into account when inferring species phylogenies," he points out. "It's clear that concatenation methods, which assume gene tree homogeneity, do not fit the complexity of phylogenetic reality – that is, that gene tree heterogeneity is common among all genes and taxa. In contrast, the multispecies coalescent model can explain 77% of gene tree heterogeneity observed in the mammal data set, indicating that the coalescent approach indeed gives a better picture of complex phylogenetic reality when gene tree heterogeneity is prevalent in the data sets."

Delving deeper, Wu notes that the erratic behavior of concatenation methods confirms that concatenation methods are not suitable for genomic data, which possess substantial levels of gene tree heterogeneity. "The robustness of coalescent methods to variable gene and taxon sampling demonstrates that coalescent methods are superior to concatenation methods in building species phylogenies based on phylogenomic data by accommodating gene tree heterogeneity – and the data suggests controversial relationships in the Tree of Life can be resolved as more data are collected. In other words, resolving the phylogeny of eutherian mammals and other clades in the Tree of Life will require a large amount of data at genomic scale."

To extend the current study, the scientists' next research step is to assess the suitability of tree-building models for different types of genomic data, and to examine how different characteristics of genomic data would affect the performance of tree-building methods. Moreover, the paper has implications for other areas of research as well. "Besides the field of evolutionary biology," Wu concludes, "a well-resolved phylogeny has important applications in the studies of comparative genomics and biomedical sciences. The major contribution of this study is to provide an example and a roadmap to help researchers to build accurate phylogenies using genomic data, which will certainly benefit studies in these areas."

1 Related: Impacts of the Cretaceous Terrestrial Revolution and KPg Extinction on Mammal Diversification, Science 28 October 2011: Vol. 334 no. 6055 pp. 521-524, doi:10.1126/science.1211028


A species phylogeny is a graphical model of the common evolutionary history of a group of species, and is most often represented as a phylogenetic tree or phylogenetic network [1]. A species phylogeny gives valuable information about protein functions [2–4], host-parasite relationships [5], etc.

However, species tree estimation is difficult, due to multiple biological processes, including recombination [6], duplication and loss [7], hybridization [8], incomplete lineage sorting (ILS) [9], and horizontal gene transfer (HGT) [10], that can cause a given genomic locus to have a tree that is different from the species tree. As a result, multiple loci are needed to estimate a species phylogeny with high accuracy.

Of the many sources of gene tree discord, the one that has received the greatest attention is ILS, which is modeled by the multi-species coalescent (MSC) model [11]. An MSC model tree has a rooted tree T , leaf-labelled by a set of species, and is given with branch lengths in coalescent units. Gene trees evolve within the species tree, in a backwards process described by the MSC thus, lineages "coalesce" on the branches of the tree, as they move from the leaves of the species tree towards the root. When two lineages fail to coalesce on the earliest branch in which they can coalesce, this can result in a gene tree having a different topology than the species tree.

Under the MSC model, each species tree defines a probability distribution on gene trees, and the species tree can be identified uniquely from this distribution. Hence, one type of technique (called a "summary method") for estimating species trees under the MSC operates by first estimating gene trees for a set of different loci, and then uses this estimated distribution on gene trees to estimate the species tree. A summary method is said to be statistically consistent under the MSC model if, as the number of loci and sites per locus go to infinity, the estimated species tree returned by the method will converge in probability to the true species tree [12]. Many statistically consistent summary methods have been developed for estimating species trees when gene discordance is due to ILS [13–19].

Despite advances in developing statistically consistent methods for species tree estimation that are robust to ILS, by far the most common technique for estimating a species tree is concatenation analysis, in which the sequence alignments for the different loci are combined into one large supermatrix, and then a phylogeny is estimated on the alignment using maximum likelihood [20, 21]. This type of approach, however, is sometimes not statistically consistent under the multi-species coalescent model [22, 12] in the presence of ILS. Hence, even though concatenation often has good accuracy (even under conditions with moderately high ILS levels) [23–25], a large effort has been made to develop alternative methods that are provably robust to ILS and have good accuracy on realistic conditions.

For very small datasets, Bayesian methods such as BEST [26], *BEAST [27] or BUCKy-pop [28] (the population tree from BUCKy) can provide excellent accuracy however, these methods are too computationally intensive to use on even moderate sized datasets with hundreds to thousands of loci and 30 or more species [29, 30].

Of the currently available coalescent-based methods, ASTRAL-2 [19], MP-EST [13], and NJst [17] have emerged as the most accurate of the methods that can run on datasets with 50 or more species and hundreds to thousands of loci. However, the comparison among these methods shows that MP-EST is typically not as accurate as NJst and ASTRAL-2 and is also much slower than both [19]. Some newer statistically consistent methods have also been developed (e.g., SVDquartets [31]), but have not yet been sufficiently evaluated in terms of their accuracy and scalability in comparison to other coalescent-based methods.

Some of the most commonly used coalescent-based methods estimate species trees by encoding each gene tree as a set of quartet trees (i.e., unrooted 4-leaf trees), and then estimate the species tree from the quartet tree frequencies. The mathematical basis of this approach is the following theorem, originally proved in [32]:

Theorem 1 Under the multi-species coalescent model, for every model species tree (T, θ) (where θ denotes the branch lengths of T in coalescent units) and for every set X of four leaves from T, the most probable unrooted gene tree topology on X is identical to the species tree T restricted to the leafset X.

Interestingly, nearly the same theorem was proven under two phylogenomic models that addressed horizontal gene transfer (HGT)! When HGT is present, the evolutionary history of the species is not really treelike, but rather requires a phylogenetic network [1]. Under HGT models, a phylogenetic network consists of an underlying species tree T with horizontal gene transfer edges (represented by directed edges) between branches in the tree, and each locus evolves down a tree (though not necessarily the species tree) within this network. Hence, while the species evolution is not purely treelike, the gene tree evolution is treelike. Furthermore, for this type of reticulate phylogeny, it is reasonable to ask whether the underlying species tree T can be reconstructed from gene trees estimated on the different loci.

This question has been partially answered for two models of HGT. The first models HGT events between lineages using a continuous-time Poisson process [33], and is called the stochastic HGT model. In a stochastic HGT model, the HGT events happen between contemporaneous lineages, either uniformly at random or with probability that depends on the distance between the lineages (so that events are less likely if the lineages are more distantly related). The second type of model assumes that there are HGT edges between specific pairs of branches in a species tree, commonly referred to as highways, along which HGT events are far more likely to occur than elsewhere in the tree this is called the highways HGT model [34].

The theoretical framework for estimating the underlying species tree under these two HGT models was established in [35] (for estimating rooted species trees from rooted gene trees) and in [36] (for estimating unrooted species trees from unrooted gene trees). Specifically, [36] proved theorems that under both the stochastic HGT model and highways model, but with bounded amounts of HGT per gene, the most probable quartet tree would be topologically identical to the species tree. Note that these theorems are the equivalents of Theorem 1 under the two bounded HGT models.

Some species tree estimation methods operate by computing gene trees, encoding each computed gene tree as a set of quartet trees, and determining the dominant quartet tree for every four species (i.e., the quartet tree that appears the most frequently of the three possible unrooted quartet trees). Then, these dominant quartet trees are combined using a quartet amalgamation method (e.g., Quartets Max Cut [37] or QFM [38]). This type of species tree estimation method can be statistically consistent under the MSC model, and also under these bounded HGT models - depending on the quartet amalgamation method, as we now show.

Theorem 2 Let M be a summary method (i.e., a method that constructs a species tree from an input set of gene trees). Suppose that M has the property that it is guaranteed to return the unique tree compatible with the dominant quartet trees defined by its input set of gene trees, whenever the dominant quartet trees are compatible. Then M is statistically consistent under the MSC model, and also under the bounded HGT models given in [36].

Proof To establish statistical consistency, we only need to prove that as the number of sites per locus and the number of loci both increase, the tree returned by the method converges in probability to the species tree. As the number of sites per locus and the number of loci both increase, the dominant quartet tree converges to the most probable quartet tree on every set X of four species. Under the MSC model and also under the bounded HGT models in [36], the most probable quartet tree on any set X is topologically identical to the species tree. Hence, for a large enough number of loci and large enough number of sites per locus, with probability converging to 1, the input to the quartet-based methods will be a set of gene trees such that the dominant quartet trees are all compatible with the species tree. Furthermore, the species tree will be the unique such compatibility tree, and so the method will return the true species tree.

Similarly, we can prove the following:

Theorem 3 ASTRAL and ASTRAL-2 are statistically consistent under the bounded HGT models of [36].

This proof uses Theorem 1, but is essentially identical to the proofs of statistical consistency for ASTRAL and ASTRAL-2 under the MSC model [19] see Methods for the proof of this theorem.

Very little is known about the theoretical guarantees of any species tree estimation methods under models in which both HGT and ILS can occur. In fact, to the best of our knowledge, no methods have yet been proven statistically consistent under these conditions. We also do not know much about the empirical performance of any species tree estimation methods under these conditions. As far as we know, the only simulation study to date of the impact of both ILS and HGT on the performance of species tree estimation methods is [39], which explored the performance of two coalescent-based methods, BUCKy and BEST, on data that evolved under both processes. However, both of these methods are computationally intensive, and cannot run on even moderately large datasets (e.g., BEST is slower than *BEAST, and *BEAST is too computationally intensive to use on datasets with more than about 100 loci) [30, 29].

We report on a study evaluating the accuracy of ASTRAL-2, NJst, and weighted Quartets Max Cut (wQMC) [40], as well as unpartitioned maximum likelihood concatenation analysis (CA-ML), on simulated datasets in which gene tree discord is due to both HGT and ILS. The simulation protocol evolved gene trees down 50-taxon species trees under the MSC model with a moderately high level of ILS, and allowed gene trees to then evolve with six different HGT rates (see Figure 1). HGT rate (1) has no HGT events, and HGT rates (2)-(6) have 0.08, 0.2, 0.8, 8.0, and expected HGT events per gene, respectively. Finally, sequences evolved down each gene tree under the GTR+Gamma model.

Properties of the simulated datasets. (Top) The histogram of the number of transfer events per gene across all 50,000 gene trees (50 replicates, each with 1000 genes) for all six model conditions. Note that the tree has only 51 species (50 ingroup species and one outgroup species), and therefore, model conditions (5) and (6) constitute high numbers of transfers per gene. (Bottom) The normalized Robinson-Foulds (bipartition) distance between the true gene trees and the species tree for all six model conditions. Note that the gene tree discordance generally increases as the transfer rate increases, but also that model condition (3) has less discordance than model condition (2) despite having a slightly higher number of transfers.

We estimated gene trees on each locus using the FastTree-2 maximum likelihood software [41], and then used the summary methods on these estimated gene trees to estimate the species tree. We also concatenated the sequence alignments and ran unpartitioned FastTree-2 maximum likelihood on the concatenated super-alignment. Finally, we analyzed a Cyanobacteria dataset with 11 species and 1128 genes [42], which is believed to have evolved under high levels of HGT and has been used to evaluate methods for inferring species trees in the presence of HGT [43, 40]. See Methods for additional details.

Minimum species removal inference and reconciliation

By linking the species tree inference problem to a supertree problem we have been able to prove that deciding whether a gene tree T is an MD-tree can be done in polynomial-time[8]. We used a constructive proof based on a min-cut strategy, which has been largely considered in the context of supertrees[25–27]. In this section, we develop a greedy heuristic for MINSRI based on a minimum vertex cut strategy.

Let F = < T 1 , T 2 , … , T f >be a forest of gene trees on a genome set G. Define leve l 0 ( F ) to be the set of highest (i.e. closest to the root) vertices of all T is that are not AD-vertices. leve l j ( F ) is then the set of vertices of all T is that are closest non-AD descendants of the vertices for leve l j − 1 ( F ) . For a given level j, forest F , and vertex x ∈ leve l j ( F ) , consider the bipartition B ( x ) = ( L ( x l ) , L ( x r ) ) . Then G j = ( V , E ) is the corresponding hypergraph[28] where V=G, and L ( x l ) , L ( x r ) ∈ E for x ∈ leve l j ( F ) .

In order for F to be an MD-forest, all the vertices of leve l j ( F ) , for any j, should represent speciation vertices with respect to some species tree S (as otherwise they would represent additional non-apparent duplication vertices, preventing the forest from being an MD-forest). In other words, the bipartitions B(x) for all xleve l0(T) should reveal a first speciation event, which is possible if and only if the graph G 0 contains at least two connected components. Indeed, in this case for any species tree S with a root r splitting G into two disconnected subsets, all the vertices of leve l 0 ( F ) would be speciation vertices. Conversely, if G 0 contains a single connected component, then for any species tree S, at least one node of leve l0(T) would be a NAD node. The same reasoning applies to any leve l j ( F ) and G j .

On the other hand, if G j is connected for some leve l j ( F ) , there exists no species tree so that all x ∈ leve l j ( F ) represent speciation events. In this case, some number of species must be removed to make G j disconnected. This corresponds exactly to a vertex cut in G j . These observations leads to the following heuristic for the MINSRI problem.