

Title: Estimating ultra-large phylogenies and alignments
Abstract:
Biomolecular sequences evolve under processes that include substitutions, insertions and deletions (indels), as well as other events, such as duplications. The estimation of evolutionary history from sequences is then used to answer fundamental questions about biology, and also has applications in a wide range of biomedical research.
From a computational perspective, however, phylogenetic (evolutionary) tree estimation is enormously hard: all favored approaches are NP-hard, and even the best heuristics can take months or years on only moderately large datasets. Furthermore, while there are very good heuristics for estimating trees from sequences that are already placed in a multiple alignment (a step that is used when sequences evolve with indels), errors in alignment estimation produce errors in tree estimation, and the standard alignment estimation methods fail to produce highly accurate alignments on large highly divergent datasets. Thus, the estimation of highly accurate phylogenetic trees from large datasets of unaligned sequences is beyond the scope of standard methods.
In this talk, I will describe new algorithmic tools that my group has developed, and which make it possible, for the first time, to obtain highly accurate estimates of trees from very large datasets, even when the sequences have evolved under high rates of substitution and indels. In particular, I will describe SAT´e (Liu et al. 2009, Science Vol 324, no. 5934). SAT´e simultaneously estimates a tree and alignment; our study shows that SAT´e is shows that SAT´e is very fast, and produces dramatically more accurate trees and alignments than competing methods, even on datasets with 1000 taxa and high rates of indels and substitutions. I will also describe our new method, DACTAL (not yet submitted). DACTAL stands for “Divide-and-Conquer Trees without Alignments”, and uses an iterative procedure combined with a novel divide-and-conquer strategy to estimate trees from unaligned sequences. Our study, using both real and simulated data, shows that DACTAL produces trees of higher accuracy than SAT´e, and does so without ever constructing an alignment on the entire set of sequences. Furthermore, DACTAL is extremely fast, producing highly accurate estimates of datasets in a few days that take many other methods years. Time permitting, I will show how DACTAL can be used to improve the speed and accuracy of other phylogeny reconstruction methods, and in particular in the context of phylogenetic analyses of whole genomes.