Bayesian phylogenetics in the densely sampled regime

Statistical phylogenetic (evolutionary tree) methods have been essential for understanding the SARS-CoV-2 epidemic, whether for understanding origins, global spread, or lineage dynamics of the virus. These methods are extremely mature, with optimized code and software packages implementing complex models. However, these methods were developed with the “classical” sampling regime in mind: a relatively small number of sequences with relatively large divergences between them.

Methods for the classical sampling regime work to integrate out the uncertainty we have in ancestral sequences. Although the Felsenstein algorithm does allow for efficient calculation and updating of phylogenetic likelihoods, even this is not enough to handle the massive trees we would like to use for SARS-CoV-2. Furthermore, the Felsenstein algorithm only works for IID models between sites.

With SARS-CoV-2 we are in a completely different sampling regime, with over 2 million genomes for a virus without very much evolutionary divergence. That means that we frequently sample identical viruses, and we often sequence the direct ancestor of a given virus. This greatly limits the uncertainty that we have in the ancestral states of the genome. However, the transmission history is quite uncertain, motivating a Bayesian approach.

There are some interesting opportunities in this new regime. For example, du Plessis, McCrone, Zarebski, Hill, Ruis, et al, (2021) replace the classical phylogenetic likelihood with a proxy that counts the number of substitutions that could have happened along a branch. This reduces computation time by orders of magnitude, and allows the model to focus on the important aspects of uncertainty: how the virus spread between individuals.

I think that there are many more opportunities in this new regime, including for substitution model complexity (think whole-sequence modeling), online (i.e. incremental updating) inference, integration with epidemiological models, and for inference (it’s not going to be MCMC).

There are other settings that we care about for which we have dense sampling, and for which complex sequence substitution models are quite important. Specifically, I’m thinking about the evolution of antibodies that happens inside of our lymph nodes when we are vaccinated or infected. Our collaborator Gabriel Victora and his lab sample these evolutionary histories in great depth. We are also very interested in the interplay between sequence and fitness.

It’s still early stages, but thus far it looks like this will become a collaborative project with:

  • Trevor Bedford, an evolutionary biologist and genomic epidemiologist known for his co-development of the nextstrain platform
  • JT McCrone, a postdoc in the Rambaut group working on scaling Bayesian phylogenetics for SARS-CoV-2
  • Vladimir Minin, a leading statistician especially known for his work in “phylodynamics:” the intersection of phylogenetics, immunology, and epidemiology
  • Marc Suchard, another leading statistician working on phylodynamics, who has developed much of the statistical framework for complex data integration in BEAST, as well as high performance algorithms
    and hopefully many others in the phylogenetics community.
Fred Hutchinson Cancer Research Center
Closing date
October 23rd, 2021
Posted on
August 27th, 2021 23:47
Last updated
August 27th, 2021 23:47