The Tree of Life: Story behind the paper: Backbones of evolutionary history test biodiversity theory for microbes

This is a guest post in my series "The Story Behind the Paper". Post is by James O. Dwyer about his paper (coauthored with Steven Kembel and Tom Sharpton) in PNAS entitled "Backbones of evolutionary history test biodiversity theory for microbes

Backbones of evolutionary history test biodiversity theory for microbes

Prehistory

This paper has its roots going back a few years, and it all started off fairly innocuously. A previous collaboration with Steve Kembel and Jessica Green resulted in this earlier paper, where we had the lofty goal of encouraging microbial ecologists to throw out slightly less data, and also attracted Jonathan’s attention for our microbiome figures. One of the central questions in ecology is to explain and understand patterns of biodiversity: for example, by quantifying the diversity of a local community (“alpha” diversity), or similarity between multiple local communities (“beta” diversity). In microbial ecology it is common to use evolutionary history to quantify these measures. But both phylogenetic alpha and beta-diversity tend to change systematically with increasing sample size, making it difficult to compare results for samples of different sizes.

Our idea in the earlier paper was to generate a fast way to compute a null prediction for these metrics for phylogenetic alpha and beta diversity—i.e. this would provide a way to standardize the results for sample size, and hence we could use full samples rather than smaller, rarefied samples. The solution is relatively simple, and involved a phylogenetic analogue of the Species Abundance Distribution (SAD), which we called the Edge-length Abundance Distribution (EAD). In comparison with the SAD, this distribution replaces species units with subclades of a phylogenetic tree, replaces species abundances with subclade size, and inserts branch length weightings in a specific way.

The present day

Job done. So how did this lead to a new paper? Well, this first study generated something slightly mysterious to us. In theory, the EADs we computed from empirical data could have taken any form they wanted to—and yet for various microbiome habitats, they all seemed to display a very distinct power law scaling. Translated into a more concrete consequence, the form of the EAD was such that phylogenetic diversity typically increased as a power law function of sample size. There’s a history in ecology of looking for (and sometimes finding) behavior that both takes on a power law scaling, and is also universal across multiple systems, fitting with a general sense that some patterns may be emergent and independent of much of the underlying variation between communities. There’s also a history of looking for (and sometimes finding) power law scaling in evolutionary trees, for example in the number of species per genus, which has often been claimed as a power law. Here we had found a link with these older ideas, with a nice combination of new factors. First, we weren’t relying on human definitions of species, which could certainly be biased towards generating power law scaling artificially (e.g., the principle of balance). Second, we had large numbers, so that these scaling behaviors spread over multiple orders of magnitude. Third, there was an untapped world of microbial sequence data to look at to see whether these patterns extended into microbiology.

With Tom and Steve, we combined these ideas to set up the empirical side of this new paper: expand the original study across a broader range of habitats, test whether the patterns are robust to different alignment and inference methods, and see whether the same scaling behavior holds up for this new range of samples. Which indeed it did---Figures 1 and 3 in the new paper show that this power law scaling is present across multiple microbial habitats.

Just knowing that this distribution takes a power law form is already useful on its own, because (again) it defines the null expectations for the way phylogenetic alpha and beta diversity change with sample size. But these results still left a number of open questions, centering around whether this could also give us some insight into what models of biodiversity could be consistent with what we were seeing. Could these scaling patterns provide evidence for whether a given ecological and evolutionary scenarios had strongly influenced a community?

Coarse-graining: reducing the resolution of phylogenetic trees

The first modeling approach we considered is neutral theory. Neutral models have provided the basic null models in fields stretching from population genetics and ecology to cultural evolution and the social sciences. In common is the key assumption that selective differences are irrelevant for predicting large-scale patterns. If the power law scaling is just an inevitably--an ecological version of Benford's law--it seemed likely that it might be just a consequence of neutrality, with all of the variation and mechanism somehow washing out. Is it possible that these observed phylogenetic patterns are driven by this most basic, neutral model of biodiversity? The answer turns out to be no---at least using the vanilla version of the neutral theory, we don't reproduce these scaling behaviors.

Next, we got a little creative. When working with trees generated by neutral processes, we were thinking of the Kingman coalescent. I.e. a model of tree structure that works backwards in time, coalescing pairs of lineages at each node. There's a one-parameter family of coalescent models generalizing the Kingman coalescent, with the unifying feature that more than two lineages can coalesce at each node. Viewed forward in time, one lineage can burst into many. This generalized family, the Lambda-coalescent, produces precisely the power law EAD (known in that context as a site-frequency spectrum) we were looking for.

These generalized coalescent trees have previously been used to understand population processes with a skewed offspring distribution, where there is a significant probability that an organism has a large number of offspring, and this matches the idea of multiple lineages coalescing. But for our evolutionary trees that idea of instantaneous, multiple branching seemed unlikely. At a fine-grained level, branches in our evolutionary trees ought to split into two, driven by cell division and subsequent diversification. This is also what our tree inference algorithms are designed to find, even when our sequence data likely isn't sufficient to resolve all polytomies. So how could these generalized coalescent trees possibly be consistent with our empirical trees?

Instead of trying to resolve as many polytomies as possible, we decided to go in the other direction. We imagined reducing the resolution at which we could distinguish the order of branching events. Applying this `coarse-graining', we would certainly generate polytomies, as fast bursts of branching and multiple nodes collapse. Still, much like the EAD, there was no guarantee for what the distribution of polytomy sizes would be after this coarse-graining, or whether it would match these theoretical models. So our second surprise is that the distribution of burst sizes is also a power law---qualitatively consistent with the same distribution in the Lambda-coalescent.

Outlook

So this seems to be the beginning of a very nice story, with a lot of open questions. Empirical trees display bursts of branching, which quickly collapse to polytomies under coarse-graining, and the distribution of sizes of these bursts is a power law. The Lambda-coalescent is likely not the end of the story, but at least suggests that this distribution is tied together with the scaling behavior of phylogenetic diversity.

What's next? Certainly lots of empirical questions. Does this behavior extend over an even broader range of samples? And will it still hold if we have better, longer sequence data? There are also theoretical questions, mostly centering around whether we can relate parsimonious but mechanistic models to the bursty tree structures, and how best to evaluate and compare these models. One take-home message stands out for me. Simplified models of biodiversity, like neutral models and their generalizations, likely won't ever capture the fine-grained dynamical behavior of an ecological community. But they might just tell us something about coarse-grained dynamical behavior, and coarse-grained phylogenies could be a nice part of this story. Let's see if coarse-grained patterns can be matched with coarse-grained process.