Blogging my Qualifying Exam
Because this seems to be my default mode of organizing my thoughts when it comes to research, I've decided to write my dissertation proposal as a blog post. This way, when I'm standing in front of my committee on Thursday, I can simply fall back on one my more more annoying habits; talking at length about something I wrote on my blog. Or, since he has graciously lent me his megaphone for the occasion, I can talk at length about something I wrote on Jonathan's blog.
Last summer, I had a lucky chance to travel to Kamchatka with Frank Robb and Albert Colman. It was a learning experience of epic proportions. Nevertheless, I came home with a puzzling question. As I continued to ponder it, the question went from puzzling to vexing to maddening, and eventually became an unhealthy obsession. In other words, a dissertation project. In the following paragraphs, I'm going to try to explain why this question is so interesting, and what I'm going to do to try answer it.
About a million years ago (the mid-Pleistocene), one of Kamchatka's many volcanoes erupted and collapsed into its magma chamber to form Uzon Caldera. The caldera floor is now a spectacular thermal field, and one of the most beautiful spots on the planet. I regularly read through Igor Shpilenok's Livejournal, where he posts incredible photographs of Uzon and the nature reserve that encompasses it. It's well worth bookmarking, even if you can't read Russian.
The thermal fields are covered in hot springs of many different sizes. Here's one of my favorites :
The experienced microbiologists on the expedition set about the business of pursuing questions like Who is there? and What are they doing? I was there to collect a few samples for metagenomic sequencing, and so my own work was completed on the first day. I spent the rest of my time there thinking about the microbes that live in these beautiful hotsprings, and wondering How did they get there?
Extremophiles are practically made-to-order for this question. The study of extremophile biology has been a bonanza for both applied and basic science. Extremophiles live differently, and their adaptations have taught us a lot about how evolution works, about the history of life on earth, about biochemistry, and all sorts of interesting things. However, their very peculiarity poses an interesting problem. Imagine you would freeze to death at 80° Celsius. How does the world look to you? Pretty inhospitable; a few little ponds of warmth dotted across vast deserts of freezing death.
This model has been a powerful tool for microbiology, and much of what we know about cellular metabolism has been learned by the careful tinkering with selective growth media it exhorts one to conduct. Nevertheless, the Baas Becking model just doesn't seem reasonable. Microbes do not disperse among the continents by quantum teleportation; they must face barriers and obstacles, some perhaps insurmountable, as well as conduits and highways. Even with their rapid growth and vast numbers, this landscape of barriers and conduits must influence their spread around the world.
As any sports fan knows, the structure of a tournament can be more important than the outcome of any particular game, or even the rules of the game. This is true for life, too. From one generation to the next, genes are shuffled and reshuffled through the population, and the way the population is compartmentalized sets the broad outlines of this process.
A monolithic population -- one in which all players are in the same compartment -- evolves differently than a fragmented population, even if mutation, recombination and selection pressures are identical. And so, if we want to understand the evolution of microbes, we need to know something about this structure. Bass Becking's hypothesis is a statement about the nature of this structure, specifically, that the structure is monolithic. If true, it means that the only difference between an Erlenmeyer flask and the entire planet is the number of unique niches. The difference in size would be irrelevant.
This is a pretty strange thing to claim. And yet, the Baas Becking model has proved surprisingly difficult to knock down. For as long as microbiologists have been systematically classifying microbes, whenever they've found similar environments, they've found basically the same microbes. Baas Becking proposed his hypothesis in an environment of overwhelming evidence.
However, as molecular techniques have allowed researchers to probe deeper into the life and times of microbes (and every other living thing), some cracks have started to show. Rachel Whitaker and Thane Papke have challenged the Bass Becking model by looking at the biogeography of thermophilic microbes (such as Sulfolobus islandicus and Oscillatoria amphigranulata), first by 16S rRNA phylogenetics and later using high resolution, multi-locus methods. Both Rachel's work and Papke's work, as well as many studies of disease evolution, very clearly show that when you look within a microbial species, the populations do not appear quite so cosmopolitan. While Sulfolobus islandicus is found in hot springs all over the world, the evolutionary distance between each pair of its isolates is strongly correlated with the geographic distance between their sources. So, these microbes are indeed getting around the planet, but if we look at their DNA, we see that they are not getting around so quickly.
However, Baas Becking has an answer for this; "...but the environment selects." What if the variation is due to selection acting at a finer scale? It's well established that species sorting effects play a major role in determining the composition of microbial communities at the species level. There is no particular reason to believe that this effect does not apply at smaller phylogenetic scales. The work with Sulfolobus islandicus attempts to control for this by choosing isolates from hot springs with similar physical and chemical properties, but unfortunately there is no such thing as a pair of identical hot springs. Just walk the boardwalks in Yellowstone, and you'll see what I mean. The differences among the sites from which these microbes were isolated can always be offered as an alternative explanation to dispersal. Even if you crank those differences down to nearly zero, one can always suggest that perhaps there is a difference that we don't know about that happened to be important.
This is why the Baas Becking hypothesis is so hard to refute: One must simultaneously establish that there is a non-uniform phylogeographic distribution, and that this non-uniformity is not due to selection-driven effects such as species sorting or local adaptive selection. To do this, we need a methodology that allows us to simultaneously measure phylogeography and selection.
There are a variety of ways of measuring selection. Jonathan's Evolution textbook has a whole chapter about it. I'll go into a bit more detail in Aim 3, but for now, I'd just like to draw attention to the fact that the effect of selection does not typically fall uniformly across a genome. This non-uniformity tends to leave a characteristic signature in the nucleotide composition of a population. Selective sweeps and bottlenecks, for example, are usually identified by examining how a population's nucleotide diversity varies over its genome.
For certain measures of selection (e.g., linkage disequilibrium) one can design a set of marker genes that could be used to assay the relative effect of selection among populations. This could then extend the single species, multi-locus phylogenetic methods that have already been used to measure the biogeography of microbes to include information about selection. This could, in principle, allow one to simultaneously refute "everything is everywhere..." and "...but the environment selects." However, designing and testing all those markers, ordering all those primers and doing all those PCR reactions would be a drag. If selection turned out to work a little differently than initially imagined, the data would be useless.
But, these are microbes, after all. If I've learned anything from Jonathan, it's that there is very little to be gained by avoiding sequencing.
We're getting better and better at sequencing new genomes, but it is not a trivial undertaking. However, re-sequencing genomes is becoming routine enough it's replacing microarray analysis for many applications. The most difficult part of re-sequencing an isolate is growing the isolate. Fortunately, re-sequencing is particularly well suited for culture-independent approaches. As long as we have complete genomes for the organisms we're interested in, we can build metagenomes from environmental samples using our favorite second-generation sequencing platform. Then we simply map the reads to the reference genomes. The workflow is a bit like ChIP-seq, except without culturing anything and without the ChIP. We go directly from the environmental sample to sequencing to read-mapping. Maybe we can call it Eco-seq? That sounds catchy.
Not only is the whole-genome approach better, but with the right tools, it is easier and cheaper that multi-locus methods, and allows one to include many species simultaneously. The data will do beautifully for phylogeography, and have the added benefit that we can recapitulate the multi-locus methodology by throwing away data, rather collecting more.
To implement this, I have divided my project into three main steps :
- Aim 1 : Develop a biogeographical sampling strategy to optimize representation of a natural microbial community
- Aim 2 : Develop an apply techniques for broad matagenomic sampling, metadata collection and data processing
- Aim 3 : Test the dispersal hypothesis using a phylogeographic model with controls for local selection
There are four technical challenges :
- Quickly collect a large number of samples and transport them to the laboratory without degradation.
- Build several hundred sequencing libraries.
- Collect high-quality metadata describing the sites.
- Assemble thousands of re-sequenced genomes.
It's called the Smash-o-Tron 3000, and you can download it on Thingiverse.
Sequencing library construction
The next technical problem is actually building the sequencing libraries. Potentially, there could be a lot of them, especially if I do replicates. If I were to collect three biological replicates from every site on the map, I would have to create about six thousand metagenomes. I will not be collecting anywhere close to six thousand samples, but I thought it was an interesting technical problem. So I solved it.
I'm pretty excited about this idea, and there are a lot of different directions one could take it. The College of Engineering here at UC Davis is letting me teach a class this quarter that I've decided to call "Robotics for Laboratory Applications," where we'll be exploring ways to apply this technology to molecular biology, genomics and ecology. Eight really bright UC Davis undergraduates have signed up (along with the director of the Genome Center's Bioinformatics Core), and I'm very excited to see what they'll do!
Environmental metadata collection
To help me sanity check the selection measurement, I decided that I wanted to have detailed measurements of environmental differences among sample sites. Water chemistry, temperature, weather, and variability of these are known to select for or against various species of microbes. The USGS database has extremely detailed measurements of all of these things, all the way down to the isotopic level. However, I still need to take my own measurements to confirm that the site hasn't changed since it was visited by the USGS team, and to get some idea of what the variability of these parameters might be. It would also be nice if I could retrieve the data remotely, and not have to make return trips to every site.
Unfortunately, these products are are extraordinarily expensive. The ones that can be left in the field for a few months to log data cost even more. The ones that can transmit the data wirelessly are so expensive that I'd only be able to afford a handful if I blew an entire R01 grant on them.
So, I ordered a couple of Arduino boards and made my own.
Assembling the genomes
In order to test my hypothesis, I need to model the dispersal of organisms among the sites. However, in order to do a proper job of this, I need to make sure I'm not conflating dispersal and selective effects in the data used to initialize the model. There are three steps :
- Identify genomic regions that have recently been under selection
- Build genome trees with those regions masked out
- Model dispersal among the sites
I will delete the regions of the genomes that deviate significantly (say, by more than one standard deviation) from neutral. Then I'll make whole genome alignments, and build a phylogenetic trees for each organism. This tree would contain only characters that (at least insofar as you believe Tajima's D and Wu and Fey's FST) are evolving neutrally, and are not under selection.
This brings us, at last, back to the hypothesis and Baas Becking. Here we have a phylogeographic model of dispersal among sites within a metacommunity, with the effects of selection removed. If the model predicts well-supported finite rates of dispersal within the metacommunity, my hypothesis is sustained. If not, then Baas Becking's 78 year reign continues.