Today is a fun day for me. A paper on which I am the senior author is being published in Nature (yes, the Academic Editor in Chief of PLoS Biology is publishing a paper in Nature, more on that below ..). This paper, entitled, "A phylogeny driven genomic encyclopedia of bacteria and archaea" represents a culmination of years of work by many people from multiple institutions. Today in this blog I am going to do my best to tell the story behind the paper - about the people and the process and a little bit about the science.
First, a brief bit about the science in the paper. In this paper, we (mostly people at the Joint Genome Institute, where I have an Adjunct Appointment -- but also people in my lab at UC Davis and at the DSMZ culture collection) did a relatively simple thing - we started with the rRNA tree of life as a guide. Then we identified branches in the bacterial and archaeal portions of this tree where there were no genome sequences available (or in progress) (this was done mostly by Phil Hugenholtz, Dongying Wu and Nikos Kyrpides) Next we searched for representatives of these "unsequenced" branches in the DSMZ culture collection (a collection of bacteria and archaea that can be grown in the lab). And we identified in total some 200 of these. And then the DSMZ (under the direction of Hans-Peter Klenk) grew these organisms and sent the DNA to the Joint Genome Institute. And then JGI turned on their genome sequencing muscle and sequenced the genomes of the organisms in the DNA samples. And finally, we spent a good deal of time then analyzing the data asking a pretty simple question - are there any general benefits that come from this "phylogeny driven" approach to sequencing genomes compared to what one might find with sequencing just any random genome (after all, any genome sequence could have some value)? The paper, describes what we found, which is that there are in fact many benefits that come from sequencing genomes from branches in the tree for which genomes are not available.
More on the details of the science below. But first, I want to note that this paper was truly an amazing team effort, with all sorts of people from the JGI in particular, going above and beyond the call of duty to make sure it happened and worked well. And the Department of Energy has been truly phenomenal in my opinion in supporting this project which in the end is not explicitly about "energy" per se but is really about providing a reference set of genomes that should improve the value of all microbial genome data.
Anyway, now for the story behind the story. And be prepared, because this is a bit long. But I think it is important to place this work in a bigger context both in terms of my background as well as some of the background of other people in the project. If you can't wait for more on the GEBA project then perhaps you should go to some of these links:
- Videos of talks I have given on the project:
- "Genomic Encyclopedia of Bacteria and Archaea (GEBA)"- Jonathan ...
- Recent talk I gave at the Sackler NAS "Microbes and Health" meeting
- Podcast of interview of me for ASM's Meet the scientist
- Stories about GEBA
- Nature News from 11.17.2009
- Stories about our paper
- Nature News
- GenomeWeb "GEBA Researchers Publish Results from Dozens of Bacterial, Archaeal Genomes"
- Ars Technica article "Presenting a genomic encyclopedia of bacteria (and archaea" by John Timmer
- Iddo Friedberg blogged about it
- The OpenHelix Blog on it
- Leonardo Martins blogs about it here and helps translate a Spanish story about the project
- R&D magazine has a post based on the press releases here
- NY Times story by Carl Zimmer here.
- FriendFeed Discussions here (includes a thread about Nature using a Creative Commons license)
My personal interest in applied uses of phylogenetics stage 1: undergraduate preparation at Harvard
As this paper is primarily about an applied use of phylogenetics (in selecting genomes for sequencing), I thought it would be worth going into some of how I personally became a bit obsessed with applied uses of phylogenetics. For me, my obsession began as an undergraduate at Harvard where I got exposed to the value of phylogeny as a tool from many many angles including but not limited to:
- Freshman year taking a course taught by Stephen Jay Gould where Wayne and David Maddison were Teaching Assistant's and where they were demoing their new phylogenetics software called MacClade
- Sophomore year taking a conservation biology class with Eric Fajer and Scott Melvin where I was exposed to the concept of "phylogenetic diversity" as a tool in assessing conservation plans
- Junior year working in the lab of Fakhri Bazzaz with people like David Ackerly and Peter Wayne who made use of phylogeny as a key tool in their research projects
- Senior year and the year after graduating where I worked in the lab of Colleen Cavanaugh using rRNA based phylogenetic analysis to characterize uncultured chemosynthetic symbionts. I note it was in Colleen's lab that I also became obsessed you could say with microbes and why they rock.
Most of my work at TIGR focused on a different side of phylogenomics than represented in the GEBA project. At TIGR I focused on the uses of evolutionary analysis as a component to analyzing genomes - from predicting gene function to finding duplications (e.g., see the V. cholerae genome paper) to identifying genes under unusual patterns of mutation or selection to finding organelle derived genes in nuclear genomes (e.g., see this) to studying the occurrence of lateral gene transfer or the lack of occurrence of it to studying genome rearrangement processes.. And sure, every once in a while I worked on a project where the organism was the first in its major branch to have a genome sequenced (e.g., Chlorobi). And I had noted, along with others that there was a big phylogenetic bias in genome sequencing project (e.g., see my 2000 review paper discussing this here).
But that did not really drive my thinking about what genome to actually sequence until TIGR hired a brilliant microbial systematics expert Naomi Ward as a new faculty member. And it was Naomi who kept emphasizing that someone should go about targeting the "undersequenced" groups in the Tree of Life.
NSF Assembling the Tree of Life grant
And so Naomi and I (w/ Karen Nelson and Frank Robb) put together a grant for the NSF's "Assembling the Tree of Life" program to do just this - to sequence the first genomes from eight phyla of bacteria for which no genomes were available but for which there were cultured organisms. Amazingly we got the grant. And we did some pretty cool things on that project, including sequencing some interesting genomes, and developing some useful new tools for analyzing genomes (e.g., STAP, AMPHORA, APIS). And I was able to hire some amazing scientists to work in my lab on the project including Dongying Wu (the lead author on the GEBA paper) and Martin Wu (who also worked on the GEBA project and is now a Prof. at U. Virginia) and Jonathan Badger. Alas, we did not publish any earth shattering papers as part of this NSF Tree of Life project on analyzing the genomes of these eight organisms, not because there was not some interesting stuff there but for some other reasons. First, I moved to UC Davis and there was a complicated administrative nightmare in transferring money and getting things up and running at Davis on this project so my lab was not really able to work on it for two years (in retrospect, what a f*ING nightmare dealing with the UC Davis grants system was ...).
Then, just as things we ready to get restarted, TIGR kind of imploded and many of the people, including Naomi, my CoPI, left (though I note, my moving to Davis was unrelated to the dissolution of TIGR). But perhaps most importantly, there were some actual technical and scientific problems with our dreams of changing the world of microbiology from our phyla sampling project - the science was not quite there. In particular, having a single genome from each of these phyla was simply not enough to get (and show) the benefits that can come from improved sampling of the tree of life. And thus though we have published some cool papers from this project (e.g., see this PLoS One paper on one of the genomes), we all ended up in one way or another, disappointed with the final results.
Davis and JGI: the return of phylogeny to genomic sampling
And then, like magic, something came up. I went to one of the monthly senior staff meetings at JGI and while we were listening to someone on the speaker phone, Eddy Rubin handed me a note asking me what I thought about the proposal someone was making to sequence all the species in the Bergey's Manual. And the light bulb of phylogeny went back on in my head. I told him (I think I wrote it down, but may have said out loud), something like "well, sequencing all 6000 or so species would be great, but it would be better to focus on the most phylogenetically novel ones first." And in a way, GEBA was born. Eddy organized some meetings at JGI to discuss the Bergey's proposal and I argued for a more phylogeny driven approach. And this is where having Phil Hugenholtz and Nikos Kyrpides at JGI was like a perfect storm. You see, both had been lamenting the limited phylogenetic coverage of genomes for years, just like I had. Phil had even written a paper about it in 2002 which we used as part of our NSF Tree of Life proposal. And Nikos too had been diligently working for years to make sure novel organisms were sequenced. So though we went to a meeting to discuss the Bergey's manual idea, we instead proposed an alternative - GEBA.
And for some months, we pitched this notion to various people including at JGI, DOE, and various advisory boards. And the response was basically - "OK - sounds like it COULD be worth doing - why don't you do a pilot and TEST if it is worth doing" And so, with support from Eddy Rubin and DOE, that is what we did.
One key limitation - getting DNA
So Phil, Nikos and I and a variety of others starting working on the general plan behind GEBA. But there was one key limitation. How were we going to get DNA from all these organisms? One possibility was to seek out diverse people in the community and have them somehow help us. This had some serious problems associated with it, not the least of which was the worry that we might end up sequencing varieties of organisms that people had in their lab but which nobody else had access to (something Naomi Ward and I had written about as a problem a few years before).
And here came the second perfect storm - none other than Hans-Peter Klenk (yes, the same one who had led some of the early genome sequencing efforts when he was at TIGR), was visiting JGI. And he had a relatively new job - at the German Culture Collection DSMZ (In fact, I should note, I had tried to get a job at TIGR even before I met Venter, since they had a position advertised for a "microbial evolutionary biologist" --- but that job went to Klenk). Phil Hugenholtz had asked the Head of DSMZ, Erko Stackebrandt, if they might be interested in helping us grow strains and get DNA but we did not yet have a full collaboration with them. And Erko had suggested we contact Hans-Peter. And in his visit to JGI it became apparent that he would do whatever he could to help us build a collaboration with DSMZ. And thus we had a source of DNA. Even more amazingly to me, they did it all for free.
And I became the chief cheerleader and administrator of the project, in part since Phil and Nikos were so busy with their other things at JGI. And though I was not always on the ball, the project moved forward and we started to get genomes sequenced using the full strength of the JGI as a genome center. The finishing teams at JGI worked diligently on finishing as many of the genomes as possible. And Nikos' team at JGI made sure that the genomes were annotated. And I helped make sure that they data release policies were broadly open (which everyone at JGI supported). And after many false starts with papers on the project that were way way way to cumbersome and big, with some kicks in the pants from the director of JGI Eddy Rubin who was getting anxious about the project, we turned out the GEBA paper that was published today in Nature.
You might ask, why, as a PLoS official and PLoS cheerleader, we ended up having a paper in Nature? Well, in the end, though I am senior author on the paper, the total contribution to the work mostly came from people at JGI who did not work for me but instead worked with me on this great project. And we took some votes and had some discussions and in the end, despite my lobbying to send it to PLoS Biology, submitting it to Nature was the group decision. I supported this decision in part due to the fact that Nature uses a Creative Commons license for genome papers. But I also supported it because in the end, this was a collaboration involving many many many people and in such projects everyone has to compromise here and there. Now mind you, I am not sad to have a paper in Nature. But I would personally have preferred to have it in a journal that was fully open access, not just occasionally open like Nature.
Now I note, there were a million other things that went on associated with the GEBA project. Some of which I was not even involved in in any way. I will try to write about some of these another time, but this post is already way way way too long. So I am going to just stop here and add that I have been honored and lucky work with people like Phil, Nikos, Hans-Peter, and others on this project and to have the people at the JGI work so hard on the background parts of this project. Thanks to all of them and to the people at DSMZ and in my lab who helped out and to the DOE for funding this work (as well as the Gordon and Betty Moore Foundation, who funded some of the work from my lab on analysis of these genomes). And last but not least, thanks to the Director of JGI Eddy Rubin, supporting this project and for being patient with it and for kicking us in the pants when we needed to get moving on getting a paper out.
Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova, N., Kunin, V., Goodwin, L., Wu, M., Tindall, B., Hooper, S., Pati, A., Lykidis, A., Spring, S., Anderson, I., D’haeseleer, P., Zemla, A., Singer, M., Lapidus, A., Nolan, M., Copeland, A., Han, C., Chen, F., Cheng, J., Lucas, S., Kerfeld, C., Lang, E., Gronow, S., Chain, P., Bruce, D., Rubin, E., Kyrpides, N., Klenk, H., & Eisen, J. (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Nature, 462 (7276), 1056-1060 DOI: 10.1038/nature08656