Tuesday, January 26, 2010

Wanted:Feedback on Importance of Finishing (Microbial) Genomes

To all

I am writing because I am working on a project to evaluate the importance of finishing microbial genomes. I know there has been lots of talk about this out there on the web and in papers, etc but I think a fresh discussion is useful. To get people up to speed below is a summary of the issue as I see it.
  1. Shotgun sequencing: Genome sequencing relies generally on the shotgun method at the beginning of a project where DNA fragments from an organism of interest are sequenced in a highly random manner.
  2. Assembly: After shotgun sequencing, the genome is assembled as best as possible into larger pieces (called contigs) and ordered sets of contigs (called scaffolds). All of this put together can be called an "assembly"
  3. Gaps: After the assembly phase, there are almost always gaps in the assembly. These generally come in two forms:
    • sequencing gaps (where we know two contigs go together in some orientation but where we do not know the sequence of the DNA in between the contigs)
    • physical gaps (where we have sets of scaffolds but do not know how the connect to each other).
  4. Quality: After the assembly phase, different components of the assembly can have different "qualities" where from example, some sections are somewhat ambiguous and others are highly reliable
  5. Finishing: Using any combination of laboratory, computational and other analyses one can both fill in gaps in the assembly and improve the quality of the assembly. This can generally be called "finishing"
  6. Quality of final product: Depending on the end quality of the assembly we could assign it to one of a few categories of "completeness" as outlined in a paper by Patrick Chain et al. In essence, you can consider the post to be a follow up to their paper and their work.
We plan to try to measure what one gains by the finishing steps. We need to know this because we would like to make intelligent decisions about how to allocate resources. If one gains a lot from finishing then it would make sense to allocate significant resources to it. I note, I and some colleagues wrote a paper about this issue "The value of complete microbial genome sequencing (You get what you pay for)" that was published in 2002. This is without a doubt not the only discussion of the topic but I just wanted to point out I have been involved in this debate before. Despite that, I think we simply do not know right now what the benefits might be in the new sequencing landscape.

So the question I am asking here is:
What do people think are the potential benefits that could come from finishing?

Here are some possible answers to get the discussion going:
  1. Gene discovery (e.g., there may be interesting/important genes in missing/low quality data)
  2. Esthetics of completeness (as in, it just feels better to have a finished genome)
  3. Improved analysis of genome organization (in particular from having contigs oriented correctly)
Also - I note there has been some discussion of this for animals, plants etc (e.g., see recent paper by Eric Green and others on vertebrates) Many of the issues are similar but they are different enough that I think a microbe focused discussion is useful.

Other links of interest:

Blakesley, R., Hansen, N., Gupta, J., McDowell, J., Maskeri, B., Barnabas, B., Brooks, S., Coleman, H., Haghighi, P., Ho, S., Schandler, K., Stantripop, S., Vogt, J., Thomas, P., Comparative Sequencing Program, N., Bouffard, G., & Green, E. (2010). Effort required to finish shotgun-generated genome sequences differs significantly among vertebrates BMC Genomics, 11 (1) DOI: 10.1186/1471-2164-11-21

Fraser, C., Eisen, J., Nelson, K., Paulsen, I., & Salzberg, S. (2002). The Value of Complete Microbial Genome Sequencing (You Get What You Pay For) Journal of Bacteriology, 184 (23), 6403-6405 DOI: 10.1128/JB.184.23.6403-6405.2002

Chain, P., & et al. (2009). Genome Project Standards in a New Era of Sequencing Science, 326 (5950), 236-237 DOI: 10.1126/science.1180614

Friendfeed discussion of this post:


  1. It really depends on where the gaps are...if they are in the middle of known genes you want to be able to know whether these are disrupted by transposons which are prone to leading to gaps with the shorter read sequencers (we've seen this with virulence factors) or whether it's simply poor assembly.

    Genome organization-wise, it is fairly difficult to see what contigs constitute plasmids or chromosomes without following up the assemblies because there are a lot of repetitive regions on plasmids that don't assemble well.

  2. I think the biggest issue with incomplete genomes is that you cannot be 100% sure that some function or gene is absent. This affects the interpretation of every pathway -- does this bug have an unusual pathway for X or is a key gene just missing? It also affects many evolutionary and comparative analyses, such as phylogenetic profiling, or identifying gene gains and losses. However, if the genome is >99% complete, then I'm not sure if this all is so bad in practice. I bet our coverage of bacterial genome space is still poor enough that two good drafts are worth more than one finished genome.

  3. Matthew Kane1/26/2010 6:14 PM

    Yes - absence of evidence is not evidence of absence.

  4. I don’t think you can make a blanket judgment that it would or would not always be beneficial to finish every microbial genome sequence that you begin. It might be good for your sanity, your human need for a sense of accomplishment, or perhaps if there are contractual agreements pertaining to your grant. Since the technology is getting better, the more routinized finishing work might be delegated to another group (a troupe of baboons, perhaps)if you have a consortium going or have several on-site groups working together…it just might make the finishing work more efficient, and allow you to get the next organism of interest, or start genomic studies for the benefit of phylogenetics or whatever reason for your role in sequencing in the first place. For the individual it might depend upon what jollies you really get. Do you like to fix cars or drive them? If you are really into sequence technology, you would eventually want to see completion on most of the sequencing projects you start (fixing the car). On the other hand, you might want to start the genomic studies and get some use out of the sequence (driving the car).

  5. There is need to finish if interested in chromosome structural dynamics, gene gain/loss and pathway holes are difficult to reconcile. Although the degree to which we lose information by not finishing is speculative- which is why I suppose you are planning on quantifying this, huh? It would be nice to define the current relationship between information gain and costs of closing and polishing the circle.

    What about costs in terms of avoiding future screw-ups by finishing? There’s that cautionary tale of the student losing a year to crystallizing a protein derived from mis-annotated sequence. What happens now with the potential increase in misassemblies as more groups are generating and depositing drafts? Perhaps it would be more difficult to quantify this sort of deleterious outcome and that’s why the genome standards are helpful.

  6. @ Morgan/Matthew: Most of the gene absence issues can be dealt with by fishing through the raw reads (from now on there is almost always going to be enough coverage to find these bits even if they don't assemble). It's definitely not like the shotgun days where un-cloneable genes are underrepresented in the assemblies if you weren't careful.

    @David Sela: I think screwup problems concerning gaps are easily fixed with a couple of PCR steps by the grad student but annotation is a completely different beast than assembly...

  7. @ Morgan/Matthew: Most of the gene absence issues can be dealt with by fishing through the raw reads (from now on there is almost always going to be enough coverage to find these bits even if they don't assemble). It's definitely not like the shotgun days where un-cloneable genes are underrepresented in the assemblies if you weren't careful.

    @David Sela: I think screwup problems concerning gaps are easily fixed with a couple of PCR steps by the grad student but annotation is a completely different beast than assembly...

  8. This is great everyone - some things here we/I had not thought of. I completely agree with the sentiment that it depends on your interest (or, what your jollies are) - both scientific and otherwise. What I am hoping to do is build metrics for different categories of potential benefits and then we can assess in a cost-benefit manner, what we get for the money. Obviously someone still will have to decide whether it is going to be worthwhile to spend X$ on y benefit, but we really want to do this more objectively than it has been done in the past.

  9. I agree with David that structural elements of the chromosome are important to consider. What about synteny, for example, and trying to see how that is conserved (or not) amongst organisms? With incomplete genomes, this may be something that becomes more difficult to analyze, particularly in genomes where closely related strains have dramatic levels of rearrangement (e.g. SAR11). It is much less comfortable to try and understand this kind of phenomenon with unfinished genomes.

  10. We're doing genome-wise analyses of recombination, and these can only be done with finished genomes. The identities of the genes and other functional components aren't important.

    In our organism (Haemophilus influenzae), most of the ~20 sequenced genomes are of no use to us, because only 4 have been finished.

  11. Arash Komeili1/27/2010 10:33 AM

    I can speak from personal experience. The genomes of two magnetotactic bacteria were "finished" in 2001 but were never completed. Once complete genomes of other magnetos were published we realized that the first two genomes were missing large chunks of the chromosomal regions that made these organisms magnetotactic bacteria. The original genomes were useful but the finished genomes were needed for the breakthrough in our field. I should point out that the original projects were plagued by contamination issues so the coverage was very poor. The bigger problem (perhaps unrelated to this discussion) is the poor quality of genome annotations.

  12. As Rosie mentions, one major advantage to finishing is that it makes downstream analysis easy because the sequence can be assumed to be correct. If we don't finish genomes, it would be a huge help if the folks who generate the data also deposit the raw reads in a public database. With the raw reads it should be possible for clever informaticians to design their downstream analysis (e.g. of recombination) to account for uncertainties in the genome assembly. It could also provide a starting point for someone else motivated to finish the genome.

  13. No one has discussed cost. What is the current cost of getting a typical microbial to less than 10 contigs versus getting it finished using next- (or next-next-) gen sequencing?

    The numbers I heard not so long ago were order $5,000 for <10 contigs and order $60,000 for finishing. Then the argument is one finished genome versus 12 good quality complete ones.

    Disclaimer: I was against finishing in 2002 and still am.

  14. Rob - I think the numbers are probably "worse" for the finishing crowd than that. Right now, I think you can get good shotgun coverage of say a 5MB genome, with a mix of 454 and Illumina, for significantly less than 5K. You probably will not end up with 10 contigs, and the quality of those contigs that oyu get may not be great, but the cost will be low, more like 1-2,000.

    Alas, the cost of finishing probably has not gone down as much over the years as shotgun sequencing. So it might still cost on average 50K to finish a 5 Mb genome. I just do not know, but it would be worth figuring out. My guess the ratio of relative cost of finished vs. shotgun keeps going up.

    So certainly cost should be figured into the equation. However, unlike our anecdotal papers from 2002, I think it would be good to more completely assess the benefits that come from having more incomplete genomes vs. that come from having fewer complete ones.

  15. This comment has been removed by the author.

  16. While it doesn't completely resolve the issue of having a closed, and hence "certain" genome sequence (using the term loosely), one way to simulate finishing while keeping costs low is via a comparative assembly tool (e.g. MERCATOR). Here, closely related genome sequences are used to create a synteny map and, through the combined signal across genomes, linkage between previously unlinked contigs within a genome is inferred. Of course, you still don't know what specific sequence is necessarily sitting in the newly linked region, but it is frequently cheaper to sequence many closely related genomes than it is to fully finish a single taxon's genome. This approach was used to finally finish the Neurospora crassa genome sequence which had been plagued with a ridiculously low contig size N50 value until the N. tetrasperam and N. discreta sequences came online.

    (weirdness with blogger.com...)

  17. I agree with David's follow-up that you can, in principle, root about in the raw reads to verify that a key gene of interest is absent. If we do large-scale sequencing to draft quality only -- which seems like a win to me -- then we should to make sure this is fast and easy to do. For example, perhaps we need centralized read archives that allow you to search only the unassembled reads.

    I also wonder if this will all be moot in a few years as read lengths improve. If yes, then should that change our attitude about getting drafts now that we will think are lame in 3-5 years?

  18. I think that people here are forgeting to look ahead. Right now there a whole new population genetics layer just waiting to be added to microbial genomes (thanks to nex gen seq, of course). And to address questions about genetic architecture (including some aspects of gene expression) and genome architecture (synteny, recomb, genome colonization, etc.) a finished genome is essential.

    The reason is quite simple: with finished genomes designing experiments to such questions is much more straightforward and certainly it will reduce some costs. Just imagine someone repeating a whole bunch of analysis just because he/she wasn't using a finished genome as reference.

    Besides all, most genomes go public at some point in time, so the effort of finishing a genome will capitalize itself very quickly as one can see with E. coli genomes. We just need to help develop better approaches to finish without so much manual curation.

    As a final comment, genomes aren't a bunch of genes and regularoty elements. They are a research subject per se and deserve and appropriate treatment.

  19. I'm not quite sure if a finished genome is essential for these population genetics analyses...Much easier, yes, but essential no. Many of the complete eukaryotic genomes are unfinished and yet we still learn a heck of a lot from them.

    It is going to be much easier to re-think the pop gen analyses (money wise and time wise) than to actually finish every genome.

  20. 1) I think the finishing is important because it makes the data useful to a greater number of researchers. Most can't handle this "unfinished" stuff (me included).
    2) I'm assuming finished also =more accurate. ie better quality, less errors. And I think there's already way too much crap in the databases. So do the hard work and finish the damn genomes.

  21. my question is the flip side - What do you gain by doing a half-hearted job? Oh, a new sequence variation to recA (/sarc).

    Seriously though, What's your purpose to sequencing lots of genomes? To find a bunch of new genes that we don't know anything about? Or to learn about pathways comparative genomics, genome evolution, etc. you can't get that with draft quality. rather you could get as much with a draft as you could metagenomics. So why not just throw a bunch of crap into a tube and do the metagenomics approach?

    I just know all the steps involved in annotation come with inherent error rates. When you start out with a 90% genome, you get a worse annotation.


  22. Sam et al

    I think (as I have said many times before) that qhat quality and quantity you need should be driven by scientific goals and questions. This is in part why I am trying to find out what people have noticed in their analyses has been possibly hurt by not having finished genomes. But there is a slight complication here - some genomic data gathering is bigger than individual projects. That is, some projects are done with the nebulous "community" in mind. In such cases it is important to try and figure out what the different possible questions are and how quality effects the ability to answer them. This is hard to do b/c there are so many diverse uses of genome sequence data (this is in part why genome papers have so many citations). So I do not think it is as simple as

    "To find a bunch of new genes that we don't know anything about?"


    "Or to learn about pathways comparative genomics, genome evolution"

    For a regular grant proposal, the person proposing the work should be able to predict in some way what type of data will be useful. But for projects like my Genomic Encyclopedia project at the JGI, we need to try and factor in 100s-1000s of uses -- and there it is beyond my brain to list what all of those currently are or might be in the future. Which is why I posed these questions ...

  23. Jon;
    Unfinished genomes are a unique limbo and disaster for a large number of reasons. First, as other people have said, you can't know 'what you don't know.' Second, you can't track rearrangements. Third, not all microbes are single chromosomes - plasmids, megaplasmids, linear plasmids, and second chromosomes are all important. So, unfinished genomes are simply not a jumble of pieces from a single chromosome, but may be a mix of entities. For infection control, for instance, some of these may move at different rates or mutate at different rates and the clinicians and drug discovery types need to know this.

    I'm currently fielding the OpGen technology for relatively high-utilization compared to their expensive 'service.' While it adds to the cost of a genome, the cost of OpGen physical mapping of the chromosomes can probably be reduced to something like $60 per genome in consumables (purified DNA and minimal labor, plus a $250,000 instrument required, throughput 24 organisms per day) over the next year; not a ton of detail in the map, but enough to know 1) when a genome is misassembled and 2) reduce the coverage needed for near closure - and order most of the contigs.

    I think that from my perspective, not only is it worth it to know that my genomes are correctly assembled, but rearrangements, pathogenicity islands and other large mobile genomic elements/indels are the signal I really care about. SNPs are fun and fine, but they don't give me the kind of information I would care to report to the infection control people when they are trying to figure out what is causing a new kind of antibiotic resistance profile.

    Does this help?

  24. All helpful Ben and I agree in principle with most of the statements. I think in the end the issue is going to come down to cost-benefit analysis and asking the question - if you could get 100s of unfinished genomes for the cost of a few finished ones - is that a big benefit? And I think for some purposes (e.g., binning metagenomic data) 100s of unfinished genomes may be very useful. And for other purposes, 100s of unfinished is not not much better than a few unfinished.

  25. So, I agree; which is why I mentioned concrete dollar values. Let's suppose that unfinished microbial genomes in the ~1000 contig range cost $1. I might accept that rearrangements are an unknown hazard and simply collect $1000 genomes to do a functional gene analysis, even if closing each one is a mere $50. However, if draft genomes are $100, at the 1000 contig level and $150 for 20 contigs, and from 20 to finished costs $50, I may want the 5 genomes I get for my $1000 finished, rather than getting 6 unfinished ones or 10 pretty raw ones.

    Of course, the mission and the resourcing dictates the most effective relative effort, but I don't see draft genomes coming down below $100 too soon, and so I tend to think that assembly - particularly overcoming the challenges of metagenomics - is well worth the investment.

  26. I don't think finishing will necessarily be a cost problem 2 years from now (except for highly repetitive regions) given that read lengths are going to be greater than 1Kb

  27. Interesting discussion. Coming from a vertebrate genome person. The finished genomes give information about synteny and regulatory elements.

    For microbial I am not too sure.

    I am curious about the costing though. For me finishing a genome is always hard to predict the cost.

    It seems that there is only a certain amount you can get by short read sequencing and shotgun methods.
    In the end you need to tackle the gaps by directed Sanger seq.

    So to me the biggest cost is usually time rather than $.
    And if its microbial I would assume its much easier and can be done in a shorter time.

  28. Kevin - it is always hard to predict the cost (time and money) for finishing, which is another reason some places would like to not do it -- hard to budget into grant proposals for example

    Overall, the main cost in finishing is labor -- and thus the more one automates the better. JGI et al organize a meeting every year on the future of finishing where a decent amount of the discussion is on the automation of finishing. Some companies, like Raindance, are developing technologies than in theory could be used to aid in the automation but not sure how much they are used right now. I think they probably work best when you are sequencing the same organism multiple times.

  29. Its amusing that people really think that they can understand the complete arrangement of a genome and that it will always be like that.

    Bacterial genomes rearrange in real time. Grow a culture in a test-tube and you will have a population. There is no such thing as the "complete" genome, it is the average of your sample. Even when we finally get the long promised single molecule long reads, you will only get a representative sequence (and you will have destroyed the organism by sequencing it, by the way).

    Next gen sequencing is good enough to get to very few contigs, and those gaps are probably real biological events. For example, rearrangements where your sample has multiple versions in the population, CRISPRs where the sequence has changed, phages, transposons, or IS elements that are hopping in, out, and around your genome.

    There is no right answer, there is no finished genome. There is no spoon.

    Finishing is an expensive way to fool yourself into thinking you know something you don't!

  30. Not buying parts of your pitch Rob. I agree that a "complete" genome is frequently a bit misleading due to the mutations that you describe that can occur either in growth in the lab or in the ancestors of a population of organisms in the field. However, I do not know of much evidence that indicates how often the "gaps" (and don't forget errors) in data sets are due to mutation/variation and how often these gaps are due to crappy data/assemblies. Do you have data on that?

    If you are arguing that what we need to do is better resolve the organization of all the pieces of DNA in a data set, I am in agreement. But just for clarification, since some might interpret your statements in other ways - I would like to express that point. And say that it is can be very useful to understand the structure of the different variants in a sample. For example, multiple studies (e.g., some from Julian Parkhill) have shown that determining where the mutations are that arise during the growth of a culture can be very informative as to the mechanisms of mutation processes.

    In addition, I also want to make sure that people out there do not equate per se missing knowledge which is usually the state of "incomplete" genomes, with variation. Certainly in some cases fractured assemblies could be due to variation. But in other cases they are likely to be due to bad data, assembly methods that are not ideal for the data at hand, or, to put it bluntly, just crappy data.

    Until we can tell variation from missing knowledge, we cannot necessarily use the data perfectly. So if you are arguing that "finishing" should be in part about trying to determine where variants are in the data - I agree - and we have on occasion tried to do this. But I think the story is still out on what the benefits are from finishing -- and I think there are potentially many more than you imply. Mind you, I am not saying it is per se worth the expense. But I think we should objectively assess what the potential benefits are ...

  31. I think an unfinished genome is of limited usefulness because of all the reasons that have been mentioned. I'd like to add that we shouldn't discount non-coding regions, given recent studies showing the importance of sRNA.

    Having said that, if we're trying to allocate resource from a limited pool of money, then higher quality annotation is in my opinion far more important. I'd also like to see people coupling genomic studies with characterization of "novel" ORFs. I realize that this is a bit impractical in the current funding environment (2-3 year projects), and will require collaboration with biochemists and cell biologists, but this is much more of an issue to me than generating new, partial, and poorly annotated genomic sequences. Given the somewhat disproportional cost of closing a genome, I'd be extra cautious with soaking up a bit chunk of funding just to do this.

    So my answer is dependent on where you plan to get the money from I guess. Yes if it's coming from places like JCVI and JGI, and no if it will cut into other non-genomic science funding.

  32. JGI and JCVI money does not, shall we say, grow on trees. Much of the work at JCVI is grant funded and much of the work at JGI is done with $$$ that could be reallocated to other science if needed. So in both places, better to not waste money - as that money could be used for other things.

  33. Of course I didn't mean money is not an issue at those institutions. What I meant to say is that they have traditionally focused on generating and analyzing genomic sequences, and as far as the sequences themselves are concerned, finishing the genomes should be high up on the priority list at those institutions (versus not leaving any budget for closing genomes). Overall, however, I think functional characterization is possibly more important at this point. We need to encourage people to incorporate functional studies when sequencing a new genome as well as utilizing existing genomes to study function. We should discourage this notion that we can't do high quality science unless we generate new sequences.

  34. OK I get your point now. I don't disagree per se, and definitely agree with the statement that we can do good science without new sequences. However, as sequencing costs go down, and as sequencing is still beneficial, it will continue to have a key place. We are certainly at a spot where functional studies, or lack thereof are severely limiting for many purposes. In particular I think we need a concerted effort to do functional studies from across the tree of life ... an extension of my/JGIs recent "Genomic Encyclopedia" concept ...


Most recent post

My Ode to Yolo Bypass

Gave my 1st ever talk about Yolo Bypass and my 1st ever talk about Nature Photography. Here it is ...