Comments on The Tree of Life: Wanted:Feedback on Importance of Finishing (Microbial) Genomes

OK I get your point now. I don't disagree per...

2010-02-13T17:59:20.879-08:00

OK I get your point now. I don't disagree per se, and definitely agree with the statement that we can do good science without new sequences. However, as sequencing costs go down, and as sequencing is still beneficial, it will continue to have a key place. We are certainly at a spot where functional studies, or lack thereof are severely limiting for many purposes. In particular I think we need a concerted effort to do functional studies from across the tree of life ... an extension of my/JGIs recent "Genomic Encyclopedia" concept ...

Of course I didn't mean money is not an issue ...

2010-02-13T16:41:02.692-08:00

Of course I didn't mean money is not an issue at those institutions. What I meant to say is that they have traditionally focused on generating and analyzing genomic sequences, and as far as the sequences themselves are concerned, finishing the genomes should be high up on the priority list at those institutions (versus not leaving any budget for closing genomes). Overall, however, I think functional characterization is possibly more important at this point. We need to encourage people to incorporate functional studies when sequencing a new genome as well as utilizing existing genomes to study function. We should discourage this notion that we can't do high quality science unless we generate new sequences.

JGI and JCVI money does not, shall we say, grow on...

2010-02-13T16:33:18.589-08:00

JGI and JCVI money does not, shall we say, grow on trees. Much of the work at JCVI is grant funded and much of the work at JGI is done with $$$ that could be reallocated to other science if needed. So in both places, better to not waste money - as that money could be used for other things.

I think an unfinished genome is of limited usefuln...

2010-02-13T16:31:04.713-08:00

I think an unfinished genome is of limited usefulness because of all the reasons that have been mentioned. I'd like to add that we shouldn't discount non-coding regions, given recent studies showing the importance of sRNA.

Having said that, if we're trying to allocate resource from a limited pool of money, then higher quality annotation is in my opinion far more important. I'd also like to see people coupling genomic studies with characterization of "novel" ORFs. I realize that this is a bit impractical in the current funding environment (2-3 year projects), and will require collaboration with biochemists and cell biologists, but this is much more of an issue to me than generating new, partial, and poorly annotated genomic sequences. Given the somewhat disproportional cost of closing a genome, I'd be extra cautious with soaking up a bit chunk of funding just to do this.

So my answer is dependent on where you plan to get the money from I guess. Yes if it's coming from places like JCVI and JGI, and no if it will cut into other non-genomic science funding.

Not buying parts of your pitch Rob. I agree that ...

2010-02-13T11:17:18.581-08:00

Not buying parts of your pitch Rob. I agree that a "complete" genome is frequently a bit misleading due to the mutations that you describe that can occur either in growth in the lab or in the ancestors of a population of organisms in the field. However, I do not know of much evidence that indicates how often the "gaps" (and don't forget errors) in data sets are due to mutation/variation and how often these gaps are due to crappy data/assemblies. Do you have data on that?

If you are arguing that what we need to do is better resolve the organization of all the pieces of DNA in a data set, I am in agreement. But just for clarification, since some might interpret your statements in other ways - I would like to express that point. And say that it is can be very useful to understand the structure of the different variants in a sample. For example, multiple studies (e.g., some from Julian Parkhill) have shown that determining where the mutations are that arise during the growth of a culture can be very informative as to the mechanisms of mutation processes.

In addition, I also want to make sure that people out there do not equate per se missing knowledge which is usually the state of "incomplete" genomes, with variation. Certainly in some cases fractured assemblies could be due to variation. But in other cases they are likely to be due to bad data, assembly methods that are not ideal for the data at hand, or, to put it bluntly, just crappy data.

Until we can tell variation from missing knowledge, we cannot necessarily use the data perfectly. So if you are arguing that "finishing" should be in part about trying to determine where variants are in the data - I agree - and we have on occasion tried to do this. But I think the story is still out on what the benefits are from finishing -- and I think there are potentially many more than you imply. Mind you, I am not saying it is per se worth the expense. But I think we should objectively assess what the potential benefits are ...

Its amusing that people really think that they can...

2010-02-13T10:23:25.086-08:00

Its amusing that people really think that they can understand the complete arrangement of a genome and that it will always be like that.

Bacterial genomes rearrange in real time. Grow a culture in a test-tube and you will have a population. There is no such thing as the "complete" genome, it is the average of your sample. Even when we finally get the long promised single molecule long reads, you will only get a representative sequence (and you will have destroyed the organism by sequencing it, by the way).

Next gen sequencing is good enough to get to very few contigs, and those gaps are probably real biological events. For example, rearrangements where your sample has multiple versions in the population, CRISPRs where the sequence has changed, phages, transposons, or IS elements that are hopping in, out, and around your genome.

There is no right answer, there is no finished genome. There is no spoon.

Finishing is an expensive way to fool yourself into thinking you know something you don't!

Kevin - it is always hard to predict the cost (tim...

2010-02-03T07:11:21.644-08:00

Kevin - it is always hard to predict the cost (time and money) for finishing, which is another reason some places would like to not do it -- hard to budget into grant proposals for example

Overall, the main cost in finishing is labor -- and thus the more one automates the better. JGI et al organize a meeting every year on the future of finishing where a decent amount of the discussion is on the automation of finishing. Some companies, like Raindance, are developing technologies than in theory could be used to aid in the automation but not sure how much they are used right now. I think they probably work best when you are sequencing the same organism multiple times.

Interesting discussion. Coming from a vertebrate g...

2010-02-02T21:39:26.575-08:00

Interesting discussion. Coming from a vertebrate genome person. The finished genomes give information about synteny and regulatory elements.

For microbial I am not too sure.

I am curious about the costing though. For me finishing a genome is always hard to predict the cost.

It seems that there is only a certain amount you can get by short read sequencing and shotgun methods.
In the end you need to tackle the gaps by directed Sanger seq.

So to me the biggest cost is usually time rather than $.
And if its microbial I would assume its much easier and can be done in a shorter time.

I don't think finishing will necessarily be a ...

2010-01-29T11:19:14.173-08:00

I don't think finishing will necessarily be a cost problem 2 years from now (except for highly repetitive regions) given that read lengths are going to be greater than 1Kb

So, I agree; which is why I mentioned concrete dol...

2010-01-29T11:09:44.060-08:00

So, I agree; which is why I mentioned concrete dollar values. Let's suppose that unfinished microbial genomes in the ~1000 contig range cost $1. I might accept that rearrangements are an unknown hazard and simply collect $1000 genomes to do a functional gene analysis, even if closing each one is a mere $50. However, if draft genomes are $100, at the 1000 contig level and $150 for 20 contigs, and from 20 to finished costs $50, I may want the 5 genomes I get for my $1000 finished, rather than getting 6 unfinished ones or 10 pretty raw ones.

Of course, the mission and the resourcing dictates the most effective relative effort, but I don't see draft genomes coming down below $100 too soon, and so I tend to think that assembly - particularly overcoming the challenges of metagenomics - is well worth the investment.

All helpful Ben and I agree in principle with most...

2010-01-29T07:23:58.481-08:00

All helpful Ben and I agree in principle with most of the statements. I think in the end the issue is going to come down to cost-benefit analysis and asking the question - if you could get 100s of unfinished genomes for the cost of a few finished ones - is that a big benefit? And I think for some purposes (e.g., binning metagenomic data) 100s of unfinished genomes may be very useful. And for other purposes, 100s of unfinished is not not much better than a few unfinished.

Jon; Unfinished genomes are a unique limbo and di...

2010-01-29T06:06:18.314-08:00

Jon;
Unfinished genomes are a unique limbo and disaster for a large number of reasons. First, as other people have said, you can't know 'what you don't know.' Second, you can't track rearrangements. Third, not all microbes are single chromosomes - plasmids, megaplasmids, linear plasmids, and second chromosomes are all important. So, unfinished genomes are simply not a jumble of pieces from a single chromosome, but may be a mix of entities. For infection control, for instance, some of these may move at different rates or mutate at different rates and the clinicians and drug discovery types need to know this.

I'm currently fielding the OpGen technology for relatively high-utilization compared to their expensive 'service.' While it adds to the cost of a genome, the cost of OpGen physical mapping of the chromosomes can probably be reduced to something like $60 per genome in consumables (purified DNA and minimal labor, plus a $250,000 instrument required, throughput 24 organisms per day) over the next year; not a ton of detail in the map, but enough to know 1) when a genome is misassembled and 2) reduce the coverage needed for near closure - and order most of the contigs.

I think that from my perspective, not only is it worth it to know that my genomes are correctly assembled, but rearrangements, pathogenicity islands and other large mobile genomic elements/indels are the signal I really care about. SNPs are fun and fine, but they don't give me the kind of information I would care to report to the infection control people when they are trying to figure out what is causing a new kind of antibiotic resistance profile.

Does this help?

Sam et al I think (as I have said many times befo...

2010-01-28T19:46:08.233-08:00

Sam et al

I think (as I have said many times before) that qhat quality and quantity you need should be driven by scientific goals and questions. This is in part why I am trying to find out what people have noticed in their analyses has been possibly hurt by not having finished genomes. But there is a slight complication here - some genomic data gathering is bigger than individual projects. That is, some projects are done with the nebulous "community" in mind. In such cases it is important to try and figure out what the different possible questions are and how quality effects the ability to answer them. This is hard to do b/c there are so many diverse uses of genome sequence data (this is in part why genome papers have so many citations). So I do not think it is as simple as

"To find a bunch of new genes that we don't know anything about?"

vs.

"Or to learn about pathways comparative genomics, genome evolution"

For a regular grant proposal, the person proposing the work should be able to predict in some way what type of data will be useful. But for projects like my Genomic Encyclopedia project at the JGI, we need to try and factor in 100s-1000s of uses -- and there it is beyond my brain to list what all of those currently are or might be in the future. Which is why I posed these questions ...

my question is the flip side - What do you gain by...

2010-01-28T17:41:31.440-08:00

my question is the flip side - What do you gain by doing a half-hearted job? Oh, a new sequence variation to recA (/sarc).

Seriously though, What's your purpose to sequencing lots of genomes? To find a bunch of new genes that we don't know anything about? Or to learn about pathways comparative genomics, genome evolution, etc. you can't get that with draft quality. rather you could get as much with a draft as you could metagenomics. So why not just throw a bunch of crap into a tube and do the metagenomics approach?

I just know all the steps involved in annotation come with inherent error rates. When you start out with a 90% genome, you get a worse annotation.

Sam

1) I think the finishing is important because it ...

2010-01-28T09:08:57.540-08:00

1) I think the finishing is important because it makes the data useful to a greater number of researchers. Most can't handle this "unfinished" stuff (me included).
2) I'm assuming finished also =more accurate. ie better quality, less errors. And I think there's already way too much crap in the databases. So do the hard work and finish the damn genomes.

I'm not quite sure if a finished genome is ess...

2010-01-28T09:02:40.609-08:00

I'm not quite sure if a finished genome is essential for these population genetics analyses...Much easier, yes, but essential no. Many of the complete eukaryotic genomes are unfinished and yet we still learn a heck of a lot from them.

It is going to be much easier to re-think the pop gen analyses (money wise and time wise) than to actually finish every genome.

I think that people here are forgeting to look ahe...

2010-01-28T03:14:39.152-08:00

I think that people here are forgeting to look ahead. Right now there a whole new population genetics layer just waiting to be added to microbial genomes (thanks to nex gen seq, of course). And to address questions about genetic architecture (including some aspects of gene expression) and genome architecture (synteny, recomb, genome colonization, etc.) a finished genome is essential.

The reason is quite simple: with finished genomes designing experiments to such questions is much more straightforward and certainly it will reduce some costs. Just imagine someone repeating a whole bunch of analysis just because he/she wasn't using a finished genome as reference.

Besides all, most genomes go public at some point in time, so the effort of finishing a genome will capitalize itself very quickly as one can see with E. coli genomes. We just need to help develop better approaches to finish without so much manual curation.

As a final comment, genomes aren't a bunch of genes and regularoty elements. They are a research subject per se and deserve and appropriate treatment.

I agree with David's follow-up that you can, i...

2010-01-28T00:32:45.917-08:00

I agree with David's follow-up that you can, in principle, root about in the raw reads to verify that a key gene of interest is absent. If we do large-scale sequencing to draft quality only -- which seems like a win to me -- then we should to make sure this is fast and easy to do. For example, perhaps we need centralized read archives that allow you to search only the unassembled reads.

I also wonder if this will all be moot in a few years as read lengths improve. If yes, then should that change our attitude about getting drafts now that we will think are lame in 3-5 years?

While it doesn't completely resolve the issue ...

2010-01-27T20:42:59.045-08:00

While it doesn't completely resolve the issue of having a closed, and hence "certain" genome sequence (using the term loosely), one way to simulate finishing while keeping costs low is via a comparative assembly tool (e.g. MERCATOR). Here, closely related genome sequences are used to create a synteny map and, through the combined signal across genomes, linkage between previously unlinked contigs within a genome is inferred. Of course, you still don't know what specific sequence is necessarily sitting in the newly linked region, but it is frequently cheaper to sequence many closely related genomes than it is to fully finish a single taxon's genome. This approach was used to finally finish the Neurospora crassa genome sequence which had been plagued with a ridiculously low contig size N50 value until the N. tetrasperam and N. discreta sequences came online.

(weirdness with blogger.com...)

2010-01-27T20:40:11.373-08:00

This comment has been removed by the author.

Rob - I think the numbers are probably "worse...

2010-01-27T19:53:08.983-08:00

Rob - I think the numbers are probably "worse" for the finishing crowd than that. Right now, I think you can get good shotgun coverage of say a 5MB genome, with a mix of 454 and Illumina, for significantly less than 5K. You probably will not end up with 10 contigs, and the quality of those contigs that oyu get may not be great, but the cost will be low, more like 1-2,000.

Alas, the cost of finishing probably has not gone down as much over the years as shotgun sequencing. So it might still cost on average 50K to finish a 5 Mb genome. I just do not know, but it would be worth figuring out. My guess the ratio of relative cost of finished vs. shotgun keeps going up.

So certainly cost should be figured into the equation. However, unlike our anecdotal papers from 2002, I think it would be good to more completely assess the benefits that come from having more incomplete genomes vs. that come from having fewer complete ones.

No one has discussed cost. What is the current cos...

2010-01-27T18:58:20.861-08:00

No one has discussed cost. What is the current cost of getting a typical microbial to less than 10 contigs versus getting it finished using next- (or next-next-) gen sequencing?

The numbers I heard not so long ago were order $5,000 for <10 contigs and order $60,000 for finishing. Then the argument is one finished genome versus 12 good quality complete ones.

Disclaimer: I was against finishing in 2002 and still am.

As Rosie mentions, one major advantage to finishin...

2010-01-27T10:34:03.521-08:00

As Rosie mentions, one major advantage to finishing is that it makes downstream analysis easy because the sequence can be assumed to be correct. If we don't finish genomes, it would be a huge help if the folks who generate the data also deposit the raw reads in a public database. With the raw reads it should be possible for clever informaticians to design their downstream analysis (e.g. of recombination) to account for uncertainties in the genome assembly. It could also provide a starting point for someone else motivated to finish the genome.

I can speak from personal experience. The genomes...

2010-01-27T10:33:48.780-08:00

I can speak from personal experience. The genomes of two magnetotactic bacteria were "finished" in 2001 but were never completed. Once complete genomes of other magnetos were published we realized that the first two genomes were missing large chunks of the chromosomal regions that made these organisms magnetotactic bacteria. The original genomes were useful but the finished genomes were needed for the breakthrough in our field. I should point out that the original projects were plagued by contamination issues so the coverage was very poor. The bigger problem (perhaps unrelated to this discussion) is the poor quality of genome annotations.

We're doing genome-wise analyses of recombinat...

2010-01-27T10:19:24.344-08:00

We're doing genome-wise analyses of recombination, and these can only be done with finished genomes. The identities of the genes and other functional components aren't important.

In our organism (Haemophilus influenzae), most of the ~20 sequenced genomes are of no use to us, because only 4 have been finished.