Friday, February 03, 2012

Interesting new metagenomics paper w/ one big big big caveat - critical software not available "

Very very strange.  There is an interesting new metagenomics paper that has come out in Science this week.  It is titled "Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota" and it is from the Armbrust lab at U. Washington.

One of the main points of this paper is that the lab has developed software that apparently can help assemble the complete genomes of organisms that are present in low abundance in a metagenomic sample.  At some point I will comment on the science in the paper, (which seems very interesting) though as the paper in non Open Access I feel uncomfortable doing so since many of the readers of this blog will not be able to read it.

But something else relating to this paper is worth noting and it is disturbing to me.  In a Nature News story on the paper by Virginia Gewin there is some detail about the computational method used in the paper:
"He developed a computational method to break the stitched metagenome into chunks that could be separated into different types of organisms. He was then able to assemble the complete genome of Euryarchaeota, even though it was rare within the sample. He plans to release the software over the next six months."
What?  It is imperative that software that is so critical to a publication be released in association with the paper.  It is really unacceptable for the authors to say "we developed a novel computational method" and then to say "we will make it available in six months".  I am hoping the authors change their mind on this but I find it disturbing that Science would allow publication of a paper highlighting a new method and then not have the method be available.  If the methods and results in a paper are not usable how can one test/reproduce the work?


  1. I already posted this on Twitter but I really think that peer reviews have a responsibility to insist that software developed for a manuscript is both available and open-source before publication. Ideally this would be in some trusted location like Github, Sourceforge or Google Code. This also means reviewers can access it without giving away their identity (if this is an issue for them, I don't usually care and have taken to signing my reviews).

  2. Our software implementation to do a similar thing (we don't split the graph heuristically) is, in fact, on github. And hey, look, the submitted paper is available, too! It's still in review, though.

  3. And Jonathan, I'm happy to comment on the science for you, since we've been pursuing this approach for about 2 years, although I would need to run some tests on their data set first. From skimming, the only real weakness is that they run an assembly first, and then partition the assembled data. Since many assemblers perform poorly on raw metagenomic data, this is unlikely to be as comprehensive as it could be. Also note that similar-in-style (although more heuristic) approaches were used in the rumen paper (Hess et al.) and the Arctic permafrost paper (Mackelprang, 2011). Good stuff, all in all.

    1. I since have had the opportunity to read through the paper more thoroughly, before talking to a journalist about it. My initial impression was not quite right -- they address the *scaffolding* part of assembly, in which contigs resulting from an initial round of assembly are connected together into longer ordered-and-oriented scaffolds that (in their case) appear to recover the majority of a genome. If this bears out on other data sets, this will be a very important contribution to metagenomics; previous efforts used hand-curation of contigs that looked similar based on various metrics, and something automated would be a significant advance in the field. However, they do rely on the (rather poor quality) initial assemblies coming out of Velvet prior to doing their scaffolding. I'll blog more about this after we submit our next paper addressing some of these issues. (I just don't have the time right now.)

      On the flip side, if their scaffolding approach doesn't bear out on other samples, then I will be very sad...

    2. Thanks for the update. Still hard to test their method if, well, they don't make the method available.

  4. Last comment: sea Right now it says "will be updated week of Feb 6th."

  5. Is this really a "rare" bug given it made up 7.5% of the sample? I would also note, from that news story, that many Euryarchaeota have been cultured and sequenced, just not this one! "One of those genomes came from the Euryarchaeota, a widespread group of marine microorganisms, none of which have been grown in culture or sequenced."

    1. To follow up on Julie's comment, the sample containing MG-II had 14.5 Gbp of filtered reads, so 7.5% of that gives ~1 Gb of sequence corresponding to MG-II. At a genome size of 2 Mbp this gives ~500x coverage. Baker et al. assembled genomes of the tiny euryarc ARMAN-2 from 100 Mbp of Sanger metagenomic data (15x coverage). I'd be interested how well the new assembly method works for genome reconstruction with less than 500x coverage.

  6. Yes - it says "information will be updated" - it does not say software will be made available

  7. I note - I have written to the software developer to encourage him to make it available ASAP ...

    1. Good, I was about to post asking if anyone had written to the authors prior to jumping on them and executing them.

      There could be a valid reason, though I imagine from the blog post, no reason would be entirely acceptable.

  8. Note - Virginia Gewin did contact me about commenting on the paper but we did not connect so I did not talk to her about her story. I was contacted by Biotechniques and they wrote an article.

    1. Here are my full answers to the reporters questions about the paper:

      1. What has made it challenging for scientists to study microorganisms such as marine archaea?

      There are multiple challenges to studying microbes such as marine
      archaea. These include

      1) They are small. This may seem to be an obvious issue, but their
      small size makes is difficult to study the activities of microbes in
      the field. Whereas one can observe multicellular organisms (e.g.,
      plants, animals, macrofungi, kelp, etc) directly in the field and
      record certain actions (e.g., what mammals eat), studying the activity
      of microbes in the field is more challenging.

      2) Even when one can do field observations, one major problem is that
      the appearance of a microbes is not a good indicator of what kind of
      organism it is. Thus you might see something but not know what it

      3) One way around these issues is to grow microbes in the lab -
      something known as culturing. Culturing allows one to determine many
      of the characteristics of individual kinds of microbes.

      4) Unfortunately, many microbes cannot be grown in the lab.

      2. How does this new technique for isolating individual genomes improve on past methods?

      In general, analysis of the DNA (as well as RNA and proteins) of
      microbes in the field allows one to learn a great deal about microbes
      in a sample without growing them in the lab. Genome sequencing of
      microbes in the field (something generally known as metagenomics)
      allows one to make many inferences and predictions about the biology
      and evolution of microbes. Unfortunately, piecing together entire
      genomes of microbes in the field can be hard.

      I note - it is hard to tell from the main text of the paper just what,
      if anything, has been done new here in terms of techniques. There is
      a lot of supplemental information I do not have access to.

      I note - we have assembled complete / nearly complete genomes via
      metagenomics previously - as have others. In most cases this prior
      work has been done in very low diversity samples (e.g.,
      endosymbionts). This new work appears to be somewhat unique in
      assembling nearly complete genomes from complex communities.

      3. What do you think the implications of this technique will be?

      Hard to know - this depends on whether and how they make the methods
      they used broadly available. Is the software going to be available
      for all to use? Did they have to use specialized lab methods and if
      so, will anyone be able to make them work or are they complex?

      4. If scientists can now begin to study the genomes of microorganisms like marine archaea, how could that help with our overall understanding of our environment?

      Microbes run the planet. We desperately need to better understand
      their contributions to all ecosystems and to the functions occurring
      therein. What we need is a field guide for microbes - akin to what we
      have for birds - that would give us a picture of the current and past
      details of microbial life on the planet. Only then will we be able to
      have a full understanding of the planet as well as make predictions
      about the future ...

    2. Interesting how they took your words out of context in their "objective report"

  9. There is always the possibility that the authors will make the "as is, used in the publication" code available to anyone on request and what is being released within the next 6 months is the user friendly full on useful program. Fairly normal in my experience to be ready to publish before you really have a nice user-friendly implementation of your software ready for release.

    But of course the code/scripts you used in the publication need to be available right at the time of publication, even if they would, at that point, be less than useful for most researchers. It at least allows inspection for bugs and verification of results.

  10. This is truly annoying. In my opinion, the reviewers aren't doing their jobs if they haven't run the software in a paper like this (even if just on demo data, and yes, all computational biologists should provide an executable demo with their software).

    I can point you to a Nature paper from a few years ago where software that was crucial to the findings was just described as "manuscript in preparation". Guess what -- the manuscript never appeared!

  11. While I agree with Nick & Shaun that reviewers should help in the policing especially when journal guidelines are lax/ambiguous, in this case the authors (and editorial staff) are not even abiding by Science's own guidelines set out by Hanson, Sugden, Alberts in their editorial "Making Data Maximally Available":

    "To address the growing complexity of data and analyses, Science is extending our data access requirement listed above to include computer codes involved in the creation or analysis of data. "

  12. I blogged about "missing software" in papers recently. It drives me nuts. I agree that this constitutes improper reviewing and editorial practice. It's bad for science.

  13. This paper clearly doesn't abide by the Science Code Manifesto:

    I suggest everyone reads and endorses those sound principles (if they haven't already!).

    Much like the Panton Principles ( they're a simple, clear set of guidelines on the use of software in academic publications.

  14. You think that's bad? What about Stevens, Kent A., and J. Michael Parrish. 1999. Neck Posture and Feeding Habits of Two Jurassic Sauropod Dinosaurs. Science 284:798-800 --

    That came out thirteen years ago, and describes a then-new program for manipulating in 3d virtual models of bones -- in particular, dinosaur neck bones. The software's never been released, so no-one's ever been able to even attempt replicating their results. (Disclosure: I think their results are flawed, and have published on the subject.)

  15. This paper is mentioned in the NY Times today....alas, no mention of the not making the software available.


  16. This article is worth a follow-up. We're now edging into fall and the website was supposed to have released all code by now. They have released only the first part/phase of three (and the initial code release on github hasn't seen activity since the creation of the repo), which is yet another example why we need more adherence to standards like the Science Code Manifesto and the practices outlined for the Bioinformatics Testing Consortium.

  17. Some times it seems that people from biological areas still don't see software and programming codes as true scientific results or as part of the scientific method. Wetlab results are mandatory and with some level of quality, those based on computational methods can be messy, poorly described and with no control. A paper describing the results based on a new software without the software dont make sense, we can only believe in what is said.


Most recent post

My Ode to Yolo Bypass

Gave my 1st ever talk about Yolo Bypass and my 1st ever talk about Nature Photography. Here it is ...