Tuesday, March 26, 2013

Question - anyone having issues w/ delays/difficulty in the process of getting genomes / metagenomes into Genbank?

DNA sequencing continues to go crazy in terms of lower cost, higher speed, and spread of technology. Alas, some aspects of doing a genome project are not necessarily keeping up. So I am posting here to ask a simple question about one of these steps. What do people out there think about the steps of getting genome / metagenome data into Genbank. Without wanting to bias answers too much - we are having some challenges in this area. Storify of Twitter responses below the fold


  1. As much as it saddens me to say this (since a colleague refers to me as the dumpster diver of genomic data), I think that scientists need to have a frank conversation about the costs and benefits of saving every piece of genomic data and curating it.

    1. Absolutely - but given that everyone is wasting months to years on just getting data into Genbank - we need to either agree that people won't do that or find another way to share.

  2. You don't have to submit to GenBank ... the European Nucleotide Archive is much easier to work with than GenBank or the SRA. There are a few publications where people use SEED, RAST, microbes online, IMG, etc etc etc to announce bacterial genomes.

    If GenBank is an archive of your annotations why do they make you use PGAAP to annotate. If you don't need to use PGAAP why don't they accept annotations directly from third party tools (RAST, microbes online, etc)?

    The days of monolithic databases holding all sequence data known are nearing an end. The question is, how can scientists still get access to all the sequence data they are interested in?

    The problem with github/figshare/etc/etc/ is generating a common dataset that holds all (or most) of the sequence data for new comparisons.

    There is no technical reason we have to submit to GenBank, we should be able to use whatever database has the best access. Provided they are listed in a common aggregator of web services e.g. http://www.biocatalogue.org/) and they provide an API for programmatic access then everyone can access the data from anywhere. [In principle we could use RDF but that does not work in practice.]

    Free the data, make smaller, open databases, but make sure they are linked and accessible to all.

  3. I can see the point(s) about ease of submission, but the major issue with lots of smaller DBs is sustainability, who will look after a database longer term?
    There is also the point that being in a single format allows for direct comparisons to be made.
    And finally, from an end users point of view, wouldn't you rather be able to go to one (or just a few) place to find all the genomes of interest?
    Now I've not tried to submit data to NCBI so I dont know your pain, but having worked in the ENA at EBI, I know it really isn't so hard to submit data there.
    Regarding @caseybergman 's comment about @GigaScience , we can provide an option for those submitting papers to the Giga Science journal, but we're still encouraging submission of raw data to the SRA.


Most recent post

My Ode to Yolo Bypass

Gave my 1st ever talk about Yolo Bypass and my 1st ever talk about Nature Photography. Here it is ...