Tuesday, February 08, 2011

Though I generally love NCBI, the Sequence/Short Read Archive (SRA) seems to have issues; what do others think?

Well, here goes. Hope to not get people from NCBI too pissed off here. Overall, I think NCBI is invaluable: GenBank. PubMed. PubMed Central (PMC) (well, I have some complaints about that but let's not get into those here -- I still like it), BLAST (Basic Local Alignment Search Tool) and a plethora of other tools, databases and resources. Generally, money well spent.

However, one database from NCBI is driving me a bit wacky these days. This is the Sequence Read Archive (SRA). Known to some as the "Short Read Archive" this database is supposedly for storing "sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System® , Helicos Biosciences Heliscope®;, Complete Genomics®, and Pacific Biosciences SMRT®."

It certainly seems to be used for that function. But alas, storing sequence is not the only need here. Recovering sequence and making use of it is really the key. And this is the area I have been having trouble with (especially related to environmental studies like rRNA PCR and metagenomics). Rather than go on about my particular issues here (and thus possibly biasing the discussion too much), I am wondering what others think of the SRA? Usability? Ease of deposition? Ease of extraction? Missing features? Things it does or does not do well? Do we need a new system for environmental projects?

Any and all comments welcome here or on twitter or on Friendfeed or wherever. See Friendfeed stream below:




Here are some comments so far from twitter
  • digitalbio Sandra Porter I agree. RT @phylogenomics: Though I generally love NCBI, the Sequence/Short Read Archive (SRA) seems to hav… (cont) http://deck.ly/~XM75A
  • lswenson Luke Swenson @phylogenomics I was JUST trying to navigate the SRA! There's no help section to be found, and forget about depositing sequences!
  • audyyy Davis-Richardson @phylogenomics I can never tell if my submission went through without emailing support. Also, no FASTQ support?
  • cabbageRed Rich C .@phylogenomics I agree, the SRA doesn't seem to be the easiest repository to search with what I believe to be "typical" NGS queries

31 comments:

  1. SRA is in my opinion the single biggest obstacle to progress in microbial ecology and microbiome research today. Rumors of its timely demise that were circulating at AGBT last week are certainly welcome! The future is open access (the _real_ sort where you can actually get the data out in a useful way). It is user-centric. It is cloud-enabled. And it is imminent.

    ReplyDelete
  2. @Rob: SRA and open access to data are two different things. The policies surrounding data from the microbiome project (and other sequencing projects) are not SRA policy but NIH policy- which will affect folks funded by NIH regardless of storage mechanisms.

    @Jonathan: What particular problems are you having? Have you directly told anyone at NCBI associated with the SRA?

    ReplyDelete
  3. Here is an example: Some of the search features lead to unconnected results. For example if you search under "studies" you can get to ERP000108 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=ERP000108 ... this is supposedly for the BGI human metagenome study

    but this is not connected to the data in any obvious way.

    So then you can search in other areas and eventually find

    http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=search_obj&m=search&s=obj&term=ERP000108&go=Search

    from there you can go to http://www.ncbi.nlm.nih.gov/sra?term=(ERP000108)%20NOT%20cluster_dbgap%5BPROP%5D

    but I still cannot find the data.

    Is this because they did not load the data? I can't tell.

    Also - I have had some serious problems tracking down metadata in the files -- is this b/c people are not asked to submit metadata or b/c there are no tables for some of the key metadata that would be useful?

    ReplyDelete
  4. Everytime I have ever tried to get data out of the SRA I found it nearly impossible. I think there are several reasons. First: There are two different search/browse pages for the sra and I'm not sure which is better or what the difference is:
    http://www.ncbi.nlm.nih.gov/sra
    http://trace.ncbi.nlm.nih.gov/Traces/sra/

    I also find that the connection between projects or publications to samples very had to follow. Usually I am looking to get all of the data associated with a project, which is often many samples. This seems like an obvious and simple thing, but I have found it to be tough.

    The data seems to be organized in a very complex fashion. I understand that NCBI has many roles to fulfill, but currently there isn't one great place to obtain metagenomic datasets. I often look at CAMERA, IMG, MG-RAST, and lastly SRA. This of course is just for the sequences....don't get me started about trying to organize metadata on the samples.

    ReplyDelete
  5. The SRA has been a frequent source of headaches for me. Some of this stems from the problems in downloading data. FTP access is notoriously unreliable, and so I installed the suggested Aspera software which worked for a while but is still a little flakey.

    Last week there was a major connectivity problem with the FTP site meaning that I couldn't download any SRA data (using FTP clients or Aspera). NCBI acknowledged this in a reply to my query. However, for resources like this I feel strongly that you should be able to get 'status information' from their website if there are known problems.

    Even now that I have downloaded some data, I find that the software that you need to use to unpack the data to extract fastq files does not always work (I've downloaded 3 sra-lite files so far, and 1 won't unpack).

    I'm presuming that there is an automated way of being able to download data for a species of interest. However, because the FTP site for SRA data is not organized by species name you have to know which experiment/run accession you are after and it is not clear how to go from species name to accession in an automated fashion.

    I echo Morgan's calls that the NCBI should make their data downloadable as torrents. Alternatively, a west-coast mirror would really help with access for a lot of us.

    ReplyDelete
  6. I was involved with getting some of the first very large pyrosequencing datasets uploaded into the SRA. ok this was 2 years ago. But at that time, I couldn't figure out how to submit at all. The developers had to do it themselves as it was so difficult. While I am told now it has improved, I also understand that it is still definitely not as easy as it should be. I am sure that there will be opportunities for it to improve, but I am not sure that it will help.

    The biggest problem is getting data back from SRA. In the last few years every single person I have pointed to my data in SRA has come back saying they cannot get hold of it! Instead I ended up pointing them at the MG-RAST database as they could download everything they wanted from there.

    In my opinion it needs to be completed overhauled, if only because the community has no faith in it.

    ReplyDelete
    Replies
    1. I just finished submitting sequences. Progress seems to have been made in the submission proces since 2011, but not as much as could be expected. I had to Email them about 5 times with detailed descriptions of the problems, before I could finish the process succesfully. Even then NCBI Staff had to set some things right before we got the desired sresults - namely 14 runs from 11 experiments uploaded and published.

      To their credit, my Emails received prompt replies and the with very helpful tips.

      But the point with public databases like SRA is that they should be usable with NO intervention from the staff, except in rare atypical cases. That is where SRA still fails quite miserably - The instructions for submission are patchy and if you are doing it for the first time, you have no idea of the steps you need to follow or the specific pieces of information you need to have at hand.

      Now they seem to be integrating the Bioprojects and Biosamples submissions process with the SRA sequence submissions and link the met data for all the three linked up. However, I am not sure if the three database developers ever really work together. I simply can't see why they cant have one script that leads you from the first step to the last step in a submission.

      Secondly, when you start the process of submission (to all the three databases), you have no idea in advance how many different factoids you need to provide. They simply tell you to gather all the necessary information before you start the submission, without telling you exactly what you need to have at hand. I personally had to stop and start the process at least five times, in order to gather one or another piece of information from my lab mate, supervisor, or the sequencing facility.

      It definitely could work better than that !!

      Delete
  7. Have you tried any of the tools in the R Bioconductor package for programmatic access to SRA? I don't know how well they work, but here's their overview page.

    I think a large part of the problem at NCBI is that they have a lot of old, "legacy" software in the backend so newer types of data have to be shoe-horned into the existing system. Perhaps the entire site needs an update.

    ReplyDelete
  8. Open source access is the best way forward to accomplish better science. With the emphasis on global microbiome growing everyday a coherent open source effort is required rather than a monopoly. SRA does not support that clause.

    ReplyDelete
  9. In my opinion, the only reason the deposition of sequences to the SRA works, is that curators manually load the sequences into the database, which is what was necessary for me to get my studies in the database. Of course when I tried to use the same templates containing another study, the business logic changed about every other week, which was a continual problem for me. Throughout my trials and errors, I tried contacting multiple people from the SRA, but I rarely got a response in less than a week.

    After spending well over 80hrs trying to get 2 studies into the SRA, I decided there was no point submitting my data there, since most people could not get the data back out and if they do, the metadata is so sparse and cut-down that it is not really beneficial. I have a feeling that if user's could submit the metadata to make their studies more relevant to the user community, the business logic could never validate the data properly.

    ReplyDelete
  10. This has to be my favorite response from the SRA auto-response server....let me know if Hint 1 is as useful to you guys as it was for me:

    -----------------------------------

    Hello,

    Some of the fields provided were not valid. Please review the following 1 hints for future reference. As the issue TR-2316 was partially created/updated you may want to manually correct it.

    Hint #1:
    --------
    Message : Can't add watchers, the reporter does not have the MANAGE_WATCHER_LIST permission for the issue: TR-2316

    Hint Detail : com.dolby.atlassian.jira.service.util.handler.emh.EMHIssueUtils.setupWatchers():1229

    ReplyDelete
  11. I've had serious issues getting data into and out of the SRA. In particular, it doesn't seem like the SRA is designed to store data for amplicon-based (e.g. 16S) microbiome data containing many barcoded samples, which is what I'm generally working with. As of several months ago (last time I tried), I could not find a way to extract data and metadata for a set of samples at one time. This is inconvenient for a few samples, but ends up being prohibitive for 100s or 1000s of samples (i.e., modern microbiome studies).

    I also had issues with getting Illumina amplicon data into the SRA (the one submission I've worked on). This would have been the first Illumina barcoded amplicon data set to be deposited in the SRA, so there was a need to communicate with the staff to sort out format issues, etc. Responses to e-mails and phone calls were frequently delayed from several days to a month in one case, so it became frustrating to make progress as I'd get time to work on the submission, run into a question, and then get delayed. I still haven't put this data in to SRA, and currently have it posted as a tgz on a web server so readers can access it upon request, pending addition to MG-RAST or another database when they're ready for it.

    I look forward to a database that facilitates the sharing of amplicon-based community surveys as I think there are a lot of incredible analyses just waiting to happen once all of the data is accessible in a centralized location that is convenient to import to and export from.

    ReplyDelete
  12. With the explosive growth of sequence data that has to be deposited in a DB, it becomes almost impossible to imagine this being a centralized DB, such as NCBI, EBI or DDBJ. Supposedly these three centers are synchronizing their DBs, but the Neanderthal and Denisovian genomes are only found in the EBI DB. So even today, the three main centers can't keep up with the growing sequence data. It is not that the will to keep up to date is there, but the sequences are being produced at too high a rate.

    I would advocate to introduce a method similar to BioTorrent to share all sequence data. Let the sequence centers themselves make their sequences available through their own websites and provide a torrent-link to a central genomic portal website. This portal website can be designed to make searches easy and as specific as the user needs it to be. Allowing for both specific searches for species or even clades or kingdoms or domain combined with what type of sequencing technique used, what type of experiment (WGS, ChIP-seq, RNA-seq, etc.) or from what sequencing center. Similar to the internet, no centralized storage, just the cloud/community working together, sharing together.

    ReplyDelete
  13. I agree that the SRA is in need of a major overhaul. On several occasions I have been unable to download data directly from the download link. In each case I have been advised to navigate the directories via ftp to track down the sequences I require.
    Why don't they fix broken links?!

    ReplyDelete
  14. Once upon a time you could download the original SFF files from the SRA for Roche 454 data - very useful to repeating an assembly with Newbler etc. Not any more :(

    ReplyDelete
  15. It did not seem that SRA was designed for studies involving more than a very small number of experiments and samples. In the era of next-gen sequencing, when a single sequencing run may generate data for thousands of individual samples, this is just not feasible.

    I never found an understandable explanation for the Experiment/Run/Sample pragma, or at least not one that could tolerate anything but the most linear of experimental designs. When I did manage to upload data, it took months for the maintainers to identify problems with the deposition, and I'm not sure that they ever could explain how I should have done it instead.

    Without question, the fields needs a centralized site for sequence storage and retriveal, but SRA is not the platform that will accomplish those goals.

    ReplyDelete
  16. Ditto Jesse, Greg and Brian. Difficult-to-impossible to submit sequences/metadata for large, highly multiplexed marker gene surveys. Frustrating, inconvenient and unintuitive for smaller studies. Major overhaul or alternative database needed.

    ReplyDelete
  17. @naptime: I am very familiar with the distinction between SRA and NIH policy as I wasted 6 months of the scarce time I have available to do software development myself writing the initial code in QIIME that we used to deposit the HMP data (I am part of the HMP DACC) -- Kyle Bittinger took on the heroic task of continuing this struggle after that. In any case, these comments don't apply to the many datasets that are not even NIH-funded that are still our queue -- and this queue is an ongoing embarrassment to us as the whole point of generating the data is to make it available. We are currently fixing this by arranging for deposition in MG-RAST, which as noted above actually works and provides end users with useful information. There will be a series of errata telling users where they can actually get the data very soon.

    ReplyDelete
  18. If anyone is sitting on data and want to make it freely accessible right now let me suggest using BioTorrents. This will at least make the data available until you find a suitable traditional database for hosting it (new version of SRA, MG-RAST,etc.). If you would like to see additional categories or different fields for searching on BioTorrents please let me know as I can update the website quickly.

    By the way, the poll What do you think of NCBI's SRA?, indicates the same message as these comments; SRA has serious problems.

    ReplyDelete
  19. Comments from Frederic Bushman
    Professor of Microbiology
    University of Pennsylvania School of Medicine

    2/10/2011

    Our laboratory is funded by the HMP to carry out metagenomic studies of bacteria ("Diet, Genetic Factors, and the Gut Microbiome in Crohn’s Disease", co-PIs Gary Wu and James Lewis). We have uploaded about 22 "experiments" to the SRA as required to demonstrate progress in our HMP project. We have been working with sequence data for years, and have considerable experience dealing with public sequence archives.

    The difficulties in dealing with SRA have been unique in our experience. We have devoted weeks of programmer time to developing tools to put the data in the proper format for submission and distributed these tools to others. Despite this, it still took weeks of work to upload the data. This is really excessively difficult. The one time we tried to download data we were unsuccessful and gave up. Kyle Bittinger has devoted a substantial effort to this (see his comments also). All this is despite the fact that the contact people at SRA were actually rather helpful. The problem is the excessively complicated design.

    We feel strongly that the system for archiving short sequence read data needs to be redesigned and greatly simplified. I would suggest that if this is undertaken, funding should be allocated in the context of a open competition and peer review, to ensure that the needs of the community heard and are met by new products.

    ReplyDelete
  20. I have to agree with the general consensus here. I've spent several weeks of programmer time improving the SRA submission tools for the QIIME package. Another several weeks have been devoted to actual submissions. I made one half-hearted attempt to recover data from the SRA, found it to be hard, and gave up.

    I think many of the problems surrounding the SRA stem from a complex, cross-referenced, deeply nested data model. The top-level entities of Experiment, Study, Sample, and Run do not carry the expected meanings. They are rarely adequate to represent the research in practice -- one often has the feeling of fitting a square peg in a round hole. In addition to this, an enormous array of sub-entities and attributes are required to create each Experiment/Sample/Run. For example, the XML schema for an SRA Experiment is 1600 lines, including common fields. Furthermore, if the attributes and sub-entities do not cross-reference each other correctly, the submission fails after upload to the SRA.

    A simpler, less all-encompassing data model is needed. After all, the information required to demultiplex a set of pooled reads is relatively minimal. Auxiliary info could be provided in a more flexible, less rigidly structured format.

    On the whole, I do have to give credit to the staff at SRA, especially Martin Shumway, for helping me though the submission process. My heart goes out to them -- I can't imagine the pain of maintaining the current system.

    ReplyDelete
  21. Just got this forwarded to me .....

    Dear Staff Members of NCBI,
     
    As you are aware, the federal government as well as NIH is facing a period of budgetary uncertainty that is resulting in ongoing program reviews throughout the government.   At NCBI our senior staff have been giving serious consideration to our own projects and staffing levels in order to prepare for and adjust to new fiscal constraints. 
     
    NCBI had received a significant adjustment in its appropriated funding in the proposed FY2011 President’s Budget.  The President’s Budget, however, has not been enacted and we are being required to operate at last year’s (FY2010) level under a Continuing Resolution (CR) from Congress.  Upon the CR’s expiration on March 4, 2011, there is little likelihood the budget picture will improve.  The NIH Office of the Director has provided us with stop-gap funding to alleviate some of our FY2011 and FY2012 funding needs.
     
    Therefore, to ensure that we can provide stability and some degree of reasonable growth for core activities, we have had to identify projects for downsizing or elimination.  In order to meet budget objectives we have had to come to a very difficult decision to downsize the Conserved Domain Database and to eliminate the OSIRIS and Peptidome projects.  The Sequence Read Archive (SRA) will also be phased out over the next 12 months.  Temporary funding from NIH for SRA is expected for at least four, and possibly eight, months.  Beyond that period, staff at the NIH ICs will be examining alternative approaches for SRA-type data.
     
    Our expectation is that we can accomplish most of the restructuring through normal attrition but unfortunately some positions will have to be eliminated.    It is regrettable that we have had to take this drastic action, but it has become unavoidable.
     
    These are difficult times but I am confident that NCBI is positioning itself to offer a stable employment environment and that all of you will continue your outstanding contributions to its success.  I thank you for all your efforts and am grateful for your continuing dedication.
     
    David Lipman
     

    ReplyDelete
  22. The ease of deposition, and subsequent retrieval are both dismal. This is also true, or perhaps even more true, for the ERA.
    The XML is a moving target, and the specification bad enough to be unuseable.
    ERA/SRA are also not in sync as is supposed (?) to be the case.

    The post indicating that the SRA is likely to be phased out is immensely concerning as I know a number of groups that use/intend to use the {S}{E}/RA as their primary archive for raw data.

    ReplyDelete
  23. I feel that the DDBJ SRA is better organized than the NCBI one... Browsing NCBI's SRA is a pain and confusing. I found DDBJ a little better.
    http://trace.ddbj.nig.ac.jp/dra/index_e.shtml

    ReplyDelete
  24. Seems that the discussion unfortunately becomes obsolete.
    NCBI announced to discontinue SRA and other databases.
    http://www.ncbi.nlm.nih.gov/About/news/16feb2011

    ReplyDelete
  25. Just got this but accidentally deleted it
    :
    From attractor (http://attractor.myopenid.com/) has left a new comment on your post "Though I generally love NCBI, the Sequence/Short R...":

    SRA has put too much focus on advanced technologies (e.g. compression and name indexing), but forgotten the end users. Some of its fancy features, such as retrieving reads by names and displaying individual read and intensity, are barely used by anyone, whereas fundamental functionalities such as search, submission and data retrieval are complicated, inconvenient or even broken. Nearly everyone complains that generating the XMLs is of a great pain, but this has not been changed for three years. In fact, all does not need to be that complicated. Describing meta information of NGS reads should not be more complicated than creating a GenBank/EMBL record, but I never see my friends spend weeks to generate a proper GenBank record even given more complex data.

    If I were asked to redesign SRA, I would probably take SAM or fastq as the central format and add submitter, date, sample, species, barcod length, library and platform in the SAM header or a separate text file. These are pretty sufficient for most of our works.

    In general, the design of biological databases should be led by biologists rather than programmers. Programmers tend to think of the latest fancy technologies, but leave most biologists in a mess.

    ReplyDelete
  26. > Nearly everyone complains that generating the XMLs is of a great pain...

    Metadefine helps relieve the pain a little bit in the meta-data submission. You can test the tool at https://trace.ddbj.nig.ac.jp/tools/contents/metaDefine. Please be aware that you can create XML files there but cannnot submit them to DRA in DDBJ, ERA in EMBL-EBI or SRA in NCBI.

    Hideaki Sugawara
    National Institute of Genetics
    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808927/

    ReplyDelete
  27. hey! for all those who have uploaded data to SRA..how long does it take for the data to get uploaded? I have 3 runs in my project and its been 3 days days now..and i still see the "wait"status..
    for those who are well versed with this, plz help me! am worried if i have gone wrong somewhere. the problem being i didnt use the FTP server for data transfer as after downloading the FTP server from the net, I was not aware how to use it!

    ReplyDelete
  28. Hi,
    I agree. The database could be improved. For searching purposes, instead of SRA archive, I often use the http://sra.dnanexus.com/ to browse the database. It is easier.
    Best regards.

    ReplyDelete
  29. Hi,
    I agree. The database could be improved. For searching purposes, instead of SRA archive, I often use the http://sra.dnanexus.com/, it is easier.
    Best regards.

    ReplyDelete