The Tree of Life: Experiments in scientific sharing contd: Biotorrents

Thursday, April 15, 2010

Experiments in scientific sharing contd: Biotorrents

Yesterday a paper from my lab (by Morgan Langille, with me as co-author) was published in PLoS On: BioTorrents: A File Sharing Service for Scientific Data

In it we describe a new website dedicated to the sharing of biology related files via BitTorrent, the popular distributed file sharing system. The abstract sums things up prety well:

The transfer of scientific data has emerged as a significant challenge, as datasets continue to grow in size and demand for open access sharing increases. Current methods for file transfer do not scale well for large files and can cause long transfer times. In this study we present BioTorrents, a website that allows open access sharing of scientific data and uses the popular BitTorrent peer-to-peer file sharing technology. BioTorrents allows files to be transferred rapidly due to the sharing of bandwidth across multiple institutions and provides more reliable file transfers due to the built-in error checking of the file sharing technology. BioTorrents contains multiple features, including keyword searching, category browsing, RSS feeds, torrent comments, and a discussion forum. BioTorrents is available at http://www.biotorrents.net.

Personally, I am not sure if Biotorrents is going to end up being used extensively. I hope so. I think it is a great idea of Morgan's. But more importantly, I believe it represents something we need more and more of in the "Open Science" movement. We need experimentation with all sorts of methods for improving sharing. The sharing of large electronic files, such as datasets of some kind (e.g., sequences, pictures, mass spec results, etc) are rapidly becoming a major complication in scientific research. If one publishes a paper on whatever, or even before one publishes a paper, sharing the data associated with the work is not always simple. Biotorrents could help in this in that sharing files via BitTorrent is very simple and easy. And if some data sets are of great interest, and if a lot of people start using Biotorrents, then the download and distribution of the data sets of interest will get faster as more and more people serve as hosts to contribute to the distributed file sharing.

If you want to learn more about Biotorrent, the best place to go is to Morgan's blog "Beta Science". In particular you should read "An interview with the creator of Biotorrents" where he interviews himself.

Also, Janet Fang of Nature News has just written a brief post on Biotorrents: "Biotorrent aims to open data sharing floodgates" where they quote me and Morgan. I particularly like the ending:

“Someone could download all the Nature papers and post them there, but we’re not encouraging that,” Eisen jokes. All PLoS papers are already on BioTorrents.

More on the web is coming out regarding Biotorrents and I will try to post some links here, including to some slightly older stuff

Some links:

The Scientist Blog from Bob Grant
Amazing News post
PLoS One press release
GenomeWeb article by Matthew Dublinv
John Timmer has written an article for ArsTechnica
FileNetworks
Tim O'Reilly on twitter
Egon Willighagen at chem-bla-ics
IsoHunt on Facebook

Freindfeed Search for Biotorrents

Older discussion on FriendFeed by Morgan et al.

9 comments:

Morgan Langille4/15/2010 10:56 AM
Completely agree that whether or not BioTorrents becomes the best data sharing site is not really the point (that would be nice though), but that scientists start thinking about sharing their data and results more openly. We can develop the best file sharing tools in the world, but without the willingness for researchers to share their data they are not of much use.
I would really like to see more result based types of data on BioTorrents, since there isn't an existing repository for this types of data.
ReplyDelete
Replies
Ross Mounce4/16/2010 6:28 AM
I really hope initiatives like this take-off, there's so much data out there that probably only gets used once, when actually it could get re-used and re-analysed multiple times in future analyses.
On a related bent; I'd like to see more policing and enforcement by editors and journals on commitments to data publishing. For example, in a recent Science paper, Nesbitt et al (2009) write that they will make their cladistic data publicly available on Morphobank (Supporting Online Materials, p18). Months after publication, and despite a 'reminder' email, their data is still not publicly available.

What can be done to stop 'empty promises' of Open data availability?
ReplyDelete
Replies
Jonathan Eisen4/16/2010 8:20 AM
Ross - I agree this is a huge problem. One thing that can be done is that one could require data associated with a paper to be put into some location/repository that is NOT run by the authors, and has to be made available at the time of publishing. This is what is done with DNA sequence data (most of the time) with authors being required to deposit data in Genbank, EMBL, DDBJ or something similar. In fact most journals will not allow one to say "will be deposited" in these places, but require accession IDs. Perhaps for phylogenetics one could require accession IDs from morphobank, etc. In general, we need to do much better in making data available.
ReplyDelete
Replies
Ross Mounce4/17/2010 6:15 AM
Yep. Couldn't agree more really. I've spent far too much of my time lately finding data, extracting it from pdfs and re-formatting it, rather than doing actual science.

So, the only further point I'd like to add is that data needs to not only be made available when published [the bare-minimum] but also to be made available in a useable, machine-readable/searchable, appropriate format.

'Human-readable' tables of data locked inside pdf's [which seem to be the standard atm for cladistic data in some journals; an atavism from days before the Internet age] only fulfill the bare-minimum requirements of availability - it's published but it's not useable without further re-formatting; at a needless expense of time, effort and possible introduction of error. This is my experience; a lot of published data is only 'pseudo-available' - it's technically there but barely useable.

Thus I think it's important to stress that free 'availability' is a great thing, but care and thought MUST also be taken with regard to the useability of the 'available-data'.

BioTorrents, Morphobank, Treebase, Genbank etc... might have imperfections but they make data available AND useable. Long may they continue :D
ReplyDelete
Replies
Jonathan Eisen4/19/2010 9:24 AM
Completely agreed Ross -usability is critical.
ReplyDelete
Replies
Jonathan Eisen4/21/2010 8:34 AM
I found a 2008 paper that discusses how BiTorrent could be of use to share biological data in developing countries with low bandwidth:

http://bioinformatics.oxfordjournals.org/cgi/content/full/24/2/299
ReplyDelete
Replies
Jonathan Eisen4/21/2010 8:35 AM
Another little web story about Biotorrents: http://www.computeach.co.uk/IT-news/IT-Computer-Technology-News/Cheap-option-for-file-transfers-launched/19733193
ReplyDelete
Replies
dalloliogm4/21/2010 9:00 AM
I was skeptical at the beginning, but I have been convinced that it is a very good idea after answering a question on biostar (http://biostar.stackexchange.com/questions/391/how-do-i-import-data-from-a-torrent-into-a-bioperl-r-bioclipse-or-taverna-appl)

The only drawback I see is that many databases update frequently, so they will need to maintain a torrent per each release.
ReplyDelete
Replies
dalloliogm4/21/2010 9:00 AM
I was skeptical at the beginning, but I have been convinced that it is a very good idea after answering a question on biostar (http://biostar.stackexchange.com/questions/391/how-do-i-import-data-from-a-torrent-into-a-bioperl-r-bioclipse-or-taverna-appl)

The only drawback I see is that many databases update frequently, so they will need to maintain a torrent per each release.
ReplyDelete
Replies