Tuesday, April 20, 2010

Yuck - rRNA databases restrictions on sharing/reuse create major complications

Well, this is annoying. I started to look at the sharing policies for data from various ribosomal RNA databases. And boy was I surprised. One database, the RDP, run out of Michigan State University has a page with information about their policies. The page says the following:

By downloading data ("Data") from the Ribosomal Database Project ("RDP"), you agree as follows:
Data are copyrighted by the Michigan State University Board of Trustees. 
You may use the Data for your own non-commercial research purposes, and may make derivatives from the Data for your own non-commercial research purposes. All other rights are reserved by Michigan State University. 
You may not sell the Data or any derivatives you prepare from the Data, nor may you provide the Data or derivatives you prepare to any third party for commercial purposes. 
MSU makes no warranty, express or implied, to you or to any other person or entity, including without limitation the implied warranties of merchantability or fitness for a particular purpose of the data. MSU will not be liable for special, incidental, consequential, indirect or other similar damages, even if MSU or its employees have been advised of the possibility of such damages. 
You will only distribute the Data or derivatives with copyright notices. 
If you publish from the Data, you will acknowledge the contribution of Michigan State University and the Ribosomal Database Project 
This cannot be right? Most of the data came from Genbank so certainly they cannot Copyright it. Now it may be that they are referring to sequence alignments and other derivatives of the raw data but this implies that all the data in the RDP is Copyrighted.

Mind you, I do not like the policy even if it is just for the derivatives of the data (e.g., alignments) since this will certainly make some things very difficult in terms of publishing.  For example, if I use an alignment from RDP, how do I provide the alignment when I publish a paper? If I provide it, do I only provide it to non commercial entities? Does that mean, in essence, commercial entities would not be allowed to see alignment figures?

I get that people do not want people to download and then redisplay all of their content, thereby in essence possibly killing the original database.  But Copyrighting all the data in the database?  Even data that is not theirs?  Is this just a scare tactic of some sort?  A mistake? I cannot tell.  There must be better ways to prevent someone from redisplaying the entire database structure and content without such severe tactics. 

So - is this an issue just with RDP? Turns out - no. The SILVA database in Europe has some restrictions too:

The SILVA databases and services/tools offered at www.arb-silva.de are FREE FOR ACADEMIC USE. All downloads can be used, modified and redistributed within the academic environment without any limitations.

Users from NON-ACADEMIC/COMMERCIAL ENVIRONMENTS can also directly access all downloads including the results of the SILVA Webaligner (SINA) but only for limited/temporary use (only for test purposes).

If you are interested in unlimited usage of the SILVA databases/services or parts of them within a non-academic/commercial environment, please send an e-mail to ....
Though thankfully they do not seem to be trying to Copyright or reserve rights for other people's data. They simply refer to downloads from their database and what one can do with such downloads. They never say they own in any way the data itself.

Fortunately greengenes seems to have no restrictions on the use of data or anything downloaded from there.  Though I am still looking into this.

I think it is time for rRNA researchers to think carefully about using data/alignments/etc from databases like Silva and RDP.  If one uses an alignment from one of these databases it is possible one would be violating the DB policies if one released the alignment as part of a paper.  Yet, if one uses the alignment in the paper, one should release it.  So seems better to seek out and used fully open datasets and alignments and other results.

Below are some discussions relating to some tweets I posted about this issue yesterday:


  1. Very yuck. I'm pretty sure that greengenes has copyright issues because they're the DOE. They were unable to provide source code for anything because it's considered DOE IP. greengenes does a poor job in the variable regions, which causes problems in some analyses. It gets more complicated by the fact that the greengenes alignment is a version of the SILVA alignment from a few years back. The RDP alignment isn't so bad since the aligner uses Infernal (open) and models which you can make yourself. But in variable regions targeted by 454 the bases don't get aligned. SILVA is the best alignment as far as I can tell, but they do tend to have a number of restrictions. Until we made our aligner in mothur it wasn't possible to get SILVA quality alignments, but you need the SILVA quality reference alignment. All of this can affect the quality of the science, at least for the for profit folks.

    I know we briefly emailed about this before, but a question I have for the broader community is how different do things have to be before the copyright doesn't apply? If I modify a set of sequences from SILVA are they still SILVA's?

  2. Actually, Pat, you were the inspiration behind this but I wrote this at 2 am and was not sure if it was OK to reveal the person who had pointed some of this out to me before.

    I do not know your answer to how much you have to modify it. I think it is better to start and end with stuff that is simply open so that these issues do not come up. I definitely thin the RDP and Silva and Greenegens web sites provide useful services and don't want to somehow kill them by pushing for open alignments but there is no doubt that their policies regarding sharing create serious copmlications

  3. It sort-of sounds to me like these are modeled after the GPL or similar open-source software licenses. It seems like the goal is to allow users to access, use and distribute the data freely, but not to make money off of the data.

    It doesn't seem to me like these "copyrights" restrict the ability of users to publish derivative alignments in their papers or distribute such data to other users, so long as the authors acknowledge the original database, which seems fair, and the data are not used to make money (which also seems fair, given the original data were provided for free).

    Anyhow, I haven't read the notices as carefully as you, so maybe there is some inherent limitation that is being imposed. But I don't see it; at least it's not obvious.

    How do you think these "copyrights" would restrict what I could do with the data, as an academic researcher?

    I guess I'm less concerned with how companies could use the data to produce products. I don't think it makes sense to allow a company to take publicly-generated data and build a commercial product off of it.

  4. Bryan

    I disagree with your interpretation of the wording here.

    1. What is a commercial vs. non commercial use of the data? If someone uses the alignment to design primers to then characterize samples in a for profit company, is that OK? What about UK universities that sometimes are views as commercial entities? This issue has come up over and over again with various licenses and it gets quite complex.

    2. It says you may not sell the data or any derivatives. Well how does that affect someone who runs a core facility charging for rRNA services where they make use of an alignment from RDP as part of their work? Even if the core facility is in an academic center, this seems to violate the wording.

    3. The statement "You will only distribute the Data or derivatives with copyright notices." is also complex. If I make new alignments from their alignments and want to distribute them do I have to include their copyright? If I build a model of rRNA secondary structure from their alignment, publish a paper on this, and then want to deposit or share the secondary structure model, do I need to include their copyright statement? At what point does this end?

    4. As for companies using the data, I am not dying on this issue one way or another. But I note that the issue is more to me about making sharing complicated by imposing restrictions. It seems these restrictions are enough to make many people not want to share the derivatives they make from something downloaded from RDP. And that is not good.

  5. That first one from Michigan State seems like it was written by IP people who are not attorneys and don't know that you can't copyright data that isn't your own and that even on data that is your own that was produced using $$ from the NIH- there is a data sharing policy to which they should adhere.

    I was just at study section and found it a little shocking the data sharing plans that I saw in proposals. It is like many people don't have any idea at all what the NIH data sharing policy is. Maybe I should just have written a blog post???

  6. Legally, I'm pretty sure they are allowed to copyright this however they like, even if they got the data from Genbank.

    There are some threshold requirements, but I'm pretty sure they've met them.


  7. Russell - that article actually confirms what I was saying. I get that people sometimes can copyright and/or protect a database itself. I am not a huge fan of this here but I get that this may be allowed.
    They claim here to Copyright the underlying data, not just the database itself. The article you link to makes it abundantly clear they cannot copyright the data if they obtained it from elsewhere. In fact, they may not be able to copyright these types of data at all no matter where they obtained them. That is what I had the biggest problem with.

  8. I'm not sure if they are claiming copyright on the underlying data or the on compilation. The language in the notice is somewhat ambiguous on the matter. I suppose it comes down to what you suppose they mean by "the Data," which is not at all clear.

    Then again, if they are claiming ownership of "the Data" as a compilation, then they contradict themselves by forbidding "derivatives you prepare from the Data." If "the Data" is a compilation of stuff they found from elsewhere, how can they tell if your derivative relied on the original source, or on "the Data?"

    This is yet another problem with the way copyright and science interact; the time you spend trying to wrap your brain around sloppy, autocontradictory legal language could have been spent trying to understand something that is actually interesting.

  9. This sort or reminds of a kerfuffle my friend Ian got himself into when he released a little ANSI drawing program he wrote in high school. In his words, he "didn't know anything about licensing," and "cut and pasted bits and pieces of free software licenses" into a Frankenstein mess.

    Then he forgot about it, until years later he discovered that his project was the subject of an argument on the Debian maintainer lists :


    Licensing is Not Easy and it Has Consequences.

  10. Jonathan,

    what did the RDP say when you e-mailed them?

    (I'm actually at MSU and collaborating with RDPish folk, so I can go whack them with a clue stick as necessary, or at least find out what's going on.)


  11. I have not emailed them recently about this. When we wrote our STAP paper a few years ago we had originally used RDP data as part of the package. But when we finalized the pipeline and the paper we switched over to GreenGenes because we were not allowed to redistribute the RDP alignment as part of our package. So this was some three years ago or so. Until this week I had forgotten about this and how annoyed we were then about having to switch to using a different DB ...

  12. Wow -- don't read blogs for a day or two, and miss all sorts of relevant discussions. This makes me doubly glad that for a recent Enterobacteriaceae 16S tree a student did for our AToL project (thanks, Bing! I'll get it posted to the web site soon) I went directly to GenBank for a bulk download of accession numbers I'd gathered. My initial impetus was to capture the type strain sequences that RDP didn't have, and to clean up Genus and species assignments to reflect current taxonomy, but in the end it was easier to build the data set anew than to edit one from RDP. Looks like I dodged a bullet ;-)