Friday, June 13, 2008

Open Metagenomics: Selenium in the Oceans

Well, I have started previous an "Open Evolution" series here and now I am starting an "Open Metagenomics" series. I know, I have gotten grief from some out there (yes, you Rob Edwards - see comments here) about my support for somewhat non open things in metagenomics, so I am going to try and make up for that as much as possible.

In the first installment, I am pointing people to a new paper on PLoS Genetics "Trends in Selenium Utilization in Marine Microbial World Revealed through the Analysis of the Global Ocean Sampling (GOS) Project" by Yan Zhang, Vadim N. Gladyshev (hat tip to Katie Pollard for pointing out this paper).

In this paper the authors study selenium utilization using data from the first part of the Venter Global Ocean Survey (GOS) which was metagneomic sequencing from multiple samples - mostly surface ocean water.  The GOS data they use comes from the Rusch et al. paper in PLoS Biology (note for full disclosure ... I was an c0-author on this paper).  

There have been challenges with getting and using metagenomic data from other people's publications in the past and I note that the authors here obtained the data sets from CAMERA, a metagenomics database supported by the Moore Foundation.  I note - it is my support of this database that Rob Edwards gave me grief about since the database is not currently completely open (e.g., you need to register to use it and the software that runs it is not currently all open source). 

Anyway, they got the data from CAMERA, and then did a pretty comprehensive analysis to search for genes and features in the data that would be indicative of selenium utilization. Selenium is of great interest to many biologists for many reasons, including that it is required for the synthesis and function of Selenocysteine (Sec), which , if you do not know, goes occasionally by the nickname "The 21st amino acid"

Without going into all the details of the paper, the last paragraph sums up the major features
In this study, we report a comprehensive analysis of Sec utilization in marine microbial samples of the GOS expedition by characterizing the GOS selenoproteome. This is the first time that the microbial selenoprotein population is described in a global biogeographical context. Our analysis yielded the largest selenoprotein dataset to date, provided a variety of new insights into Sec utilization and revealed environmental factors that influence Sec utilization in the marine microbial world.
My favorite part of the paper is that they map some of the selenium related features onto the globe.  For example, in one figure they show the inferred selenoprotein "richness" in 
the different samples. (Selenoproteins are proteins that have selenocysteine in them).  Now I am sure there are many assumptions they made in leading to the inferences they have made about selenium utilization and I am also sure some of these will turn out to be a bad idea.  But to me, this paper is a good example of what researchers will be able to do with metagenomic data in the future.  Sitting at their computers anywhere in the world, researchers can now ask questions about the distribution patterns of functions in microbes in the world.  Pretty cool.  And the more open we are with the papers, the tools, and the data, the more likely this type of work is to spread.

The figures are from the paper and I am permitted to use them here because they were published under a Creative Commons license that allows anyone to use them as long as the source is cited. The source is Zhang Y, Gladyshev VN (2008) Trends in Selenium Utilization in Marine Microbial World Revealed through the Analysis of the Global Ocean Sampling (GOS) Project. PLoS Genet 4(6): e1000095. doi:10.1371/journal.pgen.1000095 


  1. I like the mapping too. What's more, since the GOS data in GenBank are all tagged with latitude/longitude/depth coordinates, someone could create a cool mashup using something like the Google Maps API. Might be a nice student project.

  2. I am surprised this has not been done yet (well, maybe it has --- anyone?). I am going to teach a metagenomics course, probably in the Winter next year. Maybe I will try and get someone to do this.

  3. It is true, one of the most important steps in interpreting massive amounts of sequence reads is the ability to interpret the data in it's environmental context: map it! But the benefit of mapping is only as great as the environmental parameters describing the x,y coordinates.

    A note on the mapping, this is indeed work in progress by the team at MPI for Marine Microbiology, the strength of which comes from additional environmental data layers provided by the World Ocean Atlas Extractor (temp, salinity, dissolved oxygen, Apparent Oxygen Utilization, % oxygen saturation, phosphate, silicate, and nitrate at standard depth s, for annual, seasonal, and monthly points).

    And, drum roll...the next update this fall (also watch the NAR database issue 2009) will include mapping the GOS sites on the Mapserver with these additional parameters.


Most recent post

Another day to think, to pause, to ponder.

Panorama of Sycamore Park and the memorial to Karim   A bit over 10 years ago I wrote a blog post that I repost all the time. Entitled "...