I have decided to post a question here to my blog requesting help from phylogeneticists everywhere in doing
phylogenetic analysis of data from metagenomic projects. Here I will try to describe the problem and then hopefully people out there can chime in on what they think we/others should do to handle this type of data.
So here is the deal. We would like to perform a variety of phylogenetic analyses of data from "environmental shotgun sequencing (ESS)" projects in which one isolates DNA from an environmental sample (e.g., soil, water) and then randomly sequences fragments of that DNA. ESS is in essence a subset of "metagenomics" which is basically the study of the genomes of organisms from environmental samples. (I wrote a brief piece on ESS in PLoS Biology last year which can be found
here).
Though there are lots of things we would like to do with phylogenetic analysis of this type of data, I am going to focus here on one specific thing. We would like to take sequence reads that contain matches to specific gene/gene family (e.g., RecA, my favorite gene), build a multple sequence alignment that includes these reads as well as all members of this gene family from known organisms, and then build phylogenetic trees from these alignments. (And by we here I mean like totally lots of people, incliding in particular a Gordon and Betty Moore Foundation funded project called
iSEEM I am working on with the labs of Katie Pollard and Jessica Green)
The challenge with this is really two things. First, we want to analyze just the reads themselves (i.e., we do not want to use assemblies you can make from this type of data). Second, and more importantly, we want to include in our analysis sequence reads that only cover small, not necessarily overlapping regions of the "full length" sequence alignments for the family.
The alignment would look something like
sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 1 XXXXXXXXX-------------------------
sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 2 ---------XXXXXXXXXXXX-------------
fragment 3 ---------------------XXXXXXXXXXXXX
fragment 4 ----XXXXXXXXXXXXXXXXXX------------
sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 5 -----------------------XXXXXXXXXXX-
where Xs are the regions covered by the sequences/fragments (could be DNA or amino acids)
We want to build trees from these alignments with the hope of using them to learn lots of cool things about the evolution of the fragments and the species from which they come. I can provide more information but really the key part for the phylogenetics here is the nature of the alignment.
In the past, I have decided to constrain my analyses to NOT deal with this type of alignments. I have either analyzed each fragment on its own or we have built a multiple alignment but only inlcuded fragments that cover more than 3/4 of the full length sequence and thus the matrix is much more filled out. Such an alignment would look like this
sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 1 XXXXXXXXXXXXXXXXXXXXXXXXXXX-------
sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 2 --XXXXXXXXXXXXXXXXXXXXXXXX--------
fragment 3 -----XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 4 ----XXXXXXXXXXXXXXXXXXXXXXXXXXXX--
sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 5 --XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX-
But we really want to include the smaller fragments in our analysis. And we are just not certain how to best do this. We know LOTs of people out there think of similar problems in terms of sparse matrices, supermatrices, supertrees, EST data, etc. And we have ideas about how to do this and are asking around by email some phylogenetics gurus we know. But I thought it might be fun to have the discussion on a blog rather than by email.
So again, how might one best build phylogenetic trees from data that looks like this?
sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 1 XXXXXXXXX-------------------------
sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 2 ---------XXXXXXXXXXXX-------------
fragment 3 ---------------------XXXXXXXXXXXXX
fragment 4 ----XXXXXXXXXXXXXXXXXX------------
sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment 5 -----------------------XXXXXXXXXXX-
And from these trees we want to place each fragment relative to (1) the full length sequences and (2) to each other if possible. We also, of course, want branch lengths to reflect some sort of amount of evolution and thus do not just want a cladogram.
Any suggestions would be appreciated. Fire away with questions too ...