Tuesday, January 13, 2009

Calling all phylogeneticists - we need your help with metagenomic data

I have decided to post a question here to my blog requesting help from phylogeneticists everywhere in doing phylogenetic analysis of data from metagenomic projects. Here I will try to describe the problem and then hopefully people out there can chime in on what they think we/others should do to handle this type of data.

So here is the deal. We would like to perform a variety of phylogenetic analyses of data from "environmental shotgun sequencing (ESS)" projects in which one isolates DNA from an environmental sample (e.g., soil, water) and then randomly sequences fragments of that DNA. ESS is in essence a subset of "metagenomics" which is basically the study of the genomes of organisms from environmental samples. (I wrote a brief piece on ESS in PLoS Biology last year which can be found here).

Though there are lots of things we would like to do with phylogenetic analysis of this type of data, I am going to focus here on one specific thing. We would like to take sequence reads that contain matches to specific gene/gene family (e.g., RecA, my favorite gene), build a multple sequence alignment that includes these reads as well as all members of this gene family from known organisms, and then build phylogenetic trees from these alignments. (And by we here I mean like totally lots of people, incliding in particular a Gordon and Betty Moore Foundation funded project called iSEEM I am working on with the labs of Katie Pollard and Jessica Green)

The challenge with this is really two things. First, we want to analyze just the reads themselves (i.e., we do not want to use assemblies you can make from this type of data). Second, and more importantly, we want to include in our analysis sequence reads that only cover small, not necessarily overlapping regions of the "full length" sequence alignments for the family.
The alignment would look something like
    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXX-------------------------
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 ---------XXXXXXXXXXXX-------------
    fragment 3 ---------------------XXXXXXXXXXXXX
    fragment 4 ----XXXXXXXXXXXXXXXXXX------------
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 -----------------------XXXXXXXXXXX- 
    where Xs are the regions covered by the sequences/fragments (could be DNA or amino acids)

We want to build trees from these alignments with the hope of using them to learn lots of cool things about the evolution of the fragments and the species from which they come. I can provide more information but really the key part for the phylogenetics here is the nature of the alignment.

In the past, I have decided to constrain my analyses to NOT deal with this type of alignments. I have either analyzed each fragment on its own or we have built a multiple alignment but only inlcuded fragments that cover more than 3/4 of the full length sequence and thus the matrix is much more filled out. Such an alignment would look like this


    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXXXXXXXXXXXXXXXXXXXX-------
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 --XXXXXXXXXXXXXXXXXXXXXXXX--------
    fragment 3 -----XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 4 ----XXXXXXXXXXXXXXXXXXXXXXXXXXXX--
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 --XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX- 
But we really want to include the smaller fragments in our analysis. And we are just not certain how to best do this. We know LOTs of people out there think of similar problems in terms of sparse matrices, supermatrices, supertrees, EST data, etc. And we have ideas about how to do this and are asking around by email some phylogenetics gurus we know. But I thought it might be fun to have the discussion on a blog rather than by email.

So again, how might one best build phylogenetic trees from data that looks like this?

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXX-------------------------
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 ---------XXXXXXXXXXXX-------------
    fragment 3 ---------------------XXXXXXXXXXXXX
    fragment 4 ----XXXXXXXXXXXXXXXXXX------------
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 -----------------------XXXXXXXXXXX- 

And from these trees we want to place each fragment relative to (1) the full length sequences and (2) to each other if possible. We also, of course, want branch lengths to reflect some sort of amount of evolution and thus do not just want a cladogram.

Any suggestions would be appreciated. Fire away with questions too ...

19 comments:

  1. FastTree uses profiles instead of a distance matrix, and so it should do a reasonable job with highly fragmented sequences. In my (very limited) testing with fragmented alignments, FastTree gives better topologies than distance matrix methods, although not as good as PhyML. See the FastTree web site.

    ReplyDelete
  2. Something that got built as part of the sequencing methodology project at Southampton might help here. It was built to try and benchmark the quality of assemblies built from short error prone reads by various extant sequence assemblers but it started by building an alignment of the short reads onto a known template sequence (I think - it was a while ago - I'll post a link to the guy who wrote it and see what he thinks)

    ReplyDelete
  3. I'm almost completely ignorant of this - but isn't there a way to build a tree at each individual position from only those samples which exist at that position, and then combine them statistically?

    So for your example, you would make a tree using only s1, f1, s2, s3, s4, and s5 for positions 1-4, then a tree including fragment 4 for positions 5-8, another excluding f1 for positions 9-13, etc. There ought to be a way to give individuals in the sequence neutral weight in the final tree when they don't have any information input to a particular tree. Then somehow, since you have multiple full-length sequences you can at least tell the relative distance or calculate the "time" from divergence of your fragments, even if you can't calculate exact values.

    I think I might be restating your question without answering it... because I'm really, really not a phylogeneticist.

    ReplyDelete
  4. Morgan - that is great. Do you still have the simulated alignments you used for that test?

    ReplyDelete
  5. here's one attempt [NCBI] "Phylogenetic classification of short environmental DNA fragments."

    ReplyDelete
  6. I don't think you can do it. You don't have enough sequence information to build a reliable tree.

    I'm sure you can find programs that will produce some output but you have to ask yourself whether it will mean anything.

    My rule of thumb is that you need at least 100 aligned residues of a moderately conserved gene (RecA) in order to learn anything about evolution.

    Of URFs and ORFs

    ReplyDelete
  7. Rob

    In your study - did you focus on analyzing each fragment separately or did you also test analyzing them together including non overlapping fragments in one tree?

    From what I remember most of what you did was using each fragment separately.

    ReplyDelete
  8. Another attempt. We developped a "phylogenetic mapping" method to get a global picture of the taxonomic distribution of DNA viruses in the GOS data, by using DNA polB as a phylogenetic marker.

    ReplyDelete
  9. Larry

    You are correct that really small pieces do not do well in phylogenetic analysis (see Barcoding?), but my drawing may have not given the right impression. With new 454 sequencing and/or standard Sanger sequencing, fragments can be 400-1000 bp long. And if you do paired end that have small spacers, then you can get up to 2 kb from single clones/fragments. So we could set some size cutoff that is at least not completely in the twilight zone (from the figure in your link) but we still might have overlap issues (e.g., some genes are > 1500 bp long, and we might have 750 bp from the left end and 750 bp from the right end that we wnat to put into one tree).

    ReplyDelete
  10. I'm fairly sure that there is no legitimate way to make phylogenetic comparisons of completely non-overlapping sequence fragments.

    I think your best bet here is to use a nearest neighbor approach similar to what has been done by the Relman group for metagenomic 16S rRNA fragments. They call the approach "Global Alignment for Sequence Taxonomy".

    The basis of the method is to build a strong set of full-length reference sequences for each of your genes of interest with good taxonomy for each sequence, and build phylogeentic trees for the REFERENCE SEQUENCES. Then your short sequence tags are matched to their nearest neighbor in the reference set. This matching can be done by BLAST or perhaps better by some kind of global alignment method. The phylogeny of the tags is inferred from the phylogeny of the references - so your resolution is limited by the sequences in your reference collection (thus 16S works well). So your final product will be clusters of tags that all map to the same reference sequence (as nearest neighbor).
    Your final tree is a pruned subset of the reference tree that includes only the nodes that are populated by your tags. The members of each cluster can then be multiply aligned.

    —Stuart.Brown (@nyumc.org)

    The Pervasive Effects of an Antibiotic on the Human Gut Microbiota, as Revealed by Deep 16S rRNA Sequencing
    Dethlefsen L, Huse S, Sogin ML, Relman DA
    PLoS Biology Vol. 6, No. 11

    ReplyDelete
  11. Just an idea: how about simultaneous inference of alignment and phylogeny? These methods provide also metrics (scores) of the alignment+tree so that you can tweak your parameters and cover better your alignment and tree space, thus incorporating uncertainty. See POY, StatAlign, BAli-Phy, and BEAST. Of course "reference" sequences would facilitate the establishment of homology.

    ReplyDelete
  12. Stuart

    Thanks for reminding me about the GAST method. I think that is appropriate with really small pieces but I think it is in fact possible to do phylogenetic analysis even when fragments are not overlapping. There is some literature on this in terms of supermatrices and supertrees and it is similar in concept in a way to what GAST does but rather than relying on the full length sequences for ALL phylgoenetic information we want to use some information from the fragments too.

    ReplyDelete
  13. Sergios --- I had not really considered the simultaneous alignment-tree methods but that is a great point. I will look into this a bit more and post again.

    ReplyDelete
  14. Jonathan,

    Just off the top of my head, have you looked at CD-HIT EST?

    I will think a bit more about it, as this problem has started to vex me as well recently.

    ReplyDelete
  15. CD-HIT seems like it could be useful for alignments but that is not our problem. We need to make trees and we can do the alignments using HMMs for key genes of interest.

    ReplyDelete
  16. Katie Pollard has just pointed me to a possibly VERY relevant paper by Todd Vision and others. It is focused on EST alignments but they are basically very similar. It is called "A hierarchical model for incomplete alignments in
    phylogenetic inference"

    ReplyDelete
  17. Raymond Wan points to the following paper

    "Ji Qi, Bin Wang, Bailin Hao, Whole genome prokaryote phylogeny without sequence alignment: a K-string composition approach: Journal of Molecular Evolution 58(1): 1-11 2004."


    Bailin Hao showed that you can do tree analysis using q-gram. I had a fair amount of success with classification of HCV genome using a variant of the technique. The technique works best when you have long sequences and with a lot of variations in the sequences. It is relatively insensitive to gaps and errors in sequence.

    ReplyDelete
  18. I presume that k-mer approach is the same method that is used in their CVTree paper?

    If so, I have found it extremely useful in determining genomic distances in bacteria and have used it in an automated comparative genomics pipeline that identifies genomic islands.

    I would be quite skeptical about using it for your problem since it depends on having lots of data (i.e a whole proteome) and not just a single gene or gene fragment for each representative. However, it wouldn't require too much effort to run a quick test to see if the results make sense.

    ReplyDelete
  19. I have been thinking about the same problem, for slightly different reasons.

    It strikes me that some of the methods might suffer from the same limitations as supertrees (i.e. computing some sort of matrix from the original data, then using the matrix to infer trees), and that a partitioned supermatrix approach would (at least in theory) be more appropriate.

    As far as I understand it, a full maximum likelihood approach should be appropriate as long as the gaps really are missing data, not indels. That is, ML should have an underlying tree, and sum over all possibilities when there's missing data. However there's obviously an additional problem becuase it would be impossible to know the relationships of e.g. 3 non-overlapping taxa.

    Maybe there's some benefit to first figuring out which trees should be equally likely (given sequence overlaps) and then using ML to sort out the remaining parts of treespace. You could then generate an ML set of trees for further analysis.

    ReplyDelete