Friday, May 11, 2012

'Danger and Evolution in the Twilight Zone': Guest post by Randen Patterson and Gaurav Bhardwaj


Figure 1. PHYRN concept and work flow.
'Danger and Evolution in the twilight zone'

I have been communicating with Randen Patterson on and off over the last five years or so about his efforts to try and study the evolution of gene families when the sequence similarity in the gene family is so low that making multiple sequence alignments are very difficult.  Recently, Randen moved to UC Davis so I have been talking / emailing with jim more and more about this issue.  Of note, Randen has a new paper in PLoS One about this topic: Bhardwaj G, Ko KD, Hong Y, Zhang Z, Ho NL, et al. (2012) PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences. PLoS ONE 7(4): e34261. doi:10.1371/journal.pone.0034261.


Figure 8. Model for the Evolution of the DANGER Superfamily.

I invited Randen and the first author Gaurav Bhardwaj to do a guest post here providing some of the story behind their paper for my ongoing series on this topic.  I note - if you have published an open access paper on some topic related to this blog I would love to have a guest post from you too.   I note - I personally love the fact that they used the "DANGER" family as an example to test their method.


Here is their guest post:

A fundamental problem to phylogenetic inference in the “twilight zone” (<25% pairwise identity), let alone the “midnight zone” (<12% pairwise identity), is the inability to accurately assign evolutionary relationships at these levels of divergence with statistical confidence. This lack of resolution arises from difficulties in separating the phylogenetic signal from the random noise at these levels of divergence. This obviously and ultimately stymies all attempts to truly resolve the Tree of Life. Since most attempts at phylogenetic inferences in twilight/midnight zone have relied on MSA, and with no clear answer on the best phylogenetic methods to resolve protein families in twilight/midnight zone, we have presented rest of this blog post as two questions representative of these problems.  
Question1: Is MSA required for accurate phylogenetic inference? 
Our Opinion: MSA is an excellent tool for the inference from conserved data sets, but it has been shown by others and us, that the quality of MSA degrades rapidly in the twilight zone. Further, the quest for an optimal MSA becomes increasingly difficult with increased number of taxa under study. Although, quality of MSA methods has improved in last two decades, we have not made significant improvements towards overcoming these problems. Multiple groups have also designed alignment-free methods (see Hohl and Ragan, Syst. Biol. 2007), but so far none of these methods has been able to provide better phylogenetic accuracy than MSA+ML methods. We recently published a manuscript in PLoS One entitled “PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences” introducing a hybrid profile-based method. Our approach focuses on measuring phylogenetic signal from homologous biological patterns (functional domains, structural folds, etc), and their subsequent amplification and encoding as phylogenetic profile. Further, we adopt a distance estimation algorithm that is alignment-free, and thus bypasses the need for an optimal MSA. Our benchmarking studies with synthetic (from ROSE and Seqgen) and biological datasets show that PHYRN outperforms other traditional methods (distance, parsimony and Maximum Liklihood), and provides significantly accurate phylogenies even in data sets exhibiting ~8% average pairwise identity. While this still needs to be evaluated in other simulations (varying tree shapes, rates, models), we are convinced that these types of methods do work and deserve further exploration. 
Question 2: How can we as a field critically and fairly evaluate phylogenetic methods? 
Our Opinion: A similar problem plagued the field of structural biology whereby there were multiple methods for structural predictions, but no clear way of standardizing or evaluating their performance.  An additional problem that applies to phylogenetic inference is that, unlike crystal structures of proteins, phylogenies do not have a corresponding “answer” that can be obtained.  Synthetic data sets have tried to answer this question to a certain extent by simulating protein evolution and providing true evolutionary histories that can be used for benchmarking.  However, these simulations cannot truly replicate biological evolution (e.g. indel distribution, translocations, biologically relevant birth-death models, etc). In our opinion, we need a CASP-like model (solution adopted by our friends in computational structural biology), where same data sets (with true evolutionary history known only to organizers) are inferred by all the research groups, and then submitted for a critical evaluation to the organizers. To convert this thought to reality, we hereby announce CAPE (Critical Assessment of Protein Evolution) for Summer 20132. We are still in pre-production stages, and we welcome any suggestions, comments and inputs about data sets, scoring and evaluating methods.   

ResearchBlogging.org Bhardwaj, G., Ko, K., Hong, Y., Zhang, Z., Ho, N., Chintapalli, S., Kline, L., Gotlin, M., Hartranft, D., Patterson, M., Dave, F., Smith, E., Holmes, E., Patterson, R., & van Rossum, D. (2012). PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences PLoS ONE, 7 (4) DOI: 10.1371/journal.pone.0034261

4 comments:

  1. Summer 2013 not 2012 ;0)

    ReplyDelete
  2. Randen Patterson and Gaurav Bhardwaj said,

    A fundamental problem to phylogenetic inference in the “twilight zone” (<25% pairwise identity), let alone the “midnight zone” (<12% pairwise identity), is the inability to accurately assign evolutionary relationships at these levels of divergence with statistical confidence. This lack of resolution arises from difficulties in separating the phylogenetic signal from the random noise at these levels of divergence.

    I don't think this is quite right. The most fundamental problem is whether similarities in the twilight zone arise by chance or common ancestry. It is not good science to assume, without evidence, that the similarities are due to common descent and not accident or convergence (Doolittle 1981). The statement implies that false assumption because it assumes there's a "phylogenetic signal" that needs finding.

    As Doolittle (1987) points out, two completely unrelated amino acid sequences can be aligned to give 10-20% sequence identity. This does not mean they evolved from a common ancestor.

    The goal should be to prove that common descent is the correct explanation. You don't do this by ASSUMING common descent then looking for algorithms that confirm your assumption.

    Incidentally, the same problems exists when comparing protein domains that are structurally similar. Are they actually homologous or is that just an unproven conclusion that's automatically applied to any two similar structures?

    Doolitte, R. (1981) Similar Amino Acid Sequences: Chance or Common Ancestry? Science 214:149-159.

    Doolittle, R. (1987) "Of URFS and ORFS" University Science Books, Mill Valley California, USA

    ReplyDelete
    Replies
    1. I am with you on this Larry - my biggest concern in many evolutionary studies of proteins with low levels of sequence similarity is convergent evolution ...

      Delete
    2. which is why we need methods for separating convergence. It is possible. Proteins that arise convergently should be able to be resolved through statistical evaluation. Our efforts in this area have led us to conclude that even for folds that have evolved many times (e.g. HAD domains), they can be separated and correctly annotated by family using the phylogenetic signals we can derive. This also appears to be true for the 4206 RNA viruses we have performed studies on. If you can separate convergent families with statistical reliability, what would the limits be? :o)

      Delete