I am very excited about today's post. It is the first in what I hope will be many - posts from authors of interesting papers describing the "Story behind the paper". I write extensive detailed posts about my papers and also have tried to interview others about their papers if they are relevant to this blog. But Matthew Hahn approached me recently about the possibility of him writing up some details on his recent paper on the functions of orthologs vs. paralogs. So I said "sure" and set up a guest account for him to write up his comments and details of the paper.
For those of you who do not know, Matt is on the faculty at U. Indiana. He was a post doc at UC Davis so I have a particular bias in favor of him. But his recent paper has generated some controversy (I posted some links about it here). So it is great to get some more detail from him. In addition, I note, I am also using this approach to try and teach people how easy it is to write a blog post by getting them guest accounts on Blogger and letting them write up something with links, pictures, etc. So hopefully we can get more scientists blogging too.
Anyway - without any further ado - here is Matt's post:
Following Jonathan’s excellent example of how explaining the history of a project helps to illuminate how the process of science actually happens, I thought I’d start by giving a bit of history behind our study, and the paper that we recently published in PLoS Computational Biology (http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002073). And then I’ll address the critics…
Note - Toni Gabaldón was trying to post a detailed response but Blogger kept cutting him off with a character limit. So I have posted his response below.
I appreciate the effort by Matthew Hahnn on explaining the story behind his paper on the so-called "Ortholog conjecture" and on facing some of the criticism. This paper attracted my interest as that of many others that work on or just use orthology. For instance it was chosen by one of my postdocs for our "Journal Club" meeting. And it was discussed during our last "Quest for Orthologs" meeting in Cambridge. I think is raising a necessary discussion and therefore I think is a good paper. This does not mean that I fully agree with the interpretation and conclusions ;-). I hope to modestly contribute to this debate with the following post.
I think one of the causes that this paper has caused so much debate is that the conclusions seem to challenge common practice (inferring function from orthologs), and could be interpreted as the need of changing the strategies of genome annotation. I think, however, that one should interpret carefully these results before start annotating based on paralogous proteins. As I will discuss below one of the problems is that we need to agree in what is the conjecture to then agree in how to test it. I see three main points that can be a source of confusion: i) the issue of what is actually stated by this conjecture, ii) the issue of annotation, and iii) the issue of time
1) What is the "ortholog conjecture"?
Or in other terms, when should we expect orthologs to be more likely to share function than paralogs?. Always? Of course not. All of us would agree that two recently duplicated paralogs are likely to be more similar in function than two distant orthologs, so it is obvious that the conjecture is not simply "orthologs are more similar in function than paralogs". In reality the expectation that orthologs are more likely to be similar in function than paralogs, as least this is how I interpret it, is directly related to the effect that duplication have on functional divergence. If gene duplication has some effect on functional divergence (even in not 100% of the cases), then, given all other things equal (divergence time, story of speciation/duplication events - except fpr the duplication defining the orthologs) one would expect orthologs to be more likely to conserve function.
I think this complexity is not well considered (by many authors, in general). Hahn refeers to the famous review of orthology by Koonin (2005) as the source for the term "ortholog conjecture". However, In that paper this conjecture is discussed always within the context of genes accross two particular species, whether in Hahn's paper it is taken as well to other contexts. Thus, the proper context in which to test this conjecture is only between orthologs and between-species paralogs. As we can see, Red and purple lines in Hahn paper in figure2 do not show any clear difference.
Secondly, Koonin was very cautions in his paper, stating that he was referring to "equivalent functions" and not exactly the same "function", correctly implying that the functional contexts would be different in the two different species. This brings me to the next point.
If the expectation of functional conservation of orthologs refers to a given pair of species, then it makes no sense to test that expectation between paralogs within the same species and orthologs in different species. We were interested in this issue and it took us some effort to control for this "species" influence on the comparison, if you are interested you can read our paper on divergence of expression profiles between orthologs and paralogs (http://www.ncbi.nlm.nih.gov/pubmed/21515902)
As Hahn founds, and it was anticipated by Koonin in that review, there is a huge influence of the "species context", a big constraint of what fraction of the function is shared. Indeed I think is the dominant signal in Hahn's paper. Why is that? One possibility is that the functional context determines the function, I agree. However, we should not discard biases in how different communities working around a model species define processes and function, also the type of experiments that are usually done. For instance experimental inference from KO mutants might be common from mouse, but I guess is not the case in humans (!!). I think this may be having a big influence and might even be the dominant signal in Hahns paper.
Finally function has many levels and I expect subfunctionalization mostly affect lower levels (i.e. more specific). Biases may also
exist in the level of annotation between species or between families of different size (contributing more or less to the orthologs/paralogs class).
Microarray data are less likely to be subject to biases (although some may exist), at least they should be expected to be free of "human interpretation biases" and so Hahn and colleaguies did well, in my opinion, of testing that dataset. It is important to note that for microarrays and for orthologs and between-species paralogs (which I think is the right frame for testing the conjecture) ortholgs are more likely to share an expression context. This is compatible to what we found in the paper mentioned above, and compatible with the orthology conjecture as stated by koonin (accross species)
Finally, one aspect which I think is fundamental is the notion of "divergence time". Since paralogs can emerge at different time-scales they are composed by a heterogeneous set of protein pairs. Most of comparisons of orthologs and paralogs (Hahn's as well) use sequence divergence as a proxy of time. However this is only a poor estimate, specially when duplications (as in here) are involved (we explored this issue in the past: http://www.ncbi.nlm.nih.gov/pubmed/21075746). This means that for a given divergence time paralogs may have larger sequence divergence than orthologs at the same divergence time, or otherwise (if gene conversion is playing a role). Is the conjecture based on sequence divergence or on divergence time?, I think the initial sense of using orthology to annotate accross species is based on the notion of comparing things at the same evolutionary distance. Thus basing our conclusions on divergence times might not be the proper way of doing it.
CONCLUSIONS AND PROPOSAL FOR RE-STATEMENT
To conclude, and with the intention of going beyond this particular paper,
I would finish by saying that the key to the problem lies on how we interpret the so-called "ortholog conjecture" or how are our expectations on how function evolves. What I get from re-reading Eugene Koonin's paper and how I am using that "assumption" in my day-to-day work is the following:
"Orthologs in two given species are more likely to share equivalent functions than paralogs between these two species"
Therefore the notion of "accross the same pair of species" is important and thus only part of the comparisons made by Hahn and colleagues could directly test this. Looking at the microarray and between-species comparisons data, the conjecture may even hold true!!
I, however, do think that the conjecture as stated above is limited and does not capture the complexity of orthology relationships. Indeed us, and many other researchers, are tuning the confidence of the orthology-based annotation based on whether the orthologs are one-to-one, one-to-many or many-to-many, even when orthologs are "super-orthologs" (with no duplication event in the lineages separating the two orthologs).
Since, the underlying assumption of the ortholog conjecture is that duplication may (not necessarily always) promote functional shifts, then many-to-many orthology relationships will tend to include orthologous pairs with different functions.
Thus I would re-state the conjecture (or expectation) as follows:
"In the absence of additional duplication events in the lineages separating them, two orthologous genes from two given species are more likely to share equivalent functions than two paralogs between these two species"
This would be a more conservative expectation, which is closer to the current use of orthology-based annotation that tends to identify one-to-one orthologs, rather than any type.
When duplications start appearing in subsequent lineages thus creating one- or many-to-many orthology relationships, the situation is less clear. Following the assumption that duplications may promote functional divergence. Then one could expand the conjecture by "the more duplications in the evolutionary history separating two genes, the lower the expectation that these two genes would share equivalent functions".
I wrote this contribution on the fly, and surely there are ways of expressing this in more appropriate terms. In any case I hope I made clear the idea that the conjecture emerges from the notion of duplications causing functional shifts and that our expectations will be clearer if expressed on those terms. This goes on the lines of what Jonathan Eisen mentioned on considering the whole phylogenetic story to annotate genes.
Under this perspective, the real important hypothesis is that "duplications tend promote functional shifts", I think this is based on solid grounds and has been tested intensively in the past.