tag:blogger.com,1999:blog-10781944.post7649448825159053865..comments2024-03-28T00:36:36.460-07:00Comments on The Tree of Life: Useful comparative analysis of sequence classification systems w/ a few questionable bitsJonathan Eisenhttp://www.blogger.com/profile/07953790938128734305noreply@blogger.comBlogger14125tag:blogger.com,1999:blog-10781944.post-81059559609834988692012-09-26T13:20:53.751-07:002012-09-26T13:20:53.751-07:00For anyone who stumbles upon this thread (as I jus...For anyone who stumbles upon this thread (as I just did), I thought I would mention that the final (non-provisional) version of this paper is now online: http://www.biomedcentral.com/1471-2105/13/92<br /><br />It contains some corrections to the previous version, most notably to some PhymmBL values. Thanks to Arthur Brady and Steven for helping resolve those issues.<br /><br />To the concern Steven mentions in the last post: this was simply a design decision, primarily in the interest of time. We chose to evaluate the case where the query read is represented in the database, which reflects the use case where people are interested in previously described organisms within their sample. We realize that this experimental design surely yields different relative performance among programs compared to a design based on clade-level exclusion of sequences (in which PhymmBL may very well out-compete other programs). However, we consider our design valid for the use case described. In the updated version of the manuscript, there is a note about the clade-level exclusion technique thanks to Steven's feedback.pknut777https://www.blogger.com/profile/13990078966604625344noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-55342201420211620082012-05-17T11:15:56.567-07:002012-05-17T11:15:56.567-07:00Someone else on twitter (Rob Beiko) also pointed o...Someone else on twitter (Rob Beiko) also pointed out this same issue -- big flaw ---Jonathan Eisenhttps://www.blogger.com/profile/07953790938128734305noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-9777634618836493632012-05-17T08:18:02.068-07:002012-05-17T08:18:02.068-07:00Jonathan, thanks for pointing this out - I didn...Jonathan, thanks for pointing this out - I didn't know about it even though I was Director of the Center where the two authors work until a year ago. (They never discussed it with me, even though PhymmBL from my lab is included in the study.) I think the paper may have seriously erred in the way it evaluated the content-based aligners (including PhymmBL). The problem is that they trained all the aligners on everything in RefSeq, and then proceeded to test them on their 3 data sets. But those data sets are largely drawn from RefSeq genomes! So this introduces a very serious and fundamental bias - the training data include the test data. Therefore a method like Naive Bayes is expected to do well, since it has many parameters and can basically memorize (sort of) what it's seen before.<br /> This should not have gotten through review without them fixing this design flaw. The training sets should have been carefully designed for each test to exclude the test data. Oh well.Steven Salzberghttps://www.blogger.com/profile/16549957293973146438noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-39020625969674761112012-05-17T08:05:05.028-07:002012-05-17T08:05:05.028-07:00There is also the conflating issue that hard-line ...There is also the conflating issue that hard-line cladists of the Willi Hennig school tend to use "phenetic" to just mean "bad; not the one true holy method of maximum parsimony".Jonathan Badgerhttps://www.blogger.com/profile/04921990886076027719noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-7339063262856861742012-05-15T16:55:52.556-07:002012-05-15T16:55:52.556-07:00Ross - I commented over on G+. For others - here ...Ross - I commented over on G+. For others - here is what I wrote.<br /><br />I think there is some diversity in the use of "phylogenetic" out there but here are my thoughts on some of the terminology. Phylogenetics is really the study of the relationships among organisms (or genes, genomes, or other entities). And phylogenetic methods are methods for inferring phylogeny. <br /><br />Phylogenetic methods come in many flavors. Some people divide them into two classes - as Joseph Brown did above - into distance based methods and discrete data methods (see for example http://research.cs.queensu.ca/home/cisc875/faint.pdf). Other people divide up methods into distance, parsimony and likelihood or distance, parsimony, likelihood and Bayesian categories - in essence treating pastimony based methods as distinct from likelihood/Bayesian methods even though they both deal with analyzing discrete data/characters.<br /><br />In regard to phenetics - phenetics as far as I am aware has been used to describe methods that grouped organisms by their similarity to each other and generally ignored evolutionary history. To group organisms by their similarity one generally uses distance matrix methods so there is some overlap between distance matrix phylogenetic methods and phenetic methods. <br /><br />There is some disagreement as to what distance based methods should be called phenetic and just what neighbor-joining actually is doing (e.g., see http://mbe.oxfordjournals.org/content/23/11/1997.full). In essence NJ attempts to infer a phylogenetic tree from a distance matrix by minimizing the total branch length in the tree and in essence assuming that the distances are additive. It is certainly true that one could feed ANY distance matrix into NJ. But given its methodology, I think it is only really suitable for distances that are the result of a bifurcating evolutionary history. I have yet to see an example of a case where other types of data can reasonably be analyzed using NJ. In addition, due to the method of NJ it is not the standard phenetic approach of grouping organisms by similarity per se - because NJ allows rates of evolution to vary between taxa and thus organisms could be monophyletic yet be more similar to things outside of their clade. <br /><br />Just as distance methods can be used for non evolutionary purposes (e.g., standard clustering) so too can discrete character methods. For example, parsimony analysis can be applied to any data matrix. And one can infer "changes" between states even for objects that are not homologous and share no common ancestry. This does not mean parsimony methods SHOULD be used in such cases, but they can. And similarly, just because one can use a distance based phylogenetic method to analyze data that does not have a phylogenetic history, this does not mean one should. The issue in both cases is whether the model/algorithm is appropriate for the type of data. Since NJ in essence assumes additive distances it does not seem valid for most cases except phylogenetic history (note - I am not saying it is ideal for phylogenetic history and in fact I do not use it anymore) but that is not the point. <br /><br />It does not matter what we call the methods - phenetic or phylogenetic. What matters is the nuts and bolts of how they work. And NJ seems like a bad idea for clustering most objects.Jonathan Eisenhttps://www.blogger.com/profile/07953790938128734305noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-65146642288970105242012-05-15T15:45:05.425-07:002012-05-15T15:45:05.425-07:00If anyone is further interested in continuing the ...If anyone is further interested in continuing the 'is NJ phenetic or phylogenetic?' debate, we're having an open discussion of it here over on Google+ https://plus.google.com/109536929126322188570/posts/9V3QryFudFp<br /><br />There's some diversity of opinion already.<br /><br />I probably won't engage with the matter further here. But thanks all for the comments.Ross Mouncehttps://www.blogger.com/profile/02722518972624656199noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-63035092188155461502012-05-15T09:55:50.807-07:002012-05-15T09:55:50.807-07:00Yes - I am with you. Clustering is a very nice to...Yes - I am with you. Clustering is a very nice tool for visualization but it can lead to misleading results if the distances are not right for the method.Jonathan Eisenhttps://www.blogger.com/profile/07953790938128734305noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-90935779104441223932012-05-15T09:18:02.557-07:002012-05-15T09:18:02.557-07:00Well, I dunno, maybe that's a little too dogma...Well, I dunno, maybe that's a little too dogmatic; clustering (either by NJ or UPGMA) is a useful visualization and you don't have to take branch lengths all that seriously when you're just visualizing. If we took your view then we'd have to reject pretty much every use of hierarchical clustering in the literature, where the underlying distances are unlikely to be ultrametric. Hrmmm. Well, OK, maybe I could be up for that, now that I think about it. Lets's start with that gene expression clustering paper from that other Eisen guy...Sean Eddyhttps://www.blogger.com/profile/15551555799454926533noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-40095804012082233832012-05-15T09:02:08.153-07:002012-05-15T09:02:08.153-07:00Thanks Sean ... you are right here (in part) ... I...Thanks Sean ... you are right here (in part) ... I was overly focusing on phylogenetics. But I was not trying to defend UPGMA as a method to analyze non biological data either. I agree with you that the distances here are unlikely to be additive or ultrameric and thus neither method should be used for this type of data. What I (generally) object to is throwing a method at something without justifying why that particular model/method is being used. NJ or UPGMA seem unwise here.Jonathan Eisenhttps://www.blogger.com/profile/07953790938128734305noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-49437594528908016452012-05-15T08:22:31.815-07:002012-05-15T08:22:31.815-07:00NJ just assumes that the distances are additive (i...NJ just assumes that the distances are additive (i.e. the distance between two leaves summed along the path between them on the tree's branches is the same as their pairwise distance), which is a weaker constraint than UPGMA, which assumes that distances are not only additive but also ultrametric (tree is rooted, with every leaf equidistant from the root). So NJ includes the UPGMA tree as a special case -- if the data were ultrametric, NJ and UPGMA give you the same tree. It's nonsensical to object to using NJ on nonphylogenetic data, while demanding that standard and even more formally restrictive UPGMA (i.e. the usual hierarchical clustering algorithm) be used instead. There's nothing about NJ that's necessarily "phylogenetics-driven" other than the fact that we believe evolutionary distances are roughly additive, thus NJ can be reasonably applied. I doubt that the "distances" between software implementations are additive, but I'm even more sure they're not ultrametric.Sean Eddyhttps://www.blogger.com/profile/15551555799454926533noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-83329097450999901382012-05-15T07:32:30.367-07:002012-05-15T07:32:30.367-07:00Phenetics is fine. I have no objection to pheneti...Phenetics is fine. I have no objection to phenetics. It is also for classification of organisms or genes or other objects with history. What concept suggests that such methods should be used for classifying computational methods? It makes no sense to me. As for benefits / drawbacks of NJ - I was not defending it as the best phylogenetic method. I was saying it should not be used for NON phylogenetics. <br /><br />Show me ONE reasonable paper that has used NJ for clustering that is not phylogenetics-driven (e.g., as used here).Jonathan Eisenhttps://www.blogger.com/profile/07953790938128734305noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-87234486517616191392012-05-15T01:59:25.351-07:002012-05-15T01:59:25.351-07:00Plenty of more modern papers explicitly state when...Plenty of more modern papers explicitly state when using neighbor-joining that it's phenetic method: http://scholar.google.co.uk/scholar?q=neighbor+joining+phenetic&hl=en&btnG=Search&as_sdt=1%2C5&as_sdtp=on<br /><br />(too many to list them individually)<br /><br />of course I freely admit you'll also find many papers referring to NJ as a phylogenetic method. Seems like it there's no consensus view to me. Of course one can use phenetic methods to try and infer phylogeny - whether you *should* is another matter, and perhaps irrelevant in this case.<br /><br />This paper may however summarise some of NJ's problems.<br /><br />Queiroz & Good 1997 Phenetic clustering in biology: a critique. Quarterly Review of Biology 72(1) <br /><br /><br />Anyway - this might perhaps explain why Bazinet & Cummings used an NJ-gram to cluster the similarity between the different methods. It makes sense to me, I see nothing wrong with it if NJ is explicitly used as a phenetic classification method.Ross Mouncehttps://www.blogger.com/profile/02722518972624656199noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-49523279269898058072012-05-15T00:53:17.219-07:002012-05-15T00:53:17.219-07:00Umm ... it is a distance method. And it was desi...Umm ... it is a distance method. And it was designed explicitly for phylogenetic analysis. See http://mbe.oxfordjournals.org/content/4/4/406.long. The key feature is that branch length are allowed to vary and in essence optimized under the minimum evolution concept. As far as I know this is in essence unique to phylogenetic reconstruction and not something that would make sense to use to cluster objects that do not have a history.Jonathan Eisenhttps://www.blogger.com/profile/07953790938128734305noreply@blogger.comtag:blogger.com,1999:blog-10781944.post-17128495525771360712012-05-15T00:47:09.507-07:002012-05-15T00:47:09.507-07:00" I don't like it because they use an exp..." I don't like it because they use an explicitly phylogenetic method (neighbor joining, which is designed to infer phylogenetic trees and not to simply cluster entities by their similarity) to cluster entities that do not have a phylogenetic history."<br /><br />Erm... I was under the impression, and was always taught that neighbor joining was a phenetic (distance) similarity-based approach. So its usage here seems fair play to me. It does seem though that a lot of authors in the literature, now and in the past are using NJ-phenograms to represent phylogeny, so perhaps there isn't a consensus-agreement on this.Ross Mouncehttps://www.blogger.com/profile/02722518972624656199noreply@blogger.com