Tuesday, November 08, 2011

I am phylogeny obsessed but this is too much to me: phylogeny of cancer subtypes

Just because you have data that could be plugged into a phylogenetic analysis does not mean it makes sense to do so. Case in point - the following paper:

A Differentiation-Based Phylogeny of Cancer Subtypes by Riester M, Stephan-Otto Attolini C, Downey RJ, Singer S, Michor F.

In this paper the authors take gene expression data from various cancer samples/cell lines and then they build phylogenetic trees from the data.  See example below:

Figure 2. A phylogeny of acute myeloid leukemia (AML) subtypes. According to the French-American-British (FAB) classification, AML samples are classified into seven different types according to their level of differentiation (see Table 1). Expression data from 362 AML patients and 7 Myelodysplastic Syndrome (MDS-AML) patients is used to construct a phylogeny of these leukemias. We include expression data of human embryonic stem cells (hESCs), CD34+ cells from bone marrow (CD34 BM) and peripheral blood (CD34 PB), and mononuclear cells from bone marrow (BM) and peripheral blood (PB). The differentiation pathway from hESCs to mononuclear cells from peripheral blood is represented in purple, and the common ancestors of subtypes are shown as pink dots. The bootstrap values of branches are indicated by boxed numbers, representing the percentage of bootstrapping trees containing this branch. The ranking of AML subtypes identified by the phylogenetic algorithm corresponds with the differentiation status indicated by the FAB classification. The M6 subtype, represented by only 10 samples in our dataset, has the least stable branch, leading to lower bootstrap values for those branches where it can alternatively be located.

The pictures are pretty.  They make some sense biologically.  The paper has some very interesting parts and I do not want to suggest that the paper is not useful.  But it makes no sense to me to use a phylogenetic approach to analyze this data.  Phylogenetic methods are about reconstructing history of evolutionary lineages.  They should not be doing that here as far as I can tell since the cancers are from different people with different histories and what they make be looking at is convergent / developmental similarities in the cancer samples.  But they are not looking at history per se.  And thus it is not appropriate to use algorithms that use phylogenetic methods:

It just makes no sense to me to use a phylogenetic method instead of some sort of clustering method in the step where it says "construct tree" in their flow diagram.  Sure phylogenetic methods can make nice pictures.  But they should only be used when the underlying data has a history that is reflected in the model/assumptions of the phylogenetic method.  I could, for example build a phylogeny of cities based on various metrics.  But would that make sense?  Most likely not.  Don't get confused by the fact that similar things group together in the same part of a phylogenetic tree to thinking that that means that a phylogenetic model is right for your data.

I may be obsessed with phylogeny but that obsession applies to applying phylogenetic methods to data with histories that are approximated by the methods being used ...  and this paper seems to not be doing that ...

Hat tip to Eric Lowe, an undergrad in my lab for showing me this paper.

I note - this does not mean that phylogenetic methods cannot be applied to cancer studies.  Case in point - this paper:

Estimation of rearrangement phylogeny for cancer genomes by Greenman CD, Pleasance ED, Newman S, Yang F, Fu B, Nik-Zainal S, Jones D, Lau KW, Carter N, Edwards PA, Futreal PA, Stratton MR, Campbell PJ.

In this paper the authors focus on mutations in cancer cells and they use phylogenetic methods to determine the order in which genomic changes happen in these cancer cells.  This seems to be an excellent use of phylogenetic / phylogenomic methods.

So - lesson of the day - phylogenetic methods should be used on data with a phylogenetic history.  Not so complicated.  But pretty important.


  1. Exactly. Hierarchical clustering is interesting in interpreting gene expression, but such clustering is not the same as phylogeny. There are different programs for performing it, some of which were written by your brother's (Cluster, FuzzyK)

  2. yes, which is why it is weird that my brother is giving me shit on twitter about criticizing this paper

  3. From Twitter:

    phylogenomics: The Tree of Life: I am phylogeny obsessed but this is too much to me: phylogeny of cancer subtypes

    mbeisen: @phylogenomics why?

    phylogenomics: @mbeisen sorry - my autopost seems to skip the links hang on

    phylogenomics: @mbeisen see http://t.co/1wdHtmDH

    mbeisen: @phylogenomics but many "phylogenetic" methods have origins in numerical taxonomy, not in historical reconstruction

    phylogenomics: @mbeisen sure but reads the paper - they talk about distance from the "root" of the tree and make other non sensical statements

    phylogenomics: @mbeisen & say "In the case of phenotypic info such as gene exp. data, this assumption does not hold" admitting PG methods kind of wrong

    ianholmes: @mbeisen @phylogenomics Franziska does great work but "phylogeny" is wrong word here, there is tumor phylogeny(eg using FISH),this ain't it

    phylogenomics: @ianholmes @mbeisen also see link at end of my post to a new paper on tumor rearrangement phylogeny that looks good

    mbeisen: @phylogenomics @ianholmes yes, the phylogenetic metaphor is overused and taken too literally

    mbeisen: @ianholmes @phylogenomics but they ARE trying to build a phylogeny in the sense that it is based on sequential differentiation from hESCs

    phylogenomics: @mbeisen @ianholmes mostly they seem to be talking about ranking/classifying not anything sequential

    phylogenomics: @mbeisen @ianholmes uggh - now you just sound like a mindless defender - would you have called this a phylogeny of cancer subtypes?

    phylogenomics @mbeisen @ianholmes maybe you should have called your array tools "phylogeny" instead of "cluster"

    mbeisen: @phylogenomics @ianholmes i'll remind you that most of the assumptions made in what you would call "true phylogenetics" also don't hold

    phylogenomics: @mbeisen @ianholmes that's not the point - discussing this as a phylogeny is misleading - regardless of what others do

    mbeisen: @phylogenomics @ianholmes not defending paper (haven't read yet) but i think phylogenetic methods could be appropriate for this kind of work

    ianholmes: @mbeisen @phylogenomics surely the "gen" in "phylogeny" has to imply some shared origin or development - not just functional similarity

    mbeisen: @ianholmes @phylogenomics but the point of the paper is that the similarities in expression reflect common developmental history

    mbeisen: @ianholmes @phylogenomics i'm not saying they're RIGHT - just that the methods don't seem to be inherently inappropriate

    ianholmes: @mbeisen @phylogenomics is a cell lineage differentiation tree a "phylogeny"? If so, then sure, tumor cells too. Paper's a bit vague on this

    phylogenomics: @mbeisen @ianholmes I don't buy it - methods seem inherently inappropriate

    ianholmes: @mbeisen @phylogenomics is a cell lineage differentiation tree a "phylogeny"? If so, then sure, tumor cells too. Paper's a bit vague on this

  4. I think what you're getting at is that the use of a tree suggests that all the examined cancers "arise" in some sense from pluripotent stem cells. While the cancer stem cell hypothesis is certainly being examined by many researchers, it is a stretch to just assume that it is true for these cancers. I think Ian Holmes in your twitter feed buys into the validity of that assumption when he compares this tree to a cell lineage tree. Trees are valid for representing developmental lineages because the hierarchical structure of the tree properly represents temporal relationships.

  5. Shaun

    Yes, that is part of my complaint. But in addition, I also question in general the concept of taking cancer cells from different people, examining their expression patterns, and then using a method that tries to represent the "history" of these cells. The cells from the same type of cancer in different people do not have the same history. They may have similar developmental paths that they go through to get to their final state, but they started from different points. So what I am having trouble with is wrapping my brain around just what the root and nodes and branches in an evolutionary tree would mean in this case. I am all for drawing hierarchical "trees" to represent temporal relationships when there is some sort of bifurcation going on - but I am very uncomfortable with using phylogenetic methods and models to generate the tree in this case.

  6. So if they'd still constructed a distance matrix and built a tree out of that using phylip (neighbor) and presented it as an unrooted tree without calling it a phylogeny, would that have been ok? I think by your line of reasoning it would be, neighbor joining being simply a clustering algorithm with no implied notion of shared ancestry or anything.

    (...I haven't read the paper, by the way...)

  7. no Rutger -

    1. NJ has an implied evolutionary model in that it is jiggering branch lengths to reflect amount of change over time -- UPGMA or some algorithm that is more of a straight clustering algorithm would likely be fine --- and they did use this for some things

    2. In the end it is not what they called the diagram that matters - it is what they did with it --- they talk about phylogeny and nodes and temporal order and such throughout and use all sorts of phylogeny speak --- but in the end I don't buy the way they connected the distance calculation to ANY type of diagram where time/history is a part of it

  8. I clearly should have read the article before running my mouth off because they shouldn't be doing what you're describing but I'm not sure if I agree that the jiggering has to do with change over time, NJ trees being unrooted.

    NJ minimizes the sum of branch lengths that are computed from a distance matrix; I guess I don't see why you can't use that to create a tree even if the tips don't share ancestry (as they don't in this article).

    I am aware that NJ has been developed, and is used, with some notion of minimum evolution in the back of everyone's mind - but to me it looks a lot like a clustering algorithm.