Thursday, September 13, 2012

Some quick comments on "Giant viruses coexisted w/ the cellular ancestors & represent a distinct supergroup"

Got asked on Twitter about this paper:

BMC Evolutionary Biology | Abstract | Giant viruses coexisted with the cellular ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya

I answered briefly
Don't have time for a detailed blog post but here are some quick comments:

1. Giant viruses are fascinating and cool

2. I have done work connected to the topic of this paper and thus might not be considered fully objective.  For example see

3. I see no evidence that the type of analysis that they do on protein folds is a robust phylogenetic method.  Phylogeny from sequence alignments (which is what we focus on in my lab) have been tested and tweaked for some 50 years.  There are 100s to maybe 1000s of papers on methods alone - not to mention the 1000s of papers using alignments for phylogenetics.  I am not convinced that the analysis being done here of FFs and FSFs is particularly robust.  It seems interesting, certainly.  But is it sound?  I mean, I could build phylogenetic trees from cell size, from shape, from eye color, and from all sorts of other features.  Those would all suck for certain.  Protein folds - not sure about them.  They almost certainly are prone to convergent evolution and I do not see any attempt in this analysis to deal with that issue.

4. The authors of the current paper do not show any taxa names on their trees - just colors for large groups of taxa (bacteria, archaea, eukaryotes and viruses).  It is really not good practice to remove the taxon names.  If they were there the first thing I would do is to look at the patterns within the groups they highlight.  Do all the major phyla / kingdoms of eukaryotes, for example, come out looking as one would expect based upon other studies.  Or are they all over the place?  Same for bacteria and archaea.  Not including taxa makes it nearly impossible to judge this paper positively.  I could not find this information in supplemental data either.

5. They really should have released the data tables they used for the phylogenetic analysis.  Don't know why they did not.

6. In Figure 3 with the rooting they have, either viruses are a subgroup of archaea or archaea are not monophyletic.  Not a good thing in a paper trying to claim viruses represent a fourth grouping on the tree of life.

Anyway - got to do some other things but just wanted to get some comments out there.

UPDATE 9/19 - some prior stories about the "fourth domain" and ancient viruses - to counter notion in the press release for this paper that their findings "shake up the tree of life".  Even if their specific inferences about viral evolution are correct, such inferences / conclusions have been made before.


  1. Hey. Thank you for reading the paper and responding. Don't understand why you'd publish something so quickly without even interviewing the researcher, but I'll ask him to respond.

    D. Yates.

    1. Thanks for the comment. I read the paper. I am very familiar with the field and with papers on this topic. I wrote up my initial thoughts on this paper. The suggestion that I should interview the authors of a publicly available paper before commenting on it is not something I support.

    2. I must say I am disappointed in some aspects of the press release and the coverage that came from it ... for example the headline "Giant Viruses Are Ancient Living Organisms" which I found in a reposting of the press release " here . Did the paper really show that giant viruses ARE ancient living organisms? Not even remotely. First of all - nothing living today is ancient. Second, the paper was reasonably cautious in some conclusions. Not this headline which is clearly based on the text of the PR. What about the headline in the original PR "Study of giant viruses shakes up tree of life". Really? How did it shake up the tree of life? I note - it is this type of wording that drives me and many other scientists crazy. Why overstate the work? Is that needed?

    3. To be clear, here is the headline and the first paragraph of the press release. Readers can decide for themselves if it is misleading or overstated.


      A new study of giant viruses supports the idea that viruses are ancient living organisms and not inanimate molecular remnants run amok, as some scientists have argued. The study may reshape the universal family tree, adding a fourth major branch to the three that most scientists agree represent the fundamental domains of life.

      The rest of the press release is here:

    4. Sorry to go on about this but the possibility that viruses represent a fourth domain has been out there for many years. How does this new study "shake up the tree of life". The paper concludes that their analysis supports the existence of a fourth domain. It is the prior hypothesis by people like Raoult and Claverie and related papers that tried to shake up the tree of life. This paper - if it is correct - ends up lending support to prior work. The use of "as some scientists have argued" and "that most scientists agree" tries to make it seem like this is an incredibly novel new "shake up" in the tree of life. It is not. And therefore I conclude, I think accurately, that the title and some of the text of the PR is misleading and overhyped and does not do enough justice to prior work on the topic.

    5. As I said, I think readers can come to their own conclusions when they see what was actually in the news release.

    6. I think the paper and the press release would have benefited from outlining the study in context of the theories about the origin and evolution of viruses (please see my comment below).

    7. Yes, the press release should have included more background on other studies that suggest viruses be included in the tree of life. Thanks to both of you for pointing that out. I agree.

      Perhaps I'm wrong, but I do think this study is substantially different from a lot of previous work in that it identified a reliable molecular marker of deep evolutionary change and used that marker to open a new window on events near the base of the tree of life.

    8. Thanks for the response Diana. The work in this paper is certainly interesting. As I stated above, I am skeptical of the robustness of some of the methods but if we assume the methods work, then yes, the approach did reveal some new insights. But without getting too philosophical, the reason I persisted with criticism of the PR relates to the "we stand on the shoulders of giants" concept. It does not take away anything from a new study to reference the prior work. And when that work is not referenced, then many scientists (e.g., me) bristle, get snarky, and are unable to look deeper. So - I confess - I do not know how substantially different this work is because I focused on the ways in which it was building on prior work not the ways in which it was different.

  2. I have gone through all of the FSF, FF papers of Caetano-Annoles and would suggest that his method is interesting, but there are a few cases where the clock assumption falls short. For example disordered conserved ribosomal proteins like L15 cannot duplicate for re-use as can a folded cold-shock domain. Not without a duplication in its binding partner. So his dating of ribosomal proteins seems too wide, and gets to a proteins before ribosome answer. I agree, no data for any of his papers is available, impossible to replicate without a rebuild of his infrastructure. I think it's an interesting method but lacks "error bars".

    1. Christopher, Thank you for adding to the discussion. The methodology that we use does not force a clock assumption. We do however find a posteriori that trees of domains are globally clock-like. This does not mean that the clock ticks always at constant pace. The case you present is indeed interesting and worthy of detailed study. In fact, we find that a small subset of F and FSF structures is refractory to domain combinations (see our Structure paper published in 2009) and that these appear around the time of the appearance of ribosomal proteins. I bet the domain(s) of L15 may be in that group. Unfortunately, we have no marker that would link ribosomal F and FSFs to the geological record so we cannot test their clock-like behavior. However, I note that the appearance of a FSF may take hundreds of thousands of years, if not more. It is therefore important to consider timeframes when connecting processes of gene duplication and processes of structural innovation.

  3. Jonathan, Thank you for your candid assessment. Diana prompted a response so I am now finding my way to your blog, something I never do. We have been working on these kinds of approaches since 2000 and publishing our results since 2003. A focus on structure for phylogenetic analysis is indeed incipient but our methods are cladistic, very traditional, and their inception predates modern analysis of sequence. I agree that there is much to learn from molecular structure but disagree that folds are like cells or eye color. Cells and eye color express a multitude of traits at many levels of organization, and it would be premature to use those traits in global analyses of these kinds. However, folds have been studied and catalogued since Kendrew and bioinformatics suites that are used to make structural inferences are robust and advanced. Structural biology remains a strong field and structural genomics has expanded our horizons of the molecular world. I must admit however that our understanding of structure and disorder is limited but the field is advancing.

    So back to your comments about how sound is the methodology. It has a number of advantages over sequence. For starters, it does not violate character independence as much as sequence does. Sites in sequences by definition interact with each other as they establish secondary, supersecondary, tertiary, and quaternary molecular structure. The incorporation of this fact into models of sequence evolution is a grand challenge. In contrast, domain definitions can be in some cases quite precise and their cataloguing robust. So the structural census provides characters and taxa directly. This obviates the need of alignment, which is a second grand limitation of sequence analysis (as rightly pointed out by Morrison). If interested, you may find other many benefits of the methodology (phylogenetic inapplicables, taxon sampling, tree imbalance, domain rearrangements, etc) at I think that the most important feature of domain structures is their high evolutionary conservation. This makes them unique for the deep exploration of phylogenetic relationships. In contrast, sequences change at very dynamic pace and that is exactly their power. The trade-off is that they are limited in their ability to provide big pictures. Sober and Steel had very sound arguments about the informational (and philosophical) limitations of sequences that are worthy of careful analysis. (Continues in separate comment because of html character limitations) -Gustavo

    1. Thanks for the response. Some comments

      1. I was not saying protein folds were analogous to cells and eye color. I was using those as examples of things I would not use for phylogeny (or at least, would use very cautiously I suppose). I said "Protein folds - not sure about them. " and I meant this. Could be good. Could be not so good. Your paper does not convince me that they are good but certainly, as I also said, they are interesting.

      2. I am certainly with you on the limitations of sequence analysis. Lots of issues there. Many not solved.

      3. The high evolutionary conservation of folds certainly means they have more potential for inferring deep branches than for inferring recent ones. But that does not mean that trees based on parsimony analysis of such folds will be accurate. I am as of yet not convinced that the trees being made do a decent job of inferring evolutionary relationships in those deep branches. I am concerned with issues like convergent evolution (in terms of presence of folds), LGT, LBA, and other issues.

      4. I have not seen any evidence that the folds "do not violate character independence" as much as sequence does.

      5. I am not saying this cannot work. I am not saying the inferred trees are wrong. I am however saying that I do not think you have shown strong evidence that the methods you are using are robust for inferring relationships - especially for ancient events.


    2. Thank you. An interesting point about the use of domains in phylogenetic analysis is that they do not suffer much from convergence and lateral gene transfer (LGT) processes. This has been shown by the groups of Gough, Bourne, Sonhammer and Kim. We have also evidence that LGT is limited. Why the discrepancy with sequence? We think that perhaps this is related to the effect of domain rearrangements in proteins and its effects on phylogeny when domains are not taken into consideration. Of course, there could be many other explanations.

      In terms of character independence, this issue is very difficult to test. We are ultimately talking about coevolution of sites in molecules, a subject that is somehow elaborated in RNA but incipient in proteins and very important. If you are interested, have a look at the work of Haussler, Fares and Pande for example. David Penny proposes the simple thought-experiment in which you exchange rows of characters and ask what you have lost in the process. If you do this with sequence structure is gone. If you do this with domains abundance and occurrence nothing is lost, unless you are explicitly interested in domain rearrangement (a higher level of complexity). -Gustavo

  4. (continued) You also complain about not showing taxa names or making public the data. Trees are too big to make labels explicit and our intention in this paper is to show the global placement of viruses rather than distract with details. Trees of proteomes provide acceptable grouping that are not far away from those of other phylogenomic methods, as we have explicitly described in several papers in the past (since 2006). Russell Doolittle in 2005 also showed that folds provide acceptable trees of life, so it is not only my laboratory that is producing trees of proteomes, though we are one of few that are providing trees of domains. In terms of our data, which is massive, we are in the process of updating the MANET database. We will provide data matrices, trees, search functions, functional annotations and much more. Unfortunately we do not have funding for the endeavor so we are bootstrapping with what we can and the effort is progressing slowly.

    Finally, the rooting of the cellular world in Archaea that you note and that follows the early rise of giant viruses is for us remarkable. It has been consistently recovered in all our trees, regardless of genomic dataset, structural classification (SCOP, CATH), phylogenetic character (from folds to families, from high levels of gene ontology definitions to the lowest that are possible to handle), a focus on abundance or occurrence, and many other twists. It appears at odds with the canonical rooting but I caution that the issue of the rooting of the tree of life is complex and far from resolved. We simply provide the structural view, which now needs to be reconciled with sequence views.

    I close by saying that a focus on the phylogenomic analysis of molecular features other than sequence should be a welcome additional to the bioinformatics toolkit, even if the concept may be unfamiliar. I think structure can positively complement evolutionary inferences derived from sequence, especially since the cladistics methods we use to analyze multistate taxa have been also repeatedly tweaked and discussed for decades. -Gustavo

    1. Addressing one part of your comment first. I am very disappointed that you are not sharing the details of the trees (e.g., you could release nexus files with the taxa without any trouble). Such information should be released with papers. I understand the challenges of releasing a full/large dataset for the work done here but even that is relatively easy these days. With places like Data Dryad and Figshare releasing such details is relatively painless. Again, I am a bit disappointed that such information is not available for this paper.

    2. As for "phylogenomic analysis of molecular features other than sequence" I have no inherent antagonism to data other than sequences. I worked with Sam Karlin for many years as a graduate student helping him with his attempts to use nucleotide composition to infer phylogeny (I was and still am skeptical of much of his work, but I still worked with him on it ...). I have worked with many attempting to use various other features and I welcome their addition to the toolkit. However, adding to the toolkit should be done with care. New approaches need to be assessed. They need to be tested by others (see data release comments above) and they need to be viewed with some skepticism until they are proven worthy.

    3. Thank you. We will be depositing trees and matrices of this paper in TreeBase. However, only our MANET database will be able to properly showcase phylogenomic data.

      I agree with you on your second point and I am glad that you are not antagonistic to the analysis of other phylogenetic characters in molecules. Being skeptical and open minded is part of being a good scientist. I appreciate your concerns.


  5. Dear Jonathan,

    Thank you for engaging with the material. It takes time and effort to understand new approaches and I appreciate that you are doing so! I sincerely hope the conversation (between you and GCA) will continue. I think you will find it most fruitful.

    Best regards

  6. It is great to see that Gustavo and his colleagues (Nasir et al.) have extended their approach of inferring phylogenies (based on protein domain structures) from the cellular kingdoms, Archaea, Bacteria, and Eukarya, to viruses.

    There are only two broad ways of thinking about the origin and evolution of viruses: they evolved from simple to complex, by increasing the size of their genome and the complexity of their proteome, or from complex to simple (reviewed here:

    The current prevalent view is that the viral lineages originated from simple genetic elements, before the origin of cells. According to this hypothesis, the mysterious ancestral viral elements evolved by acquiring new genetic material (including genes for components of translation machinery) into complex viruses whose genomes are several times larger than the genome of many symbiotic or parasitic cellular species.

    On the contrary, the fusion model proposes that the viral lineages originated from parasitic or symbiotic cellular species that, in order have full access to the host resources, including translation machinery, fused with their host cell, by a process in which their cellular membrane fused with that of their host. After synthesizing their specific molecules and replicating their genome within the host cytoplasm, these organisms regain a cellular organization and continued their development. This novel type of life cycle opened unique evolutionary opportunities for both viruses and their host cells.

    Many extant viruses, including poxviruses and mimiviruses, start their life cycle by fusing with their host cells,which provids compelling evidence for the fusion model.

    One of the most remarkable implications of fusion model is that new viral lineages originated from diverse Archaea, Bacteria and Eukarya species though out their life history, and that this process might still be active. Surprisingly, it appears that several parasitic cellular species are indeed evolving into new viral lineages.

    The data from Nasir et al. paper indicate that viruses have evolved by reductive evolution. This data represents a strong additional line of evidence supporting the fusion model and the hypothesis that the ancestors of viruses were cellular species.

  7. To start with I am not going to comment on the phylogenetics algorithms, or the way they are employed or the conclusions that Gustavo draws, however I will say something about the underlying data and use of domains as molecular characters; on balance they are both more robust than others (1), but also potentially less informative (2).

    1. This comment has been removed by the author.

    2. This comment has been removed by the author.

    3. 1) SCOP domain superfamilies are defined as having a common evolutionary ancestor, therefore they de facto cannot be subject to convergent evolution. To understand this one must understand protein structure. For domains of known structure the close-packing of the sidechains of the buried amino acids in the conserved core have a distinctive fingerprint that is retained over evolution. In short, the sidechains are like interlocking pieces of a multi-dimensional jigsaw puzzle (3 spatial dimensions plus additional dimensions which describe the chemistry, e.g. charge); while every piece of the puzzle (amino acid) can be substituted for another over evolution, it is not possible to substantially change the way the puzzle is packed, thus a unique evolutionary fingerprint is retained. The reason for this is that substitutions that erase the complex packing of the amino acids also erase the protein, i.e. they would lead to an unfolded protein.
      There are three caveats to this and two important consequences. The first caveat is that there are a small number of documented exceptions to this rule in the SCOP classification (for historical reasons), such as the classification of TIM barrels and Rossmann domains (please contact me for a complete list). The second caveat is that in theory a structural domain could evolve into a disordered state (losing the restriction of selection on the packing of the core), then continue to evolve into a subsequent structured state that is different from the starting structure. Since there is no functional continuity for this kind of evolution, it has no more biological meaning than there would be attached to a single DNA base in an intronic region that evolves to become an exon, giving the base a novel and independent context. Also, once a protein no longer folds and loses it's structure (and conserved function), the slate is wiped clean and it is equally (un)likely to evolve into any other possible novel fold, thus there is no structural or functional continuity of information between the two points in evolution. The third caveat is that one might argue that the same puzzle packing could arise independently twice in evolution; from a theoretical point of view the combinatorial complexity of the multi-dimensional fingerprint (as for the mere 2D human fingerprint) is astronomical, and from an observational point of view the fingerprints of the close-packing sidechains of the conserved core are clearly discrete, even if there are many cases where the fold (arrangement of backbone and secondary structure) has arisen more than once in evolution. While it may be possible to observe more of a continuum of backbone positions between some protein structures, the conserved packing of sidechains in the cores bears one of a small and finite number of distinct evolutionary fingerprints.

    4. A) The first consequence of (1) is that domain superfamilies, by definition (provable by comparative structural analysis) are not subject to convergent evolution, or even divergent evolution that connects two superfamilies to a common ancestor; in SCOP the family level denotes sub-classes within a superfamily which have resulted from significant sequence divergence from the common structural ancestor. While I support the use of domains as molecular characters in Gustavo's work, the evidence does not support the model of evolution where superfamilies have evolved from each other, or even in some papers where all domains have evolved from a single ancestral protein. Superfamilies are much more robust as molecular characters than positions in a multiple sequence alignment, which despite what Jonathan insinuates are no less characters than domains are, either from a theoretical or from an algorithmic point of view. Sequence alignments suffer from, what I believe phylogeneticists call homoplasy, whereas superfamilies do not; there are only 20 amino acids and substitutions between all of them have been observed, whereas there are thousands of domain superfamilies and no substitution from one to another has been observed. The argument that sequence alignment for phylogeny is what has been done for the last 50 years and therefore must be right, has no place in science and, not meaning to take Jonathan's comment out of context, it is sadly an illustration of a similar attitude routinely adopted by other sectors of the phylogenetics community that come across to an outsider as territorial and inbred.
      B) When using domain databases to study evolution, one should restrict to domains of known structure, such as with the SUPERFAMILY and SCOP databases. To use functional domain databases such as Pfam is to include domains for which the evolutionary relationships are unknown and for which the domain definition is not that of an evolutionary unit; since more distant evolutionary relationships are detected, there is little or sometimes no loss in genome coverage by focussing on domains with a known structural representative.

    5. 2) Although superfamilies are very robust characters, they are not always discriminative. For example the Last Universal Common Ancestor (LUCA) of cellular organisms probably already contained (very) roughly half of all superfamilies seen in nature, and there is very little variation in the repertoire of domains between species (Chothia, C. and Gough, J. 2009 Genomic and Structural Aspects of Protein Evolution. Biochem. J. 419(1), 15-28. ). There are also currently a similar number of characters (superfamilies) as taxa (completely sequenced genomes). With respect to the paper in this discussion, there are relatively few superfamilies unique to any superkingdom or to viruses (Abroi, A. and Gough, J. 2011 Are viruses a source of new protein folds for organisms? - Virosphere structure space and evolution. Bioessays 33(8), 626-635.). Thus, using the repertoire of superfamilies in genomes for phylogenetics can give very robust characters for the deep branches of the tree of life, a big weakness of sequence alignments, but may not contain enough information. The good news is that there are other molecular characters which can also be included, such as the sub-families within the superfamilies (Pethica, R.B. and Gough, J. 2012 Evolutionarily Consistent Families in SCOP: Sequence, Structure and Function. BMC Struct. Biol.
      We also have paper under review which evaluates the different molecular characters for species tree reconstruction, including superfamilies, families, supra-domains and domain architectures. I look forward to sharing this with you when it is published, as it is very enlightening to this aspect of the discussion and contains a great deal of evidence to support the points made therein. Readers of this may also be interested in a live and regularly-updated species tree of all completely sequenced genomes derived from molecular characters is available here (please excuse the clunky interface): .

    6. Thanks for all the responses and comments Julian. One quick comment as I will try to make comments on different topics separated.

      I vigorously dispute the statement "SCOP domain superfamilies are defined as having a common evolutionary ancestor, therefore they de facto cannot be subject to convergent evolution". Are you saying that two unrelated proteins cannot ever converge so that they have similar folds/domains and end up being classified as in the same SCOP superfamily? How on earth can you make such a claim? If we can have sphinx moths and butterflies converge on flight mechanics, sharks and dolphins converge on swimming form, flowers and fungi converge on certain colour patterns, and so on how can it be that somehow protein domains are "resistant" to the forces of natural selection. I confess I am completely lost by this argument and hope I have misinterpreted what you wrote.

    7. To use your analogy sharks and dolphins both swim, but any decent scientist would look inside and see that the organs are different; e.g. one gets its oxygen from the water and one form the air. Similarly one can take a molecular example in the form of the the cupredoxin and immunoglobulin superfamily domains which both have a topology with a similar appearance consisting of a sandwich of beta sheets, but the pattern of hydrogen bonds in the conserved core as illustrated in fig. 3 of Gough and Chothia (2004 is completely different from that illustrated in fig. 3 of Chothia and Lesk (1987

      If you wish to vigorously dispute this then I suggest that you do so by presenting the structural evidence against it. The evidence supporting it is overwhelming and it was in fact this observation that led to the creation of SCOP in the first place and is the most fundamental principle on which it is still generally based. The concept that despite the vast number and variation in protein sequences there are a finite and comparatively small number of 3D structures in nature has been widely accepted since Chothia (1992

      Returning to your analogy, yes you can find single features such as your flight mechanics or colour patterns in flowers and fungi which occur regularly in evolution. A molecular example of convergence could be a coiled coil structure, e.g. Rackham et al. (2010 I am sorry you are lost with the argument, but hopefully you can now see how the single characteristic of the external appearance of a protein domain fold does not outweigh the evidence of the multitude of complex internal features in determining the evolutionary relationships used to classify superfamilies in SCOP any more than a similarity in colour patterns of plants and fungi would cause them to be classified together in a biological taxonomy a century ago.

      To put this back into context of the discussion in this thread: using sequence alignments, a molecular character that is an equivalent position in the same sequence from two different organisms, can mutate from valine to leucine in one organism and from isoleucine to leucine in the other organism. The resulting leucines are indistinguishable in the two organisms, despite having different ancestry and this can confound the phylogenetics, particularly over large evolutionary distances. On the other hand with the molecular character that is the presence of a superfamily in the repertoire of two organisms' genomes, whatever the evolutionary distance, you know that the two domains share a common evolutionary ancestor. Thus superfamilies have a more robust equivalence than positions in a sequence alignment, notwithstanding the other limitations mentioned in my previous posts.

    8. Actually I think it is upon the structural biologists to prove that somehow, magically, structural features - even the insides of structures - are somehow magically not prone to convergence. I have yet to see any proof of this. I accept readily that sequence is prone to convergence. It is a big problem. Every character trait I have ever studied is prone to homoplasy. That SCOP is based on the notion that convergence does not happen for the SCOP traits is no selling point to me. It is not convincing.

    9. I note I went to your tree server and selected the first few organisms on the list and asked for the tree. The tree is available here . The tree shows chimps and gorillas as sister taxa to the exclusion of humans. A result that is inconsistent with 100s of papers and enormous data. Why does this happen? I don't know. But when I see results like that in my first try, I am not convinced your approach works well. What the cause of this is I do not know. It is probably not convergent evolution in this case but something is amiss. And my feeling is there has not been enough work on building trees from this type of data to convince me that it works well.

    10. I note - I am NOT saying that this approach cannot work well or even that is does not work well. But I have yet to see convincing evidence that it (1) is robust to issues like homoplasy, variation in genome size, etc (2) that it is preferable to inferring phylogenetic trees from alignments. I will continue to play around with your trees at the web site and continue to read papers on this topic but claims like "there is no possibility of convergent evolution" make me skeptical about what I will find.

    11. Your argument is a straw man, i.e. by looking for something that you believe to be wrong in a tree I publish on my website, you argue that one cannot determine evolutionary relationships of proteins by analysing their 3D structure.

      Please do not take the very specific claim about convergent evolution of protein structure out of context. I restate for clarity: convergent evolution of individual features of protein structure is possible, however atomic resolution 3D structures of proteins have sufficient conservation of features to recognise common ancestry of whole domains. This is not true for sequence. If you are yet to see convincing evidence then I refer you to two decades of CASP (

      With regard to your completely separate question about the grouping of chimps and gorillas as a sister group to humans on our species tree of completely sequenced organisms, this tree is not built on superfamilies because of some of the limitations I mention above, so your point does not follow. People with a large audience have a responsibility not to apply debating tactics rather than scientific reasoning.

      At any rate since you ask, for whole-genome phylogeny to work well (even if we are not using superfamilies, since chimp, gorilla and human all have the same repertoire of superfamilies and are thus indistinguishable by that measure), we require high quality and independently determined whole genomes. We have found that particularly amongst mammals (including gorilla) the genome quality is not sufficiently high quality or sufficiently independent from other genome assemblies and gene models (based on species-specific transcript data). I am not saying that our tree is necessarily right or wrong in this case, but we have lower confidence in the mammalian part of the tree than other parts. Look for strengths in e.g. fungi and in the deeper more ancient branches that sequence-based analyses struggle to reach.

      It is frustrating that the top journals continue to reward 'firsts' in genome sequencing over quality. Genome sequencing groups are encouraged and rewarded for releasing poor quality assemblies (usually built on some other already-sequenced organism) as quickly as possible, published with cursory bean-counting analyses and promises of what subsequent research on the organism in question will deliver. Subsequently instead of spending the following years delivering the promised science, they jump to the next 'new' organism and turn the handle on the next pointless Nature/Science/etc. paper judged by the choice of organism or scale of the study rather than the quality of the science. The protein sets for these organisms are usually missing large numbers of essential housekeeping genes and their gene models are clearly just an intersection with those of the organism on which the assembly was based. The field is in desperate need of a measure of genome quality (such as the R factor for protein structure).

      I digress ... the issue of genome size is easily avoidable by considering presence/absence of molecular characters and ignoring copy number which in e.g. human is more a function of arbitrary cutoffs for isoforms and pipeline parameters than anything else anyway.

    12. I am at a loss as to what to say. Now you are attacking the publishing of genome papers that have a data quality you don't like. And the journals too.

      I looked at the tree you said I should look at "readers of this may also be interested in a live and regularly-updated species tree of all completely sequenced genomes derived from molecular characters is available here". I was sincerely trying to see how your approach might be working.

      I found something in the first part of the tree I looked at that seemed off. I commented on it. And your response is to attack me for using "debating tactics" not scientific reasoning (I don't even know what debating tactics are).

      At some level, you continue to lose me with many of your lines of argument here. But I think I get the gist. You are angry and frustrated that I cannot see the wisdom and quality of your method and your approach. I am sorry that is the case. But I still am not sure I like the general approach. You have done nothing in your presentation to convince me that my critiques (which were originally about a specific paper) were wrong.

      Maybe we are not even talking about the same thing here (seems like we are not). But, regardless, when you start criticizing me for actually looking in more depth at your results as you suggested and then commenting on what I see, I think it is time to move on to other topics and other interactions.

  8. This comment has been removed by the author.

  9. I am not angry or frustrated, I am trying to shed light where there is darkness. My approach, or the wisdom of it has not been described here and was never under discussion. You have confused my approach with that used in the Caetano-Anolles paper in this discussion; I am sorry if I confused things by also mentioning my own work in passing at the end. I have said several times our tree is not based on superfamily characters.

    My initial comments were all regarding aspects of superfamily domains, within the context of the Caetano-Anolles paper; there was clearly a lack of understanding about superfamily domains and I was trying to be informative as I have a great deal of expertise on the subject to share. Neither of us have done a very good job of sticking to that point.

  10. I must say that I followed the exchanges between Jonathan, our host, and Julian with much interest. The reason is that unlike conventional publications, here in the Blogosphere it is difficult to hide from inconvenient questions or issues.

    The focus of the discussion was on the merit of using protein domains (e.g. superfamily) vs sequences as molecular characters for determining phylogenetic relationships. Although this discussion developed in a post about the evolution of viruses, specifically, on a paper by Nasir et al. entitled “Giant viruses coexisted with the cellular ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya,” unfortunately, it did not crystalized on this subject. Possibly, that’s because without clarifying the robustness of the methods used for inferring phylogenies, both Julian and Jonathan felt that it was not worth pursuing a productive discussion of the paper and its subject; or, possibly, they inadvertently got distracted and distanced themselves from such a discussion.

    My goal here is to resurrect this discussion, and hopefully make progress in understanding the origin and evolution of viruses. And, I’ll try to do that by bringing forward some biological and evolutionary principles that might direct, or at least help with the interpretation of phylogenetic data, whether this data is based on proteins domains or sequence characters.

    As I mentioned in a previous comment (please see above), fortunately, there are only 2 broad ways of thinking about the origin and evolution of viruses: they evolved from simple to complex by increasing the size of their genome and the complexity of their proteome, or from complex to simple (reviewed here:

    Although, it is clear that the extant viruses and their recent viral ancestors have occasionally acquired new genetic material, most of the data, including that produce by Nasir et al., supports the paradigm that overall, the current viral lineages are evolving by reductive evolution. Probably, the most convincing argument for the reductive evolution of viral lineages is the evolutionary pattern of the thousands of extant intracellular parasitic or symbiotic cellular lineages, which to my knowledge have been evolving, without exception, by reductive evolution.

    If that is the case, than the question is: why would parasitic or symbiotic viral lineages evolve any other way?

    1. I ended my previous comment with the question: why would the viral lineages and the parasitic cellular lineages, which both occupy an intracellular environment, evolve in opposite directions? Indeed, currently, it is fully accepted that all parasitic or symbiotic cellular lineages have been evolving by reductive evolution (see ref. 1), whereas, according to the current prevalent views (see 2-5), viruses originated from simple genetic elements that have been evolved, by acquiring new genetic material, into complex viruses, such as poxviruses, chloroviruses, and megaviruses whose genome is several times larger than the genome of many cellular species.

      In the last few years, I ask this question many if not most researchers working in the field of deep viral evolution, and there were no reasonable answers. So, if indeed there is no rationale for the evolution of viral lineages towards complexity in an intracellular environment, then it would make sense to abandon the current prevalent view about the evolution of viruses from simple to complex, and bring forward the theory that viral lineages originated from parasitic cellular lineages by reductive evolution.

      As many readers probably know, the reductive origin of viruses from parasitic cellular species was proposed more than a century ago. However, the reductive theory was later abandoned because, as eloquently described by Salvador Luria and James Darnell: "The strongest argument against the regressive origin of viruses from cellular parasites is the non-cellular organization of viruses. The viral capsids are morphogenetically analogous to cellular organelles made up of protein subunits, such as bacterial flagella, actin filaments, and the like, and not to cellular membranes…. This theory today has little to recommend it, at least in its original form." (6).

      In a paper published thirty years ago (7), I presented a novel reductive model for the evolutionary origin and nature of viruses, which challenged the dogma of viruses as virus particles and provided a solution to the problems associated with the reductive theory. In a recently updated version of this model (8), it is proposed that a parasitic cellular lineage evolves into a viral lineage when it acquires the ability of entering the host-cell by fusing its cellular membrane with that of the host-cell. Unlike parasites that maintain their cellular membrane within the host cell, by losing their cellular membrane, this novel parasitic lineage gains full access to host cell’s resources, such as nucleotides, amino acids, and lipids. More significantly, however, the parasite gains access to the host cell’s informational machineries, particularly to translational apparatus, which creates unique parasitic and evolutionary opportunities. After the parasite synthesizes its specific molecules and replicates its genome using the resources found in its environment (i.e. the host-cell), it directs the assembly or morphogenesis of new cellular membranes and cell-like progenies that further differentiate into transmissible, infectious forms – the viral particles.

      Among extant viruses, the life cycle of poxviruses and other complex viruses that start their intracellular life cycle by fusing with their host cells provides compelling evidence for the fusion model (see Fig. 1 in ref. 8). One of the most remarkable implications of this model is that new viral lineages originated from parasitic cellular species throughout the history of life, and that this process might still be active.

      References: (because of space limitations, references are presented in a separate comment)

    2. References (this is a Reference list for my previous comment:

      (1) McCutcheon JP, Moran NA. Extreme genome reduction in symbiotic bacteria. Nat Rev Microbiol. 2011; 11:13-26.

      (2) Koonin EV, Dolja VV. Evolution of complexity in the viral world: the dawn of a new vision. Virus Res 2006; 117:1-4.

      (3) Forterre P. The origin of viruses and their possible roles in major evolutionary transitions. Virus Res 2006; 117:5-16.

      (4) Krupovic M, Bamford DH. Virus evolution: how far does the double beta-barrel viral lineage extend? Nat Rev Microbiol 2008; 12:941-8.

      (5) Moreira D, Lopez-Garcia P. Ten reasons to exclude viruses from the tree of life. Nat Rev Microbiol 2009; 7:306-11.

      (6) Luria, SE, and Darnell, JE. General Virology. Wiley. New-York. 1967.

      (7) Bandea, CI. A new theory on the origin and the nature of viruses. J. Theor. Biol. 1983; 105:591-602.

      (8) Bandea, CI. The origin and evolution of viruses as molecular organisms. ( 2009.

    3. This comment has been removed by the author.

    4. According to the fusion model on the origin of viruses (1), thousands of viral lineages originated from parasitic cellular species, both before and after the origin of the three cellular superkingdoms, Bacteria, Archaea and Eukarya. And, remarkably, this process might still be active; this means that we might find extant cellular lineages that are on their way of evolving into viral lineages, which would provide direct evidence for the model.

      Another radical tenet associated with this model is that only parasitic lineages that have a molecular biology compatible to that of their host would be able to enter their host-cell by fusion and take full advantage of the host resources. This means, for example, that the numerous parasitic bacteria infecting eukaryal host cells would not be able to evolve into viral lineages.

      It also means that some of the highly complex viruses infecting eukaryal cells might have evolved from parasitic eukaryal species. Although this would seem totally far-fetched, we know of many examples of eukaryal organisms that have a genome smaller than that of large viruses (3, 4), so the potential for profound reductive evolution of eukaryal cells is a real.

      In the fusion model for the reductive evolution of parasitic eukaryal species into viral lineages, the parasites would lose their organelles, such as mitochondria, plastids, and much of the cyto-membrane system, which are found in their unique environment - the eukaryal host cell. The nucleus, however, which is tightly coupled with gene expression, might be maintained evolutionarily for long periods before it could be lost through reductive evolution. It might be expected, therefore, that some complex viruses infecting eukaryal host cells have remnants of a nuclear membrane. Interestingly, using this model as a working hypothesis, I found data that supports this evolutionary model. As shown by electron tomography (93), the poxviral cell-like infectious particles apparently contain a genuine ‘nuclear’ membrane.

      However, the most remarkable finding revealed by using this model as a working hypothesis is that several parasitic lineages currently classified as species of algae (5) and fungi (6) might be evolving into viral lineages (1). As predicted by the fusion model, similar to some of the extant large viruses, these parasitic lineages: (i) fuse with their host cells, (ii) develop a ‘dispersed’ or ‘molecular structure’ (see ref. 1), (iii) replicate their genome and synthesize their other specific molecules using host-cell resources, and (iv) generate cell-like reproductive forms that differentiate into transmissible or infectious particles, which start a new life cycle by fusing with other host cells.


      (1) Bandea, CI. The Origin and Evolution of Viruses as Molecular Organisms.
      ( 2009.

      (2) Cyrklaff M, Risco C, Fernandez JJ, et al. Cryo-electron tomography of vaccinia virus. Proc. Natl. Acad. Sci; 2005; 102:2772-2777.

      (3) Archibald JM. Nucleomorph genomes: structure, function, origin and evolution. Bioessays, 2007. 29:392-402.

      (4) Keeling PJ and Slamovits CH. Causes and effects of nuclear genome reduction. Curr. Opin. Genet. Dev. 2005, 15:601-608.

      (5) Goff LJ and Coleman AW. Transfer of nuclei from a parasite to its host. Proc. Natl. Acad. Sci. 1984, 81:5420-4.

      (6) Bauer R, Lutz M and Oberwinkler F. Tuberculina-rusts: a unique basidiomycetous interfungal cellular interaction with horizontal nuclear transfer. Mycologia 2004, 96:960-7.


Most recent post

My Ode to Yolo Bypass

Gave my 1st ever talk about Yolo Bypass and my 1st ever talk about Nature Photography. Here it is ...