Monday, September 03, 2012

A blast from the past: Plasmodium, plastids, phylogeny, and reproducibility

A few days ago I got an email from a colleague who I had not seen in many years.  It was from Malcolm Gardner who worked at TIGR when I was there and is now at Seattle Biomed.

His email was related to the 2002 publication of the complete genome sequence of Plasmodium falciparum - the causative agent of most human malaria cases -  for which he was the lead author.   Someone had emailed Malcolm asking if he could provide details about the settings used in the blast searches that were part of the evolutionary analyses of the paper.   The paper is freely available at Nature - at least for now - every once in a while the Nature Publishing Group seems to put it behind a paywall despite their promises not to.

Malcolm was contacting me because I had run / coordinated much of the evolutionary analysis reported in that paper.  I note - as one of the only evolution focused people at TIGR it was pretty common for people to come to me and ask if I could help them with their genome.  I pretty much always said yes since, well, I loved doing that kind of thing and it was really exciting in the early days of genome sequencing to be the first person to ask some evolution related question about the data.

Malcolm included the email he had received (which did not have a lot of detail) and he and I wrote back and forth trying to figure out exactly what this person wanted.  And then I said, well, maybe the person should get in touch with me directly so I can figure out what they really want/need.  It seemed unusual that someone was asking about something like that from a 10 year old paper, but, whatever.  

As I was communicating with this person, I started digging through my files and my brain trying to remember exactly what had been done for this paper more than 10 years ago.  I remember Malcolm and others from the Plasmodium community organizing some "jamborees" looking at the annotation of the genome. At one of those jamborees I met with some of the folks from the Sanger Center (which was one of the big players in the P. falciparum genome sequencing) with Malcolm and - after some discussion I ended up doing three main things relating to the paper, which I describe below.

Thing 1: Conserved eukaryote genes

One of my analyses was to use the genome to look for genes conserved in eukaryotes but not present in bacteria or archaea.  I did this to try and find genes that could be considered likely to have been invented on the evolutionary branch leading up to the common ancestor of eukaryotes.

As an aside, at about the same time I was asked to write a News and Views for Nature about the publication of the Schizosaccharomyces pombe genome.  In the N&V I had written "Genome sequencing: Brouhaha over the other yeast" I noted how the authors had used the genome to do some interesting analysis of conserved eukaryotic genes.  With the help of the Nature staff I had also made a figure which demonstrated (sort of) what they were trying to do in their analysis - which was to find genes that originated on the branch leading up to the common ancestor of the eukaryotes for which genomes were available at the time.  As another aside - the S. pombe genome paper and my News and Views article are freely available ...

Figure 1: The tree of life, with the branches labelled according to Wood et al.'s analysis of genes that might be specific to eukaryotes versus prokaryotes, and to multicellular versus single-celled organisms. Bacteria and archaea are prokaryotes (they do not have nuclei). From Nature 415, 845-848 (21 February 2002) | doi:10.1038/nature725. The eukaryotic portion of the tree is based on Baldauf et al. 2000

Anyway, I did a similar analysis to what was in the S. pombe genome paper and I found a reasonable number and helped write a section for the paper on this.
Comparative genome analysis with other eukaryotes for which the complete genome is available (excluding the parasite E. cuniculi) revealed that, in terms of overall genome content, P. falciparum is slightly more similar to Arabidopsis thaliana than to other taxa. Although this is consistent with phylogenetic studies (64), it could also be due to the presence in the P. falciparum nuclear genome of genes derived from plastids or from the nuclear genome of the secondary endosymbiont. Thus the apparent affinity of Plasmodium and Arabidopsis might not reflect the true phylogenetic history of the P. falciparum lineage. Comparative genomic analysis was also used to identify genes apparently duplicated in the P. falciparum lineage since it split from the lineages represented by the other completed genomes (Supplementary Table B). 
There are 237 P. falciparum proteins with strong matches to proteins in all completed eukaryotic genomes but no matches to proteins, even at low stringency, in any complete prokaryotic proteome (Supplementary Table C). These proteins help to define the differences between eukaryotes and prokaryotes. Proteins in this list include those with roles in cytoskeleton construction and maintenance, chromatin packaging and modification, cell cycle regulation, intracellular signalling, transcription, translation, replication, and many proteins of unknown function. This list overlaps with, but is somewhat larger than, the list generated by an analysis of the S. pombe genome (65). The differences are probably due in part to the different stringencies used to identify the presence or absence of homologues in the two studies.
The list of genes is available as supplemental material on the Nature web site.  Alas it is in MS Word format which is not the most useful thing.  But more on that issue at the end of this post.

Thing 2. Searching for lineage specific duplications

Another aspect of comparative genomic analysis that I used to do for most genomes at TIGR was to look for lineage specific duplications (i.e., genes that have undergone duplications in the lineage of the species being studied to the exclusion of the lineages for which other genomes are available).  The quick and dirty way we used to do this was to simply look for genes that had a better blast match to another gene from their own genome than to genes in any other genome.  The list of genes we identified this way is also provided as a Word document in Supplemental materials.

Thing 3: Searching for organelle derived genes in the nuclear genome of P. falciparum

The third thing I did for the paper was to search for organelle derived genes in the nuclear genome of Plasmodium.  Specifically I was looking for genes derived from the mitochondrial genome and plastid genome.  For those who do not know, Plasmodium is a member of the Apicomplexa - all organisms in this group have an unusual organelle called the Apicoplast.  Though the exact nature of this organelle had been debated, it's evolutionary origins were determined by none other than Malcolm Gardner many years earlier (Gardner et al. 1994). They had shown that this organelle was in fact derived from chloroplasts (which themselves are derived from cyanoabcteria).  I am shamed to say that before hanging out with Malcolm and talking about Plasmodium I did not know this.  This finding of a chloroplast in an evolutionary group of eukaryotes that are not particularly closely related to plants is one of the key pieces of evidence in the "secondary endosymbiosis" hypothesis which proposes that some eukaryotes have brought into themselves as an endosymbiont a single-celled photosynthetic algae which had a chloroplast.  

Anyway - here we were - with the first full genome of a member of the Apicomplexans group.  And we could use it to discover some new details on plastid evolution and secondary endosymbioses.  So I adapted some methods I had used in analyzing the Arabidopsis genome (see Lin et al. 1999 and AGI 2000), and searched for plastid derived genes in the nuclear genome of Plasmodium.  Why look in the nuclear genome for plastid genes?  Or mitochondrial genes for that matter.  Well, it turns out that genes that were once in the organelle genomes frequently move to the nuclear genome of their "host".  In fact, a lot of genes move.  So - if you want to study the evolution of an organism's organelles, it is sometimes more fruitful to look in the nuclear genome than in the actual organelle's genome.  OK - now back to the Plasmodium genome.  What I was doing was trying to find genes in the nuclear that had once been in the plastid genome.  How would you look for these?  

To find mitochondrial-derived genes I did blast searches against the same database of genomes used to study the evolution of eukaryotes but for this I looked for genes in Plasmodium that has decent matches to genes in alpha proteobacteria.  And for those I then build phylogenetic trees of each gene and its homologs, then screened through all the trees to look for any in which the gene from Plasmodium grouped in a tree inside a clade with sequences from alpha proteobacteria (and allowed for mitochondrial genes from other eukaryotes to be in this clade).  

To find plastid derived genes I did a similar screen except instead searched for genes that grouped in evolutionary trees with genes from cyanobacteria (or eukaryotic genes that were from plastids).  The section of the paper that I helped write is below:
A large number of nuclear-encoded genes in most eukaryotic species trace their evolutionary origins to genes from organelles that have been transferred to the nucleus during the course of eukaryotic evolution. Similarity searches against other complete genomes were used to identify P. falciparum nuclear-encoded genes that may be derived from organellar genomes. Because similarity searches are not an ideal method for inferring evolutionary relatedness (66), phylogenetic analysis was used to gain a more accurate picture of the evolutionary history of these genes. Out of 200 candidates examined, 60 genes were identified as being of probable mitochondrial origin. The proteins encoded by these genes include many with known or expected mitochondrial functions (for example, the tricarboxylic acid (TCA) cycle, protein translation, oxidative damage protection, the synthesis of haem, ubiquinone and pyrimidines), as well as proteins of unknown function. Out of 300 candidates examined, 30 were identified as being of probable plastid origin, including genes with predicted roles in transcription and translation, protein cleavage and degradation, the synthesis of isoprenoids and fatty acids, and those encoding four subunits of the pyruvate dehydrogenase complex. The origin of many candidate organelle-derived genes could not be conclusively determined, in part due to the problems inherent in analysing genes of very high (A + T) content. Nevertheless, it appears likely that the total number of plastid-derived genes in P. falciparum will be significantly lower than that in the plant A. thaliana (estimated to be over 1,000). Phylogenetic analysis reveals that, as with the A. thaliana plastid, many of the genes predicted to be targeted to the apicoplast are apparently not of plastid origin. Of 333 putative apicoplast-targeted genes for which trees were constructed, only 26 could be assigned a probable plastid origin. In contrast, 35 were assigned a probable mitochondrial origin and another 85 might be of mitochondrial origin but are probably not of plastid origin (they group with eukaryotes that have not had plastids in their history, such as humans and fungi, but the relationship to mitochondrial ancestors is not clear). The apparent non-plastid origin of these genes could either be due to inaccuracies in the targeting predictions or to the co-option of genes derived from the mitochondria or the nucleus to function in the plastid, as has been shown to occur in some plant species (67).
Thing 4: Analysis of DNA repair genes 

Arnab Pain from the Sanger Center and I analyzed genes predicted to be involved in DNA repair and recombination processes and wrote a section for the paper:
DNA repair processes are involved in maintenance of genomic integrity in response to DNA damaging agents such as irradiation, chemicals and oxygen radicals, as well as errors in DNA metabolism such as misincorporation during DNA replication. The P. falciparum genome encodes at least some components of the major DNA repair processes that have been found in other eukaryotes (111, 112). The core of eukaryotic nucleotide excision repair is present (XPB/Rad25, XPG/Rad2, XPF/Rad1, XPD/Rad3, ERCC1) although some highly conserved proteins with more accessory roles could not be found (for example, XPA/Rad4, XPC). The same is true for homologous recombinational repair with core proteins such as MRE11, DMC1, Rad50 and Rad51 present but accessory proteins such as NBS1 and XRS2 not yet found. These accessory proteins tend to be poorly conserved and have not been found outside of animals or yeast, respectively, and thus may be either absent or difficult to identify in P. falciparum. However, it is interesting that Archaea possess many of the core proteins but not the accessory proteins for these repair processes, suggesting that many of the accessory eukaryotic repair proteins evolved after P. falciparum diverged from other eukaryotes. 
The presence of MutL and MutS homologues including possible orthologues of MSH2, MSH6, MLH1 and PMS1 suggests that P. falciparum can perform post-replication mismatch repair. Orthologues of MSH4 and MSH5, which are involved in meiotic crossing over in other eukaryotes, are apparently absent in P. falciparum. The repair of at least some damaged bases may be performed by the combined action of the four base excision repair glycosylase homologues and one of the apurinic/apyrimidinic (AP) endonucleases (homologues of Xth and Nfo are present). Experimental evidence suggests that this is done by the long-patch pathway (113). 
The presence of a class II photolyase homologue is intriguing, because it is not clear whether P. falciparum is exposed to significant amounts of ultraviolet irradiation during its life cycle. It is possible that this protein functions as a blue-light receptor instead of a photolyase, as do members of this gene family in some organisms such as humans. Perhaps most interesting is the apparent absence of homologues of any of the genes encoding enzymes known to be involved in non-homologous end joining (NHEJ) in eukaryotes (for example, Ku70, Ku86, Ligase IV and XRCC1)(112). NHEJ is involved in the repair of double strand breaks induced by irradiation and chemicals in other eukaryotes (such as yeast and humans), and is also involved in a few cellular processes that create double strand breaks (for example, VDJ recombination in the immune system in humans). The role of NHEJ in repairing radiation-induced double strand breaks varies between species (114). For example, in humans, cells with defects in NHEJ are highly sensitive to -irradiation while yeast mutants are not. Double strand breaks in yeast are repaired primarily by homologous recombination. As NHEJ is involved in regulating telomere stability in other organisms, its apparent absence in P. falciparum may explain some of the unusual properties of the telomeres in this species (115).

Back to the story
Anyway ... back to the story.  I do not have current access to all of TIGR's old computer systems which is where my searches for the genome paper reside.  But I figured I might have some notes somewhere on my computer about what blast parameters I used for these searches.  And amazingly I did.  As I was getting ready to write back to Malcolm and to the person who has asked for the information I decided to double check to see what was in the paper.  And amazingly, much of the detail was right there all along.   
Plasmodium falciparum proteins were searched against a database of proteins from all complete genomes as well as from a set of organelle, plasmid and viral genomes. Putative recently duplicated genes were identified as those encoding proteins with better BLASTP matches (based on E value with a 10-15 cutoff) to other proteins in P. falciparum than to proteins in any other species. Proteins of possible organellar descent were identified as those for which one of the top six prokaryotic matches (based on E value) was to either a protein encoded by an organelle genome or by a species related to the organelle ancestors (members of the Rickettsia subgroup of the -Proteobacteria or cyanobacteria). Because BLAST matches are not an ideal method of inferring evolutionary history, phylogenetic analysis was conducted for all these proteins. For phylogenetic analysis, all homologues of each protein were identified by BLASTP searches of complete genomes and of a non-redundant protein database. Sequences were aligned using CLUSTALW, and phylogenetic trees were inferred using the neighbour-joining algorithms of CLUSTALW and PHYLIP. For comparative analysis of eukaryotes, the proteomes of all eukaryotes for which complete genomes are available (except the highly reduced E. cuniculi) were searched against each other. The proportion of proteins in each eukaryotic species that had a BLASTP match in each of the other eukaryotic species was determined, and used to infer a 'whole-genome tree' using the neighbour-joining algorithm. Possible eukaryotic conserved and specific proteins were identified as those with matches to all the complete eukaryotic genomes (10-30 E-value cutoff) but without matches to any complete prokaryotic genome (10-15 cutoff).
Alas, I cannot for the life of me find what other parameters I used for the blastp searches.  I am 99.9999% sure I used default settings but alas, I don't know what default settings for blast were in that era.  And I am not even sure which version of blastp was installed on the TIGR computer systems then.  I certainly need to do a better job of making sure everything I do is truly reproducible.


This all brings me to the actual real part of this story.  Reproducibility.  It is a big deal.  Anyone should be able to reproduce what was done in a study.  And alas, it is difficult to do that when not all the methods are fully described.  And one should also provide intermediate results so that people to do not have to redo everything you did in a study but can just reproduce part of it.   It would be good to have, for example, released all the phylogenetic trees from the analysis of organellar genes in Plasmodium.  Alas, I do not seem to have all of these files as they were stored in a directory at TIGR dedicated to this genome project and as I am no longer at TIGR I do not have ready access to that material.  It is probably still lounging around somewhere on the JCVI computer systems (TIGR alas, no longer officially exists ... it was swallowed by the J. Craig Venter Institute ...).  But I will keep digging and I will post them to some place like FigShare if/when I find them.

Perhaps more importantly, I will be working with my lab to make sure that in the future we store/record/make available EVERYTHING that would allow people to reproduce, re-analyze, re-jigger, re-whatever anything from our papers.

The key lesson - plan in advance for how you are going to share results, methods, data, etc ...


  1. We are clearly on the same wavelength today :).

    1. ha ... we should start a club

    2. Let's call ourselves "scientists"!

  2. This comment has been removed by a blog administrator.

  3. The good folks over at ENCODE opted to make a virtual machine image available for download to coincide with the release of their publications today.

    Available at

    How's that for transparency and reproducibility!

  4. On a related note, I just read this paper tonight: "Next-generation sequencing data interpretation: enhancing reproducibility and accessibility" (not open access - frustrating because its a good review). Take home points were 1) Life scientists can't escape computational biology now given the NGS revolution, so they're going to have to get used to it 2) Always provide access to primary data, 3) Record versions of ALL auxiliary datasets used during analysis (e.g. genome assembly releases), 4) Note the EXACT versions of software used 5) Record all parameters, even if defaults are used, 6) Provide all custom scripts as supplementary data.

    The authors also made the point that it is very painful right now for researchers to even adhere to these basic guidelines, because most bioinformatics software is just not designed to automatically log versions/parameters/commands and blatantly deliver them to users as output files. So this paper was really an eye opener for me about just how *important* it really is to record and publish EVERYTHING related to original data analysis, for reproducibility purposes. I'm all for adopting that kind of policy!