I have been interested in such activities since, well, since 1989 when I started working in Colleen Cavanaugh's lab at Harvard sequencing rRNA genes to do classification. And I have known one of the authors, Michael Cummings for almost as long.
Their abstract does a decent job of summing up what they did
A fundamental problem in modern genomics is to taxonomically or functionally classify DNA sequence fragments derived from environmental sampling (i.e., metagenomics). Several different methods have been proposed for doing this effectively and efficiently, and many have been implemented in software. In addition to varying their basic algorithmic approach to classification, some methods screen sequence reads for ’barcoding genes’ like 16S rRNA, or various types of protein-coding genes. Due to the sheer number and complexity of methods, it can be difficult for a researcher to choose one that is well-suited for a particular analysis.
We divided the very large number of programs that have been released in recent years for solving the sequence classification problem into three main categories based on the general algorithm they use to compare a query sequence against a database of sequences. We also evaluated the performance of the leading programs in each category on data sets whose taxonomic and functional composition is known.
We found significant variability in classification accuracy, precision, and resource consumption of sequence classification programs when used to analyze various metagenomics data sets. However, we observe some general trends and patterns that will be useful to researchers who use sequence classification programs.
The three main categories of methods they identified are
- Programs that primarily utilize sequence similarity search
- Programs that primarily utilize sequence composition models (like CompostBin from my lab)
- Programs that primarily utilize phylogenetic methods (like AMPHORA & STAP from my lab)
|Figure 1. Program clustering. A neighbor-joining tree|
that clusters the classification programs based on their similar attributes. From here.
It is important to note that some supervised learning methods will only classify sequences that contain “marker genes”. Marker genes are ideally present in all organisms, and have a relatively high mutation rate that produces significant variation between species. The use of marker genes to classify organisms is commonly known as DNA barcoding. The 16S rRNA gene has been used to greatest effect for this purpose in the microbial world (green genes , RDP ). For animals, the mitochondrial COI gene is popular , and for plants the chloroplast genes rbcL and matK have been used . Other strategies have been proposed, such as the use of protein-coding genes that are universal, occur only once per genome (as opposed to 16S rRNA genes that can vary in copy number), and are rarely horizontally transferred . Marker gene databases and their constitutive multiple alignments and phylogenies are usually carefully curated, so taxonomic and functional assignments based on marker genes are likely to show gains in both accuracy and speed over methods that analyze input sequences less discriminately. However, if the sequencing was not specially targeted , reads that contain marker genes may only account for a small percentage of a metagenomic sample.I think I will just leave these highlighted sections uncommented upon and leave it to people to imagine what I don't like about them .. for now.
Anyway - again - the paper is worth checking out. And if you want to know more about methods used for classifying sequences see this Mendeley collection which focuses on metagenomic analysis but has many additional paper on top of the ones discussed in this paper.