There are over 85 million protein records known which have sequence information available on databases and online tools. Despite years of annotation efforts and multiple new tools available to researchers, a large proportion of these proteins are still classified as “hypothetical proteins” (
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3265122/). That is, they have no annotation in published literature that explains function. This lack of knowledge pertaining to almost 40% of sequenced proteins presents a challenge both to researchers and computational experts. The Critical Assessment of Functional Annotation (CAFA) challenge
\cite{Jiang2016} was conceived in 2010 to help expand protein function annotation, and potentially provide researchers with innovative tools to assist in protein function prediction. In 2013, the second CAFA challenge took place. There were 126 methods, from 56 international research groups, submitted and evaluated in the CAFA-2 challenge. This analysis aims to evaluate a small number of these tools from the perspective of a researcher, not just a computational scientist. This survey will utilize proteins provided for the CAFA-2 challenge and one specific protein of interest, a heat shock protein-70 from rice. As with the CAFA-2 assessment, the tool BLAST from the National Center for Biotechnology Information (NCBI) will be used as a measure of accuracy and ease of use.
CAFA2
Overview
CAFA, Critical Assessment of Functional Annotation, was developed to help address the bottleneck of assigning functional annotations to biological macromolecules, specifically proteins. As more and more tools become available for automating the discovery of protein function with varying degrees of accuracy and performance, an objective comparison was needed to evaluate the numerous methods available. The CAFA challenge has provided this objective overview of the most current tools for automated protein function prediction. \cite{Radivojac2013} The CAFA organizers provide a large set of protein sequences, which, over the course of several months, participants analyze to predict protein function and provide associated Gene Ontology or Human Phenotype Ontology annotation. The results of each predictor are then compared against a selection of benchmark proteins with validated annotations to determine performance. The 2013-2014 CAFA challenge evaluated 126 methods from 56 research groups. \cite{Jiang2016}
Methodology
The CAFA-2 challenge was evaluated using two overarching methodologies: Protein-centric evaluation and Term-centric evaluation. Protein-centric evaluation utilizes frequency and ranking of ontology terms to assess how well the tool is working. Term-centric evaluation looks at a ranking of sequences for specific ontology terms. Four major ontologies were used as a metric of how well these tools would predict protein function in a biologically relevant way: Molecular Function (MFO), Biological Process (BPO), Cellular Component (CCO) and Human Phenotype (HPO). These methods were then compared to two baseline predictors: BLAST and a naive method that utilized a frequency of terms function. Statistical metrics utilized for unbiased assessment consisted of two primary calculated values: a precision-recall based "F max" score and a uncertainty, false positive based semantic distance score. CAFA-2 followed a strict timeline of releasing the 100,000 plus sequences to the developers, allowing four months for submissions, growing the annotation data bank and assessing the tools until the release of the results nearly a year and a half after the release of sequences. Among the sequences provided there was a subset that had experimental annotation collected during the time of the evaluation, these were considered the benchmark proteins utilized for the tools. \cite{Jiang2016}
Assessment
Four tools that ranked highly in CAFA-2 will be compared in the scope of this evaluation: ARGOT2, EVEX, INGA and SIFTER. In order to standardized analysis of these tools, results will be compared to BLAST. An average researcher would likely use BLAST and trust the output, so utilizing the BLAST search as a baseline for comparison is a logical workflow. Researchers are familiar with this tool and have a good understanding of how to use it, what the output means and how to input protein sequences. These factors will be taken into account when looking at the CAFA-2 tools.
Predictors
BLAST
BLAST, Basic Local Alignment Search Tool, is a sequence similarity algorithm used for querying a large database of gene sequencing data. It is a heuristic algorithm that returns results quickly, but does not guarantee an optimal alignment, unlike others, such as the Smith-Waterman algorithm. It is one of the most widely used sequence searching tools because of its speed and practicality. Many gene sequence databases and organizations, i.e. UniProt, NCBI, EMBL-EBI, commonly provide a BLAST search. The BLAST algorithm is The BLAST algorithm was originally developed and published by Altschul et al. in 1990. \cite{Altschul_1990} BLAST is based on the assumption that good alignments contain short lengths of exact matches. BLAST was used in the CAFA2 challenge as a baseline measure. It requires a query sequence, a database of target sequences, and a minimal matching threshold score. The BLAST algorithm consists of five steps: 1) Filter out areas of low complexity or repeated sequences, which can result in misleading high matching scores. These regions will be marked with an X and ignored by BLAST algorithm 2) Create a sequential list of all k-letter words in the sequence. 3) The database is then scanned for for matches, using a scoring matrix such as BLOSUM, against all k-letter words in the list. Only matches scoring higher than the minimal matching threshold score are retained. 4) Take the matches from the previous step, extend each word by one letter and scan the database again. 5) Repeat the process until a minimal set of matching sequences is obtained.