Roderic Page edited Genomics.md  over 8 years ago

Commit id: 21452b8378a03e37ca20c53f7d08bb98f2645789

deletions | additions      

       

In contrast with taxonomic knowledge, which is widely scattered, most genomic information is highly centralised, being stored in the three components of the International Nucleotide Sequence Database Collaboration (INSDC) (GenBank, EMBL, and DDBJ) \cite{Benson_2012}. Taxonomic name "databases" more closely resemble digitised library catalogues, whereas sequence databases contain the actual sequences, which means we can compute over them. For example, a researcher with a new sequence can discover a lot about that sequence by a simple BLAST search \cite{Altschul_1990}, whereas a taxonomist armed only with a name will struggle to get computable data from the name alone.  Although most sequence data is centralised, this is not the case for DNA barcodes, most of which reside is a the Barcode of Life Data System (BOLD) \cite{RATNASINGHAM_2007}\cite{Ratnasingham_2013}. \cite{RATNASINGHAM_2007,Ratnasingham_2013}.  BOLD has released nearly 2.5 million DNA barcodes since 2009, with updates every few months. However, many of these are not currently available in GenBank. To document this I searched for barcodes in GenBank using two criteria. The first searched for sequences that were listed in the Bioproject database \cite{Barrett_2011} under accession PRJNA37833 [http://www.ncbi.nlm.nih.gov/bioproject/37833](http://www.ncbi.nlm.nih.gov/bioproject/37833). The second searched for sequences with the keyword "barcode". Plotting counts of these sequences over the same intervals as the data releases on the BOLD site highlights the limited data sharing between BOLD and INSDC. [figure here]