Dark taxa

As desirable as data sharing is, it is not without complications. In 2011 I coined the phrase "dark taxa" \citep[http://iphylo.blogspot.co.uk/2011/04/dark-taxa-genbank-in-post-taxonomic.html, see also ]{Parr_2012} to refer to species in GenBank that lacked formal scientific names. Typically they will have a name that comprises a genus name and some combination of letters and numbers to make the name unique within GenBank (e.g., a specimen code or the first letter of the last names of the researchers that deposited the sequence). For this paper I've updated the analysis to include sequences published up to the time of writing (Fig. 5).

The pattern shown in Fig. 5 likely reflects a combination of processes. If most of the taxa being added to GenBank represent species that have already been described, then the rate at which taxa can be identified (either by taxonomists or by researchers using their outputs, such as keys) is being outstripped by the pace of sequencing. Alternatively, dark taxa may represent unknown species, but we lack taxonomists capable of recognising the taxa as new (and formally describing them). If taxonomic capacity is a limiting factor then we would expect a gradual decline in percentage of named taxa, which is the background pattern in Fig. 5. The growth of dark taxa might also reflect changing practices of molecular workers, for example in DNA barcoding where large numbers of specimens are sequenced and deposited into GenBank labelled with specimen codes rather than taxonomic names. Indeed, the dramatic increase in the numbers of dark taxa in 2010 is mostly due to sequences from the BOLD project (recognised by the prefix "BOLD") being added. Even if we allow for the import of unidentified BOLD sequences as a one-off event, at present less than half the newly sequenced invertebrate taxa being added to GenBank have been identified to species level.