DNA barcoding and taxonomy: dark taxa and dark texts


Both classical taxonomy and DNA barcoding are engaged in the task of digitising the living world. Much of the taxonomic literature remains undigitised. The rise of open access publishing this century, and the freeing of older literature from the shackles of copyright has greatly increased the online availability of taxonomic descriptions, but much of the literature of the mid- to late twentieth century remains offline ("dark texts"). DNA barcoding is generating a wealth of computable data that in many ways is much easier to work with than classical taxonomic descriptions, but many of the sequences are not identified to species level. These "dark taxa" hamper the classical method of integrating biodiversity data using shared taxonomic names. Voucher specimens are a potential common currency of both the taxonomic literature and sequence databases, and could be used to help link names, literature, and sequences. An obstacle to this approach is the lack of stable, resolvable specimen identifiers. The paper concludes with an appeal for a global "digital dashboard" to assess the extent to which biodiversity data is available online.

Keywords: DNA barcoding, taxonomy, dark taxa, dark texts, digitisation


As with many fields, digitisation is having huge impact on the study of biodiversity. Museums and herbaria are engaged with turning physical, analogue specimens into digital objects, whether these are strings of A's, G's, C's and T's from DNA sequencing machines, or pixels obtained from a digital camera. Libraries and commercial publishers are converting physical books and articles into images, which are then converted into strings of letters using optical character recognition (OCR). Despite the sometimes contentious relationship between morphological and molecular taxonomy, there are striking parallels between the formation of DNA sequence databases in the twentieth century and the rise of natural history museums in the preceding centuries (Strasser 2008, Strasser 2011).

Viewed in this way, both classical taxonomy and genomics are in the business of digitising life. Some of the challenges faced are similar, for example algorithms developed for pairwise sequence alignment have applications in extracting articles from OCR text (Page 2011). But in other respects the two fields are very different. Sequence data is approximately doubling every 18 months (Lathe 2008), whereas the number of new taxa described each year has remained essentially constant since the 1980s (see below). A challenge for sequence databases is how to handle exponential growth of data; for taxonomy the challenge is often how to make a dent in the vast number of objects that don't have a digital representation (Ariño 2010). This paper explores some of these issues, focusing on taxonomy and DNA barcoding.


Among the many challenges faced by taxonomy is the difficulty of determining the size of the task it faces. Estimates of the number of species on Earth are uncertain and inconsistent, and show no signs of converging (Caley 2014). Some estimates based on models of taxonomic effort suggest that two-thirds of all species have already been described (Costello 2011). Analyses that use the number of authors per species description as a proxy for effort (Joppa 2011) ignore the global trend for an increasing number of authors per paper (Aboukhalil 2014), and assume that the effort required per species description has remained constant over time. An alternative interpretation is that the quality of taxonomic description is increasing over time (Sangster 2014), reflecting both increased thoroughness and the availability of new technologies (Stoev 2013, Akkari 2015).

Numbers of new animal names published each year for animals as a whole, and various taxonomic groups. Data from ION and BioNames.