DNA barcoding and taxonomy: dark taxa and dark texts
Both classical taxonomy and DNA barcoding are engaged in the task of digitising the living world. Much of the taxonomic literature remains undigitised. The rise of open access publishing this century, and the freeing of older literature from the shackles of copyright has greatly increased the online availability of taxonomic descriptions, but much of the literature of the mid- to late twentieth century remains offline ("dark texts"). DNA barcoding is generating a wealth of computable data that in many ways is much easier to work with than classical taxonomic descriptions, but many of the sequences are not identified to species level. These "dark taxa" hamper the classical method of integrating biodiversity data using shared taxonomic names. Voucher specimens are a potential common currency of both the taxonomic literature and sequence databases, and could be used to help link names, literature, and sequences. An obstacle to this approach is the lack of stable, resolvable specimen identifiers. The paper concludes with an appeal for a global "digital dashboard" to assess the extent to which biodiversity data is available online.
Keywords: DNA barcoding, taxonomy, dark taxa, dark texts, digitisation
As with many fields, digitisation is having huge impact on the study of biodiversity. Museums and herbaria are engaged with turning physical, analogue specimens into digital objects, whether these are strings of A's, G's, C's and T's from DNA sequencing machines, or pixels obtained from a digital camera. Libraries and commercial publishers are converting physical books and articles into images, which are then converted into strings of letters using optical character recognition (OCR). Despite the sometimes contentious relationship between morphological and molecular taxonomy, there are striking parallels between the formation of DNA sequence databases in the twentieth century and the rise of natural history museums in the preceding centuries (Strasser 2008, Strasser 2011).
Viewed in this way, both classical taxonomy and genomics are in the business of digitising life. Some of the challenges faced are similar, for example algorithms developed for pairwise sequence alignment have applications in extracting articles from OCR text (Page 2011). But in other respects the two fields are very different. Sequence data is approximately doubling every 18 months (Lathe 2008), whereas the number of new taxa described each year has remained essentially constant since the 1980s (see below). A challenge for sequence databases is how to handle exponential growth of data; for taxonomy the challenge is often how to make a dent in the vast number of objects that don't have a digital representation (Ariño 2010). This paper explores some of these issues, focusing on taxonomy and DNA barcoding.
Among the many challenges faced by taxonomy is the difficulty of determining the size of the task it faces. Estimates of the number of species on Earth are uncertain and inconsistent, and show no signs of converging (Caley 2014). Some estimates based on models of taxonomic effort suggest that two-thirds of all species have already been described (Costello 2011). Analyses that use the number of authors per species description as a proxy for effort (Joppa 2011) ignore the global trend for an increasing number of authors per paper (Aboukhalil 2014), and assume that the effort required per species description has remained constant over time. An alternative interpretation is that the quality of taxonomic description is increasing over time (Sangster 2014), reflecting both increased thoroughness and the availability of new technologies (Stoev 2013, Akkari 2015).
Rather than try and estimate an unknown (the number of species remaining to be described), here I focus on the current state of taxonomic knowledge. Given that we lack a comprehensive, global index of all species descriptions, discovering what we know about what we know is not entirely straightforward. For zoology the nearest we have is the Index of Organism Names (ION, http://www.organismnames.com), which is based on Zoological Record. Figure 1 shows the numbers of new taxonomic names covered by the International Code on Zoological Nomenclature (animals plus some protozoan groups) that have been described each year based on data from ION, cleaned and augmented in BioNames (Page 2013). These data show an increase in overall numbers over time, with dips around the times of the two World Wars, followed by an essentially constant number each year since the mid-twentieth century. The pattern varies across taxa, some taxa show increasing numbers per year, but other taxonomic groups are essentially static or in decline, even in groups thought to be hyperdiverse such as nematodes (Blaxter 2003).
The rate of progress in biodiversity research is controlled by two factors, the speed with which we can discover and describe biodiversity, and the speed with which we can communicate that information (Pentcheff 2010). Unlike most biological disciplines, the entire corpus of taxonomic literature since the mid 18th century remains a vital resource for current day research. In this way taxonomy is similar to the digital humanities, which has not just "big data" but "long data" (Aiden 2013). Not only is this because of the rules of nomenclature that dictate (with some exceptions) that the name to use for a species is the oldest one published, it reflects the uneven effort devoted to the study of different taxonomic groups (MAY 1988). For poorly known groups the bulk of our knowledge of their biology may reside in the primary taxonomic literature.
Digitisation is one step towards making taxonomic information available. Many commercial publishers have, on the face of it, done the taxonomic community a great service by digitising whole back catalogues of relatively obscure journals. However, digitisation is not the same as access, and many commercial publishers keep this scanned literature behind paywalls. In some fields legal issues around access have been side-stepped by constructing a "shadow" dataset that summarises key features of the data while still restricting access to the data itself. For example, by extracting phrases comprising a set of n words (n-grams) from Google Books it is possible to create a data set that contains valuable information without exposing the full text (Michel 2010). However for taxonomic work, there does not seem to be an obvious way to extract a shadow. Agosti and colleagues (Agosti 2009, Patterson 2014) have explored ways to extract core facts from the literature and re-purpose these without violating copyright, though how much of their conclusions can be generalised across different national and international legal systems remains untested.
Apart from commercial digitisation of the scientific literature, two other developments are accelerating access to taxonomic information. The first is the rise of open access publishing, notably journals such as ZooKeys that support sophisticated markup of the text (Penev 2010). This is increasing the number of recently-described species that are published in a machine-readable form that can then be subject to further processing (Miller 2015). At the same time, the Biodiversity Heritage Library (BHL) (Gwinn 2009) has embarked on large-scale digitisation of legacy taxonomic literature. Although initially focussing on out of copyright literature (i.e., pre-1923 in the United States), BHL is increasingly getting permission from copyright holders to scan more recent literature as well. Coupled with tools such as BioStor (Page 2011) to locate and extract articles within the scanned volumes BHL is fast becoming the largest available open access archive of biodiversity literature.
To quantify the extent to which the taxonomic literature has been digitised, for each decade I counted the number of publications of new names in animals both with and without a digital identifier (such as a DOI, a PDF, a Handle, or a URL to BioStor). The recent taxonomic literature is mostly digital: for the years 2010-15 60% of publications have a digital identifier, the bulk of these having a DOI. However, prior to the 21st century more publications lack identifiers than have them, with the 1970s being the least digitalised decade (Fig. 2).
##The long tail of taxonomic literature
Another challenge presented by the taxonomic literature is that it is highly decentralised, being spread across numerous journals (Fig. 3). What is striking is the dominance of animal taxonomy by the "megajournal" Zootaxa, and yet this journal has published only 15% of the new names that have been minted since 2000. The taxonomic literature has a very "long tail" of small, often obscure journals that contain a few taxonomic publications. Long tails require significant effort to index (Edwards 1993), although the Zoological Record claims 90% coverage of the taxonomic literature (Thorne 2003), in some taxa there may be significantly greater gaps (Bouchet 1992). Conversely, if we set our sites lower, then long tail distributions mean that we can get a substantial fraction of the names from a small number of journals (the "low hanging fruit"). Indeed, the first 20% of the journals in Fig. 3 contain 80% of the names in Bionames that are linked to a publication. Unfortunately, many of these journals are not currently available digitally (Fig. 2).
The picture that emerges from our knowledge of the taxonomic literature is the recent literature is mostly digital, identified with DOIs, and some of it is open access. But much of our fundamental knowledge of the world's biodiversity, particularly that published in the mid to late 20th century remains digitally inaccessible (Fig. 2). Between the 21st century trend towards digitisation and open access, and the removal of restrictions pre-1923 as copyright expires, lies a great body of 20th century work that will require considerable effort to make available.
In contrast with taxonomic knowledge, which is widely scattered, most genomic information is highly centralised, being stored in the three components of the International Nucleotide Sequence Database Collaboration (INSDC), namely GenBank, EMBL, and the DDBJ (Benson 2012). Taxonomic name "databases" more closely resemble digitised library catalogues, whereas sequence databases contain the actual sequences, which means we can compute over them. For example, a researcher with a new sequence can discover a lot about that sequence by a simple BLAST search (Altschul 1990), whereas a taxonomist armed only with a name will struggle to get computable data from the name alone.
Although the bulk of the world's sequence data is available in the INSDC, this is not the case for DNA barcodes, most of which reside in the Barcode of Life Data System (BOLD) (RATNASINGHAM 2007, Ratnasingham 2013). Since 2009 BOLD has released nearly 2.5 million DNA barcodes, with updates every few months. However, many of these sequences are not currently available in GenBank. To document this I searched for barcodes in GenBank using two criteria. The first searched for sequences that were listed in the Bioproject database (Barrett 2011) under accession PRJNA37833. The second searched for sequences with the keyword "barcode". For both searches the sequences were grouped by their date of publication (the "pdat" query parameter) that correspond to the intervals between each BOLD data release. Plotting counts of these sequences over the same intervals as the BOLD data releases highlights the limited data sharing between BOLD and INSDC (Fig. 4).
As desirable as data sharing is, it is not without complications. In 2011 I coined the phrase "dark taxa" (http://iphylo.blogspot.co.uk/2011/04/dark-taxa-genbank-in-post-taxonomic.html, see also Parr et al., 2012) to refer to species in GenBank that lacked formal scientific names. Typically they will have a name that comprises a genus name and some combination of letters and numbers to make the name unique within GenBank (e.g., a specimen code or the first letter of the last names of the researchers that deposited the sequence). For this paper I've updated the analysis to include sequences published up to the time of writing (Fig. 5).
The pattern shown in Fig. 5 likely reflects a combination of processes. If most of the taxa being added to GenBank represent species that have already been described, then the rate at which taxa can be identified (either by taxonomists or by researchers using their outputs, such as keys) is being outstripped by the pace of sequencing. Alternatively, dark taxa may represent unknown species, but we lack taxonomists capable of recognising the taxa as new (and formally describing them). If taxonomic capacity is a limiting factor then we would expect a gradual decline in percentage of named taxa, which is the background pattern in Fig. 5. The growth of dark taxa might also reflect changing practices of molecular workers, for example in DNA barcoding where large numbers of specimens are sequenced and deposited into GenBank labelled with specimen codes rather than taxonomic names. Indeed, the dramatic increase in the numbers of dark taxa in 2010 is mostly due to sequences from the BOLD project (recognised by the prefix "BOLD") being added. Even if we allow for the import of unidentified BOLD sequences as a one-off event, at present less than half the newly sequenced invertebrate taxa being added to GenBank have been identified to species level.
Typically integration across biodiversity databases is achieved using taxonomic names (Patterson 2010), but the rise of dark taxa makes this problematic for an increasing fraction of sequence-based data. Even if we have names, these need not always mean the same thing (Kennedy 2003). As an example, Fig 6a shows the distribution of the lizard Morethia obscura from the Global Biodiversity Information Facility (GBIF). For comparison, Fig. 6b shows a geophylogeny (Page 2015) for DNA barcodes from BOLD for Morethia obscura which reveals considerable phylogenetic structure within "Morethia obscura". Specimens of this species are assigned several distinct Barcode Index Numbers (BINs) (Ratnasingham 2013) implying that "Morethia obscura" comprises more than one species.
Although GBIF and BOLD present rather different views of the "same" species, there is considerable overlap in the specimens used to construct Figs 6a and 6b. For example, DNA barcode WAMMS012-10 was obtained from specimen WAMR127637, which also occurs in GBIF (as occurrence 691832260. Because the taxonomic concepts in GBIF and BOLD are explicitly defined with respect to sets of specimens we can directly compare them, rather than rely on the possibly erroneous assumption that a given taxonomic name means the same thing in the two databases. Furthermore, as increasing numbers of type specimens are sequenced (Federhen 2014) we can more firmly associate names with sets of specimens, leading to a computable nomenclature where the name we assign to a set of specimens can be determined automatically (Pullan 2000).
Integrating databases using specimens is attractive, but not without its own set of problems. The biodiversity informatics community has yet to standardise identifiers for specimens, despite numerous efforts (Guralnick 2015), consequently there may be little apparent overlap between specimen identifiers in different databases (Guralnick 2014). As an example, despite the limited sharing of data between BOLD and GBIF, there are already barcoded specimens in GBIF. To illustrate, consider the DNA barcode GWORH520-09 from sample "BC ZSM Lep 10234". GBIF doesn't have this record from BOLD, but it does have the specimen BC ZSM Lep 10234 (provided by the host institution ???, 2015). The DNA barcode from this specimen is also in GenBank, and because that record is georeferenced it has been ingested by GBIF as part of the Geographically tagged INSDC sequences dataset (Geographically tagged...). Hence, GBIF has duplicate records for a barcoded moth, neither provided directly by BOLD (Fig. 7). Merging and de-duplicating specimen-based records is going to be a significant challenge for global aggregators such as GBIF.
Both taxonomy and barcoding are actively digitisng the living world. The description of new animal taxa is essentially proceeding at a constant rate, generating a steadily growing legacy of taxonomic literature into which digitisation has made modest inroads. In contrast, sequence databases as a whole are growing exponentially, although barcode growth is more modest. Nucleotide sequences are "born digital" and readily computable, for example they can be clustered into BINs of similar sequences, or phylogenies of the type shown in Fig. 6. Given the obvious overlap between the goals of classical taxonomy and barcodes, the lack of digital overlap between these two endeavours is disconcerting. Many barcodes lack taxonomic names ("dark taxa"), and much of the primary taxonomic literature has not been digitised ("dark texts"). Integrating barcodes and taxonomy at scale is going to be significant challenge, as indeed will be integrating barcodes into mainstream sequence databases. Mapping between databases using taxonomic names seems the obvious approach, but the abundance of dark taxa shows this has not been entirely successful. Alternatives such as integration via specimens show promise, but are hampered by the lack of stable identifiers. If we are to make process the stubborn problem of the lack of unique, persistent identifiers, and cross links between those identifiers needs to be tackled in earnest (Page 2008).
As a postscript, in writing this opinion piece, I have had to write custom scripts to query various databases in an ad hoc manner, trying to extract and assemble information that gives insight into the current state of biodiversity digitisation. For these analyses and visualisations to have broader utility it would be desirable to have some way of consistently and automatically doing these analyses, in effect creating a dashboard of digitisation that would enable us to not only see where we are as a field, but also suggest directions in which we could be heading.
I thank Paul Hebert for the invitation to speak at the 6th International Barcode of Life Conference. Some of the ideas discussed here were first developed on posts on my blog iPhylo and benefited from feedback from people who left comments on those posts.
I have no competing interests.
Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data. Zoologische Staatssammlung München/Staatliche Naturwissenschaftliche Sammlungen Bayerns, 2015. Link
Geographically tagged INSDC sequences. European Molecular Biology Laboratory (EMBL), 2014. Link
Robert Aboukhalil. The rising trend in authorship. The Winnower The Winnower LLC, 2014. Link
Donat Agosti, Willi Egloff. Taxonomic information exchange and copyright: the Plazi approach. BMC Research Notes 2, 53 Springer Science \(\mathplus\) Business Media, 2009. Link
Erez Aiden, Jean-Baptiste Michel. Uncharted: Big Data as a Lens on Human Culture. Riverhead Books, 2013.
Nesrine Akkari, Henrik Enghoff, Brian D. Metscher. A New Dimension in Documenting New Species: High-Detail Imaging for Myriapod Taxonomy and First 3D Cybertype of a New Millipede Species (Diplopoda Julida, Julidae). PLOS ONE 10, e0135243 Public Library of Science (PLoS), 2015. Link
Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 Elsevier BV, 1990. Link
Arturo H. Ariño. Approaches to estimating the universe of natural history collections data. Biodiv. Inf. 7 The University of Kansas, 2010. Link
T. Barrett, K. Clark, R. Gevorgyan, V. Gorelenkov, E. Gribov, I. Karsch-Mizrachi, M. Kimelman, K. D. Pruitt, S. Resenchuk, T. Tatusova, E. Yaschenko, J. Ostell. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Research 40, D57–D63 Oxford University Press (OUP), 2011. Link
D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, E. W. Sayers. GenBank. Nucleic Acids Research 41, D36–D42 Oxford University Press (OUP), 2012. Link
Mark Blaxter. Molecular systematics: Counting angels with DNA. Nature 421, 122–124 Nature Publishing Group, 2003. Link
Philippe Bouchet, Jean-Pierre Rocroi. Supraspecific names of molluscs: a quantitative review. Malacologia 34, 75-86 (1992). Link
M. Julian Caley, Rebecca Fisher, Kerrie Mengersen. Global species richness estimates have not converged. Trends in Ecology & Evolution 29, 187–188 Elsevier BV, 2014. Link
M. J. Costello, S. Wilson, B. Houlding. Predicting Total Global Species Richness Using Rates of Species Description and Estimates of Taxonomic Effort. Systematic Biology 61, 871–883 Oxford University Press (OUP), 2011. Link
M A Edwards, M J Thorne. Reply to ’Supraspecific names of molluscs: a quantitative review’. Malacologia 35, 153-154 (1993). Link
S. Federhen. Type material in the NCBI Taxonomy Database. Nucleic Acids Research 43, D1086–D1098 Oxford University Press (OUP), 2014. Link
Robert P. Guralnick, Nico Cellinese, John Deck, Richard L. Pyle, John Kunze, Lyubomir Penev, Ramona Walls, Gregor Hagedorn, Donat Agosti, John Wieczorek, Terry Catapano, Roderic Page. Community Next Steps for Making Globally Unique Identifiers Work for Biocollections Data. ZooKeys 494, 133–154 Pensoft Publishers, 2015. Link
Robert Guralnick, Tom Conlin, John Deck, Brian J. Stucky, Nico Cellinese. The Trouble with Triplets in Biodiversity Informatics: A Data-Driven Case against Current Identifier Practices. PLoS ONE 9, e114069 Public Library of Science (PLoS), 2014. Link
N. E. Gwinn, C. Rinaldo. The Biodiversity Heritage Library: sharing biodiversity literature with the world. IFLA Journal 35, 25–34 SAGE Publications, 2009. Link
Lucas N. Joppa, David L. Roberts, Stuart L. Pimm. The population ecology and social behaviour of taxonomists. Trends in Ecology & Evolution 26, 551–553 Elsevier BV, 2011. Link
Jessie Kennedy. Supporting Taxonomic Names in Cell and Molecular Biology Databases. OMICS: A Journal of Integrative Biology 7, 13–16 Mary Ann Liebert Inc, 2003. Link
W. Lathe, J. Williams, M. Mangan, D. Karolchik. Genomic data resources: challenges and promises. Nature Education 1 (2008). Link
R. M. MAY. How Many Species Are There on Earth?. Science 241, 1441–1449 American Association for the Advancement of Science (AAAS), 1988. Link
J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, E. L. Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331, 176–182 American Association for the Advancement of Science (AAAS), 2010. Link
Jeremy Miller, Donat Agosti, Lyubomir Penev, Guido Sautter, Teodor Georgiev, Terry Catapano, David Patterson, David King, Serrano Pereira, Rutger Vos, Soraya Sierra. Integrating and visualizing primary data from prospective and legacy taxonomic literature. BDJ 3, e5063 Pensoft Publishers, 2015. Link
Roderic DM Page. Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics 12, 187 Springer Science \(\mathplus\) Business Media, 2011. Link
Roderic D.M. Page. BioNames: linking taxonomy texts, and trees. PeerJ 1, e190 PeerJ, 2013. Link
Roderic Page. Visualising Geophylogenies in Web Maps Using GeoJSON. PLoS Curr Public Library of Science (PLoS), 2015. Link
R. D. M. Page. Biodiversity informatics: the challenge of linking data and the role of shared identifiers. Briefings in Bioinformatics 9, 345–354 Oxford University Press (OUP), 2008. Link
Cynthia S. Parr, Robert Guralnick, Nico Cellinese, Roderic D.M. Page. Evolutionary informatics: unifying knowledge about the diversity of life. Trends in Ecology & Evolution 27, 94–103 Elsevier BV, 2012. Link
David J Patterson, Willi Egloff, Donat Agosti, David Eades, Nico Franz, Gregor Hagedorn, Jonathan A Rees, David P Remsen. Scientific names of organisms: attribution rights, and licensing. BMC Research Notes 7, 79 Springer Science \(\mathplus\) Business Media, 2014. Link
D.J. Patterson, J. Cooper, P.M. Kirk, R.L. Pyle, D.P. Remsen. Names are key to the big new biology. Trends in Ecology & Evolution 25, 686–691 Elsevier BV, 2010. Link
Lyubomir Penev, Donat Agosti, Teodor Georgiev, Terry Catapano, Jeremy Miller, Vladimir Blagoderov, David Roberts, Vincent Smith, Irina Brake, Simon Ryrcroft, Ben Scott, Norman Johnson, Robert Morris, Guido Sautter, Vishwas Chavan, Tim Robertson, David Remsen, Pavel Stoev, Cynthia Parr, Sandra Knapp, W. John Kress, Frederic Thompson, Terry Erwin. Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples. ZooKeys 50, 1–16 Pensoft Publishers, 2010. Link
N. Dean Pentcheff, N. Dean Pentcheff. Copyrights and digitizing the systematic literature: the horror... the horror.... Nature Precedings Nature Publishing Group, 2010. Link
Martin R. Pullan, Mark F. Watson, Jessie B. Kennedy, Cédric Raguenaud, Roger Hyam, Cedric Raguenaud. The Prometheus Taxonomic Model: A Practical Approach to Representing Multiple Classifications. Taxon 49, 55 JSTOR, 2000. Link
SUJEEVAN RATNASINGHAM, PAUL D. N. HEBERT. BARCODING: bold: The Barcode of Life Data System (http://www.barcodinglife.org). Molecular Ecology Notes 7, 355–364 Wiley-Blackwell, 2007. Link
Sujeevan Ratnasingham, Paul D. N. Hebert. A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. PLoS ONE 8, e66213 Public Library of Science (PLoS), 2013. Link
G. Sangster, J. A. Luksenburg. Declining Rates of Species Described per Taxonomist: Slowdown of Progress or a Side-effect of Improved Quality in Taxonomy?. Systematic Biology 64, 144–151 Oxford University Press (OUP), 2014. Link
Pavel Stoev, Ana Komerički, Nesrine Akkari, Shanlin Liu, Xin Zhou, Alexander Weigand, Jeroen Hostens, Christopher Hunter, Scott Edmunds, David Porco, Marzio Zapparoli, Teodor Georgiev, Daniel Mietchen, David Roberts, Sarah Faulwetter, Vincent Smith, Lyubomir Penev. Eupolybothrus cavernicolus Komerički &\(\mathsemicolon\) Stoev sp. n. (Chilopoda: Lithobiomorpha: Lithobiidae): the first eukaryotic species description combining transcriptomic DNA barcoding and micro-CT imaging data. BDJ 1, e1013 Pensoft Publishers, 2013. Link
B. J. Strasser. GENETICS: GenBank–Natural History in the 21st Century?. Science 322, 537–538 American Association for the Advancement of Science (AAAS), 2008. Link
Bruno J. Strasser. The Experimenters Museum: GenBank Natural History and the Moral Economies of Biomedicine. Isis 102, 60–96 University of Chicago Press, 2011. Link
Joan Thorne. Zoological Record and registration of new names in zoology. Bulletin of Zoological Nomenclature 60, 7-11 (2003). Link