Roderic Page edited Taxonomy.md  over 8 years ago

Commit id: bfd857735230928c7d4cfcaa6d72ad968c5ae009

deletions | additions      

       

Rather than try and estimate an unknown (the number of species remaining to be described), here I focus on the current state of taxonomic knowledge. However, given that we lack a comprehensive, global index of all species descriptions, discovering what we know about what we know is not entirely straightforward. For zoology the nearest we have is the Index of Organism Names (ION, [http://www.organismnames.com](http://www.organismnames.com)), which is based on Zoological Record. Figure 1 shows the numbers of new taxonomic names covered by the International Code on Zoological Nomenclature (animals plus some protozoan groups) that have been described each year based on data from ION, cleaned and augmented in BioNames \cite{Page_2013}. These data show an increase in overall numbers over time, with dips around the times of the two World Wars, followed by an essentially constant number each year since the mid-twentieth century. The pattern varies across taxa, some taxa show increasing numbers per year, but other taxonomic groups are essentially static or in decline, even in groups thought to be hyperdiverse such as nematodes \cite{Blaxter_2003}.  ## Digitisating the taxonomic literature  The rate of progress in biodiversity research is controlled by two factors, the speed with which we can discover and describe biodiversity, and the speed with which we can communicate that information \cite{Pentcheff_2010}. Unlike most biological disciplines, the entire corpus of taxonomic literature since the mid 18th century remains a vital resource for current day research. In this way taxonomy is similar to the digital humanities, where we have not just "big data" but "long data" \cite{Aiden2013}. Not only is this because of the rules of nomenclature that dictate (with some exceptions) that the name to use for a species is the oldest one published, it reflects the uneven effort devoted to the study of different taxonomic groups \cite{MAY_1988}. For poorly known groups the bulk of our knowledge of their biology may reside in the primary taxonomic literature.   Digitisation is one step towards making that biodiversity information available. Many commercial publishers have, on the face of it, done the taxonomic community a great service by digitising whole back catalogues of relatively obscure journals. However, digitisation is not the same as access, and many commercial publishers keep this scanned literature behind a paywall. In some fields, legal issues around access have been side-stepped by constructing a "shadow" dataset that summarises key features of the data while still restricting access to the data itself. For example, by extracting _n_-grams (phrases comprising _n_ words) from Google Books it is possible to create a data set that still contains valuable information without exposing the full text \cite{Michel_2010}. But for taxonomic work, there does not seem to be an obvious way to extract a shadow. Agosti and colleagues \cite{Agosti_2009, Patterson_2014} have explored ways to extract core facts from the literature and re-purpose these without violating copyright, though how much of their conclusions can be generalised across different national and international legal systems is unclear.