Roderic Page edited Taxonomy.md  over 8 years ago

Commit id: 21a0474fc9616a54fd2f6613833574d1098db8b0

deletions | additions      

       

Among the many challenges faced by taxonomy is the difficulty of estimating the size of the task it faces. Estimates of the number of species on Earth are uncertain and inconsistent, and show no signs of converging \cite{Caley_2014}. Some estimates based on models of taxonomic effort suggest that two-thirds of all species have already been described \cite{Costello_2011}. Analyses that use the number of authors per species description as a proxy for effort \cite{Joppa_2011} ignore the global trend for an increasing number of authors per paper \cite{Aboukhalil_2014}, and assume that the effort required per description has remained constant over time. An alternative interpretation is that the quality of taxonomic description is increasing over time \cite{Sangster_2014}, reflecting both increased thoroughness and new technologies \cite{Stoev_2013} \cite{Akkari_2015}.  Currently Rather than try and estimate an unknown (the number of species remaining to be described), we can instead focus on what we know about what we know. In other words, the current state of taxonomic knowledge, which is less than ideal. For example, we currently  we lack a comprehensive, global index of species descriptions. For zoology the nearest we have in the Index of Organism Names (ION), which is based on Zoological Record. Fig x shows the numbers of new taxa covered by the ICZN (animals plus some protozoan groups) that have been described each year, based on data from ION. These data show an increase in numbers with dips around the times of the two World Wars, followed by an essentially constant number each year since the mid-twentieth century. The pattern in individual groups may vary considerably. For most of the taxa analysed by \cite{Joppa_2011} the numbers of new species described per year are increasing, but other taxonomic groups are essentially static or in decline. Rather than try and estimate an unknown (the number of species remaining to be described), we can instead focus on what we know, that is, the taxonomic literature representing the output of generations of taxonomists. The rate of progress in biodiversity research is controlled by two factors, the speed with which we can discover and describe biodiversity, and the speed with which we can communicate that information \cite{Pentcheff_2010}. Unlike most biological disciplines, the entire corpus of taxonomic literature since the mid 18th century remains a vital resource for current day research. In this way taxonomy is similar to the digital humanities, where we have not just "big data" but "long data" [978-1594632907]. Not only is this because of the rules of nomenclature that dictate (with some exceptions) that the name to use for a species is the oldest one published, it is also because of the "long tail" effect - for a few species we know a great deal, but for most species the entire sum of our knowledge may reside in the primary taxonomic literature. Digitisation is one step towards making that information available, Many commercial publishers have, on the face of it, done the taxonomic community a great service by digitising whole back catalogues of relatively obscure journals. However, digitisation is not the same as access, and many commercial publishers keep this scanned literature behind a paywall. In some fields, legal issues around access have been side-stepped by constructing a "shadow" dataset. For example, by extracting n-grams (phrases comprising n words) from Google Books it is possible to create a data set that still contains valuable information without exposing the full text \cite{Michel_2010}. But for taxonomic work, there does not seem to be an obvious way to extract a shadow. Agosti and colleagues \cite{Agosti_2009} have explored ways to extract core facts from the literature and repurpose these without violating copyright, though how much of their conclusions can be generalised across different national and international legal systems is unclear. 

The taxonomic literature is highly decentralised, being spread across numerous journals [fig]. What is striking is the dominance of animal taxonomy by the "megajournal" Zootaxa, and yet this journal has published only 15% of the new names that have been minted since 2000. The taxonomic literature has a very "long tail" of small, often obscure journals that contain a few taxonomic publications. Individually each journal in the tail contains little taxonomic information, collectively they contain the bulk [nee dot quantify this]. Long tails require significant effort to index, although the Zoological Record claims 90% coverage \cite{thorne_2003}, in some taxa there may be significantly greater gaps [Bouchot debate, etc.]. It also presents a considerable challenge to digitisation efforts such as BHL, which will have to scan a large number of journals to capture even a reasonable fraction of taxonomic information.  The picture that emerges from our knowledge of the taxonomic literature is the recent literature is mostly digital, identified with DOIs, and some of it Open Access. But much of our fundamental  knowledge of the twentieth world's biodiversity, partiuclarly that published in the mid to late 20th  century remainsdigitally  inaccessible.