Rather than try and estimate an unknown (the number of species remaining to be described), here I focus on the current state of taxonomic knowledge. Given that we lack a comprehensive, global index of all species descriptions, discovering what we know about what we know is not entirely straightforward. For zoology the nearest we have is the Index of Organism Names (ION, http://www.organismnames.com), which is based on Zoological Record. Figure 1 shows the numbers of new taxonomic names covered by the International Code on Zoological Nomenclature (animals plus some protozoan groups) that have been described each year based on data from ION, cleaned and augmented in BioNames \cite{Page_2013}. These data show an increase in overall numbers over time, with dips around the times of the two World Wars, followed by an essentially constant number each year since the mid-twentieth century. The pattern varies across taxa, some taxa show increasing numbers per year, but other taxonomic groups are essentially static or in decline, even in groups thought to be hyperdiverse such as nematodes \cite{Blaxter_2003}.

Digitising the taxonomic literature

The rate of progress in biodiversity research is controlled by two factors, the speed with which we can discover and describe biodiversity, and the speed with which we can communicate that information \cite{Pentcheff_2010}. Unlike most biological disciplines, the entire corpus of taxonomic literature since the mid 18th century remains a vital resource for current day research. In this way taxonomy is similar to the digital humanities, which has not just "big data" but "long data" \cite{Aiden2013}. Not only is this because of the rules of nomenclature that dictate (with some exceptions) that the name to use for a species is the oldest one published, it reflects the uneven effort devoted to the study of different taxonomic groups \cite{MAY_1988}. For poorly known groups the bulk of our knowledge of their biology may reside in the primary taxonomic literature.

Digitisation is one step towards making taxonomic information available. Many commercial publishers have, on the face of it, done the taxonomic community a great service by digitising whole back catalogues of relatively obscure journals. However, digitisation is not the same as access, and many commercial publishers keep this scanned literature behind paywalls. In some fields legal issues around access have been side-stepped by constructing a "shadow" dataset that summarises key features of the data while still restricting access to the data itself. For example, by extracting phrases comprising a set of n words (n-grams) from Google Books it is possible to create a data set that contains valuable information without exposing the full text \cite{Michel_2010}. However for taxonomic work, there does not seem to be an obvious way to extract a shadow. Agosti and colleagues \cite{Agosti_2009, Patterson_2014} have explored ways to extract core facts from the literature and re-purpose these without violating copyright, though how much of their conclusions can be generalised across different national and international legal systems remains untested.

Apart from commercial digitisation of the scientific literature, two other developments are accelerating access to taxonomic information. The first is the rise of open access publishing, notably journals such as ZooKeys that support sophisticated markup of the text \cite{Penev_2010}. This is increasing the number of recently-described species that are published in a machine-readable form that can then be subject to further processing \cite{Miller_2015}. At the same time, the Biodiversity Heritage Library (BHL) \cite{Gwinn_2009} has embarked on large-scale digitisation of legacy taxonomic literature. Although initially focussing on out of copyright literature (i.e., pre-1923 in the United States), BHL is increasingly getting permission from copyright holders to scan more recent literature as well. Coupled with tools such as BioStor \cite{Page_2011} to locate and extract articles within the scanned volumes BHL is fast becoming the largest available open access archive of biodiversity literature.

To quantify the extent to which the taxonomic literature has been digitised, for each decade I counted the number of publications of new names in animals both with and without a digital identifier (such as a DOI, a PDF, a Handle, or a URL to BioStor). The recent taxonomic literature is mostly digital: for the years 2010-15 60% of publications have a digital identifier, the bulk of these having a DOI. However, prior to the 21st century more publications lack identifiers than have them, with the 1970s being the least digitalised decade (Fig. 2).