Authorea

Roderic Page edited Typically_integration_across_biodiversity_databases__.md over 8 years ago

Commit id: be9cdc964153408415efef974e3b104650b61a7b

deletions | additions

Typically integration across biodiversity databases is achieved using taxonomic names \cite{Patterson_2010}, but the rise of dark taxa makes this problematic for an increasing fraction of sequence-based data. Even if we have names, these need not always mean the same thing \cite{Kennedy_2003}. As an example, Fig x shows the distribution of the lizard _Morethia obscura_ from GBIF. For comparison, Fig. y shows a geophylogeny \cite{Page_2015} for DNA barcodes for Morethia obscura from BOLD which reveal considerable phylogenetic structure within "Morethia obscura", which is reflected in specimens of this species being assigned to several distinct BINs implying that "Morethia obscura" comprises more than one species. Although GBIF and BOLD present rather different views of the "same" species, Figs x and y are to some extent based on the same specimens. For example, DNA barcode WAMMS012-10 was obtained from specimen WAMR127637, which also occurs in GBIF (as occurrence http://gbif.org/occurrence/691832260). Because the taxonomic concepts in GBIF and BOLD are explicitly defined with respect to sets of specimens we can directly compare them, rather than rely on the possibly erroneous assumption that a given taxonomic name means the same thing in the two databases. Furthermore, as more and more type specimens are sequenced \cite{Federhen_2014} we can more firmly associate names with sets of specimens, leading to a more computable nomenclature where the name we assign to a set of specimens can be determined automatically \cite{Pullan_2000}. Integrating databases using specimens is attractive, but not without its own set of issues. The biodiversity informatics community has yet to standardise identifiers for specimens, despite numerous efforts \cite{Guralnick_2015}, consequently there may be little apparent overlap between specimen identifiers in different databases \cite{Guralnick_2014}. As an example, despite the limited sharing of data between BOLD and GBIF, there are already barcoded specimens in GBIF. To illustrate, consider the DNA barcode GWORH520-09 from sample "BC ZSM Lep 10234". GBIF doesn't have this record from BOLD, but it does have the specimen BC ZSM Lep 10234 (provided by the host institution \cite{9915051b-04a1-4a45-8c40-6bed0885c5bd}. Furthermore, the DNA barcode from this specimen is also in GenBank, and because that record is georeferenced it has been ingested by GBIF as part of the Geographically tagged INSDC sequences dataset \cite{dc7b81db-6e39-484a-8868-d85b386d2fee}. So, GBIF has duplicate records for a barcoded moth, neither provided diretcly by BOLD. Merging and de-duplicating specimen-based records is going to be a significant challenge for global aggregators such as GBIF. [figure of this]