Discussion
A lack of high quality and location-specific reference sequence data
appears to limit the potential for DNA-based monitoring of terrestrial
biodiversity, despite the great promise of these techniques. A scarcity
of invertebrate taxonomic expertise and an associated lack of
authoritatively identified specimens, and the high costs of Sanger
sequencing, pose significant barriers to reference database generation.
We present an efficient strategy that helps to overcome these barriers,
with the potential to lessen the need for reliance on publicly available
databases with inadequate local relevance for biodiversity monitoring.
By leveraging a set of taxonomically identified specimens from a
national collection coupled with a sensitive high-throughput sequencing
approach, we rapidly generated a reference COI sequence database
consisting of full-length barcodes for 334 specimens and partial (FC or
BR) barcodes for a further 105 specimens, representing a wide range of
invertebrate taxa from a diverse range of locations and with varied
storage conditions. We observed no obvious effects of sampling or DNA
extraction methods on barcoding success, indicating that a variety of
protocols and specimen types can provide acceptable outcomes using this
process. Furthermore, BR sequences were recovered from nearly all
specimens, highlighting the potential for this process to achieve
exceptionally high rates of DNA barcoding success, apparent deficiencies
with the FC PCR primers notwithstanding.
While this analysis represents only a small portion of the source
collection, the number of specimens included was arbitrarily limited,
and there is considerable scope to greatly increase the throughput of
this process. We recovered an average of over 10,000 sequences per
amplicon and specimen (albeit with considerable variance). Given that
the theoretical capacity of the MiSeq system exceeds 20 million sequence
reads, and that only one correct FC and BR sequence is required to form
a complete barcode, this suggests, conservatively, that at least one (or
perhaps several) thousand specimens could be sequenced in one MiSeq run
using this process. Typically, as the number of samples pooled together
increases, the read depth per sample decreases, which may influence the
detection of sequences from specimens that were difficult to amplify.
Previous attempts to obtain DNA barcodes from multiple invertebrate
specimens have used a variety of sequencing approaches. In one example,
Sanger sequencing was used to obtain DNA barcodes from 86 % of over
40,000 museum-held Lepidoptera specimens, demonstrating a profound
effect of specimen age on barcoding success using this method [15].
However, this effort required six months of molecular work by five
people, illustrating the inefficiencies/impracticality of Sanger
sequencing applied to large numbers of specimens. Invertebrate DNA
barcoding efforts utilizing high-throughput sequencing technologies
typically report greater efficiency, lower costs, and higher barcoding
success rates than equivalent Sanger sequencing-based efforts [12, 22,
35]. The two-amplicon PCR approach used in this study was previously
used to obtain barcodes from 97 % of > 1000 freshly
trapped arthropod specimens [23]. We achieved comparable success
rates from older specimens, from a wide range of locations, including a
diverse selection of earthworms, confirming the utility of this MiSeq
approach for efficiently barcoding numerous specimens from diverse
lineages and sources. On the other hand, the same approach applied to
barcoding of dried saproxylic beetle specimens achieved a lower success
rate of 55 %, perhaps due to specimen collection methods being
suboptimal for DNA preservation [36]. The MiSeq system has also been
used in a multi-locus metabarcoding approach for detecting insect pests
in bulk trap catches, which confirmed the importance of taxonomic
information for confirming metabarcoding outcomes [19].
Single molecule real-time (SMRT) sequencing on the PacBio Sequel
platform has recently been used to recover DNA barcodes from 20,000
insect specimens [35], and to recover barcodes from hundreds of
~50 year-old butterfly specimens [12]. This system
is argued to offer the most economic high-throughput barcoding system
[35], but this relies upon very large numbers (tens of thousands) of
input specimens. The MiSeq system is arguably more accessible (in terms
of platform availability and sequencing run costs) than the PacBio
Sequel system, and is suited to more modest numbers of specimens
(hundreds to low thousands), which may be more compatible with
biomonitoring requirements. Below we discuss some of the salient
features and limitations of our approach.