Discussion
A lack of high quality and location-specific reference sequence data appears to limit the potential for DNA-based monitoring of terrestrial biodiversity, despite the great promise of these techniques. A scarcity of invertebrate taxonomic expertise and an associated lack of authoritatively identified specimens, and the high costs of Sanger sequencing, pose significant barriers to reference database generation. We present an efficient strategy that helps to overcome these barriers, with the potential to lessen the need for reliance on publicly available databases with inadequate local relevance for biodiversity monitoring. By leveraging a set of taxonomically identified specimens from a national collection coupled with a sensitive high-throughput sequencing approach, we rapidly generated a reference COI sequence database consisting of full-length barcodes for 334 specimens and partial (FC or BR) barcodes for a further 105 specimens, representing a wide range of invertebrate taxa from a diverse range of locations and with varied storage conditions. We observed no obvious effects of sampling or DNA extraction methods on barcoding success, indicating that a variety of protocols and specimen types can provide acceptable outcomes using this process. Furthermore, BR sequences were recovered from nearly all specimens, highlighting the potential for this process to achieve exceptionally high rates of DNA barcoding success, apparent deficiencies with the FC PCR primers notwithstanding.
While this analysis represents only a small portion of the source collection, the number of specimens included was arbitrarily limited, and there is considerable scope to greatly increase the throughput of this process. We recovered an average of over 10,000 sequences per amplicon and specimen (albeit with considerable variance). Given that the theoretical capacity of the MiSeq system exceeds 20 million sequence reads, and that only one correct FC and BR sequence is required to form a complete barcode, this suggests, conservatively, that at least one (or perhaps several) thousand specimens could be sequenced in one MiSeq run using this process. Typically, as the number of samples pooled together increases, the read depth per sample decreases, which may influence the detection of sequences from specimens that were difficult to amplify.
Previous attempts to obtain DNA barcodes from multiple invertebrate specimens have used a variety of sequencing approaches. In one example, Sanger sequencing was used to obtain DNA barcodes from 86 % of over 40,000 museum-held Lepidoptera specimens, demonstrating a profound effect of specimen age on barcoding success using this method [15]. However, this effort required six months of molecular work by five people, illustrating the inefficiencies/impracticality of Sanger sequencing applied to large numbers of specimens. Invertebrate DNA barcoding efforts utilizing high-throughput sequencing technologies typically report greater efficiency, lower costs, and higher barcoding success rates than equivalent Sanger sequencing-based efforts [12, 22, 35]. The two-amplicon PCR approach used in this study was previously used to obtain barcodes from 97 % of > 1000 freshly trapped arthropod specimens [23]. We achieved comparable success rates from older specimens, from a wide range of locations, including a diverse selection of earthworms, confirming the utility of this MiSeq approach for efficiently barcoding numerous specimens from diverse lineages and sources. On the other hand, the same approach applied to barcoding of dried saproxylic beetle specimens achieved a lower success rate of 55 %, perhaps due to specimen collection methods being suboptimal for DNA preservation [36]. The MiSeq system has also been used in a multi-locus metabarcoding approach for detecting insect pests in bulk trap catches, which confirmed the importance of taxonomic information for confirming metabarcoding outcomes [19].
Single molecule real-time (SMRT) sequencing on the PacBio Sequel platform has recently been used to recover DNA barcodes from 20,000 insect specimens [35], and to recover barcodes from hundreds of ~50 year-old butterfly specimens [12]. This system is argued to offer the most economic high-throughput barcoding system [35], but this relies upon very large numbers (tens of thousands) of input specimens. The MiSeq system is arguably more accessible (in terms of platform availability and sequencing run costs) than the PacBio Sequel system, and is suited to more modest numbers of specimens (hundreds to low thousands), which may be more compatible with biomonitoring requirements. Below we discuss some of the salient features and limitations of our approach.