Introduction
Recent advances in high-throughput sequencing technology are enabling a
shift towards environmental DNA (eDNA)-based methods for biodiversity
assessment and biosecurity monitoring [1]. While still in their
infancy, these tools offer great promise for rapid and accessible
biodiversity monitoring applications in terrestrial, aquatic, and marine
ecosystems [2, 3]. However, a scarcity of accurately identified
reference DNA sequence data from local biota [4] remains a
significant obstacle to the application of these eDNA tools to
biomonitoring, preventing the confident identification and
interpretation of detected organisms.
DNA-based identification methods, regardless of application, rely on the
determination of similarity between newly detected sequences and
existing reference sequence data [4, 5]. Depending on target taxa,
this requires representative taxonomically validated data on established
marker genes such as the ~650 bp region of cytochromec oxidase subunit I (COI) for metazoans [6], or combinations
of plastid regions (rbcl , matK , trnH–psbA ) and the
ribosomal internal transcribed spacer region (ITS) for plants [7].
Large open data sources, such as the GenBank nr database or BOLD
[8] are typically employed as reference databases, but rely on data
submitters for fidelity of sequence to organism and suffer from
geographic sampling biases [9]. Further, this reliance on large
pre-existing databases also limits the emergence of new or
taxon-specific markers, with individual studies utilising previously
established markers even if they may be suboptimal for certain taxa
[10]. Curated databases containing only sequences from a targeted
ecosystem may result in improved accuracy of sequence identifications
compared to a global database [11]. However, the current sparse
database coverage of biodiversity from most ecosystems means that
targeted reference databases typically must be populated with newly
generated and locally relevant reference sequences.
Taxonomically validated reference sequences are difficult to generate.
Not only do they require high levels of sequence accuracy (traditionally
achieved via Sanger sequencing, more recently possible via PacBio hifi
technology[12]), but also accurate taxonomic identification of
specimens. The former may be time-consuming, contingent on sample
quality, and expensive, especially when applied to large numbers of
specimens, while the latter requires specialist taxonomic expertise
across taxa. For example, generating a reliable reference database for a
previously uncharacterized insect fauna may require taxonomic skills
spanning 24 distinct insect orders. Natural history museums and national
biological collections, however, are unparalleled repositories of both
invaluable taxonomic knowledge [13] and authoritatively identified
genetic source material [14], with the potential to allow the
efficient generation of taxonomically comprehensive and locally relevant
reference DNA sequence databases [15]. Generating full-length DNA
barcodes via Sanger sequencing from dried or historical specimens stored
over long periods may be difficult due to DNA degradation and low
sensitivity of the sequencing approach, often resulting in only partial
barcodes [15-18]. Furthermore, museum samples are often
indispensable permanent records, and therefore unavailable for
destructive DNA extraction. Non-destructive extraction [19, 20] and
PCR [21] approaches can be effective, however, depending on the taxa
being analysed, and multiplex PCR coupled with high-throughput DNA
sequencing technologies has allowed the efficient recovery of barcodes
from 50- to 100-year-old museum samples [12, 22], as well as
recently collected specimens [23].
There is a pressing need to leverage museum collections for rapid and
cost-effective generation of reference databases, in order to aid
eDNA-based biodiversity monitoring [24]. Here, we present a fast,
cost-effective, and efficient method for developing a reference COI
database from a diverse selection of terrestrial invertebrates sourced
from the New Zealand Arthropod Collection (NZAC). These taxonomically
validated specimens exhibit a variety of field collected methods,
specimen treatment and storage conditions, as well as variable
accessibility for destructive sampling. We demonstrate the use of a dual
indexing approach, in combination with a pair of overlapping short PCR
amplicons suitable for sequencing on the Illumina MiSeq platform, for
generating full length barcodes from hundreds of invertebrate specimens
simultaneously. We provide a taxonomy-informed bioinformatics pipeline
for processing and filtering the sequence data and the rapid assembly of
successful barcodes. Together, our approach represents a highly
sensitive, accurate, and efficient method for targeted reference
database generation, providing a foundation for DNA-based assessments
and monitoring of biodiversity.