Environmental DNA (eDNA) metabarcoding has gained growing attention as a strategy for monitoring biodiversity in ecology. However, taxa identifications produced through metabarcoding require sophisticated processing of high-throughput sequencing data from taxonomically informative DNA barcodes. Various sets of universal and taxon-specific primers have been developed, extending the usability of metabarcoding across archaea, bacteria, and eukaryotes. Accordingly, a multitude of metabarcoding data analysis tools and pipelines have also been developed. Often, several developed workflows are designed to process the same amplicon sequencing data, making it somewhat puzzling to choose one amongst the plethora of existing pipelines. However, each pipeline has its own specific philosophy, strengths, and limitations, which should be considered depending on the aims of any specific study, as well as the bioinformatics expertise of the user. In this review, we outline the input data requirements, supported operating systems, and particular attributes of thirty-one amplicon processing pipelines with the goal of helping users to select a pipeline for their metabarcoding projects.
Here I describe the novel R package SNPfiltR and demonstrate its functionalities as the backbone of a customizable, reproducible SNP filtering pipeline implemented exclusively via the widely adopted R programming language. SNPfiltR extends existing SNP filtering functionalities by automating the visualization of key parameters such as depth, quality, and missing data, then allowing users to set filters based on optimized thresholds, all within a single, cohesive working environment. All SNPfiltR functions require a vcfR object as input, which can be easily generated by reading a SNP dataset stored as a standard vcf file into an R working environment using the function read.vcfR() from the R package vcfR. Performance benchmarking reveals that for moderately sized SNP datasets (up to 50M genotypes with associated quality information), SNPfiltR performs filtering with comparable efficiency to current state of the art command-line-based programs. These benchmarking results indicate that for most reduced-representation genomic datasets, SNPfiltR is an ideal choice for investigating, visualizing, and filtering SNPs as part of a cohesive and easily documentable bioinformatic pipeline. The SNPfiltR package can be downloaded from CRAN with the command [install.packages(“SNPfiltR”)], and a development version is available from GitHub at: (github.com/DevonDeRaad/SNPfiltR). Additionally, thorough documentation for SNPfiltR, including multiple comprehensive vignettes, is available at the website: (devonderaad.github.io/SNPfiltR/).
Metabarcoding of DNA extracted from environmental or bulk specimen samples is increasingly used to detect plant and animal taxa in basic and applied biodiversity research because of its targeted nature that allows sequencing of genetic markers from many samples in parallel. To achieve this, PCR amplification is carried out with primers designed to target a taxonomically informative marker within a taxonomic group, and sample-specific nucleotide identifiers are added to the amplicons prior to sequencing. This enables assignment of the sequences back to the samples they originated from. Nucleotide identifiers can be added during the metabarcoding PCR and/or during ‘library preparation’, i.e. when amplicons are prepared for sequencing. Different strategies to achieve this labelling exist. All have advantages, challenges and limitations, some of which can lead to misleading results, and in the worst case compromise the fidelity of the metabarcoding data. Given the range of questions addressed using metabarcoding, the importance of ensuring that data generation is robust and fit for purpose should be at the forefront of practitioners seeking to employ metabarcoding for biodiversity assessments. Here, we present an overview of the three main workflows for sample-specific labelling and library preparation in metabarcoding studies on Illumina sequencing platforms. Further, we distil the key considerations for researchers seeking to select an appropriate metabarcoding strategy for their specific study. Ultimately, by gaining insights into the consequences of different metabarcoding workflows, we hope to further consolidate the power of metabarcoding as a tool to assess biodiversity across a range of applications.
Biodiversity studies greatly benefit from molecular tools, such as DNA metabarcoding, which provides an effective identification tool in biomonitoring and conservation programmes. The accuracy of species-level assignment, and consequent taxonomic coverage, relies on comprehensive DNA barcode reference libraries. The role of these libraries is to support species identification, but accidental errors in the generation of the barcodes may compromise their accuracy. Here we present an R-based application, BAGS (Barcode, Audit & Grade System), that performs automated auditing and annotation of cytochrome c oxidase subunit I (COI) sequences libraries, for a given taxonomic group of animals, available in the Barcode of Life Data System (BOLD). This is followed by implementing a qualitative ranking system that assigns one of five grades (A to E) to each species in the reference library, according to the attributes of the data and congruency of species names with sequences clustered in Barcode Index Numbers (BINs). Our ultimate goal is to allow researchers to obtain the most useful and reliable data, highlighting and segregating records according to their congruency. Different tests were performed to perceive its usefulness and limitations. BAGS fulfils a significant gap in the current landscape of DNA barcoding research tools by quickly screening reference libraries to gauge the congruence status of data and facilitate the triage of ambiguous data for posterior review. Thereby, BAGS have the potential to become a valuable addition in forthcoming DNA metabarcoding studies, in the long term contributing to globally improve the quality and reliability of the public reference libraries.
Until recently many historical museum specimens were largely inaccessible to genomic inquiry, but high-throughput sequencing (HTS) approaches have allowed researchers to successfully sequence genomic DNA from dried and fluid-preserved museum specimens. In addition to preserved specimens, many museums contain large series of allozyme supernatant samples but the amenability of these samples to HTS has not yet been assessed. Here, we compared the performance of a target-capture approach using alternative sources of genomic DNA from ten specimens of spring salamanders (Plethodontidae: Gyrinophilus porphyriticus) collected 1985–1990: allozyme supernatants, allozyme homogenate pellets, and formalin-fixed tissues. We designed capture probes based on double-digest restriction-site associated (RADseq) sequencing derived loci from seven of the specimens and assessed the success and consistency of capture and RADseq technical replicates. This study design enabled direct comparisons of data quality and potential biases among the different datasets for phylogenomic and population genomic analyses. We found that in phylogenetic analyses, all replicates for a given specimen clustered together, but in principal component space, RADseq replicates did not cluster with corresponding capture-based replicates. SNP calls were on average 18.3% different between technical replicates, but these discrepancies were primarily due to differences in heterozygous/homozygous SNP calls. We demonstrate that both allozyme supernatant and formalin-fixed samples can be successfully used for population genomic analyses and we discuss ways to identify and reduce biases associated with combining capture and RADseq data.
Genome sequencing methods and assembly tools have improved dramatically since the 2013 publication of draft genome assemblies for the mountain pine beetle, Dendroctonus ponderosae Hopkins (Coleoptera: Curculionidae). We conducted proximity ligation library sequencing and scaffolding to improve contiguity, and then used linkage mapping and recent bioinformatic tools for correction and further improvement. The new assemblies have dramatically improved contiguity and gaps compared to the originals: N50 values increased 26- to 36-fold, and the number of gaps were reduced by half. Ninety percent of the content of the assemblies is now contained in 12 and 11 scaffolds for the female and male assemblies, respectively. Based on linkage mapping information, the 12 largest scaffolds in both assemblies represent all 11 autosomal chromosomes and the neo-X chromosome. These assemblies now have nearly chromosome-sized scaffolds and will be instrumental for studying genomic architecture, chromosome evolution, population genomics, functional genomics, and adaptation in this and other pest insects. We also identified regions in two chromosomes, including the ancestral-X portion of the neo-X chromosome, with elevated differentiation between northern and southern Canadian populations.
To associate specimens identified by molecular characters to other biological knowledge, we need reference sequences annotated by Linnaean taxonomy. In this paper, we 1) report the creation of a comprehensive reference library of DNA barcodes for the arthropods of an entire country (Finland), 2) publish this library, and 3) deliver a new identification tool based on this resource. The reference library contains mtDNA COI barcodes for 11,275 (43%) of 26,437 arthropod species known from Finland, including 10,811 (45%) of 23,956 insect species. To quantify the improvement in identification accuracy enabled by the current reference library, we ran 1,000 Finnish insect and spider species through the Barcode of Life Data system (BOLD) identification engine. Of these, 91% were correctly assigned to a unique species when compared to the new reference library alone, 85% were correctly identified when compared to BOLD with the new material included, and 75% with the new material excluded. To capitalize on this resource, we used the new reference material to train a probabilistic taxonomic assignment tool, FinPROTAX, scoring high success. For the full-length barcode region, the accuracy of taxonomic assignments at the level of classes, orders, families, subfamilies, tribes, genera, and species reached 99.9%, 99.9%, 99.8%, 99.7%, 99.4%, 96.8%, and 88.5%, respectively. The FinBOL arthropod reference library and FinPROTAX are available through the Finnish Biodiversity Information Facility (www.laji.fi). Overall, the FinBOL investment represents a massive capacity-transfer from the taxonomic community of Finland to all sectors of society.
Metabarcoding of DNA extracted from community samples of whole organisms (whole organism community DNA, wocDNA) is increasingly being applied to terrestrial, marine and freshwater metazoan communities to provide rapid, accurate and high resolution data for novel molecular ecology research. The growth of this field has been accompanied by considerable development that builds on microbial metabarcoding methods to develop appropriate and efficient sampling and laboratory protocols for whole organism metazoan communities. However, considerably less attention has focused on ensuring bioinformatic methods are adapted and applied comprehensively in wocDNA metabarcoding. In this study we examined over 600 papers and identified 111 studies that performed COI metabarcoding of wocDNA. We then systematically reviewed the bioinformatic methods employed by these papers to identify the state-of-the-art. Our results show that the increasing use of wocDNA COI metabarcoding for metazoan diversity is characterised by a clear absence of bioinformatic harmonisation, and the temporal trends show little change in this situation. The reviewed literature showed (i) high heterogeneity across pipelines, tasks and tools used, (ii) limited or no adaptation of bioinformatic procedures to the nature of the COI fragment, and (iii) a worrying underreporting of tasks, software and parameters. Based upon these findings we propose a set of recommendations that we think the wocDNA metabarcoding community should consider to ensure that bioinformatic methods are appropriate, comprehensive and comparable. We believe that adhering to these recommendations will improve the long-term integrative potential of wocDNA COI metabarcoding for biodiversity science.
Prevailing 16S rRNA gene-amplicon methods for characterizing the bacterial microbiome are economical, but result in coarse taxonomic classifications, are subject to primer and 16S copy number biases, and do not allow for direct estimation of microbiome functional potential. While deep shotgun metagenomic sequencing can overcome many of these limitations, it is prohibitively expensive for large sample sets. We evaluated the ability of shallow shotgun metagenomic sequencing to characterize taxonomic and functional patterns in the fecal microbiome of a model population of feral horses (Sable Island, Canada). Since 2007, this unmanaged population has been the subject of an individual-based, long-term ecological study. Using deep shotgun metagenomic sequencing, we determined the sequencing depth required to accurately characterize the horse microbiome. In comparing conventional versus high-throughput shotgun metagenomic library preparation techniques, we validate the use of more cost-effective lab methods. Finally, we characterize similarities between 16S amplicon and shallow shotgun characterization of the microbiome, and demonstrate that the latter recapitulates biological patterns first described in a published amplicon dataset. Unlike amplicon data, we demonstrate how shallow shotgun metagenomic data also provided useful insights about microbiome functional potential which support previously hypothesized diet effects in this study system.
The use of NGS datasets has increased dramatically over the last decade, however, there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single Pinus contorta parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the SNP genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded one to two orders of magnitude larger numbers of SNPs and error rates, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.
Gradient Forests is a machine learning algorithm that is gaining in popularity for studying the environmental drivers of genomic variation and for incorporating genomic information into climate change impact assessments. Here we provide the first experimental evaluation of the ability of ‘genomic offsets’ - a metric of climate maladaptation derived from Gradient Forests - to predict organismal responses to environmental change. We used high-throughput sequencing, genome scans, and several methods (including Gradient Forests) to identify candidate loci associated with climate adaptation in balsam poplar (Populus balsamifera L.). Individuals collected throughout balsam poplar’s range also were planted in two common garden experiments. We used Gradient Forests to relate candidate loci to environmental gradients and to predict the expected magnitude of response (i.e., the genetic offset) of populations when transplanted from their “home” environment to the new environments in the common gardens. We then compared the predicted genetic offsets to measurements of population performance in the common gardens. We found the expected inverse relationship between genetic offset and performance in the common gardens: populations with larger predicted genetic offsets performed worse in the common gardens than populations with smaller offsets. Also, genetic offset better predicted performance in the common gardens than did ‘naive’ climate distances. Our results provide preliminary evidence that genomic offsets may provide a first order estimate of the degree of expected maladaptation of populations exposed to rapid environmental change.
Populus has a wide ecogeographical range spanning the Northern Hemisphere, and exhibits abundant distinct species and hybrids globally. Populus tomentosa Carr. is widely distributed and cultivated in the eastern region of Asia, where it plays multiple important roles in forestry, agriculture, conservation, and urban horticulture. Reference genomes are available for several Populus species, however, our goals were to produce a very high quality de novo, chromosome-level genome assembly in P. tomentosa genome that could serve as a reference for evolutionary and ecological studies of hybrid speciation. Here, combining long-read sequencing and Hi-C scaffolding, we present a high-quality, haplotype-resolved genome assembly. The genome size was 740.2 Mb, with a contig N50 size of 5.47 Mb and a scaffold N50 size of 46.68 Mb, consisting of 38 chromosomes, as expected with the known diploid chromosome number (2n=2x=38). A total of 59,124 protein-coding genes were identified. Phylogenomic analyses revealed that P. tomentosa is comprised of two distinct subgenomes, which we deomonstrate is likely to have resulted from hybridization between Populus adenopoda as the female parent and Populus alba var. pyramidalis as the male parent, approximately 3.93 Mya. Although highly colinear, significant structural variation was also found between the two subgenomes. Our study provides a valuable resource for ecological genetics and forest biotechnology.
We present the chromosome-level genome assembly of Dysdera silvatica Schmidt, 1981, a nocturnal ground-dwelling spider endemic from the Canary Islands. The genus Dysdera has undergone a remarkable diversification in this archipelago mostly associated with shifts in the level of trophic specialization, becoming an excellent model to study the genomic drivers of adaptive radiations. The new assembly (1.37 Gb; and scaffold N50 of 174.2 Mb), was performed using the chromosome conformation capture scaffolding technique, represents a continuity improvement of more than 4,500 times with respect to the previous version. The seven largest scaffolds or pseudochromosomes cover 87% of the total assembly size and match consistently with the seven chromosomes of the karyotype of this species, including the characteristic large X chromosome. To illustrate the value of this new resource we performed a comprehensive analysis of the two major arthropod chemoreceptor gene families (i.e., gustatory and ionotropic receptors). We identified 545 chemoreceptor sequences distributed across all pseudochromosomes, with a notable underrepresentation in the X chromosome. At least 54% of them localize in 83 genomic clusters with a significantly lower evolutionary distances between them than the average of the family, suggesting a recent origin of many of them. This chromosome-level assembly is the first high-quality genome representative of the Synspermiata clade, and just the third among spiders, representing a new valuable resource to gain insights into the structure and organization of chelicerate genomes, including the role that structural variants, repetitive elements and large gene families played in the extraordinary biology of spiders.
Over the past few decades, the rapid democratization of high-throughput sequencing and the growing emphasis on open science practices have resulted in an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining datasets to achieve unprecedented sample sizes, spatial coverage, or temporal replication in population genomic studies. However, a common concern is that non-biological differences between datasets may generate batch effects that can confound real biological patterns. Despite general awareness about the risk of batch effects, few studies have examined empirically how they manifest in real datasets, and it remains unclear what factors cause batch effects and how to best detect and mitigate their impact bioinformatically. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch-effect-naive” bioinformatic pipeline, batch effects severely biased our genetic diversity estimates, population structure inference, and selection scan. We then demonstrate that these batch effects resulted from multiple technical differences between our datasets, including the sequencing instrument model/chemistry, read type, read length, DNA degradation level, and sequencing depth, but their impact can be detected and substantially mitigated with simple bioinformatic approaches. We conclude that combining datasets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.
Version of record in Molecular Ecology Resources: https://doi.org/10.1111/1755-0998.13402 The Maroni is one of the most speciose basins of the Guianas and hosts a megadiverse freshwater fish community. Although taxonomical references exist for both the Surinamese and Guyanese parts of the basin, these lists were mainly based on morphological identification and there are still taxonomical uncertainties concerning the status of several fish species. Here we present a barcode dataset of 1,284 COI sequences from 199 freshwater fish species (68.86% of the total number of strictly freshwater fishes from the basin) from 124 genera, 36 families, and 8 orders. DNA barcoding allowed for fast and efficient identification of all specimens studied as well as unveiling a consequent cryptic diversity, with the detection of 20 putative cryptic species and 5 species flagged for re-identification. In order to explore global genetic patterns across the basin, genetic divergence landscapes were computed for 128 species, showing a global trend of high genetic divergence between the Surinamese south-west (Tapanahony and Paloemeu), the Guianese south-east (Marouini, Litany, Tampok, Lawa…), and the river mouth in the north. This could be explained either by lower levels of connectivity between these three main parts or by the exchange of individuals with the surrounding basins. A new method of ordination of genetic landscapes successfully assigned species into cluster groups based on their respective pattern of genetic divergence across the Maroni Basin: genetically homogenous species across the basin were effectively discriminated from species showing high spatial genetic fragmentation and possible lower capacity for dispersal.
Targeted sequencing is an increasingly popular Next Generation Sequencing (NGS) approach for studying populations, through focusing sequencing efforts on specific parts of the genome of a species of interest. Methodologies and tools for designing targeted baits are scarce but in high demand. Here, we present specific guidelines and considerations for designing capture sequencing experiments for population genetics for both neutral genomic regions and regions subject to selection. We describe the bait design process for three diverse fish species: Atlantic salmon, Atlantic cod and tiger shark, which was carried out in our research group, and provide an evaluation of the performance of our approach across both historical and modern samples. The workflow used for designing these three bait sets has been implemented in the R-package supeRbaits, which encompass our considerations and guidelines for bait design to benefit researchers and practitioners. The supeRbaits R package is user‐friendly and versatile. It is written in C++ and implemented in R. supeRbaits and its manual are available from Github: https://github.com/BelenJM/supeRbaits
The Harbour porpoise (Phocoena phocoena) is a highly mobile cetacean species which primarily occurs in coastal and shelf waters across the Northern hemisphere. It inhabits heterogeneous seascapes that vary broadly in salinity and temperature. Here we produced 74 whole genomes at intermediate coverage to study Harbour porpoise’s evolutionary history and investigate the role of local adaptation in the diversification into subspecies and populations. We identified ~6 million high quality SNPs sampled at 8 localities across the North Atlantic and adjacent waters, which we used for population structure, demographic, and genotype-environment association analyses. Our results support a genetic differentiation between three subspecies, and three distinct populations within the subspecies P.p. phocoena: Atlantic, Belt Sea and Proper Baltic Sea. Effective population size and Tajima’s D levels suggest a population contraction in both Black Sea and Iberian porpoises while a population expansion in the P.p. phocoena populations. Phylogenetic trees indicate a post-glacial colonization of Harbour porpoises from a southern refugium. Genotype-environment association analysis identified salinity as a major driver in genomic variation and we identified candidate genes putatively underlying adaptation to different salinity levels. Our study highlights the value of whole genome resequencing to unravel subtle population structure in highly mobile species and shows how strong environmental gradients and local adaptation may lead to population differentiation. The results have great conservation implications as we found major levels of inbreeding and low genetic diversity in the endangered Black Sea subspecies and identified the critically endangered Proper Baltic Sea porpoises as a separate population.
The bean bug (Riptortus pedestris) causes great economic losses of soybeans by piercing and sucking pods and seeds. Although R. pedestris has become the focus of numerous studies associated with insect–microbe interactions, plant–insect interactions, and pesticide resistance, a lack of genomic resources has limited deeper insights. In this study, we report the first R. pedestris genome at the chromosomal level using PacBio, Illumina, and Hi-C technologies. The assembled genome was 1.193 Gb in size with a contig N50 of 13.97 Mb. More than 95.7% of the total genome bases were successfully anchored to 6 unique chromosomes, with the scaffold N50 reaching 181.34 Mb. Genome resequencing of male and female individuals and chromosomic staining demonstrated that the sex chromosome system of R. pedestris is XO, and the shortest chromosome is the X chromosome. In total, 21,562 protein-coding genes were predicted, 21,320 of which were validated as being expressed in different tissues or different developmental stages. Evolutionary analysis demonstrated that R. pedestris and Oncopeltus fasciatus formed a sister group and split ∼35 million years ago. Additionally, a 5.04 Mb complete genome of symbiotic Serratia marcescens Rip1 was assembled, and the virulence factors that account for successful colonization in the host midgut were identified. The high-quality R. pedestris genome provides a valuable resource for further research, as well as for the pest management of bug pests.