DNA metabarcoding is an important tool for molecular ecology. However, its effectiveness hinges on the quality of reference sequence databases and classification parameters employed. Here we evaluate the performance of MiFish 12S taxonomic assignments using a case study of California Current Large Marine Ecosystem fishes to determine best practices for metabarcoding. Specifically, we use a taxonomy cross-validation by identity framework to compare classification performance between a global database comprised of all available sequences and a curated database that only includes sequences of fishes from the California Current Large Marine Ecosystem. We demonstrate that the curated, regional database provides higher assignment accuracy than the comprehensive global database. We also document a tradeoff between accuracy and misclassification across a range of taxonomic cutoff scores, highlighting the importance of parameter selection for taxonomic classification. Furthermore, we compared assignment accuracy with and without the inclusion of additionally generated reference sequences. To this end, we sequenced tissue from 605 species using the MiFish 12S primers, adding 253 species to GenBank’s existing 550 California Current Large Marine Ecosystem fish sequences. We then compared species and reads identified from seawater environmental DNA samples using global databases with and without our generated references, and the regional database. The addition of new references allowed for the identification of 16 native taxa and 17.0% of total reads from eDNA samples, including species with vast ecological and economic value. Together these results demonstrate the importance of comprehensive and curated reference databases for effective metabarcoding and the need for locus-specific validation efforts.
Non-random mating among individuals can lead to spatial clustering of genetically similar individuals and population stratification. This deviation from panmixia is commonly observed in natural populations. Consequently, individuals can have parentage in single populations or involving hybridization between differentiated populations. Accounting for this mixture and structure is important when mapping the genetics of traits and learning about the formative evolutionary processes that shape genetic variation among individuals and populations. Stratified genetic relatedness among individuals is commonly quantified using estimates of ancestry that are derived from a statistical model. Development of these models for polyploid and mixed-ploidy individuals and populations has lagged behind those for diploids. Here, we extend and test a hierarchical Bayesian model, called entropy, which can utilize low-depth sequence data to estimate genotype and ancestry parameters in autopolyploid and mixed-ploidy individuals (including sex chromosomes and autosomes within individuals). Our analysis of simulated data illustrated the trade-off between sequencing depth and genome coverage and found lower error associated with low depth sequencing across a larger fraction of the genome than with high depth sequencing across a smaller fraction of the genome. The model has high accuracy and sensitivity as verified with simulated data and through analysis of admixture among populations of diploid and tetraploid Arabidopsis arenosa.
Plant interactions are as important belowground as aboveground. Belowground plant interactions are however inherently difficult to quantify, as roots of different species are difficult to disentangle. Although for a couple of decades molecular techniques have been successfully applied to quantify root abundance, root identification and quantification in multi-species plant communities remains particularly challenging. Here we present a novel methodology, multi-species Genotyping By Sequencing (msGBS), as a next step to tackle this challenge. First, a multi-species meta-reference database containing thousands of gDNA clusters per species is created from GBS derived High Throughput Sequencing (HTS) reads. Second, GBS derived HTS reads from multi-species root samples are mapped to this meta-reference which, after a filter procedure to increase the taxonomic resolution, allows the parallel quantification of multiple species. The msGBS signal of 111 mock-mixture root samples, with up to 8 plant species per sample, was used to calculate the within-species abundance. Optional subsequent calibration yielded the across-species abundance. The within- and across-species abundances highly correlated (R2 range 0.72-0.94 and 0.85-0.98, respectively) to the biomass-based species abundance. Compared to a qPCR based method which was previously used to analyze the same set of samples, msGBS provided similar results. Additional data on 11 congener species groups within 105 natural field root samples showed high taxonomic resolution of the method. msGBS is highly scalable in terms of sensitivity and species numbers within samples, which is a major advantage compared to the qPCR method and advances our tools to reveal the hidden belowground interactions.
Microsporidia are obligate intracellular eukaryotic parasites that infect nearly all animal groups, including humans. The most common molecular methods for Microsporidia detection rely on species-targeting qPCR or end-point PCR using group-specific primers. However, these methods could be not specific enough or fail in case of mixed infections. We developed a method for parallel detection of both microsporidian infection and the host species. We designed new primer sets: one specific for the classical Microsporidia (targeting hypervariable V5 region of ssu rDNA), and a second one targeting a shortened fragment of the COI gene (standard metazoan DNA-barcode); both markers are well suited for a NGS approach. The analysis of ssu rDNA dataset representing 607 microsporidian species (120 genera) indicated that the V5 region enables identification of >98% species in the dataset (596/607). To test the method, we used microsporidians that infect mosquitoes in natural populations. Using mini-COI data, all field-collected mosquitoes were unambiguously assigned to seven species; among them almost 60% of specimens (127/212) were positive for at least 11 different microsporidian species, including a new microsporidian ssu rDNA sequence (Microsporidium sp. PL01). Phylogenetic analysis of Microsporidium sp. PL01 ssu rDNA showed that this species belongs to one of the two main clades in the Terresporidia. In addition, the level of microsporidian mixed infections was relatively high (9.4%). The numbers of sequence reads for the OTUs suggest that the occurrence of Nosema spp. in co-infections could benefit them; however, this observation should be re-tested using more intensive host sampling. The proposed method for detection of Microsporidia can be applied to all types of DNA extracts, including medical and environmental samples.
Dispersal abilities play a crucial role in shaping the extent of population genetic structure, with more mobile species being panmictic over large geographic ranges and less mobile ones organized in meta-populations exchanging migrants to different degrees. In turn, population structure directly influences the coalescence pattern of the sampled lineages, but the consequences on the estimated variation of the effective population size (Ne) over time obtained by means of unstructured demographic models remain poorly understood. However, this knowledge is crucial for biologically interpreting the observed Ne trajectory and further devising conservation strategies in endangered species. Here we investigated the demographic history of four shark species (Carharhinus melanopterus, Carharhinus limbatus, Carharhinus amblyrhynchos, Galeocerdo cuvier) with different degrees of endangered status and life history traits related to dispersal distributed in the Indo-Pacific and sampled off New Caledonia. We compared several evolutionary scenarios representing both structured (meta-population) and unstructured models and then inferred the Ne variation through time. By performing extensive coalescent simulations, we provided a general framework relating the underlying population structure and the observed Ne dynamics. On this basis, we concluded that the recent decline observed in three out of the four considered species when assuming unstructured demographic models can be explained by the presence of population structure. Furthermore, we also demonstrated the limits of the inferences based on the sole site frequency spectrum and warn that statistics based on linkage disequilibrium will be needed to exclude recent demographic events affecting meta-populations.
The bean bug (Riptortus pedestris) causes great economic losses of soybeans by piercing and sucking pods and seeds. Although R. pedestris has become the focus of numerous studies associated with insect–microbe interactions, plant–insect interactions, and pesticide resistance, a lack of genomic resources has limited deeper insights. In this study, we report the first R. pedestris genome at the chromosomal level using PacBio, Illumina, and Hi-C technologies. The assembled genome was 1.193 Gb in size with a contig N50 of 13.97 Mb. More than 95.7% of the total genome bases were successfully anchored to 6 unique chromosomes, with the scaffold N50 reaching 181.34 Mb. Genome resequencing of male and female individuals and chromosomic staining demonstrated that the sex chromosome system of R. pedestris is XO, and the shortest chromosome is the X chromosome. In total, 21,562 protein-coding genes were predicted, 21,320 of which were validated as being expressed in different tissues or different developmental stages. Evolutionary analysis demonstrated that R. pedestris and Oncopeltus fasciatus formed a sister group and split ∼35 million years ago. Additionally, a 5.04 Mb complete genome of symbiotic Serratia marcescens Rip1 was assembled, and the virulence factors that account for successful colonization in the host midgut were identified. The high-quality R. pedestris genome provides a valuable resource for further research, as well as for the pest management of bug pests.
Partial clonality is widespread across the tree of life, but most population genetics models are designed for exclusively clonal or sexual organisms. This gap hampers our understanding of the influence of clonality on evolutionary trajectories and the interpretation of population genetics data. We performed forward simulations of diploid populations at increasing rates of clonality (c), analysed their relationships with genotypic (clonal richness, R, and distribution of clonal sizes, Pareto β) and genetic (FIS and linkage disequilibrium) indices, and tested predictions of c from population genetics data through supervised machine learning. Two complementary behaviours emerged from the probability distributions of genotypic and genetic indices with increasing c. While the impact of c on R and Pareto β was easily described by simple mathematical equations, its effects on genetic indices were noticeable only at the highest levels (c>0.95). Consequently, genotypic indices allowed reliable estimates of c, while genetic descriptors led to poorer performances when c<0.95. These results provide clear baseline expectations for genotypic and genetic diversity and dynamics under partial clonality. Worryingly, however, the use of realistic sample sizes to acquire empirical data systematically led to gross underestimates (often of one to two orders of magnitude) of c, suggesting that many interpretations hitherto proposed in the literature, mostly based on genotypic richness, should be reappraised. We propose future avenues to derive realistic confidence intervals for c and show that, although still approximate, a supervised learning method would greatly improve the estimation of c from population genetics data.
Biodiversity studies greatly benefit from molecular tools, such as DNA metabarcoding, which provides an effective identification tool in biomonitoring and conservation programmes. The accuracy of species-level assignment, and consequent taxonomic coverage, relies on comprehensive DNA barcode reference libraries. The role of these libraries is to support species identification, but accidental errors in the generation of the barcodes may compromise their accuracy. Here we present an R-based application, BAGS (Barcode, Audit & Grade System), that performs automated auditing and annotation of cytochrome c oxidase subunit I (COI) sequences libraries, for a given taxonomic group of animals, available in the Barcode of Life Data System (BOLD). This is followed by implementing a qualitative ranking system that assigns one of five grades (A to E) to each species in the reference library, according to the attributes of the data and congruency of species names with sequences clustered in Barcode Index Numbers (BINs). Our ultimate goal is to allow researchers to obtain the most useful and reliable data, highlighting and segregating records according to their congruency. Different tests were performed to perceive its usefulness and limitations. BAGS fulfils a significant gap in the current landscape of DNA barcoding research tools by quickly screening reference libraries to gauge the congruence status of data and facilitate the triage of ambiguous data for posterior review. Thereby, BAGS have the potential to become a valuable addition in forthcoming DNA metabarcoding studies, in the long term contributing to globally improve the quality and reliability of the public reference libraries.
Although the use and development of molecular biomonitoring tools based on eNAs (environmental nucleic acids; eDNA and eRNA) have gained broad interest for the quantification of biodiversity in natural ecosystems, studies investigating the impact of site-specific physicochemical parameters on eNA-based detection methods (particularly eRNA) remain scarce. Here, we used a controlled laboratory microcosm experiment to comparatively assess the environmental degradation of eDNA and eRNA across an acid-base gradient following complete removal of the progenitor organism (Daphnia pulex). Using water samples collected over a 30-day period, eDNA and eRNA copy numbers were quantified using a droplet digital PCR (ddPCR) assay targeting the mitochondrial cytochrome c oxidase subunit I (COI) gene of D. pulex. We found that eRNA decayed more rapidly than eDNA at all pH conditions tested, with detectability—predicted by an exponential decay model—for up to 57 hours (eRNA; neutral pH) and 143 days (eDNA; acidic pH) post organismal removal. Decay rates for eDNA were significantly higher in neutral and alkaline conditions than in acidic conditions, while decay rates for eRNA did not differ significantly among pH levels. Collectively, our findings provide the basis for a predictive framework assessing the persistence and degradation dynamics of eRNA and eDNA across a range of ecologically relevant pH conditions, establish the potential for eRNA to be used in spatially and temporally sensitive biomonitoring studies (as it is detectable across a range of pH levels), and may be used to inform future sampling strategies in aquatic habitats.
DNA metabarcoding is routinely used for biodiversity assessment, especially targeting highly diverse groups for which limited taxonomic expertise is available. Various protocols are currently in use, although standardization is key to its application in large-scale monitoring. DNA metabarcoding of arthropod bulk samples can be either conducted destructively from sample tissue, or non-destructively from sample fixative or lysis buffer. Non-destructive methods are highly desirable for the preservation of sample integrity but have yet to be experimentally evaluated in detail. Here, we compare diversity estimates from 14 size sorted Malaise trap samples processed consecutively with three non-destructive approaches (one using fixative ethanol and two using lysis buffers) and one destructive approach (using homogenized tissue). Extraction from commercial lysis buffer yielded comparable species richness and high overlap in species composition to the ground tissue extracts. A significantly divergent community was detected from preservative ethanol-based DNA extraction. No consistent trend in species richness was found with increasing incubation time in lysis buffer. These results indicate that non-destructive DNA extraction from incubation in lysis buffer could provide a comparable alternative to destructive approaches with the added advantage of preserving the specimens for post-metabarcoding taxonomic work.
Here we present an annotated, chromosome-anchored, genome assembly for Lake Trout (Salvelinus namaycush) – a highly diverse salmonid species of notable conservation concern and an excellent model for research on adaptation and speciation. We leveraged Pacific Biosciences long-read sequencing, paired-end Illumina sequencing, proximity ligation (Hi-C), and a previously published linkage map to produce a highly contiguous assembly composed of 7,378 contigs (contig N50 = 1.8 mb) assigned to 4,120 scaffolds (scaffold N50 = 44.975 mb). 84.7% of the genome was assigned to 42 chromosome-sized scaffolds and 93.2% of Benchmarking Universal Single Copy Orthologs were recovered, putting this assembly on par with the best currently available salmonid genomes. Estimates of genome size based on k-mer frequency analysis were highly similar to the total size of the finished genome, suggesting that the entirety of the genome was recovered. A mitome assembly was also produced. Self-vs-self synteny analysis allowed us to identify homeologs resulting from the Salmonid specific autotetraploid event (Ss4R) and alignment with three other salmonid species allowed us to identify homologous chromosomes in other species. We also generated multiple resources useful for future genomic research on Lake Trout including a repeat library and a sex averaged recombination map. A novel RNA sequencing dataset was also used to produce a publicly available set of gene annotations using the National Center for Biotechnology Information Eukaryotic Genome Annotation Pipeline. Potential applications of these resources to population genetics and the conservation of native populations are discussed.
Here I describe the novel R package SNPfiltR and demonstrate its functionalities as the backbone of a customizable, reproducible SNP filtering pipeline implemented exclusively via the widely adopted R programming language. SNPfiltR extends existing SNP filtering functionalities by automating the visualization of key parameters such as depth, quality, and missing data, then allowing users to set filters based on optimized thresholds, all within a single, cohesive working environment. All SNPfiltR functions require a vcfR object as input, which can be easily generated by reading a SNP dataset stored as a standard vcf file into an R working environment using the function read.vcfR() from the R package vcfR. Performance benchmarking reveals that for moderately sized SNP datasets (up to 50M genotypes with associated quality information), SNPfiltR performs filtering with comparable efficiency to current state of the art command-line-based programs. These benchmarking results indicate that for most reduced-representation genomic datasets, SNPfiltR is an ideal choice for investigating, visualizing, and filtering SNPs as part of a cohesive and easily documentable bioinformatic pipeline. The SNPfiltR package can be downloaded from CRAN with the command [install.packages(“SNPfiltR”)], and a development version is available from GitHub at: (github.com/DevonDeRaad/SNPfiltR). Additionally, thorough documentation for SNPfiltR, including multiple comprehensive vignettes, is available at the website: (devonderaad.github.io/SNPfiltR/).
Populus has a wide ecogeographical range spanning the Northern Hemisphere, and exhibits abundant distinct species and hybrids globally. Populus tomentosa Carr. is widely distributed and cultivated in the eastern region of Asia, where it plays multiple important roles in forestry, agriculture, conservation, and urban horticulture. Reference genomes are available for several Populus species, however, our goals were to produce a very high quality de novo, chromosome-level genome assembly in P. tomentosa genome that could serve as a reference for evolutionary and ecological studies of hybrid speciation. Here, combining long-read sequencing and Hi-C scaffolding, we present a high-quality, haplotype-resolved genome assembly. The genome size was 740.2 Mb, with a contig N50 size of 5.47 Mb and a scaffold N50 size of 46.68 Mb, consisting of 38 chromosomes, as expected with the known diploid chromosome number (2n=2x=38). A total of 59,124 protein-coding genes were identified. Phylogenomic analyses revealed that P. tomentosa is comprised of two distinct subgenomes, which we deomonstrate is likely to have resulted from hybridization between Populus adenopoda as the female parent and Populus alba var. pyramidalis as the male parent, approximately 3.93 Mya. Although highly colinear, significant structural variation was also found between the two subgenomes. Our study provides a valuable resource for ecological genetics and forest biotechnology.
The analysis of genomic data can be an intimidating process, particularly for researchers who are not experienced programmers. Commonly used analyses are spread out across programs, each of which require their own input formats, and data must often be wrangled and re-wrangled into new formats to split the data according to categorical metadata variables, such as population or family. Here, we introduce snpR, and R package that allows for user-friendly processing of SNP genomic data by automating data sub-setting and processing across categorical metadata, integrating approaches contained in many different packages under a single ecosystem, and allowing for iterative, efficient analysis focused on a single R object across an entire analysis pipeline.
Genetic monitoring using non-invasive samples provides a complement or alternative to traditional population monitoring methods. However, Next Generation Sequencing approaches to monitoring typically require high quality DNA and the use of non-invasive samples (e.g. scat) is often challenged by poor DNA quality and contamination by non-target species. One promising solution is a highly multiplexed sequencing approach called Genotyping-in-thousands by sequencing (GT-seq), which can enable cost-efficient genomics-based monitoring for populations based on non-invasively collected samples. Here, we develop and validate a GT-seq panel of 324 single nucleotide polymorphisms (SNPs) optimized for genotyping of polar bears based on DNA from non-invasively collected fecal samples. We demonstrate 1) successful GT-seq genotyping of DNA from a range of sample sources, including successful genotyping of 85.7% of non-invasively collected fecal samples determined to contain polar bear DNA, and 2) that we can reliably differentiate individuals, ascertain sex, assess relatedness, and resolve population structure of Canadian polar bear subpopulations based on a GT-seq panel of 324 SNPs. Our GT-seq data reveal similar spatial-genetic patterns as previous polar bear studies but at lesser cost per sample and using non-invasively collected samples, indicating the potential of this approach for population monitoring. This GT-seq panel provides the foundation for a non-invasive toolkit for polar bear monitoring and contribute to community-based programs – a framework which may serve as a model for wildlife management and contribute to conservation and policy for species worldwide.
Metabarcoding of DNA extracted from environmental or bulk specimen samples is increasingly used to detect plant and animal taxa in basic and applied biodiversity research because of its targeted nature that allows sequencing of genetic markers from many samples in parallel. To achieve this, PCR amplification is carried out with primers designed to target a taxonomically informative marker within a taxonomic group, and sample-specific nucleotide identifiers are added to the amplicons prior to sequencing. This enables assignment of the sequences back to the samples they originated from. Nucleotide identifiers can be added during the metabarcoding PCR and/or during ‘library preparation’, i.e. when amplicons are prepared for sequencing. Different strategies to achieve this labelling exist. All have advantages, challenges and limitations, some of which can lead to misleading results, and in the worst case compromise the fidelity of the metabarcoding data. Given the range of questions addressed using metabarcoding, the importance of ensuring that data generation is robust and fit for purpose should be at the forefront of practitioners seeking to employ metabarcoding for biodiversity assessments. Here, we present an overview of the three main workflows for sample-specific labelling and library preparation in metabarcoding studies on Illumina sequencing platforms. Further, we distil the key considerations for researchers seeking to select an appropriate metabarcoding strategy for their specific study. Ultimately, by gaining insights into the consequences of different metabarcoding workflows, we hope to further consolidate the power of metabarcoding as a tool to assess biodiversity across a range of applications.
Interactions of organisms with their environment are complex and environmental regulation at different levels of biological organization is often non-linear. Therefore, the genotype to phenotype continuum requires study at multiple levels of organization. While studies of transcriptome regulation are now common for many species, quantitative studies of environmental effects on proteomes are needed. Here we report the generation of a data-independent acquisition (DIA) assay library that enables simultaneous targeted proteomics of thousands of Oreochromis niloticus kidney proteins using a label- and gel-free workflow that is well suited for ecologically relevant field samples. We demonstrate the usefulness of this DIA assay library by discerning environmental effects on the kidney proteome of O. niloticus. Moreover, we demonstrate that the DIA assay library approach generates data that are complimentary rather than redundant to transcriptomics data. Transcript and protein abundance differences in kidneys of tilapia acclimated to freshwater and brackish water (25 g/kg) were correlated for 2114 unique genes. A high degree of non-linearity in salinity-dependent regulation of transcriptomes and proteomes was revealed suggesting that the regulation of O. niloticus renal function by environmental salinity relies heavily on post-transcriptional mechanisms. The application of functional enrichment analyses using STRING and KEGG to DIA assay datasets is demonstrated by identifying myo-inositol metabolism, antioxidant and xenobiotic functions, and signaling mechanisms as key elements controlled by salinity in tilapia kidneys. The DIA assay library resource presented here can be adopted for other tissues and other organisms to study proteome dynamics during changing ecological contexts.
DNA metabarcoding has become a powerful approach for analyzing complex communities from environmental samples, but there are still methodological challenges limiting its full potential. While conserved DNA markers, like 16S and 18S, often are not able to discriminate among closely related species, other more variable markers – like the fungal ITS region, may include considerable intraspecific variation, which can lead to over-splitting of species during DNA metabarcoding analyses. Here we assess the effects of intraspecific sequence variation in DNA metabarcoding, by analyzing local populations of eleven fungal species. We investigated the allelic diversity of ITS2 haplotypes using both Sanger sequencing and high throughput sequencing (HTS), coupled with error correction with the software DADA2. All focal species, except one, included some level of intraspecific variation in the ITS2 region. Overall, we observed a high correspondence between haplotypes generated by Sanger sequencing and HTS, with the exception of a few additional haplotypes detected using either approach. These extra haplotypes, often occurring in low frequencies, were likely due to PCR and sequencing errors or intragenomic variation in the rDNA region. The presence of intraspecific (and possibly intragenomic) variation in ITS2 suggest that haplotypes (or ASVs) should not be used as basic units in ITS-based fungal community analyses, but an extra clustering step is needed to approach species-level resolution.