Over the last two decades, there has been a huge increase in our understanding of microbial diversity, structure and composition enabled by high throughput sequencing (HTS) technologies. Yet, it is unclear how the number of sequences translates to the number of cells or species within the community. Additional observational data may be required to ensure relative abundance patterns from sequence reads are biologically meaningful or presence absence data may be used instead of abundance. The goal is to obtain robust community abundance data, simultaneously, from environmental samples. In this issue of Molecular Ecology Resources, Karlusich et al., (2022) describe a new method for quantifying phytoplankton cell abundance. Using Tara Oceans datasets, the authors propose the photosynthetic gene psbO for reporting accurate relative abundance of the entire phytoplankton community from metagenomic data. The authors demonstrate improved correlations with traditional optical methods including microscopy and flow cytometry, improving upon current molecular identification typically using rRNA markers genes. Furthermore, to facilitate application of their approach, the authors curated a psbO gene database for accessible taxonomic queries. This is an important step towards improving species abundance estimates from molecular data and eventually reporting of absolute species abundance, enhancing our understanding of community dynamics.
The analysis of genomic data can be an intimidating process, particularly for researchers who are not experienced programmers. Commonly used analyses are spread out across programs, each of which require their own input formats, and data must often be wrangled and re-wrangled into new formats to split the data according to categorical metadata variables, such as population or family. Here, we introduce snpR, and R package that allows for user-friendly processing of SNP genomic data by automating data sub-setting and processing across categorical metadata, integrating approaches contained in many different packages under a single ecosystem, and allowing for iterative, efficient analysis focused on a single R object across an entire analysis pipeline.
Background Microbiome studies are often limited by a lack of statistical power due to small sample sizes and a large number of features. This problem is exacerbated in correlative studies of multi-omic datasets. Statistical power can be increased by finding and summarizing modules of correlated observations, which is one dimensionality reduction method. Additionally, modules provide biological insight as correlated groups of microbes can have relationships among themselves. Results To address these challenges, we developed SCNIC: Sparse Cooccurrence Network Investigation for compositional data. SCNIC is open-source software that can generate correlation networks and detect and summarize modules of highly correlated features. Modules can be formed using either the Louvain Modularity Maximization (LMM) algorithm or a Shared Minimum Distance algorithm (SMD) that we newly describe here and relate to LMM using simulated data. We applied SCNIC to two published datasets and we achieved increased statistical power and identified microbes that not only differed across groups, but also correlated strongly with each other, suggesting shared environmental drivers or cooperative relationships among them. Conclusions SCNIC provides an easy way to generate correlation networks, identify modules of correlated features and summarize them for downstream statistical analysis. Although SCNIC was designed considering properties of microbiome data, such as compositionality and sparsity, it can be applied to a variety of data types including metabolomics data and used to integrate multiple data types. SCNIC allows for the identification of functional microbial relationships at scale while increasing statistical power through feature reduction.
Conducting large-scale phylogeographic studies to understand processes affecting population structure and genetic diversity across multiple species is difficult because the key genetic (NCBI) and spatial (GBIF) repositories are disconnected. In this issue of Molecular Ecology Resources, Pelletier et al. (2022) demonstrate the power of connecting these in the program phylogatR. This program assembled 87,852 species and 102,268 sequence alignments in a taxonomic hierarchy, yielding multiple sequence alignments per species, mainly for animals (88%), composed mostly of mtDNA data. The authors discuss several caveats with these alignments and provide flags identifying particular problems associating locality and genetic data with certain taxa (e.g., multiple localities per individuals). They provide a test that nucleotide diversity should increase with area, but find a significant relationship in only 32% of taxa with no clear taxonomic or ecological factors accounting for this. To examine the potential of this program, I tested the idea that the degree of population expansion should increase with latitude given potential environmental stability in the tropics and instability in temperate regions. In under two hours, I downloaded all squamates (lizards and snakes) and regressed Tajima’s D on latitude and found a weak but significant negative relationship, indicating a potential association between latitude and population expansion. The phylogatR database is a powerful resource for researchers wanting to test the relationship between genetic diversity and some aspect of space or environment.
Dietary metabarcoding has vastly improved our ability to analyse the diets of animals, but it is hampered by a plethora of technical limitations including potentially reduced data output due to the disproportionate amplification of the DNA of the focal predator, here termed ‘the predator problem’. We review the various methods commonly used to overcome this problem, from deeper sequencing to exclusion of predator DNA during PCR, and how they may interfere with increasingly common multi-predator-taxon studies. We suggest that multi-primer approaches with an emphasis on achieving both depth and breadth of prey detections may overcome the issue to some extent, although multi-taxon studies require further consideration, as highlighted by an empirical example. We also review several alternative methods for reducing the prevalence of predator DNA that are conceptually promising but require additional empirical examination. The predator problem is a key constraint on molecular dietary analyses but, through this synthesis, we hope to guide researchers in overcoming this in an effective and pragmatic way.
Prevailing 16S rRNA gene-amplicon methods for characterizing the bacterial microbiome are economical, but result in coarse taxonomic classifications, are subject to primer and 16S copy number biases, and do not allow for direct estimation of microbiome functional potential. While deep shotgun metagenomic sequencing can overcome many of these limitations, it is prohibitively expensive for large sample sets. We evaluated the ability of shallow shotgun metagenomic sequencing to characterize taxonomic and functional patterns in the fecal microbiome of a model population of feral horses (Sable Island, Canada). Since 2007, this unmanaged population has been the subject of an individual-based, long-term ecological study. Using deep shotgun metagenomic sequencing, we determined the sequencing depth required to accurately characterize the horse microbiome. In comparing conventional versus high-throughput shotgun metagenomic library preparation techniques, we validate the use of more cost-effective lab methods. Finally, we characterize similarities between 16S amplicon and shallow shotgun characterization of the microbiome, and demonstrate that the latter recapitulates biological patterns first described in a published amplicon dataset. Unlike amplicon data, we demonstrate how shallow shotgun metagenomic data also provided useful insights about microbiome functional potential which support previously hypothesized diet effects in this study system.
DNA metabarcoding is routinely used for biodiversity assessment, especially targeting highly diverse groups for which limited taxonomic expertise is available. Various protocols are currently in use, although standardization is key to its application in large-scale monitoring. DNA metabarcoding of arthropod bulk samples can be either conducted destructively from sample tissue, or non-destructively from sample fixative or lysis buffer. Non-destructive methods are highly desirable for the preservation of sample integrity but have yet to be experimentally evaluated in detail. Here, we compare diversity estimates from 14 size sorted Malaise trap samples processed consecutively with three non-destructive approaches (one using fixative ethanol and two using lysis buffers) and one destructive approach (using homogenized tissue). Extraction from commercial lysis buffer yielded comparable species richness and high overlap in species composition to the ground tissue extracts. A significantly divergent community was detected from preservative ethanol-based DNA extraction. No consistent trend in species richness was found with increasing incubation time in lysis buffer. These results indicate that non-destructive DNA extraction from incubation in lysis buffer could provide a comparable alternative to destructive approaches with the added advantage of preserving the specimens for post-metabarcoding taxonomic work.
Environmental DNA (eDNA) analyses are powerful for describing marine biodiversity but must be optimized for their effective use in routine monitoring. To maximize eDNA detection probabilities of sparsely distributed populations, water samples are usually concentrated from larger volumes and filtered using fine-pore membranes, often a significant cost-time bottleneck in the workflow. This study aimed to streamline eDNA sampling by investigating plankton net versus bucket sampling, direct versus sequential filtration including self-preserving filters. Biodiversity was assessed using metabarcoding of the small ribosomal subunit (18S rRNA) and mitochondrial cytochrome c oxidase I (COI) genes. Multi-species detection probabilities were estimated for each workflow using a probabilistic occupancy modelling approach. Significant workflow-related differences in biodiversity metrics were reported. Highest amplicon sequence variant (ASV) richness was attained by the bucket sampling combined with self-preserving filters, comprising a large portion of micro-plankton. Less diversity but more metazoan taxa were captured in the net samples combined with 5 µm pore size filters. Pre-filtered 1.2 µm samples yielded few or no unique ASVs. The highest average (~32%) metazoan detection probabilities in the 5 µm pore size net samples confirmed the effectiveness of pre-concentrating plankton for biodiversity screening. These results contribute to streamlining eDNA sampling protocols for uptake and implementation in marine biodiversity research and surveillance.
Although the use and development of molecular biomonitoring tools based on eNAs (environmental nucleic acids; eDNA and eRNA) have gained broad interest for the quantification of biodiversity in natural ecosystems, studies investigating the impact of site-specific physicochemical parameters on eNA-based detection methods (particularly eRNA) remain scarce. Here, we used a controlled laboratory microcosm experiment to comparatively assess the environmental degradation of eDNA and eRNA across an acid-base gradient following complete removal of the progenitor organism (Daphnia pulex). Using water samples collected over a 30-day period, eDNA and eRNA copy numbers were quantified using a droplet digital PCR (ddPCR) assay targeting the mitochondrial cytochrome c oxidase subunit I (COI) gene of D. pulex. We found that eRNA decayed more rapidly than eDNA at all pH conditions tested, with detectability—predicted by an exponential decay model—for up to 57 hours (eRNA; neutral pH) and 143 days (eDNA; acidic pH) post organismal removal. Decay rates for eDNA were significantly higher in neutral and alkaline conditions than in acidic conditions, while decay rates for eRNA did not differ significantly among pH levels. Collectively, our findings provide the basis for a predictive framework assessing the persistence and degradation dynamics of eRNA and eDNA across a range of ecologically relevant pH conditions, establish the potential for eRNA to be used in spatially and temporally sensitive biomonitoring studies (as it is detectable across a range of pH levels), and may be used to inform future sampling strategies in aquatic habitats.
Spatially explicit population genetic models have long been developed, yet have rarely been used to test hypotheses about the spatial distribution of genetic diversity or the expected neutral levels of genetic divergence between populations. Here, we use spatially explicit coalescence simulations to explore the properties of the island model and the two-dimensional stepping stone model under a wide range of scenarios with spatio-temporal variation in deme size. We avoid the simulation of genetic data, using the fact that under the studied models, summary statistics of genetic diversity and divergence between demes can be approximated from coalescence times. We perform the simulations using gridCoal, a flexible spatial wrapper for the software msprime developed herein. In gridCoal, deme sizes can change arbitrarily across space and time, and migration rates between individual demes can be specified. We identify the different factors that can cause a deviation from the theoretical expectations, such as the simulation time in comparison to the effective deme size and the spatio-temporal autocorrelation across the grid. Our results highlight that Fst, a measure of the strength of population structure, principally depends on recent demography, which makes it robust to temporal variation in deme size. We also warn that predicting genetic diversity from coalescence times requires a much longer run time than needed for the estimation of Fst. Finally, we illustrate the use of gridCoal on a real-world example, the range expansion of silver fir (Abies alba Mill.) since the Last Glacial Maximum, using different degrees of spatio-temporal variation in deme size.
eDNA metabarcoding is an effective method for studying fish communities but allows only an estimation of relative species abundance (density / biomass). Here, we combine metabarcoding with an estimation of the total abundance of eDNA amplified by our universal marker (teleo) using a qPCR approach to infer the absolute abundance of fish species. We carried out a 2850 km eDNA survey within the Danube catchment using a spatial integrative sampling protocol coupled with traditional electrofishing for fish biomass and density estimation. Total fish eDNA concentrations and total fish abundance were highly correlated. The correlation between eDNA concentrations per taxon and absolute specific abundance was of comparable strength when all sites were pooled and remained significant when the sites were considered separately. Furthermore, a non-linear mixed model showed that species richness was underestimated when the amount of teleo-DNA extracted from a sample was below a threshold of 0.65.106 copies of eDNA. This result, combined with the decrease in teleo-DNA concentration by several orders of magnitude with river size, highlights the need to increase sampling effort in large rivers. Our results show a comprehensive description of longitudinal changes in fish communities and underline our combined metabarcoding/qPCR approach for biomonitoring and bioassessment surveys when a rough estimate of absolute species abundance is sufficient.
The collembolan Folsomia candida Willem, 1902, is an important representative soil arthropod that is widely distributed throughout the world and has been frequently used as a test organism in soil ecology and ecotoxicology studies. However, it is questioned as an ideal “standard” because of differences in reproductive modes and cryptic genetic diversity between strains from various geographical origins. In this study, we present two high-quality chromosome-level genomes of F. candida, for the parthenogenetic Danish strain (FCDK, 219.08 Mb, N50 of 38.47 Mb, 25,139 protein-coding genes) and the sexual Shanghai strain (FCSH, 153.09 Mb, N50 of 25.75 Mb, 21,609 protein-coding genes). The seven chromosomes of FCDK are each 25–54% larger than the corresponding chromosomes of FCSH, showing obvious repetitive element expansions and large-scale inversions and translocations but no whole-genome duplication. The strain-specific genes, expanded gene families and genes in nonsyntenic chromosomal regions identified in FCDK are highly related to its broader environmental adaptation. In addition, the overall sequence identity of the two mitogenomes is only 78.2%, and FCDK has fewer strain-specific microRNAs than FCSH. In conclusion, FCDK and FCSH have accumulated independent genetic changes and evolved into distinct species since diverging 10 Mya. Our work shows that F. candida represents a good model of rapidly cryptic speciation. Moreover, it provides important genomic resources for studying the mechanisms of species differentiation, soil arthropod adaptation to soil ecosystems, and Wolbachia-induced parthenogenesis as well as the evolution of Collembola, a pivotal phylogenetic clade between Crustacea and Insecta.
Here I describe the novel R package SNPfiltR and demonstrate its functionalities as the backbone of a customizable, reproducible SNP filtering pipeline implemented exclusively via the widely adopted R programming language. SNPfiltR extends existing SNP filtering functionalities by automating the visualization of key parameters such as depth, quality, and missing data, then allowing users to set filters based on optimized thresholds, all within a single, cohesive working environment. All SNPfiltR functions require a vcfR object as input, which can be easily generated by reading a SNP dataset stored as a standard vcf file into an R working environment using the function read.vcfR() from the R package vcfR. Performance benchmarking reveals that for moderately sized SNP datasets (up to 50M genotypes with associated quality information), SNPfiltR performs filtering with comparable efficiency to current state of the art command-line-based programs. These benchmarking results indicate that for most reduced-representation genomic datasets, SNPfiltR is an ideal choice for investigating, visualizing, and filtering SNPs as part of a cohesive and easily documentable bioinformatic pipeline. The SNPfiltR package can be downloaded from CRAN with the command [install.packages(“SNPfiltR”)], and a development version is available from GitHub at: (github.com/DevonDeRaad/SNPfiltR). Additionally, thorough documentation for SNPfiltR, including multiple comprehensive vignettes, is available at the website: (devonderaad.github.io/SNPfiltR/).
Understanding the genetic changes associated with the evolution of biological diversity is of fundamental interest to molecular ecologists. The assessment of genetic variation at hundreds or thousands of unlinked genetic loci forms a sound basis to address questions ranging from micro- to macro-evolutionary timescales, and is now possible thanks to advances in sequencing technology. Major difficulties are associated with i) the lack of genomic resources for many taxa, especially from tropical biodiversity hotspots, ii) scaling the numbers of individuals analyzed and loci sequenced, and iii) building tools for reproducible bioinformatic analyses of such datasets. To address these challenges, we developed a set of target capture probes for phylogenomic studies of the highly diverse, pantropically distributed and economically significant rosewoods (Dalbergia spp.), explored the performance of an overlapping probe set for target capture across the legume family (Fabaceae), and built a general-purpose bioinformatics pipeline. Phylogenomic analyses of Dalbergia species from Madagascar yielded highly resolved and well supported hypotheses of evolutionary relationships. Population genomic analyses identified differences between closely related species and revealed the existence of a potentially new species, suggesting that the diversity of Malagasy Dalbergia species has been underestimated. Analyses at the family level corroborated previous findings by the recovery of monophyletic subfamilies and many well-known clades, as well as high levels of gene tree discordance, especially near the root of the family. The new genomic and bioinformatics resources will hopefully advance systematics and ecological genetics research in legumes, and promote conservation of the highly diverse and endangered Dalbergia rosewoods.
Schistosomiasis is a neglected tropical disease of humans caused by blood flukes of the genus Schistosoma – the only dioecious parasitic flatworms. Although aspects of sex determination, differentiation and reproduction have been studied in some Schistosoma species, almost nothing is understood for Schistosoma japonicum - the causative agent of schistosomiasis japonica. This relates mainly to a lack of high-quality genomic and transcriptomic resources for this species. As current draft genomes for S. japonicum are highly fragmented, we assembled here a chromosome-level reference genome (seven autosomes, the Z-chromosome and partial W-chromosome), achieving a substantially enhanced gene annotation. Utilising this genome, we discovered that the sex chromosomes of S. japonicum and its congener S. mansoni independently suppressed recombination during evolution, forming four and two ‘strata’, respectively. By exploring the W-chromosome and sex-specific transcriptomes, we identified 35 W-linked genes and 257 female-preferentially transcribed genes (FTGs) and identified a signature for sex determination and differentiation in S. japonicum. These FTGs cluster within autosomes or the Z-chromosome and exhibit a highly dynamic transcription profile during the pairing of female and male schistosomules (advanced juveniles), representing a critical phase for the maturation of the female worms, suggesting distinct layers of regulatory control of gene transcription at this stage of development. Collectively, these data provide a valuable resource for further functional genomic characterisation of S. japonicum, shed light on the evolution of sex chromosomes in this highly virulent human blood fluke and provide a pathway to identify novel targets for development of intervention tools against schistosomiasis.
High-throughput sequencing for analysis of environmental microbial diversity has evolved vastly over the last decade. Currently the go-to method for microbial eukaryotes is short-read metabarcoding of variable regions of the 18S rRNA gene with <500 bp amplicons. However, there is a growing interest in long-read sequencing of amplicons covering the rRNA operon for improving taxonomic resolution. For both methods, the choice of primers is crucial. It determines if community members are covered, if they can be identified at a satisfactory taxonomic level, and if the obtained community profile is representative. Here, we designed new primers targeting 18S and 28S rRNA based on 177,934 and 21,072 database sequences, respectively. The primers were evaluated in silico along with published primers on reference sequence databases and marine metagenomics datasets. We further evaluated a subset of the primers for short- and long-read sequencing on environmental samples in vitro and compared the obtained community profile with primer-unbiased metagenomic sequencing. Of the short-read pairs, a new V6-V8 pair and the V4_Balzano pair used with a simplified PCR protocol provided good results in silico and in vitro. Fewer differences were observed between the long-read primer pairs. The long-read amplicons and ITS1 alone provided higher taxonomic resolution than V4. Together, our results represent a reference and guide for selection of robust primers for research on and environmental monitoring of microbial eukaryotes.
The use of NGS datasets has increased dramatically over the last decade, however, there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single Pinus contorta parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the SNP genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded one to two orders of magnitude larger numbers of SNPs and error rates, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.
The molecular characterisation of complex behaviours is a challenging task as a range of different factors are often involved to produce the observed phenotype. An established approach is to look at the overall levels of expression of brain genes – known as ‘neurogenomics’ – to select the best candidates that associate with patterns of interest. This approach has relied so far on a set of powerful statistical tools capable to provide a snapshot of the expression of many thousands of genes that are present in an organism’s genome. However, traditional neurogenomic analyses have some well-known limitations; above all, the limited number of biological replicates compared to the number of genes tested – often referred to as “curse of dimensionality”. Here we implemented a new Machine Learning (ML) approach that can be used as a complement to established methods of transcriptomic analyses. We tested three types of ML models for their performance in the identification of genes associated with honeybee waggle dance. We then intersected the results of these analyses with traditional outputs of differential gene expression analyses and identified two promising candidates for the neural regulation of the waggle dance: the G-protein coupled receptor boss and hnRNP A1, a gene involved in alternative splicing. Overall, our study demonstrates the application of Machine Learning to analyse transcriptomics data and identify genes underlying social behaviour. This approach has great potential for application to a wide range of different scenarios in evolutionary ecology, when investigating the genomic basis for complex phenotypic traits.