Polyploids are cells or organisms with a genome consisting of more than two sets of homologous chromosomes. Polyploid plants have important traits that facilitate speciation and are thus often model systems for evolutionary, molecular ecology and agricultural studies. However, due to their unusual mode of inheritance and dou-ble-reduction, diploid models of population genetic analysis cannot properly be ap-plied to polyploids. To overcome this problem, we developed a software package en-titled VCFPOP to perform a variety of population genetic analyses for autopolyploids, such as parentage analysis, analysis of molecular variance, principal coordinates analysis, hierarchical clustering analysis and Bayesian clustering. We make this soft-ware freely available, downloadable from http://github.com/huangkang1987/vcfpop.
Prevailing 16S rRNA gene-amplicon methods for characterizing the bacterial microbiome are economical, but result in coarse taxonomic classifications, are subject to primer and 16S copy number biases, and do not allow for direct estimation of microbiome functional potential. While deep shotgun metagenomic sequencing can overcome many of these limitations, it is prohibitively expensive for large sample sets. We evaluated the ability of shallow shotgun metagenomic sequencing to characterize taxonomic and functional patterns in the fecal microbiome of a model population of feral horses (Sable Island, Canada). Since 2007, this unmanaged population has been the subject of an individual-based, long-term ecological study. Using deep shotgun metagenomic sequencing, we determined the sequencing depth required to accurately characterize the horse microbiome. In comparing conventional versus high-throughput shotgun metagenomic library preparation techniques, we validate the use of more cost-effective lab methods. Finally, we characterize similarities between 16S amplicon and shallow shotgun characterization of the microbiome, and demonstrate that the latter recapitulates biological patterns first described in a published amplicon dataset. Unlike amplicon data, we demonstrate how shallow shotgun metagenomic data also provided useful insights about microbiome functional potential which support previously hypothesized diet effects in this study system.
DNA metabarcoding is routinely used for biodiversity assessment, especially targeting highly diverse groups for which limited taxonomic expertise is available. Various protocols are currently in use, although standardization is key to its application in large-scale monitoring. DNA metabarcoding of arthropod bulk samples can be either conducted destructively from sample tissue, or non-destructively from sample fixative or lysis buffer. Non-destructive methods are highly desirable for the preservation of sample integrity but have yet to be experimentally evaluated in detail. Here, we compare diversity estimates from 14 size sorted Malaise trap samples processed consecutively with three non-destructive approaches (one using fixative ethanol and two using lysis buffers) and one destructive approach (using homogenized tissue). Extraction from commercial lysis buffer yielded comparable species richness and high overlap in species composition to the ground tissue extracts. A significantly divergent community was detected from preservative ethanol-based DNA extraction. No consistent trend in species richness was found with increasing incubation time in lysis buffer. These results indicate that non-destructive DNA extraction from incubation in lysis buffer could provide a comparable alternative to destructive approaches with the added advantage of preserving the specimens for post-metabarcoding taxonomic work.
Environmental DNA (eDNA) analyses are powerful for describing marine biodiversity but must be optimized for their effective use in routine monitoring. To maximize eDNA detection probabilities of sparsely distributed populations, water samples are usually concentrated from larger volumes and filtered using fine-pore membranes, often a significant cost-time bottleneck in the workflow. This study aimed to streamline eDNA sampling by investigating plankton net versus bucket sampling, direct versus sequential filtration including self-preserving filters. Biodiversity was assessed using metabarcoding of the small ribosomal subunit (18S rRNA) and mitochondrial cytochrome c oxidase I (COI) genes. Multi-species detection probabilities were estimated for each workflow using a probabilistic occupancy modelling approach. Significant workflow-related differences in biodiversity metrics were reported. Highest amplicon sequence variant (ASV) richness was attained by the bucket sampling combined with self-preserving filters, comprising a large portion of micro-plankton. Less diversity but more metazoan taxa were captured in the net samples combined with 5 µm pore size filters. Pre-filtered 1.2 µm samples yielded few or no unique ASVs. The highest average (~32%) metazoan detection probabilities in the 5 µm pore size net samples confirmed the effectiveness of pre-concentrating plankton for biodiversity screening. These results contribute to streamlining eDNA sampling protocols for uptake and implementation in marine biodiversity research and surveillance.
Although the use and development of molecular biomonitoring tools based on eNAs (environmental nucleic acids; eDNA and eRNA) have gained broad interest for the quantification of biodiversity in natural ecosystems, studies investigating the impact of site-specific physicochemical parameters on eNA-based detection methods (particularly eRNA) remain scarce. Here, we used a controlled laboratory microcosm experiment to comparatively assess the environmental degradation of eDNA and eRNA across an acid-base gradient following complete removal of the progenitor organism (Daphnia pulex). Using water samples collected over a 30-day period, eDNA and eRNA copy numbers were quantified using a droplet digital PCR (ddPCR) assay targeting the mitochondrial cytochrome c oxidase subunit I (COI) gene of D. pulex. We found that eRNA decayed more rapidly than eDNA at all pH conditions tested, with detectability—predicted by an exponential decay model—for up to 57 hours (eRNA; neutral pH) and 143 days (eDNA; acidic pH) post organismal removal. Decay rates for eDNA were significantly higher in neutral and alkaline conditions than in acidic conditions, while decay rates for eRNA did not differ significantly among pH levels. Collectively, our findings provide the basis for a predictive framework assessing the persistence and degradation dynamics of eRNA and eDNA across a range of ecologically relevant pH conditions, establish the potential for eRNA to be used in spatially and temporally sensitive biomonitoring studies (as it is detectable across a range of pH levels), and may be used to inform future sampling strategies in aquatic habitats.
Spatially explicit population genetic models have long been developed, yet have rarely been used to test hypotheses about the spatial distribution of genetic diversity or the expected neutral levels of genetic divergence between populations. Here, we use spatially explicit coalescence simulations to explore the properties of the island model and the two-dimensional stepping stone model under a wide range of scenarios with spatio-temporal variation in deme size. We avoid the simulation of genetic data, using the fact that under the studied models, summary statistics of genetic diversity and divergence between demes can be approximated from coalescence times. We perform the simulations using gridCoal, a flexible spatial wrapper for the software msprime developed herein. In gridCoal, deme sizes can change arbitrarily across space and time, and migration rates between individual demes can be specified. We identify the different factors that can cause a deviation from the theoretical expectations, such as the simulation time in comparison to the effective deme size and the spatio-temporal autocorrelation across the grid. Our results highlight that Fst, a measure of the strength of population structure, principally depends on recent demography, which makes it robust to temporal variation in deme size. We also warn that predicting genetic diversity from coalescence times requires a much longer run time than needed for the estimation of Fst. Finally, we illustrate the use of gridCoal on a real-world example, the range expansion of silver fir (Abies alba Mill.) since the Last Glacial Maximum, using different degrees of spatio-temporal variation in deme size.
eDNA metabarcoding is an effective method for studying fish communities but allows only an estimation of relative species abundance (density / biomass). Here, we combine metabarcoding with an estimation of the total abundance of eDNA amplified by our universal marker (teleo) using a qPCR approach to infer the absolute abundance of fish species. We carried out a 2850 km eDNA survey within the Danube catchment using a spatial integrative sampling protocol coupled with traditional electrofishing for fish biomass and density estimation. Total fish eDNA concentrations and total fish abundance were highly correlated. The correlation between eDNA concentrations per taxon and absolute specific abundance was of comparable strength when all sites were pooled and remained significant when the sites were considered separately. Furthermore, a non-linear mixed model showed that species richness was underestimated when the amount of teleo-DNA extracted from a sample was below a threshold of 0.65.106 copies of eDNA. This result, combined with the decrease in teleo-DNA concentration by several orders of magnitude with river size, highlights the need to increase sampling effort in large rivers. Our results show a comprehensive description of longitudinal changes in fish communities and underline our combined metabarcoding/qPCR approach for biomonitoring and bioassessment surveys when a rough estimate of absolute species abundance is sufficient.
The collembolan Folsomia candida Willem, 1902, is an important representative soil arthropod that is widely distributed throughout the world and has been frequently used as a test organism in soil ecology and ecotoxicology studies. However, it is questioned as an ideal “standard” because of differences in reproductive modes and cryptic genetic diversity between strains from various geographical origins. In this study, we present two high-quality chromosome-level genomes of F. candida, for the parthenogenetic Danish strain (FCDK, 219.08 Mb, N50 of 38.47 Mb, 25,139 protein-coding genes) and the sexual Shanghai strain (FCSH, 153.09 Mb, N50 of 25.75 Mb, 21,609 protein-coding genes). The seven chromosomes of FCDK are each 25–54% larger than the corresponding chromosomes of FCSH, showing obvious repetitive element expansions and large-scale inversions and translocations but no whole-genome duplication. The strain-specific genes, expanded gene families and genes in nonsyntenic chromosomal regions identified in FCDK are highly related to its broader environmental adaptation. In addition, the overall sequence identity of the two mitogenomes is only 78.2%, and FCDK has fewer strain-specific microRNAs than FCSH. In conclusion, FCDK and FCSH have accumulated independent genetic changes and evolved into distinct species since diverging 10 Mya. Our work shows that F. candida represents a good model of rapidly cryptic speciation. Moreover, it provides important genomic resources for studying the mechanisms of species differentiation, soil arthropod adaptation to soil ecosystems, and Wolbachia-induced parthenogenesis as well as the evolution of Collembola, a pivotal phylogenetic clade between Crustacea and Insecta.
Here I describe the novel R package SNPfiltR and demonstrate its functionalities as the backbone of a customizable, reproducible SNP filtering pipeline implemented exclusively via the widely adopted R programming language. SNPfiltR extends existing SNP filtering functionalities by automating the visualization of key parameters such as depth, quality, and missing data, then allowing users to set filters based on optimized thresholds, all within a single, cohesive working environment. All SNPfiltR functions require a vcfR object as input, which can be easily generated by reading a SNP dataset stored as a standard vcf file into an R working environment using the function read.vcfR() from the R package vcfR. Performance benchmarking reveals that for moderately sized SNP datasets (up to 50M genotypes with associated quality information), SNPfiltR performs filtering with comparable efficiency to current state of the art command-line-based programs. These benchmarking results indicate that for most reduced-representation genomic datasets, SNPfiltR is an ideal choice for investigating, visualizing, and filtering SNPs as part of a cohesive and easily documentable bioinformatic pipeline. The SNPfiltR package can be downloaded from CRAN with the command [install.packages(“SNPfiltR”)], and a development version is available from GitHub at: (github.com/DevonDeRaad/SNPfiltR). Additionally, thorough documentation for SNPfiltR, including multiple comprehensive vignettes, is available at the website: (devonderaad.github.io/SNPfiltR/).
Understanding the genetic changes associated with the evolution of biological diversity is of fundamental interest to molecular ecologists. The assessment of genetic variation at hundreds or thousands of unlinked genetic loci forms a sound basis to address questions ranging from micro- to macro-evolutionary timescales, and is now possible thanks to advances in sequencing technology. Major difficulties are associated with i) the lack of genomic resources for many taxa, especially from tropical biodiversity hotspots, ii) scaling the numbers of individuals analyzed and loci sequenced, and iii) building tools for reproducible bioinformatic analyses of such datasets. To address these challenges, we developed a set of target capture probes for phylogenomic studies of the highly diverse, pantropically distributed and economically significant rosewoods (Dalbergia spp.), explored the performance of an overlapping probe set for target capture across the legume family (Fabaceae), and built a general-purpose bioinformatics pipeline. Phylogenomic analyses of Dalbergia species from Madagascar yielded highly resolved and well supported hypotheses of evolutionary relationships. Population genomic analyses identified differences between closely related species and revealed the existence of a potentially new species, suggesting that the diversity of Malagasy Dalbergia species has been underestimated. Analyses at the family level corroborated previous findings by the recovery of monophyletic subfamilies and many well-known clades, as well as high levels of gene tree discordance, especially near the root of the family. The new genomic and bioinformatics resources will hopefully advance systematics and ecological genetics research in legumes, and promote conservation of the highly diverse and endangered Dalbergia rosewoods.
Schistosomiasis is a neglected tropical disease of humans caused by blood flukes of the genus Schistosoma – the only dioecious parasitic flatworms. Although aspects of sex determination, differentiation and reproduction have been studied in some Schistosoma species, almost nothing is understood for Schistosoma japonicum - the causative agent of schistosomiasis japonica. This relates mainly to a lack of high-quality genomic and transcriptomic resources for this species. As current draft genomes for S. japonicum are highly fragmented, we assembled here a chromosome-level reference genome (seven autosomes, the Z-chromosome and partial W-chromosome), achieving a substantially enhanced gene annotation. Utilising this genome, we discovered that the sex chromosomes of S. japonicum and its congener S. mansoni independently suppressed recombination during evolution, forming four and two ‘strata’, respectively. By exploring the W-chromosome and sex-specific transcriptomes, we identified 35 W-linked genes and 257 female-preferentially transcribed genes (FTGs) and identified a signature for sex determination and differentiation in S. japonicum. These FTGs cluster within autosomes or the Z-chromosome and exhibit a highly dynamic transcription profile during the pairing of female and male schistosomules (advanced juveniles), representing a critical phase for the maturation of the female worms, suggesting distinct layers of regulatory control of gene transcription at this stage of development. Collectively, these data provide a valuable resource for further functional genomic characterisation of S. japonicum, shed light on the evolution of sex chromosomes in this highly virulent human blood fluke and provide a pathway to identify novel targets for development of intervention tools against schistosomiasis.
High-throughput sequencing for analysis of environmental microbial diversity has evolved vastly over the last decade. Currently the go-to method for microbial eukaryotes is short-read metabarcoding of variable regions of the 18S rRNA gene with <500 bp amplicons. However, there is a growing interest in long-read sequencing of amplicons covering the rRNA operon for improving taxonomic resolution. For both methods, the choice of primers is crucial. It determines if community members are covered, if they can be identified at a satisfactory taxonomic level, and if the obtained community profile is representative. Here, we designed new primers targeting 18S and 28S rRNA based on 177,934 and 21,072 database sequences, respectively. The primers were evaluated in silico along with published primers on reference sequence databases and marine metagenomics datasets. We further evaluated a subset of the primers for short- and long-read sequencing on environmental samples in vitro and compared the obtained community profile with primer-unbiased metagenomic sequencing. Of the short-read pairs, a new V6-V8 pair and the V4_Balzano pair used with a simplified PCR protocol provided good results in silico and in vitro. Fewer differences were observed between the long-read primer pairs. The long-read amplicons and ITS1 alone provided higher taxonomic resolution than V4. Together, our results represent a reference and guide for selection of robust primers for research on and environmental monitoring of microbial eukaryotes.
The use of NGS datasets has increased dramatically over the last decade, however, there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single Pinus contorta parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the SNP genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded one to two orders of magnitude larger numbers of SNPs and error rates, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.
The molecular characterisation of complex behaviours is a challenging task as a range of different factors are often involved to produce the observed phenotype. An established approach is to look at the overall levels of expression of brain genes – known as ‘neurogenomics’ – to select the best candidates that associate with patterns of interest. This approach has relied so far on a set of powerful statistical tools capable to provide a snapshot of the expression of many thousands of genes that are present in an organism’s genome. However, traditional neurogenomic analyses have some well-known limitations; above all, the limited number of biological replicates compared to the number of genes tested – often referred to as “curse of dimensionality”. Here we implemented a new Machine Learning (ML) approach that can be used as a complement to established methods of transcriptomic analyses. We tested three types of ML models for their performance in the identification of genes associated with honeybee waggle dance. We then intersected the results of these analyses with traditional outputs of differential gene expression analyses and identified two promising candidates for the neural regulation of the waggle dance: the G-protein coupled receptor boss and hnRNP A1, a gene involved in alternative splicing. Overall, our study demonstrates the application of Machine Learning to analyse transcriptomics data and identify genes underlying social behaviour. This approach has great potential for application to a wide range of different scenarios in evolutionary ecology, when investigating the genomic basis for complex phenotypic traits.
Because of their challenging taxonomy, arthropods are traditionally underrepresented in biological inventories and monitoring programs. However, arthropods are the largest component of biodiversity, and no assessment can be considered informative without including them. Arthropod immature stages are often discarded during sorting, despite frequently representing more than half of the collected individuals. To date, little effort has been devoted to characterising the impact of discarding non-adult specimens on our diversity estimates. Here, we use a metabarcoding approach to analyse spiders from white oak communities in the Iberian Peninsula collected with standardised protocols, to assess (1) the contribution of juvenile stages to local diversity estimates, and (2) their effect on the diversity patterns inferred across communities. We further investigate the ability of metabarcoding to inform on abundance. We obtained 363 and 331 species as adults and juveniles, respectively. Species represented only by juveniles represented an increase of 35% with respect to those identified from adults in the whole sampling. Differences in composition between communities were greatly reduced when immature stages were taken considered, especially across latitudes. Moreover, our results revealed that metabarcoding data are to a certain extent quantitative, but some sort of taxonomic conversion factor may be necessary to provide accurate informative estimates. Although our findings do not question the relevance of the information provided by adult-based inventories, they also reveal that juveniles provide a novel and relevant layer of knowledge that, especially in areas with marked seasonality, may influence our interpretations, providing more accurate information from standardised biological inventories.
Over the past few decades, the rapid democratization of high-throughput sequencing and the growing emphasis on open science practices have resulted in an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining datasets to achieve unprecedented sample sizes, spatial coverage, or temporal replication in population genomic studies. However, a common concern is that non-biological differences between datasets may generate batch effects that can confound real biological patterns. Despite general awareness about the risk of batch effects, few studies have examined empirically how they manifest in real datasets, and it remains unclear what factors cause batch effects and how to best detect and mitigate their impact bioinformatically. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch-effect-naive” bioinformatic pipeline, batch effects severely biased our genetic diversity estimates, population structure inference, and selection scan. We then demonstrate that these batch effects resulted from multiple technical differences between our datasets, including the sequencing instrument model/chemistry, read type, read length, DNA degradation level, and sequencing depth, but their impact can be detected and substantially mitigated with simple bioinformatic approaches. We conclude that combining datasets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.
The Heteroduplex mobility assay (HMA) has proven to be a robust tool for the detection of genetic variation. Here, we describe a simple and rapid application of the HMA by microfluidic capillary electrophoresis, for phylogenetics and population genetic analyses (pgHMA). We show how commonly applied techniques in phylogenetics and population genetics have equivalents with pgHMA: phylogenetic reconstruction with bootstrapping, skyline plots, and mismatch distribution analysis. We assess the performance and accuracy of pgHMA by comparing the results obtained against those obtained using standard methods of analyses applied to sequencing data. The resulting comparisons demonstrate that: (1) there is a significant linear relationship (R = 0.992) between heteroduplex mobility and genetic distance; (2) phylogenetic trees obtained by HMA and nucleotide sequences present nearly identical topologies; (3) clades with high pgHMA parametric bootstrap support also have high bootstrap support on nucleotide phylogenies; (4) skyline plots estimated from the UPGMA trees of HMA and Bayesian trees of nucleotide data reveal similar trends, especially for the median trend estimate of effective population size; and (5) optimized mismatch distributions of HMA are closely fitted to the mismatch distributions of nucleotide sequences. In summary, pgHMA is an easily-applied method for approximating phylogenetic diversity and population trends. KEYWORDS: bootstrap, heteroduplex mobility assay, mismatch distribution, phylogenetics, skyline plot
Sex-specific ecology has management implications, but rapid sex-chromosome turnover in fishes hinders development of markers to sex monomorphic species. Here, we use annotated genomes and reduced-representation sequencing data for two Australian percichthyids, the Macquarie perch Macquaria australasica and the golden perch M. ambigua, and whole genome resequencing data for 50 Macquarie perch of each sex, to detect sex-linked loci, identify a candidate sex-determining gene and develop an affordable sexing assay. In-silico pool-seq tests of 1,492,004 Macquarie perch SNP loci revealed that a 275-Kb scaffold, containing the transcription factor SOX1b gene, was enriched for gametologous loci. Within this scaffold, 22 loci were sex-linked in a predominantly XY system, with females being homozygous at all 22, and males being heterozygous at two or more. Seven XY-gametologous loci were within a 146-bp region. Being ~38 Kb upstream of SOX1b, it might act as an enhancer controlling SOX1b transcription in the bipotential gonad that drives gonad differentiation. A PCR-RFLP sexing assay, targeting one of the Y-linked SNPs, tested in 66 known-sex Macquarie perch and two individuals of each sex of three confamilial species, and amplicon sequencing of 400 bp encompassing the 146-bp region, revealed that the few sex-linked positions differ between species and between Macquarie perch populations. This indicates sex-chromosome lability in Percichthyidae, also supported by non-homologous scaffolds containing sex-linked loci for Macquarie- and golden perches. The resources developed here will facilitate genomic research in Percichthyidae. Sex-linked markers will be useful for determining genetic sex in some populations and studying sex chromosome turnover.