The advancement of next-generation sequencing (NGS) technologies has been revolutionary for the field of evolutionary biology. This technology has led to an abundance of available genomes and transcriptomes for researchers to mine. Specifically, researchers can mine for various types of molecular markers that are vital for phylogenetic, evolutionary, and ecological studies. Numerous tools have been developed to extract these molecular markers from NGS data. However, due to an insufficient number of well-annotated reference genomes for non-model organisms, it remains challenging to obtain these markers accurately and efficiently. Here, we present GeneMiner, an improved and expanded version of our previous tool, Easy353. GeneMiner combines the reference-guided de Bruijn graph assembly with seed self-discovery and greedy extension. Additionally, it includes a verification step using a parameter-bootstrap method to reduce the pitfalls associated with using a relatively distant reference. Our results using both experimental and simulation data showed GeneMiner can accurately acquires phylogenetic molecular markers for plants using transcriptomic, genomic, and other NGS data. GeneMiner is designed to be user-friendly, fast, and memory efficient. Further, it is compatible with Linux, Windows, and macOS. All source codes are publicly available on GitHub for easy accessibility and transparency (https://github.com/yyscu/GeneMiner).
Diapause, a form of dormancy to delay or halt the reproductive development during unfavourable seasons, has evolved in many insect species. One example is aestivation, a summer adult-stage diapause, enhancing malaria vectors’ survival during the unfavourable dry season (DS) and their re-establishment in the next rainy season (RS). This work develops a novel genetic approach to estimate the number or proportion of individuals undergoing diapause, as well as the breeding sizes of the two seasons, using signals from temporal allele frequency dynamics. Using Anopheles coluzzii as an example, our modelling shows the magnitude of drift is dampened at early RS when previously aestivating individuals reappear. Aestivation severely biases the temporal effective population size (N_e), leading to overestimation of the DS breeding size by 1/(1-α)^2 across one year, where α is the aestivating proportion. We find sampling breeding individuals in three consecutive seasons starting from a RS is sufficient for parameter estimation, and perform extensive simulations to verify our derivations. This method does not require sampling individuals in the dormant state, the biggest challenge in most studies. We apply the method to a published An. coluzzii dataset from Thierola, Mali, and the estimated aestivating proportions were 39%-79%. These results will inform the development of genetic approaches to vector control. Beyond mosquitoes, our method and the expected evolutionary implications are applicable to any species in which a fraction of the population diapauses for more than one generation, and are difficult or impossible to sample during that stage.
More than 30% of extant shark species are classified as threatened with extinction, yet reliable population-level data is often rare, frequently due to a lack of genomic resources. Here we present a new genome-wide marker set for endangered shark species. We developed a target gene capture bait set based on the transcriptome and genome of a Lamnid shark, and tested it on 36 shark specimens, representing seven species from three orders. Illumina read mapping and calling of Single Nucleotide Polymorphisms (SNPs) showed high target recovery rates, especially in the order Lamniformes, providing several thousand detected biallelic SNPs in each species tested. Our results show this marker set can be used for SNP-calling in a broad range of shark species, enabling detailed population assessments and other ecological and evolutionary studies.
Genome visualization tools are important for exploring genomic features and their interactions. Currently, visualization of the plant mitochondrial genomes (mitogenome) depends on those tools designed originally for animal mitogenomes and plant plastomes. These tools cannot faithfully present features unique to the plant mitogenomes, such as non-linear exon arrangement for genes, prevalence of functional non-coding features, and complex chromosomal architectures. To address these challenges, a software package plant mitochondrial genome map (PMGmap), was developed using Python programming language. PMGmap can draw genes at exon levels, draw cis- and trans-splicing gene maps, draw non-coding features, draw repetitive sequences, scale the genic regions using a scaling the genic regions on the genome (SGM) algorithm, and draw multiple chromosomes simultaneously. We compared PMGmap with other leading tools on 405 plant mitogenomes and found that PMGmap allowed the visualization of the above-mentioned features better than those tools. We believe PMGmap will become an invaluable tool for plant mitogenome research. The web and container versions and the source code of PMGmap can be accessed at http://www.1kmpg.cn/pmgmap.
Metazoa-level Universal Single-Copy Orthologs (mzl-USCOs) are universally applicable markers for DNA taxonomy in animals which can replace or supplement single-gene barcodes. While previously mzl-USCOs from target enrichment data were shown to reliably distinguish species, here we tested whether USCOs are an evenly distributed, representative sample of a given metazoan genome and therefore able to cope with past hybridization events and incomplete lineage sorting. This is relevant for coalescent-based species delimitation approaches, which critically depend on the assumption that the investigated loci do not exhibit autocorrelation due to physical linkage. Based on 239 assessed chromosome-level assembled genomes, we confirmed that mzl-USCOs are genetically unlinked for practical purposes and a representative sample of a genome in terms of reciprocal distances between USCOs on a chromosome and of distribution across chromosomes. We tested the suitability of mzl-USCOs extracted from genomes for species delimitation and phylogeny in four case studies: Anopheles mosquitos, Drosophila fruit flies, Heliconius butterflies, and Darwin’s finches. In almost all instances, USCOs allowed delineating species and yielded phylogenies that correspond to those generated from whole genome data. Our phylogenetic analyses demonstrate that USCOs may complement single-gene DNA barcodes and provide more accurate taxonomic inferences. Combining USCOs from sources that used different versions of ortholog reference libraries to infer marker orthology may be challenging and at times impact taxonomic conclusions. However, we expect this problem to become less severe as the rapidly growing number of reference genomes provides a better representation of the number and diversity of organismic lineages.
As sequencing technology continues to rapidly improve, studies investigating the microbial communities of host organisms (i.e., microbiomes) are becoming not only more popular but also more financially accessible. Across many taxa, microbiomes can have important impacts on organismal health and fitness. To evaluate the microbial community composition of a particular microbiome, microbial DNA must be successfully extracted. Fecal samples are often easy to collect and are a good source of gut microbial DNA. However, in birds and reptiles, microbial DNA extractions from fecal matter have proven to be difficult due to high concentrations of uric acid, an inhibitor of DNA extractions. Here, we present a new microbial DNA extraction method that is highly effective for avian species and displays higher efficiency and consistency than other commonly used methodologies. Further, our method is also effective in extracting microbial DNA from oils collected from the avian preen gland. Preen oil chemicals are important for many aspects of avian life, and the biosynthesis of these chemicals is dependent on the preen gland microbial community. We expect our method will facilitate microbial DNA extractions from multiple avian microbiome reservoirs, which have previously proved difficult and expensive. Our method therefore increases the feasibility of future studies of avian host microbiomes.
Accurate and efficient genotyping of microsatellite loci is essential for their application in population genetics and various demographic analysis. Protocols for next generation sequencing of microsatellite loci generate high-throughput and cross-compatible allele scoring characteristics: common issues associated with size separation on conventional capillary-based protocols. As a result, we have developed a novel, ultra-fast, all-in-one software Seq2Sat in C++ to support accurate automated microsatellite genotyping. It directly takes raw reads of microsatellite amplicons and subsequently performs read quality control before inferring genotypes based on depth of read, sequence composition and length. It does not produce any intermediate files, making I/O very efficient. Additionally, we developed a module in Seq2Sat for sex identification based on sex locus amplicons. We further developed a user-friendly website-based platform SatAnalyzer to conduct reads-to-report analyses by calling Seq2Sat to generate genotype tables and interactive genotype graphs for manual editing. SatAnalyzer also allows visualization of read quality and distribution across loci and samples to troubleshoot multiplex optimization and high-quality library preparation. To evaluate its performance, we benchmarked SatAnalyzer against conventional capillary gel electrophoresis and an existing microsatellite genotyping software MEGASAT. Results show that SatAnalyzer can achieve > 0.993 genotyping accuracy and Seq2Sat is ~ 5 times faster than MEGASAT despite many more informative tables and figures generated. Seq2Sat and SatAnalyzer are freely available at github (https://github.com/ecogenomicscanada/Seq2Sat) and dockerhub (https://hub.docker.com/r/rocpengliu/satanalyzer).
Metabarcoding is an increasingly popular and accessible method for assessing bacterial communities across a wide range of environments, and as the sequence data archives grow, sequence data reuse will likely become an important source of novel insights into the ecology of microbes. While literature on the benefits of longer read lengths for the study of microbial communities, little is known about the (re)usability of shorter (<200 bp) read lengths, but this information is essential to improve the reuse and comparability of metabarcoding data across studies. This study reanalyzed three 16S rRNA datasets targeting aquatic, animal-associated, and soil microbiomes, and evaluated how processing the sequence data across a range of read lengths affected the resulting taxonomic assignments, biodiversity metrics, and differential (i.e., before-after treatment) analyses. Short read lengths successfully recovered ecological patterns, and limited increases in resolution were observed beyond 100 bp reads across environments. Furthermore, abundance-weighted diversity metrics (e.g., Inverse Simpson index or Bray-Curtis dissimilarities) were more robust to variation in read lengths. Importantly, the total number of ASVs detected increased with read length, highlighting the need to consider metabarcoding-derived diversity estimates within the context of the bioinformatics parameters selected. This study provides evidence-based guidelines for the processing of short reads.
Environmental DNA is an effective tool for describing fish biodiversity in lotic environments, but the downstream transport of eDNA released by organisms makes it difficult to interpret species detection at the local scale. In addition to biophysical degradation and exchanges at the water-sediment interface, hydrological conditions control the transport distance. We have developed an eDNA transport model that considers downstream retention and degradation processes in combination with hydraulic conditions and assumes that the sedimentation rate of very fine particles is a correct estimate of the eDNA deposition rate. Based on meta-analyses of available studies, we successively modelled the particle size distribution of fish eDNA (PSD), the relationship between the sedimentation rate and the size of very fine particles in suspension, and the influence of temperature on the degradation rate of fish eDNA. After combining the results in a mechanistic-based model, we correctly simulated the eDNA uptake distances observed in a compilation of previous experimental studies. eDNA degradation is negligible at low flow and temperature but has a comparable influence to background transfer when hydraulic conditions allow a long uptake distance. The wide prediction intervals associated with the simulations reflect the complexity of the processes acting on eDNA after shedding. This model can be useful for estimating eDNA detection distance downstream from a source point and discussing the possibility of false positive detection in eDNA samples, as shown in an example.
Biomonitoring of marine life has been enhanced in recent years by the integration of innovative DNA-based approaches, which offer advantages over more laborious conventional techniques (e.g. direct capture) and greater taxonomic resolution especially in complex life cycles and early life stages. However, tradeoffs between throughput, sensitivity and quantitative measurements must be made when choosing between the prevailing molecular methodologies (i.e. metabarcoding or qPCR/dPCR). Thus, the aim of the present study was to demonstrate the utility of a microfluidic-enabled High Throughput quantitative PCR platform (HT-qPCR) for the rapid and cost-effective development and validation of a DNA-based multi-species biomonitoring toolkit, using larvae of 24 commercially targeted bivalve and crustacean species as a case study. The workflow was divided into three main phases: definition of target taxa and establishment of reference databases (PHASE 1); in silico selection/development and in vitro assessment of molecular assays (PHASE 2); and protocol optimization and field validation (PHASE 3). Of a total of 85 assays in silico, 42 were eventually chosen and validated in vitro. Genetic signal showed good correlation with direct visual counts by microscopy, but also showed the ability to provide quantitative data at the highest taxonomic resolution (species level) in a time- and cost-effective fashion. This study developed a biomonitoring toolkit, demonstrating the considerable advantages of this state-of-the-art technology in boosting the development and application of panels of molecular assays for the monitoring and management of natural resources that can be applied to a range of monitoring programmes. Keywords: DNA, High Throughput, qPCR, biomonitoring, shellfish
Phylogenetic studies now routinely require manipulating and summarizing thousands of data files. For most of these tasks, currently available software requires considerable computing resources and substantial knowledge of command-line applications. We develop ultrafast and memory-efficient software that performs over a dozen common phylogenomic manipulations and calculates statistics summarizing essential data features. Our software is available as standalone command-line (CLI) and graphical user interface (GUI) applications, and as a programming language library for Rust, R, and Python, with possible support of other languages. The CLI and library versions, SEGUL, run native on Windows, Linux, and macOS, including Apple ARM Macs. The GUI version extends support to include mobile iOS and Android operating systems. SEGUL offer fast execution times and low memory footprints regardless of dataset size and platform choice. The inclusion of a GUI minimizes bioinformatics barriers to phylogenomics while SEGUL’s efficiency reduces economic barriers by enabling analysis on inexpensive hardware. Our support for mobile operating systems further enables teaching phylogenomics where access to computing power is limited.
Phylogenetic generalized least squares (PGLS) regression is widely used to detect evolutionary correlations. In contrast to the equal treatment of analyzed traits in conventional correlation methods such as Pearson and Spearman’s rank tests, we must designate one trait as the independent variable and the other as the dependent variable. However, in our PGLS regression analyses (using Pagel’s λ model) of both empirical and simulated datasets, switching independent and dependent variables yielded many conflicting results. A serious problem with PGLS regression that has not been noticed before is that selecting an inappropriate trait as the dependent variable will often result in an error. To assess correlations in simulated data, we established a gold standard by analyzing changes in traits along phylogenetic branches. Next, we tested seven potential criteria for dependent variable selection: log-likelihood, Akaike information criterion, R2, p-value, Pagel’s λ, Blomberg et al.’s K, and the estimated λ in Pagel’s λ model. We determined that the last three criteria performed equally well in selecting the dependent variable and were superior to the other four. For practicality, we suggest using the trait with a higher λ or K value as the dependent variable in future PGLS regressions. In analyzing the evolutionary relationship between two traits, we should designate the trait with a stronger phylogenetic signal as the dependent variable even if it could logically assume the cause in the relationship.
Current methodologies of genome-wide Single Nucleotide Polymorphism (SNP) genotyping produce large amounts of missing data that may affect statistical inference and bias the outcome of experiments. Genotype imputation is routinely used in well-studied species to buffer the impact in downstream analysis and several algorithms are available to fill in missing genotypes. The lack of reference haplotype panels precludes the use of these methods in genomic studies on non-model organisms. As an alternative, machine learning algorithms are employed to explore the genotype data and to estimate the missing genotypes. Here, we propose an imputation method based on Self-Organizing Maps (SOM), a widely used neural networks formed by spatially distributed neurons that cluster similar inputs into close neurons. We follow a classical approach that explores genotype datasets to select SNP loci for each query missing SNP genotype to build training sets, and that initializes and trains the neural networks to finally use the SOM-derived clustering to impute the best genotype. To automate the imputation process, we have implemented GTIMPUTATION, an open source application programmed in Python3 and with a user-friendly GUI to facilitate the whole process. The method performance was validated by comparing its accuracy, precision and sensitivity on several benchmark genotype datasets with other available imputation algorithms. Our approach produced highly accurate and precise genotype imputations and outperformed other algorithms, especially for datasets from mixed populations with unrelated individuals.
Using high-throughput sequencing for precise genotyping of multi-locus gene families, such as the Major Histocompatibility Complex (MHC), remains challenging, due to the complexity of the data and difficulties in distinguishing genuine from erroneous variants. Several dedicated genotyping pipelines for data from high-throughput sequencing, such as next-generation sequencing (NGS), have been developed to tackle the ensuing risk of artificially inflated diversity. Here, we thoroughly assess three such multi-locus genotyping pipelines for NGS data, using MHC class IIβ datasets of three-spined stickleback gDNA, cDNA, and “artificial” plasmid samples with known allelic diversity. We show that genotyping of gDNA and plasmid samples at optimal pipeline parameters was highly accurate and reproducible across methods. However, for cDNA data, the same configuration yielded decreased overall genotyping precision and consistency between pipelines. Further adjustments of key clustering parameters were required tο account for higher error rates and larger variation in sequencing depth per allele, highlighting the importance of template-specific pipeline optimization for reliable genotyping of multi-locus gene families. Through accurate paired gDNA-cDNA genotyping and MHC-II haplotype inference, we show that MHC-II allele-specific expression levels correlate negatively with allele number across haplotypes. Lastly, sibship-assisted cDNA genotyping of MHC-I revealed novel variants and haplotype-based allelic segregation with a higher-than-previously-reported individual allelic diversity for MHC-I in sticklebacks. In conclusion, we here provide novel genotyping protocols for MHC-I and -II genes of the three-spined stickleback, but also evaluate the performance of popular NGS-genotyping pipelines and highlight the need for template-specific optimization for reliable multi-locus genotyping.
Continued advancements in environmental DNA (eDNA) research have made it possible to access intraspecific variation from eDNA samples, opening new opportunities to expand non-invasive genetic studies of wild animal populations. However, the use of eDNA samples for individual genotyping, as typically performed in non-invasive genetics, still remained unachieved. We present the first successful individual genotyping of eDNA obtained from snow tracks of three large carnivores: brown bear (Ursus arctos), European lynx (Lynx lynx) and wolf (Canis lupus). DNA was extracted using a protocol for isolating water eDNA and genotyped using amplicon sequencing of short tandem repeats (STR) and, for brown bear, a sex marker, on a high-throughput sequencing platform. Individual genotypes were obtained for all species, but genotyping performance differed among samples and species. Multilocus genotyping success for individual identification was higher for brown bear samples (6 over 7), than for wolf (7 over 10) and lynx (4 over 9) samples. The sex marker was genotyped in 5 out of 7 brown bear samples. Results for three species show that reliable individual genotyping, including sex identification, is now possible from eDNA in snow tracks, underlining its vast potential to complement the non-invasive genetic methods used for wildlife. To fully leverage the application of snow track eDNA, improved understanding of the ideal species- and site-specific sampling conditions, as well as laboratory methods promoting genotyping success are needed. This will also inform efforts to retrieve and type nuclear DNA from other eDNA samples, thereby advancing eDNA–based individual and population level studies.
Population genomic studies are increasing in the last decade, showing great potential to understand the evolutionary patterns in a great variety of organisms, mostly relying on RAD sequencing techniques to obtain reduced representations of the genomes. Among them, 2b-RAD can provide further secondary reduction to adjust study costs by using base-selective adaptors, although its impact on genotyping is unknown. Here we provide empirical comparisons on genotyping and genetic differentiation when using fully degenerate and base-selective adaptors and assess the impact of missing data. We built libraries with the two types of adaptors for the same individuals and generated independent and combined datasets with different missingness filters according to their presence (100%, 75% and 50%). Exploring locus-by-locus, we found 92% of identical genotypes between both libraries of the same individual when using loci present in 100% of the samples, which decreased to 35% when working with loci present in at least 50% of them. We show that missing data is a major source of individual genetic differentiation. The loci discordant by genotyping were in low frequency (7.67%) in all filtered files. Only 0.96% were directly attributable to base-selective adaptors, and 6.44% underestimated heterozygosity in NN libraries, of which ca. 70% had <10 reads per locus indicating that sufficient read depth should be ensured for a correct genotyping. Our work confirms that 2b-RAD libraries using base-selective adaptors are a robust tool to use in population genomics of species with large genome sizes.
DNA methylation is one of the most relevant epigenetic modifications. It is present in eukaryotes and prokaryotes and is related to several biological phenomena, including gene flow and adaptation to environmental conditions. The widespread use of third-generation sequencing technologies allows direct and easy detection of genome-wide methylation profiles, offering increasing opportunities to understand and exploit the epigenomics landscape of individuals and populations. Here, we present MeStudio, a pipeline which allows to analyse and combine genome-wide methylation profiles with genomic features. Outputs report the presence of DNA methylation in coding sequences (CDS) and noncoding sequences, including both intergenic sequences, and sequences upstream to CDS. We show the usage and performances of MeStudio on a set of single-molecule real time sequencing outputs from strains of the bacterial species Sinorhizobium meliloti. MeStudio is freely available under an open source GPLv3 license at https://github.com/combogenomics/MeStudio