Molecular Ecology Resources - Authorea

Evaluating kinship estimation methods for reduced-representation SNP data in non-mode...

Eilish McMaster

and 4 more

April 20, 2024

Kinship estimation is widely used in ecological and evolutionary research, particularly in studies of human genealogy and genome-wide associations. In conservation, restoration, agriculture, and forestry, identifying relationships between individuals can be crucial for successful population management and can provide insight into inheritance patterns. Kinship estimation methods are typically designed for large datasets with hundreds of thousands of single-nucleotide polymorphisms. However, studies of non-model species often use much smaller datasets obtained using reduced-representation sequencing. To evaluate the performance of kinship estimation methods under these circumstances, we applied six algorithms to datasets from six non-model Australian flowering plant species (_Acacia terminalis_, _Acacia suaveolens_, _Banksia serrata_, _Banksia aemula_, _Hakea sericea_, and _Hakea teretifolia_), encompassing 3,390 individuals and 369 families. Our results show different performances of kinship methods on reduced-representation sequence data compared with prior evaluations. PC-Relate, RelateAdmix, and Goudet’s beta dosage exhibited limited precision, KING Homo and KING Robust demonstrated high precision with limited sensitivity, while PLINK displayed variable sensitivity and precision. The sensitivity and precision of the methods were affected in various ways by filtering parameters; each method showed its best performance under different thresholds for minor allele frequency and locus missingness. We also present a case study that illustrates a practical application of the methods, demonstrating how estimates of kinship can inform management of seed production areas of the broadleaf hopbush (_Dodonaea viscosa_). Based on our findings, we offer specific recommendations for utilizing kinship estimation methods in studies of reduced-representation sequence data from non-model species.

MADaM, an accurate and fast unsupervised algorithm for genotyping of short sequencing...

Thomas Goeury

and 2 more

March 31, 2024

We present here MADaM (Multiplexed Amplicon Data Miner), an original algorithm designed to de-novo genotyping of small sequencing reads that do not require assembly step. It performs a classification of the reads based on an original set of features using t-SNE’s and clustering with the DBSCAN algorithm. The algorithm is applied to three different approaches and datasets showing that this software is fully suitable for fastly genotyping highly variable regions such as MHC-HLA exons 2 without any priors such as SNP positions or already known alleles.

Stability of environmental DNA methylation and its utility in tracing reproductive ac...

Itsuki Hirayama

and 2 more

March 10, 2024

1. The use of environmental DNA (eDNA) as a new method of ecological monitoring is widely applied. Although eDNA can provide important information on the distribution and biomass of particular taxa, an organism’s DNA sequences remain unaltered throughout its existence, which complicates identifying crucial events, including reproduction, with high accuracy. We thus examined DNA methylation as a novel source of information from eDNA, considering that methylation patterns of eggs and sperm released during reproduction differ from those of somatic tissues. 2. Despite its potential applications, little is known about eDNA methylation, including its stability and methods for detection and quantification. Therefore, we conducted tank experiments and performed methylation analysis targeting 18S rDNA through bisulfite amplicon sequencing. 3. Methylation of eDNA was not affected by degradation and was equivalent to the rate of genomic DNA from somatic tissues. Unmethylated DNA, which is abundant in the ovary, was detected in eDNA during reproductive activity of fish. 4. These results indicate that eDNA methylation is a stable signal reflecting genomic methylation and demonstrate that germ cell-specific methylation patterns can be used as markers for detecting reproductive activity.

GeneMiner: a tool for extracting phylogenetic markers from next-generation sequencing...

Pulin Xie

and 4 more

April 17, 2023

The advancement of next-generation sequencing (NGS) technologies has been revolutionary for the field of evolutionary biology. This technology has led to an abundance of available genomes and transcriptomes for researchers to mine. Specifically, researchers can mine for various types of molecular markers that are vital for phylogenetic, evolutionary, and ecological studies. Numerous tools have been developed to extract these molecular markers from NGS data. However, due to an insufficient number of well-annotated reference genomes for non-model organisms, it remains challenging to obtain these markers accurately and efficiently. Here, we present GeneMiner, an improved and expanded version of our previous tool, Easy353. GeneMiner combines the reference-guided de Bruijn graph assembly with seed self-discovery and greedy extension. Additionally, it includes a verification step using a parameter-bootstrap method to reduce the pitfalls associated with using a relatively distant reference. Our results using both experimental and simulation data showed GeneMiner can accurately acquires phylogenetic molecular markers for plants using transcriptomic, genomic, and other NGS data. GeneMiner is designed to be user-friendly, fast, and memory efficient. Further, it is compatible with Linux, Windows, and macOS. All source codes are publicly available on GitHub for easy accessibility and transparency (https://github.com/yyscu/GeneMiner).

Genome-wide target capture baits for endangered shark species

Clara Isabel Wagner

and 5 more

October 24, 2023

More than 30% of extant shark species are classified as threatened with extinction, yet reliable population-level data is often rare, frequently due to a lack of genomic resources. Here we present a new genome-wide marker set for endangered shark species. We developed a target gene capture bait set based on the transcriptome and genome of a Lamnid shark, and tested it on 36 shark specimens, representing seven species from three orders. Illumina read mapping and calling of Single Nucleotide Polymorphisms (SNPs) showed high target recovery rates, especially in the order Lamniformes, providing several thousand detected biallelic SNPs in each species tested. Our results show this marker set can be used for SNP-calling in a broad range of shark species, enabling detailed population assessments and other ecological and evolutionary studies.

Plant Mitochondrial Genome Map (PMGmap): A Software Tool for Comprehensive Visualizat...

Xinyi Zhang

and 6 more

October 19, 2023

Genome visualization tools are important for exploring genomic features and their interactions. Currently, visualization of the plant mitochondrial genomes (mitogenome) depends on those tools designed originally for animal mitogenomes and plant plastomes. These tools cannot faithfully present features unique to the plant mitogenomes, such as non-linear exon arrangement for genes, prevalence of functional non-coding features, and complex chromosomal architectures. To address these challenges, a software package plant mitochondrial genome map (PMGmap), was developed using Python programming language. PMGmap can draw genes at exon levels, draw cis- and trans-splicing gene maps, draw non-coding features, draw repetitive sequences, scale the genic regions using a scaling the genic regions on the genome (SGM) algorithm, and draw multiple chromosomes simultaneously. We compared PMGmap with other leading tools on 405 plant mitogenomes and found that PMGmap allowed the visualization of the above-mentioned features better than those tools. We believe PMGmap will become an invaluable tool for plant mitogenome research. The web and container versions and the source code of PMGmap can be accessed at http://www.1kmpg.cn/pmgmap.

Metazoa-level USCOs as markers in species delimitation and classification

Lars Dietz

and 7 more

October 16, 2023

Metazoa-level Universal Single-Copy Orthologs (mzl-USCOs) are universally applicable markers for DNA taxonomy in animals which can replace or supplement single-gene barcodes. While previously mzl-USCOs from target enrichment data were shown to reliably distinguish species, here we tested whether USCOs are an evenly distributed, representative sample of a given metazoan genome and therefore able to cope with past hybridization events and incomplete lineage sorting. This is relevant for coalescent-based species delimitation approaches, which critically depend on the assumption that the investigated loci do not exhibit autocorrelation due to physical linkage. Based on 239 assessed chromosome-level assembled genomes, we confirmed that mzl-USCOs are genetically unlinked for practical purposes and a representative sample of a genome in terms of reciprocal distances between USCOs on a chromosome and of distribution across chromosomes. We tested the suitability of mzl-USCOs extracted from genomes for species delimitation and phylogeny in four case studies: Anopheles mosquitos, Drosophila fruit flies, Heliconius butterflies, and Darwin’s finches. In almost all instances, USCOs allowed delineating species and yielded phylogenies that correspond to those generated from whole genome data. Our phylogenetic analyses demonstrate that USCOs may complement single-gene DNA barcodes and provide more accurate taxonomic inferences. Combining USCOs from sources that used different versions of ortholog reference libraries to infer marker orthology may be challenging and at times impact taxonomic conclusions. However, we expect this problem to become less severe as the rapidly growing number of reference genomes provides a better representation of the number and diversity of organismic lineages.

High Efficiency Microbial DNA Extraction Method for Avian Feces and Preen Oil

Austin Russell

and 3 more

October 03, 2023

As sequencing technology continues to rapidly improve, studies investigating the microbial communities of host organisms (i.e., microbiomes) are becoming not only more popular but also more financially accessible. Across many taxa, microbiomes can have important impacts on organismal health and fitness. To evaluate the microbial community composition of a particular microbiome, microbial DNA must be successfully extracted. Fecal samples are often easy to collect and are a good source of gut microbial DNA. However, in birds and reptiles, microbial DNA extractions from fecal matter have proven to be difficult due to high concentrations of uric acid, an inhibitor of DNA extractions. Here, we present a new microbial DNA extraction method that is highly effective for avian species and displays higher efficiency and consistency than other commonly used methodologies. Further, our method is also effective in extracting microbial DNA from oils collected from the avian preen gland. Preen oil chemicals are important for many aspects of avian life, and the biosynthesis of these chemicals is dependent on the preen gland microbial community. We expect our method will facilitate microbial DNA extractions from multiple avian microbiome reservoirs, which have previously proved difficult and expensive. Our method therefore increases the feasibility of future studies of avian host microbiomes.

Genomics-informed captive breeding can reduce inbreeding depression and the genetic l...

Samuel Speak

and 6 more

September 25, 2023

A document by Samuel Speak. Click on the document to view its contents.

Seq2Sat & SatAnalyzer toolkit: towards comprehensive microsatellite genotyping fr...

Peng Liu

and 4 more

September 12, 2023

Accurate and efficient genotyping of microsatellite loci is essential for their application in population genetics and various demographic analysis. Protocols for next generation sequencing of microsatellite loci generate high-throughput and cross-compatible allele scoring characteristics: common issues associated with size separation on conventional capillary-based protocols. As a result, we have developed a novel, ultra-fast, all-in-one software Seq2Sat in C++ to support accurate automated microsatellite genotyping. It directly takes raw reads of microsatellite amplicons and subsequently performs read quality control before inferring genotypes based on depth of read, sequence composition and length. It does not produce any intermediate files, making I/O very efficient. Additionally, we developed a module in Seq2Sat for sex identification based on sex locus amplicons. We further developed a user-friendly website-based platform SatAnalyzer to conduct reads-to-report analyses by calling Seq2Sat to generate genotype tables and interactive genotype graphs for manual editing. SatAnalyzer also allows visualization of read quality and distribution across loci and samples to troubleshoot multiplex optimization and high-quality library preparation. To evaluate its performance, we benchmarked SatAnalyzer against conventional capillary gel electrophoresis and an existing microsatellite genotyping software MEGASAT. Results show that SatAnalyzer can achieve > 0.993 genotyping accuracy and Seq2Sat is ~ 5 times faster than MEGASAT despite many more informative tables and figures generated. Seq2Sat and SatAnalyzer are freely available at github (https://github.com/ecogenomicscanada/Seq2Sat) and dockerhub (https://hub.docker.com/r/rocpengliu/satanalyzer).

Short read lengths recover ecological patterns in 16S rRNA gene amplicon data

Stephanie Jurburg

August 29, 2023

Metabarcoding is an increasingly popular and accessible method for assessing bacterial communities across a wide range of environments, and as the sequence data archives grow, sequence data reuse will likely become an important source of novel insights into the ecology of microbes. While literature on the benefits of longer read lengths for the study of microbial communities, little is known about the (re)usability of shorter (<200 bp) read lengths, but this information is essential to improve the reuse and comparability of metabarcoding data across studies. This study reanalyzed three 16S rRNA datasets targeting aquatic, animal-associated, and soil microbiomes, and evaluated how processing the sequence data across a range of read lengths affected the resulting taxonomic assignments, biodiversity metrics, and differential (i.e., before-after treatment) analyses. Short read lengths successfully recovered ecological patterns, and limited increases in resolution were observed beyond 100 bp reads across environments. Furthermore, abundance-weighted diversity metrics (e.g., Inverse Simpson index or Bray-Curtis dissimilarities) were more robust to variation in read lengths. Importantly, the total number of ASVs detected increased with read length, highlighting the need to consider metabarcoding-derived diversity estimates within the context of the bioinformatics parameters selected. This study provides evidence-based guidelines for the processing of short reads.

SEGUL: An ultrafast, memory-efficient alignment manipulation and summary tool for phy...

Heru Handika

and 1 more

May 04, 2022

Phylogenetic studies now routinely require manipulating and summarizing thousands of data files. For most of these tasks, currently available software requires considerable computing resources and substantial knowledge of command-line applications. We develop ultrafast and memory-efficient software that performs over a dozen common phylogenomic manipulations and calculates statistics summarizing essential data features. Our software is available as standalone command-line (CLI) and graphical user interface (GUI) applications, and as a programming language library for Rust, R, and Python, with possible support of other languages. The CLI and library versions, SEGUL, run native on Windows, Linux, and macOS, including Apple ARM Macs. The GUI version extends support to include mobile iOS and Android operating systems. SEGUL offer fast execution times and low memory footprints regardless of dataset size and platform choice. The inclusion of a GUI minimizes bioinformatics barriers to phylogenomics while SEGUL’s efficiency reduces economic barriers by enabling analysis on inexpensive hardware. Our support for mobile operating systems further enables teaching phylogenomics where access to computing power is limited.

Dependent variable selection in phylogenetic generalized least squares regression ana...

Zheng-Lin Chen

and 2 more

June 16, 2023

Phylogenetic generalized least squares (PGLS) regression is widely used to detect evolutionary correlations. In contrast to the equal treatment of analyzed traits in conventional correlation methods such as Pearson and Spearman’s rank tests, we must designate one trait as the independent variable and the other as the dependent variable. However, in our PGLS regression analyses (using Pagel’s λ model) of both empirical and simulated datasets, switching independent and dependent variables yielded many conflicting results. A serious problem with PGLS regression that has not been noticed before is that selecting an inappropriate trait as the dependent variable will often result in an error. To assess correlations in simulated data, we established a gold standard by analyzing changes in traits along phylogenetic branches. Next, we tested seven potential criteria for dependent variable selection: log-likelihood, Akaike information criterion, R2, p-value, Pagel’s λ, Blomberg et al.’s K, and the estimated λ in Pagel’s λ model. We determined that the last three criteria performed equally well in selecting the dependent variable and were superior to the other four. For practicality, we suggest using the trait with a higher λ or K value as the dependent variable in future PGLS regressions. In analyzing the evolutionary relationship between two traits, we should designate the trait with a stronger phylogenetic signal as the dependent variable even if it could logically assume the cause in the relationship.

Missing genotype imputation in non-model species using Self-Organizing Maps

Fernando Mora-Márquez

and 3 more

June 05, 2023

Current methodologies of genome-wide Single Nucleotide Polymorphism (SNP) genotyping produce large amounts of missing data that may affect statistical inference and bias the outcome of experiments. Genotype imputation is routinely used in well-studied species to buffer the impact in downstream analysis and several algorithms are available to fill in missing genotypes. The lack of reference haplotype panels precludes the use of these methods in genomic studies on non-model organisms. As an alternative, machine learning algorithms are employed to explore the genotype data and to estimate the missing genotypes. Here, we propose an imputation method based on Self-Organizing Maps (SOM), a widely used neural networks formed by spatially distributed neurons that cluster similar inputs into close neurons. We follow a classical approach that explores genotype datasets to select SNP loci for each query missing SNP genotype to build training sets, and that initializes and trains the neural networks to finally use the SOM-derived clustering to impute the best genotype. To automate the imputation process, we have implemented GTIMPUTATION, an open source application programmed in Python3 and with a user-friendly GUI to facilitate the whole process. The method performance was validated by comparing its accuracy, precision and sensitivity on several benchmark genotype datasets with other available imputation algorithms. Our approach produced highly accurate and precise genotype imputations and outperformed other algorithms, especially for datasets from mixed populations with unrelated individuals.

Individual genotypes from environmental DNA: fingerprinting snow tracks of three larg...

Marta De Barba

and 10 more

May 29, 2023

Continued advancements in environmental DNA (eDNA) research have made it possible to access intraspecific variation from eDNA samples, opening new opportunities to expand non-invasive genetic studies of wild animal populations. However, the use of eDNA samples for individual genotyping, as typically performed in non-invasive genetics, still remained unachieved. We present the first successful individual genotyping of eDNA obtained from snow tracks of three large carnivores: brown bear (Ursus arctos), European lynx (Lynx lynx) and wolf (Canis lupus). DNA was extracted using a protocol for isolating water eDNA and genotyped using amplicon sequencing of short tandem repeats (STR) and, for brown bear, a sex marker, on a high-throughput sequencing platform. Individual genotypes were obtained for all species, but genotyping performance differed among samples and species. Multilocus genotyping success for individual identification was higher for brown bear samples (6 over 7), than for wolf (7 over 10) and lynx (4 over 9) samples. The sex marker was genotyped in 5 out of 7 brown bear samples. Results for three species show that reliable individual genotyping, including sex identification, is now possible from eDNA in snow tracks, underlining its vast potential to complement the non-invasive genetic methods used for wildlife. To fully leverage the application of snow track eDNA, improved understanding of the ideal species- and site-specific sampling conditions, as well as laboratory methods promoting genotyping success are needed. This will also inform efforts to retrieve and type nuclear DNA from other eDNA samples, thereby advancing eDNA–based individual and population level studies.

Genotyping discordances? Empirical comparison of base-selective adaptors impact in 2b...

Carles Galià-Camps

and 3 more

July 01, 2022

Population genomic studies are increasing in the last decade, showing great potential to understand the evolutionary patterns in a great variety of organisms, mostly relying on RAD sequencing techniques to obtain reduced representations of the genomes. Among them, 2b-RAD can provide further secondary reduction to adjust study costs by using base-selective adaptors, although its impact on genotyping is unknown. Here we provide empirical comparisons on genotyping and genetic differentiation when using fully degenerate and base-selective adaptors and assess the impact of missing data. We built libraries with the two types of adaptors for the same individuals and generated independent and combined datasets with different missingness filters according to their presence (100%, 75% and 50%). Exploring locus-by-locus, we found 92% of identical genotypes between both libraries of the same individual when using loci present in 100% of the samples, which decreased to 35% when working with loci present in at least 50% of them. We show that missing data is a major source of individual genetic differentiation. The loci discordant by genotyping were in low frequency (7.67%) in all filtered files. Only 0.96% were directly attributable to base-selective adaptors, and 6.44% underestimated heterozygosity in NN libraries, of which ca. 70% had <10 reads per locus indicating that sufficient read depth should be ensured for a correct genotyping. Our work confirms that 2b-RAD libraries using base-selective adaptors are a robust tool to use in population genomics of species with large genome sizes.

MeStudio: crossing methylation and genomic features for comparative epigenomic analys...

Christopher Riccardi

and 5 more

June 10, 2022

DNA methylation is one of the most relevant epigenetic modifications. It is present in eukaryotes and prokaryotes and is related to several biological phenomena, including gene flow and adaptation to environmental conditions. The widespread use of third-generation sequencing technologies allows direct and easy detection of genome-wide methylation profiles, offering increasing opportunities to understand and exploit the epigenomics landscape of individuals and populations. Here, we present MeStudio, a pipeline which allows to analyse and combine genome-wide methylation profiles with genomic features. Outputs report the presence of DNA methylation in coding sequences (CDS) and noncoding sequences, including both intergenic sequences, and sequences upstream to CDS. We show the usage and performances of MeStudio on a set of single-molecule real time sequencing outputs from strains of the bacterial species Sinorhizobium meliloti. MeStudio is freely available under an open source GPLv3 license at https://github.com/combogenomics/MeStudio

Methods for detection of recent population subdivisions

Gabe O'Reilly

and 4 more

June 10, 2022

Potential subdivision events in populations can have a wide range of causes: from natural disasters like bushfires that isolate communities, to anthropogenic disturbances like infrastructure projects cutting through a population’s habitat. Due to the unpredictability inherent in events like bushfires, or even for predictable events such as property development, populations affected by these potential subdivisions are often not studied until after the event, making it extremely hard to assess negative conservation impacts without the benefit of prior data. This paper aims to apply population genetics methods to assess whether it is possible to accurately assess the impact a potential subdivision event can have on the genetic makeup of a population, especially when one has no data prior to such an event. Differentiation measures, such as Fst, might be used for detecting whether a population has been subdivided. However, these measures often take dozens of generations to show a significant change from zero (i.e., no differentiation), especially in larger populations. In this paper we present a more sensitive method, which is suitable for detecting subdivision effects within a few generations of the event and which can be applied without prior data. We test this method using both simulated data, and genetic data from a population of koalas impacted by a railroad infrastructure development.