Soil protists are increasingly studied due to a release from previous methodological constraints and the acknowledgement of their immense diversity and functional importance in ecosystems. However, these studies often lack a sufficient depth in knowledge, which is visible in the form of falsely used terms and false- or over-interpreted data with conclusions that cannot be drawn from the data obtained. As we welcome that also non-experts include protists in their still mostly bacterial and/or fungal focused studies, our aim here is to help avoid some common errors. We provide an overview of current terms to be used when working on soil protists, like protist instead of protozoa, predator instead of grazer, microorganisms rather than microflora and terms to be used to describe the prey spectrum of protists. We then highlight some do’s and don’ts in soil protist ecology including challenges related to interpreting 18S rRNA gene amplicon sequencing data. We caution against the use of standard bioinformatic settings optimized for bacteria and the uncritical reliance on incomplete and partly erroneous reference databases. We also show why causal inferences cannot be drawn from sequence-based correlation analyses or any sampling/monitoring, study in the field without thorough experimental confirmation and sound understanding of the biology of taxa. Together, we envision this work to help non-experts to more easily include protists in their soil ecology analyses, and obtain more reliable interpretations from their protist data and other biodiversity data that, in the end, will help to better understand soil ecology.
The rise of sedimentary ancient DNA (sedaDNA) studies has opened up new possibilities for studying pre-historic ecology. The use of sediments to identify organisms even where macroscopic remains are limited or no longer exist is an exciting and potentially ground-breaking area of genomics. There are special considerations however when managing this substrate in Indigenous Australian contexts. Sediments and soils are often considered as waste by-products during archaeological and paleontological excavations, and as such are not typically considered of high value in ethical considerations in traditional western research. Nevertheless, the product of sedaDNA work – genetic information from past fauna, flora, microbial communities, and human ancestors – is likely to be of cultural value for Indigenous peoples. We argue that the integration of Traditional Knowledges into sedaDNA research would a) allow identification of sensitive, secret, or sacred genomic data, and b) improve research outcomes by providing ecological context for species through multi-millennia oral histories.
We analyzed robustness of species identification based on proteomic composition to data processing and intraspecific variability, specificity and sensitivity of species-markers as well as discriminatory power of proteomic fingerprinting and its sensitivity to phylogenetic distance. Our analysis is based on MALDI-TOF MS data from 32 marine copepod species coming from 13 regions (North and Central Atlantic and adjacent seas). A random forest (RF) model correctly classified all specimens to species level with only small sensitivity to data processing, demonstrating the strong robustness of the method. Compounds with high specificity showed low sensitivity i.e., identification was rather based on complex pattern-differences than on presence of single markers. Proteomic distance was not consistently related to phylogenetic distance. A species-gap in proteome composition appeared at 0.8 Euclidean distance when using only specimens from the same sample. When other regions or seasons were included, intra-specific variability increased, resulting in overlaps of intra- and inter-specific distance. Highest intra-specific distances (> 0.8) were observed between specimens from brackish and marine habitats i.e., salinity likely affects proteomic patterns. When testing library sensitivity of the RF model to regionality, strong misidentification was only detected between two congener pairs. Still, choice of reference library may have an impact on identification of closely related species and should be tested before routine application. We envision high relevance of this time- and cost-efficient method for future zooplankton monitoring as it provides not only in-depth taxonomic resolution for counted specimens but also add-on information e.g., on developmental stage or environmental conditions.
Published literature suggests that indigenous cultural practices, specifically traditional medicine, are commonplace among urban communities contrary to the general conception that such practices are associated to rural societies. We reviewed literature for records of herptiles sold by traditional health practitioners in urban South Africa, then used visual confirmation surveys, DNA barcoding, and folk taxonomy to identify the herptile species that were on sale. Additionally, interviews with 11 SePedi and IsiZulu speaking traditional health practitioners were used to document details of the collection and pricing of herptile specimens along with the practitioners’ views of current conservation measures aimed at traditional medicine markets. The herptile specimens sold by traditional health practitioners included endangered and non-native species. The absorbance ratios of DNA extracted from the tissue of herptiles used in traditional medicine were found to be unreliable predictors of whether those extractions would be suitable for downstream applications. From an initial set of 111 tissue samples, 81 sequencing reactions were successful and 55 of the obtained sequences had species level matches to COI reference sequences on the NCBI GenBank and/or BOLD databases. Molecular identification revealed that traditional health practitioners sometimes mislabel the species they use. The mixed methodology employed here is useful for conservation planning as it updates knowledge of animal use in indigenous remedies and can accurately identify species of high conservation priority. Furthermore, the study highlights the possibility of collaborative conservation planning with traditional health practitioners.
Despite the increasing accessibility of high-throughput sequencing, obtaining high-quality genomic data on non-model organisms without proximate well-assembled and annotated genomes remains challenging. Here we describe a workflow that takes advantage of distant genomic resources and ingroup transcriptomes to select and jointly enrich long open reading frames (ORFs) and ultraconserved elements (UCEs) from genomic samples for integrative studies of microevolutionary and macroevolutionary dynamics. This workflow is applied to samples of the African unionid bivalve tribe Coelaturini (Parreysiinae) at basin and continent-wide scales. Our results indicate that ORFs are efficiently captured without prior identification of intron-exon boundaries. The enrichment of UCEs was less successful, but nevertheless produced substantial datasets. Exploratory continent-wide phylogenetic analyses with ORF supercontigs (>515,000 parsimony informative sites) resulted in a fully resolved phylogeny, the backbone of which was also retrieved with UCEs (>11,000 informative sites). Variant calling on ORFs and UCEs of Coelaturini from the Malawi Basin produced ~2,000 SNPs per population pair. Estimates of nucleotide diversity and population differentiation were similar for ORFs and UCEs. They were low compared to previous estimates in mollusks, but comparable to those in recently diversifying Malawi cichlids and other taxa at an early stage of speciation. Skimming off-target sequence data from the same enriched libraries of Coelaturini from the Malawi Basin, we reconstructed the maternally-inherited mitogenome, which displays the gene order inferred for the most recent common ancestor of Unionidae. Overall, our workflow and results provide exciting perspectives for integrative genomic studies of microevolutionary and macroevolutionary dynamics in non-model organisms.
Revegetation projects face the major challenge of sourcing the optimal plant material. This is often done with limited information about plant performance and increasingly requires to factor resilience to climate change. Functional traits can be used as quantitative indices of plant performance and guide provenancing, but trait values expected under novel conditions are often unkown. To support climate-resilient provenancing efforts, we develop a trait prediction model that integrates the effect of genetic variation with fine-scale temperature variation. We train our model on multiple field plantings of Arabidopsis thaliana and predict two relevant fitness traits -- days-to-bolting and fecundity -- across the species' European range. Prediction accuracies were high for days-to-bolting and moderate for fecundity, with the majority of trait variation explained by temperature differences between plantings. Projection under future climate predicted a decline in fecundity, although this response was heterogeneous across the range. In response, we identified novel genotypes that could be introduced to genetically offset the fitness decay. Our study highlights the value of predictive models to aid seed provenancing and improve the success of revegetation projects.
Innovations in ancient DNA (aDNA) preparation and sequencing technologies have exponentially increased the quality and quantity of aDNA data extracted from ancient biological materials. The additional temporal component from the incoming aDNA data can provide improved power to address fundamental evolutionary questions like characterising selection processes that shape the phenotypes and genotypes of contemporary populations or species. However, utilising aDNA to study past selection processes still involves considerable hurdles such as how to eliminate the confounding effect of genetic interactions in the inference of selection. To circumvent this challenge, in this work we extend the method introduced by He et al. (2022) to infer temporally variable selection from the data on aDNA sequences with the flexibility of modelling linkage and epistasis. Our posterior computation is carried out through a robust adaptive version of the particle marginal Metropolis-Hastings algorithm with a coerced acceptance rate. Moreover, our extension inherits their desirable features like modelling sample uncertainties resulting from the damage and fragmentation of aDNA molecules and reconstructing underlying gamete frequency trajectories of the population. We assess the performance and show the utility of our procedure with an application to ancient horse samples genotyped at the loci encoding base coat colours and pinto coat patterns.
Over the last two decades, there has been a huge increase in our understanding of microbial diversity, structure and composition enabled by high throughput sequencing (HTS) technologies. Yet, it is unclear how the number of sequences translates to the number of cells or species within the community. Additional observational data may be required to ensure relative abundance patterns from sequence reads are biologically meaningful or presence absence data may be used instead of abundance. The goal is to obtain robust community abundance data, simultaneously, from environmental samples. In this issue of Molecular Ecology Resources, Karlusich et al., (2022) describe a new method for quantifying phytoplankton cell abundance. Using Tara Oceans datasets, the authors propose the photosynthetic gene psbO for reporting accurate relative abundance of the entire phytoplankton community from metagenomic data. The authors demonstrate improved correlations with traditional optical methods including microscopy and flow cytometry, improving upon current molecular identification typically using rRNA markers genes. Furthermore, to facilitate application of their approach, the authors curated a psbO gene database for accessible taxonomic queries. This is an important step towards improving species abundance estimates from molecular data and eventually reporting of absolute species abundance, enhancing our understanding of community dynamics.
The analysis of genomic data can be an intimidating process, particularly for researchers who are not experienced programmers. Commonly used analyses are spread out across programs, each of which require their own input formats, and data must often be wrangled and re-wrangled into new formats to split the data according to categorical metadata variables, such as population or family. Here, we introduce snpR, and R package that allows for user-friendly processing of SNP genomic data by automating data sub-setting and processing across categorical metadata, integrating approaches contained in many different packages under a single ecosystem, and allowing for iterative, efficient analysis focused on a single R object across an entire analysis pipeline.
RNA sequencing (RNA-Seq) is a popular method for measuring gene expression in non-model organisms, including wild populations. While RNA-Seq can measure gene expression variation among wild-caught individuals and can yield important biological insights into organism function, sampling methods may also influence gene expression estimates. We examined the influence of multiple technical variables on estimated gene expression in a non-model fish, the westslope cutthroat trout (Oncorhynchus clarkii lewisi), using two RNA-Seq library types: 3’ RNA-Seq and whole mRNA-Seq. We evaluated effects of dip netting versus electrofishing, and of harvesting tissue immediately versus 5 minutes after euthanasia on estimated gene expression in blood, gill, and muscle. We detected 30% more genes with whole mRNA-Seq than with 3’ RNA-Seq and found that 58% of genes were significantly differently expressed between 3’ RNA-Seq and whole mRNA-Seq. Our findings indicate that 3’ RNA-Seq and whole mRNA-Seq are robust to the technical variables related to the field sampling approaches tested here with a lack of differential gene expression among sampling methods and tissue collection time after euthanasia. However, we found that gene expression varied based on which RNA-Seq library type was used on the same set of samples. Our study suggests researchers could safely rely on different fish sampling strategies in the field and save money and analyze more individuals using 3’ RNA-Seq, but should use whole mRNA-Seq when working with a species without good genomic resources, and when maximizing the number of genes identified and detecting alternative splicing are important.
Background Microbiome studies are often limited by a lack of statistical power due to small sample sizes and a large number of features. This problem is exacerbated in correlative studies of multi-omic datasets. Statistical power can be increased by finding and summarizing modules of correlated observations, which is one dimensionality reduction method. Additionally, modules provide biological insight as correlated groups of microbes can have relationships among themselves. Results To address these challenges, we developed SCNIC: Sparse Cooccurrence Network Investigation for compositional data. SCNIC is open-source software that can generate correlation networks and detect and summarize modules of highly correlated features. Modules can be formed using either the Louvain Modularity Maximization (LMM) algorithm or a Shared Minimum Distance algorithm (SMD) that we newly describe here and relate to LMM using simulated data. We applied SCNIC to two published datasets and we achieved increased statistical power and identified microbes that not only differed across groups, but also correlated strongly with each other, suggesting shared environmental drivers or cooperative relationships among them. Conclusions SCNIC provides an easy way to generate correlation networks, identify modules of correlated features and summarize them for downstream statistical analysis. Although SCNIC was designed considering properties of microbiome data, such as compositionality and sparsity, it can be applied to a variety of data types including metabolomics data and used to integrate multiple data types. SCNIC allows for the identification of functional microbial relationships at scale while increasing statistical power through feature reduction.
The ithomiine butterflies (Nymphalidae: Danainae) represent the largest known radiation of Mullerian mimetic butterflies. They dominate by number the mimetic butterfly communities, which include species such as the iconic neotropical Heliconius genus. Despite recent studies carried out on ithomiine ecology and genetic structure, no reference genome was available for the tribe. Here, we generated high-quality, chromosome-scale genome assemblies of two Melinaea species, Melinaea marsaeus and Melinaea menophilus, and a draft genome of Ithomia salapia. We obtained genomes with a size ranging from 396 Mb to 503 Mb across the three species and scaffold N50 of 40.5 Mb and 23.2 Mb for the two chromosome-scale assemblies. Using collinearity analyses we identified massive rearrangements between the two closely related Melinaea species. A detailed annotation of transposable elements and genes was performed, resulting in the identification of 24,341, 31,081 and 31,976 genes in I. salapia, M. marsaeus and M. menophilus, respectively. We used a specialist annotation to target chemosensory genes, which is crucial for host plant detection and mate recognition in mimetic species. A comparative genomic approach revealed independent gene expansions in ithomiines and particularly in gustatory receptor genes. These first three genomes of ithomiine mimetic butterflies constitute a valuable addition and a welcome comparison to existing biological models of mimicry, such as Heliconius, and will enable further understanding of the mechanisms of adaptation and the genetic bases underpinning mimicry.
Conducting large-scale phylogeographic studies to understand processes affecting population structure and genetic diversity across multiple species is difficult because the key genetic (NCBI) and spatial (GBIF) repositories are disconnected. In this issue of Molecular Ecology Resources, Pelletier et al. (2022) demonstrate the power of connecting these in the program phylogatR. This program assembled 87,852 species and 102,268 sequence alignments in a taxonomic hierarchy, yielding multiple sequence alignments per species, mainly for animals (88%), composed mostly of mtDNA data. The authors discuss several caveats with these alignments and provide flags identifying particular problems associating locality and genetic data with certain taxa (e.g., multiple localities per individuals). They provide a test that nucleotide diversity should increase with area, but find a significant relationship in only 32% of taxa with no clear taxonomic or ecological factors accounting for this. To examine the potential of this program, I tested the idea that the degree of population expansion should increase with latitude given potential environmental stability in the tropics and instability in temperate regions. In under two hours, I downloaded all squamates (lizards and snakes) and regressed Tajima’s D on latitude and found a weak but significant negative relationship, indicating a potential association between latitude and population expansion. The phylogatR database is a powerful resource for researchers wanting to test the relationship between genetic diversity and some aspect of space or environment.
Rust fungi are characterized by large genomes with high repeat content, and have two haploid nuclei in most life stages, which makes achieving high-quality genome assemblies challenging. Here, we describe a pipeline using HiFi reads and Hi-C data to assemble a gigabase-sized fungal pathogen, Puccinia polysora f.sp. zeae, to haplotype-phased and chromosome-scale. The final assembled genome is 1.71 Gbp, with ~850 Mbp and 18 chromosomes in each haplotype, being currently the largest fungal genome assembled to chromosome scale. Transcript-based annotation identified 47,512 genes with a similar number for each haplotype. A high level of interhaplotype variation was found with 10% haplotype-specific BUSCO genes, 5.8 SNPs/kbp, and structural variation accounting for 3% of the genome size. The P. polysora genome displayed over 85% repeat content, with genome-size expansion, gene losses and gene family expansions suggested by multiple copies of species-specific orthogroups. Interestingly, these features did not affect overall synteny with other Puccinia species with smaller genomes. Fine-time-point transcriptomics revealed seven clusters of co-expressed secreted proteins that are conserved between two haplotypes. The fact that candidate effectors interspersed with all genes indicated the absence of a “two-speed genome” evolution in P. polysora. Genome resequencing of 79 additional isolates revealed a clonal population structure of P. polysora in China with low geographic differentiation. Nevertheless, a minor population drifted from the major population by having mutations on secreted proteins including AvrRppC, indicating the ongoing evolution and population differentiation. The high-quality assembly provides valuable genomic resources for future studies on the evolution of P. polysora.
Biomonitoring surveys from environmental DNA make use of metabarcoding tools to describe the community composition. These studies match their sequencing results against public genomic databases to identify the species. However, mitochondrial genomic reference data are yet incomplete, only a few genes may be available, or the suitability of existing sequence data is suboptimal for species level resolution. Here we present a dedicated and cost-effective workflow with no DNA amplification for generating complete fish mitogenomes for the purpose of strengthening fish mitochondrial databases. Two different long-fragment sequencing approaches using Oxford Nanopore sequencing coupled with mitochondrial DNA enrichment were used. One where the enrichment is achieved by preferential isolation of mitochondria followed by DNA extraction and nuclear DNA depletion (‘mitoenrichment’). A second enrichment approach takes advantage of the CRISPR-Cas9 targeted scission on previously dephosphorylated DNA (‘targeted mitosequencing’). The sequencing results varied between tissue, species, and integrity of the DNA. The mitoenrichment method yielded 0.17-12.33 % of sequences on target and a mean coverage ranging from 74.9-805-fold. The targeted mitosequencing experiment from native genomic DNA yielded 1.83-55 % of sequences on target and a 38-2123-fold mean coverage. This helped complete the mitogenome of species with homopolymeric regions, tandem repeats and gene rearrangements. We demonstrate that deep sequencing of long fragments of native fish DNA is possible, can be achieved with low computational resources in a cost-effective manner, exceeding the widespread genome skimming approach, and allowing the discovery of mitogenomes of non-model or understudied fish taxa to a broad range of laboratories worldwide.
The measurement of biodiversity is an integral aspect of life science research. With the establishment of second- and third-generation sequencing technologies, an increasing amount of metabarcoding data is being generated as we seek to describe the extent and patterns of biodiversity in multiple contexts. The reliability and accuracy of taxonomically assigning metabarcoding sequencing data has been shown to be critically influenced by the quality and completeness of reference databases. Custom, curated, eukaryotic reference databases, however, are scarce, as are the software programs for generating them. Here, we present CRABS (Creating Reference databases for Amplicon-Based Sequencing), a software package to create custom reference databases for metabarcoding studies. CRABS includes tools to download sequences from multiple online repositories (i.e., NCBI, BOLD, EMBL, MitoFish), retrieve amplicon regions through in silico PCR analysis and pairwise global alignments, curate the database through multiple filtering parameters (e.g., dereplication, sequence length, sequence quality, unresolved taxonomy), export the reference database in multiple formats for the immediate use in taxonomy assignment software, and investigate the reference database through implemented visualizations for diversity, primer efficiency, reference sequence length, and taxonomic resolution. CRABS is a versatile tool for generating curated reference databases of user-specified genetic markers to aid taxonomy assignment from metabarcoding sequencing data. CRABS is available for download as a conda package and via GitHub (https://github.com/gjeunen/reference_database_creator).
Despite recent advances in high-throughput DNA sequencing technologies, a lack of locally relevant DNA reference databases may limit the potential for DNA-based monitoring of biodiversity for conservation and biosecurity applications. Museums and national collections represent a compelling source of authoritatively identified genetic material for DNA database development yet obtaining DNA barcodes from long-stored specimens may be difficult due to sample degradation. We demonstrate a sensitive and efficient laboratory and bioinformatic process for generating DNA barcodes from hundreds of invertebrate specimens simultaneously via the Illumina MiSeq system. Using this process, we recovered full-length (334) or partial (105) COI barcodes from 439 of 450 (98 %) national collection-held invertebrate specimens. This included full-length barcodes from 146 specimens which produced low-yield DNA and no visible PCR bands, and which produced as little as a single sequence per specimen, demonstrating high sensitivity of the process. In many cases, the identity of the most abundant sequences per specimen were not the correct barcodes, necessitating the development of a taxonomy-informed process for identifying correct sequences among the sequencing output. The recovery of only partial barcodes for some taxa indicates a need to refine certain PCR primers. Nonetheless, our approach represents a highly sensitive, accurate, and efficient method for targeted reference database generation, providing a foundation for DNA-based assessments and monitoring of biodiversity.
Dietary metabarcoding has vastly improved our ability to analyse the diets of animals, but it is hampered by a plethora of technical limitations including potentially reduced data output due to the disproportionate amplification of the DNA of the focal predator, here termed ‘the predator problem’. We review the various methods commonly used to overcome this problem, from deeper sequencing to exclusion of predator DNA during PCR, and how they may interfere with increasingly common multi-predator-taxon studies. We suggest that multi-primer approaches with an emphasis on achieving both depth and breadth of prey detections may overcome the issue to some extent, although multi-taxon studies require further consideration, as highlighted by an empirical example. We also review several alternative methods for reducing the prevalence of predator DNA that are conceptually promising but require additional empirical examination. The predator problem is a key constraint on molecular dietary analyses but, through this synthesis, we hope to guide researchers in overcoming this in an effective and pragmatic way.