Non-random mating among individuals can lead to spatial clustering of genetically similar individuals and population stratification. This deviation from panmixia is commonly observed in natural populations. Consequently, individuals can have parentage in single populations or involving hybridization between differentiated populations. Accounting for this mixture and structure is important when mapping the genetics of traits and learning about the formative evolutionary processes that shape genetic variation among individuals and populations. Stratified genetic relatedness among individuals is commonly quantified using estimates of ancestry that are derived from a statistical model. Development of these models for polyploid and mixed-ploidy individuals and populations has lagged behind those for diploids. Here, we extend and test a hierarchical Bayesian model, called entropy, which can utilize low-depth sequence data to estimate genotype and ancestry parameters in autopolyploid and mixed-ploidy individuals (including sex chromosomes and autosomes within individuals). Our analysis of simulated data illustrated the trade-off between sequencing depth and genome coverage and found lower error associated with low depth sequencing across a larger fraction of the genome than with high depth sequencing across a smaller fraction of the genome. The model has high accuracy and sensitivity as verified with simulated data and through analysis of admixture among populations of diploid and tetraploid Arabidopsis arenosa.
Functional annotation aims to assess the biochemical and biological functions of sets of genomic or transcriptomic sequences yielded after next-generation sequencing experiments. One common way to perform functional annotation of a set of sequences obtained from a next-generation sequencing experiment, is by searching for homologous sequences and accessing to the related functional information deposited in genomic databases. Functional annotation is especially challenging in de novo assemblies of transcriptomes of non-model organisms, like many plant species. In such cases, existing commercial and open access general purpose applications may not offer complete and accurate results. We present TOA (Taxonomy-oriented annotation), a user-friendly open-access application designed to establish functional annotation pipelines geared towards non-model plant species. TOA performs homology searches against proteins stored in the PLAZA platform databases, NCBI RefSeq Plant, Nucleotide Database and Non-Redundant Protein Sequence Database, and retrieves functional information for several gene ontology systems. The software performance was validated by comparing the runtimes, total number of annotated sequences and accuracy of the functional information obtained for several plant benchmark datasets with TOA and other open-access functional annotation solutions. TOA outperformed the other software in terms of number of annotated sequences and accuracy of the annotation, and constitutes a good alternative to improve functional annotation in plants. TOA is especially recommended for gymnosperms or for low quality sequence datasets of non-model plants.
Karyotypic changes in chromosome number and structure are drivers in the divergent evolution of diverse plant species and lineages. This study aimed to reveal the origins of the unique karyotype (2n = 12) and phylogenetic relationships of the genus Megadenia (Brassicaceae). A high-quality chromosome-scale genome was assembled for Megadenia pygmaea using Nanopore long reads and high-throughput chromosome conformation capture (Hi-C). The assembled genome is 215.2-Mb and is anchored on six pseudo-chromosomes. We annotated a total of 25,607 high-confidence protein-coding genes and corroborated the phylogenetic affinity of Megadenia with the expanded Lineage II, which contains numerous agricultural crops. We dated the divergence of Megadenia from its closest relatives to 27.04 (19.11-36.60) million years ago. A reconstruction of the chromosomal composition of the species was performed based on the de novo assembled genome and comparative chromosome painting analysis. The karyotype structure of M. pygmaea is very similar to the previously inferred Proto-Calepineae Karyotype (PCK; n = 7) of the Brassicaceae Lineage II. However, an end-to-end translocation between two ancestral chromosomes reduced the chromosome number from n = 7 to n = 6, comparable to Megadenia. Our reference genome provides fundamental information for use in horticulture, plant breeding and evolutionary study of this genus.
Plant interactions are as important belowground as aboveground. Belowground plant interactions are however inherently difficult to quantify, as roots of different species are difficult to disentangle. Although for a couple of decades molecular techniques have been successfully applied to quantify root abundance, root identification and quantification in multi-species plant communities remains particularly challenging. Here we present a novel methodology, multi-species Genotyping By Sequencing (msGBS), as a next step to tackle this challenge. First, a multi-species meta-reference database containing thousands of gDNA clusters per species is created from GBS derived High Throughput Sequencing (HTS) reads. Second, GBS derived HTS reads from multi-species root samples are mapped to this meta-reference which, after a filter procedure to increase the taxonomic resolution, allows the parallel quantification of multiple species. The msGBS signal of 111 mock-mixture root samples, with up to 8 plant species per sample, was used to calculate the within-species abundance. Optional subsequent calibration yielded the across-species abundance. The within- and across-species abundances highly correlated (R2 range 0.72-0.94 and 0.85-0.98, respectively) to the biomass-based species abundance. Compared to a qPCR based method which was previously used to analyze the same set of samples, msGBS provided similar results. Additional data on 11 congener species groups within 105 natural field root samples showed high taxonomic resolution of the method. msGBS is highly scalable in terms of sensitivity and species numbers within samples, which is a major advantage compared to the qPCR method and advances our tools to reveal the hidden belowground interactions.
Biodiversity studies greatly benefit from molecular tools, such as DNA metabarcoding, which provides an effective identification tool in biomonitoring and conservation programmes. The accuracy of species-level assignment, and consequent taxonomic coverage, relies on comprehensive DNA barcode reference libraries. The role of these libraries is to support species identification, but accidental errors in the generation of the barcodes may compromise their accuracy. Here we present an R-based application, BAGS (Barcode, Audit & Grade System), that performs automated auditing and annotation of cytochrome c oxidase subunit I (COI) sequences libraries, for a given taxonomic group of animals, available in the Barcode of Life Data System (BOLD). This is followed by implementing a qualitative ranking system that assigns one of five grades (A to E) to each species in the reference library, according to the attributes of the data and congruency of species names with sequences clustered in Barcode Index Numbers (BINs). Our ultimate goal is to allow researchers to obtain the most useful and reliable data, highlighting and segregating records according to their congruency. Different tests were performed to perceive its usefulness and limitations. BAGS fulfils a significant gap in the current landscape of DNA barcoding research tools by quickly screening reference libraries to gauge the congruence status of data and facilitate the triage of ambiguous data for posterior review. Thereby, BAGS have the potential to become a valuable addition in forthcoming DNA metabarcoding studies, in the long term contributing to globally improve the quality and reliability of the public reference libraries.
Abstract: Interrogation of chromatin modifications, such as DNA methylation, has potential to improve forecasting and conservation of marine ecosystems. The standard method for assaying DNA methylation (Whole Genome Bisulfite Sequencing), however, is too costly to apply at the scales required for ecological research. Here we evaluate different methods for measuring DNA methylation for ecological epigenetics. We compare Whole Genome Bisulfite Sequencing (WGBS) with Methylated CpG Binding Domain Sequencing (MBD-seq), and a modified version of MethylRAD we term methylation-dependent Restriction site-Associated DNA sequencing (mdRAD). We evaluate these three assays in measuring variation in methylation across the genome, between genotypes, and between polyp types in the reef-building coral Acropora millepora. We find that all three assays measure absolute methylation levels similarly, with tight correlations for methylation of gene bodies (gbM), as well as exons and 1Kb windows. Correlations for differential gbM between genotypes were weaker, but still concurrent across assays. We detected little to no reproducible differences in gbM between polyp types. We conclude that MBD-seq and mdRAD are reliable cost-effective alternatives to WGBS. Moreover, the considerably lower sequencing effort required for mdRAD to produce comparable methylation estimates makes it particularly useful for ecological epigenetics.
Microsporidia are obligate intracellular eukaryotic parasites that infect nearly all animal groups, including humans. The most common molecular methods for Microsporidia detection rely on species-targeting qPCR or end-point PCR using group-specific primers. However, these methods could be not specific enough or fail in case of mixed infections. We developed a method for parallel detection of both microsporidian infection and the host species. We designed new primer sets: one specific for the classical Microsporidia (targeting hypervariable V5 region of ssu rDNA), and a second one targeting a shortened fragment of the COI gene (standard metazoan DNA-barcode); both markers are well suited for a NGS approach. The analysis of ssu rDNA dataset representing 607 microsporidian species (120 genera) indicated that the V5 region enables identification of >98% species in the dataset (596/607). To test the method, we used microsporidians that infect mosquitoes in natural populations. Using mini-COI data, all field-collected mosquitoes were unambiguously assigned to seven species; among them almost 60% of specimens (127/212) were positive for at least 11 different microsporidian species, including a new microsporidian ssu rDNA sequence (Microsporidium sp. PL01). Phylogenetic analysis of Microsporidium sp. PL01 ssu rDNA showed that this species belongs to one of the two main clades in the Terresporidia. In addition, the level of microsporidian mixed infections was relatively high (9.4%). The numbers of sequence reads for the OTUs suggest that the occurrence of Nosema spp. in co-infections could benefit them; however, this observation should be re-tested using more intensive host sampling. The proposed method for detection of Microsporidia can be applied to all types of DNA extracts, including medical and environmental samples.
Partial clonality is widespread across the tree of life, but most population genetics models are designed for exclusively clonal or sexual organisms. This gap hampers our understanding of the influence of clonality on evolutionary trajectories and the interpretation of population genetics data. We performed forward simulations of diploid populations at increasing rates of clonality (c), analysed their relationships with genotypic (clonal richness, R, and distribution of clonal sizes, Pareto β) and genetic (FIS and linkage disequilibrium) indices, and tested predictions of c from population genetics data through supervised machine learning. Two complementary behaviours emerged from the probability distributions of genotypic and genetic indices with increasing c. While the impact of c on R and Pareto β was easily described by simple mathematical equations, its effects on genetic indices were noticeable only at the highest levels (c>0.95). Consequently, genotypic indices allowed reliable estimates of c, while genetic descriptors led to poorer performances when c<0.95. These results provide clear baseline expectations for genotypic and genetic diversity and dynamics under partial clonality. Worryingly, however, the use of realistic sample sizes to acquire empirical data systematically led to gross underestimates (often of one to two orders of magnitude) of c, suggesting that many interpretations hitherto proposed in the literature, mostly based on genotypic richness, should be reappraised. We propose future avenues to derive realistic confidence intervals for c and show that, although still approximate, a supervised learning method would greatly improve the estimation of c from population genetics data.
Samia ricini, a gigantic saturniid moth, has the potential to be a novel lepidopteran model species. Since S. ricini is much more tough and resistant to diseases than the current model species Bombyx mori, the former can be easily reared compared to the latter. In addition, genetic resources available for S. ricini rival or even exceed those for B. mori: at least 26 eco-races of S. ricini are reported and S. ricini can hybridise with wild Samia species, which are distributed throughout Asian countries, and produce fertile progenies. Physiological traits such as food preference, integument colour, larval spot pattern, etc. are different between S. ricini and wild Samia species so that those traits can be the target for forward genetic analysis. In order to facilitate genetic research in S. ricini, we determined the whole genome sequence of S. ricini. The assembled genome of S. ricini was 458 Mb with 155 scaffolds, and the N50 length of the assembly was approximately 21 Mb. 16,702 protein coding genes were predicted in the assembly. Although the gene repertoire of S. ricini was not so different from that of B. mori, some genes, such as chorion genes and fibroin genes, seemed to have specifically evolved in S. ricini.
Measuring biological diversity is a crucial but difficult undertaking, as exemplified in oaks where complex morphological, ecological, biogeographic and genetic differentiation patterns collide with traditional taxonomy that measures biodiversity in number of species (or higher taxa). In this pilot study, we generated High-Throughput Sequencing (HTS) amplicon data of the intergenic spacer of the 5S nuclear ribosomal DNA cistron (5S-IGS) in oaks, using six mock samples that differ in geographic origin, species composition, and pool complexity. The potential of the marker for automated geno-taxonomy applications was assessed using a reference dataset of 1770 5S-IGS cloned sequences, covering the entire taxonomic breadth and distribution range of western Eurasian Quercus, and applying similarity (BLAST) and evolutionary approaches (ML trees and EPA). Both methods performed equally well, with correct identification of species in sections Ilex and Cerris in the pure and mixed samples and main genotypes shared by species of sect. Quercus. Application of different cut-off thresholds revealed that medium-high abundance sequences (>10 or 25) suffice for a net species identification of samples containing one or few individuals. Lower thresholds identify phylogenetic correspondence with all target species in highly mixed samples (analogue to environmental bulk samples) and include rare variants pointing towards reticulation, incomplete lineage sorting, pseudogenic 5S units, and in-situ (natural) contamination. Our pipeline is highly promising for future assessments of intra-specific and inter-population diversity, and of the genetic resources of natural ecosystems, which are fundamental to empower fast and solid biodiversity conservation programs worldwide.
The leopard coral grouper, Plectropomus leopardus, belonging to genus Plectropomus, family Epinephelinae, is a carnivorous coral reef fish widely distributing in the tropical and subtropical water of Indo-Pacific Oceans. Due to its appealing body appearance and delicious taste, P. leopardus has become a popular commercial fish for aquaculture in many countries. However, the lack of genomic and molecular resources for P. leopardus hinders its biological studies and genomic breeding programs. Here we report the de novo sequencing and assembly of P. leopardus genome using 10× Genomics and high-throughput chromosome conformation capture (Hi-C) technologies. Using 127.36 Gb 10× Genomics we generated a 902.90 Mb genome assembly with a contig and scaffold N50 of 31.8 Kb and 33.47 Mb, respectively. The scaffolds were clustered and oriented into 24 pseudo-chromosomes with 13.39 Gb valid Hi-C data. BUSCO analysis showed that 95.3% of the conserved single-copy genes were retrieved, indicating a good entirety of the assembly. We predicted 23,234 protein-coding genes, among which 96.5% were functional annotated. The P. leopardus genome provides a valuable genomic resource for genetics, evolutionary and biological studies of this species. Particularly, it is expected to benefit the development of genomic breeding programs in the farming industry.
Gene annotation is a critical bottleneck in genomic research, especially for the comprehensive study of very large gene families in the genomes of non-model organisms. Despite the recent progress in automatic methods, state-of-the-art tools used for this task often produce inaccurate annotations, such as fused, chimeric, partial or even completely absent gene models for many family copies, errors that require considerable extra efforts to be corrected. Here we present BITACORA, a bioinformatics solution that integrates popular sequence similarity-based search tools and Perl scripts to facilitate both the curation of these inaccurate annotations and the identification of previously undetected gene family copies directly in genomic DNA sequences. We tested the performance of BITACORA in annotating the members of two chemosensory gene families with different repertoire size in seven available genome sequences, and compared its performance with that of Augustus-PPX, a tool also designed to improve automatic annotations using a sequence similarity-based approach. Despite the relatively high fragmentation of some of these drafts, BITACORA was able to improve the annotation of many members of these families and detected thousands of new chemoreceptors encoded in genome sequences. The program creates general feature format (GFF) files, with both curated and newly identified gene models, and FASTA files with the predicted proteins. These outputs can be easily integrated in genomic annotation editors, greatly facilitating subsequent manual annotation and downstream evolutionary analyses.
Sarcophaga peregrina is usually considered to be of great ecological, medical and forensic significance, and has the biological characteristics such as the ovoviviparous reproductive pattern and adaptation to feed on carrion. However, the underlying mechanisms still remain unsolved by lack of high-quality genome. Here we present de novo–assembled genome at chromosome-scale for S. peregrina. The final assembled genome was 560.31 Mb with contig N50 of 3.84 Mb. Hi-C scaffolding reliably anchored six pseudochromosomes, accounting for 97.76% of the assembled genome. Moreover, 45.70% of repeat elements were identified in the genome. A total of 14,476 protein-coding genes were functionally annotated, accounting for 92.14% of all predicted genes. Phylogenetic analysis indicated that S. peregrina and S. bullata diverged ~7.14 Mya. Comparative genomic analysis revealed expanded and positively selected genes related to biological features that aid in clarifying its ovoviviparous reproduction and necrophagous habit, such as horionic membrane formation and Dorso-ventral axis formation, lipid metabolism, and olfactory receptor activity. This study provides a valuable genomic resource of S. peregrina, and sheds insight into further revealing the underlying molecular mechanisms of adaptive evolution.