Corresponding author:
Ping Wu, wpwupingwp@outlook.com
Abstract:
Organelle genomes serve as crucial datasets for investigating the genetics and evolution of plants and animals, genome diversity, and species identification. To enhance the collection, analysis, and visualization of such data, we have developed a novel open-source software tool named Organelle Genome Utilities (OGU). The software encompasses three modules designed to streamline the handling of organelle genome data. The data collection module is dedicated to retrieving, validating, and organizing sequence information. The evaluation module assesses sequence variance using a range of methods, including novel metrics termed stem and terminal phylogenetic diversity, as well as observed resolution. The primer module could design universal primers for downstream applications. Finally, a visualization pipeline has been developed to present comprehensive insights into organelle genomes across different lineages rather than focusing solely on individual species. The performance, compatibility, and stability of OGU have been rigorously evaluated through benchmarking with four datasets, including one million mixed GenBank records, plastid genomic data from the Lamiaceae family, mitochondrial data from rodents, and 308 plastid genomes sourced from various angiosperm families. Based on software capabilities, we have identified 30 plastid intergenic spacers that exhibit a moderate evolutionary rate and offer practical utility comparable to coding regions, which highlights the potential applications of intergenic spacers in organelle genomes. We anticipate that OGU will substantially enhance the efficient utilization of organelle genomic data and broaden the prospects for related research endeavors.
Keywords:
Organelle genome, polymorphism evaluation, plastid evolution, intergenic spacers
Availability:
Organelle Genome Utilities is an open-source cross-platform Python program freely available at https://github.com/wpwupingwp/OGU.
Introduction
Plastids and mitochondria are semi-autonomous organelles in eukaryotic organisms, housing their own distinct genetic information in the form of organelle genomes. Compared to the nuclear genome, organelle genomes are typically smaller in size, often circular in structure, and exhibit a simpler genetic background (Gray 1993; Lynch et al. 2006; Smith & Keeling 2015). Notably, such as plastids in green plants and mitochondria in animals are commonly inherited maternally (Chunget al. 2023). Due to moderate evolutionary rates, simple genetic backgrounds, and the increasingly mature methods for data acquisition, particularly in assembly (Dierckxsens et al. 2016; Jin et al. 2020; Meng et al. 2019; Song et al. 2022) and annotation (Qu et al. 2019; Tillich et al. 2017), organelle genomes, especially plant plastid genomes, have found widespread applications in studying species phylogenetic relationships, species identification, and genomic variations (Barrett et al.2016; Hollingsworth et al. 2009; Weng et al. 2014). Correspondingly, the related organelle genome data are also increasingly abundant. As of early 2024, NCBI RefSeq has released over 7,300 non-redundant complete plastid genomes and more than 10,800 animal mitochondrial genomes (Pruitt et al. 2012). These extensive datasets facilitate cross-species, large-scale comparative studies (Huet al. 2023; Li et al. 2019; Tonti-Filippini et al.2017), and drive advancements in associated methodologies.
Currently, researchers can directly use full-length sequences for genome structure and functional variation studies (Sanchez-Puerta et al.2023; H. R. Zhang et al. 2019). Software tools for these purposes include but are not limited to progressiveMauve, IRscope, and CPGView (Amiryousefi et al. 2018; Darling et al. 2010; Liuet al. 2023). For phylogenetic studies and species identification, it is common to split the annotated genomes into coding gene fragments for further analysis. Integrated tools like TBtools and PhyloSuite are available for automating multi-step analyses to generate datasets tailored for different downstream analyses (C. Chen et al. 2020; D. Zhang et al. 2020). Additionally, fragmented coding region data can be obtained directly from databases like NCBI GenBank and CGIR (Hua et al. 2022; Sayers et al. 2019). However, there are still some important issues worth addressing in the analysis and utilization of organelle genome data. Several key problems are outlined as followed.
Evaluation and selection of organelle genomic sequences are crucial in the analysis . Genes and non-coding regions in organelle genomes may evolved at varying rates, leading to differences in subsequent analytical methods and applications. Highly variable fragments like CO1, ycf1 and matK are more suitable for DNA barcoding, moderately variable coding sequences (CDS) are better suited for phylogenetic analysis, while highly conserved genes such asrrn16 , psaC hold application value in genome assembly and evolutionary studies (Hebert et al. 2003; Kress & Erickson 2008; Li et al. 2019; P. Wu et al. 2021). In phylogenetic analysis, partitioning the data and assigning different evolutionary models to each fragment using tools allows for the convenient utilization of data with varied mutation rates, yet conflicts between fragments exceeding the model’s capacity can still yield ambiguous results (Lanfear et al. 2017; Walker et al. 2019). Therefore, sequence variation assessment methods are particularly critical. There are existing indicators and software tools can evaluate sequence polymorphism. For example, nucleotide polymorphism is commonly used, with software like DnaSP utilized for calculations (Nei 1987; Rozas et al. 2017). However, excluding gap sites and ambiguous bases may underestimate sequence variance, especially in loci with numerous insertions or deletions (Xi et al. 2016). Furthermore, nucleotide diversity may not differentiate variance within and across subgroups of assessed sequences due to heterogeneity that could affect the results (Barrett et al. 2016). Another method, the selection pressure dN/dS ratios are only applicable to protein-coding sequences and cannot handle gaps well (Yang 2007).
The intergenic spacers in organelle genomes are often overlooked . Intergenic spacers (or intergenic regions) refer to the sequences between the stop codon of an upstream gene and the start codon of a downstream gene. In animal mitochondrial genomes, these regions are sparse and compact, containing only a few homologous intergenic spacers. In plant mitochondrial genomes, due to massive structure variations, these spacers are highly variable across species in terms of size and structure, behaving low homology (Gualberto & Newton 2017). For plastid genomes, due to their stable structure and good collinearity, there is relatively higher homology among intergenic spacers. While they do not encode protein products, some may contain regulatory elements such as promoters or non-coding RNAs that play a role in regulating the expression of plastid-related proteins in response to environmental changes (Anand & Pandi 2021; Belbin et al. 2017; Dietrichet al. 2015). The notable homology and potential biological functions of the plastid intergenic spacers suggest a scientific value for further exploitation, but the current state of research does not fully reflect this potential. Apart from using certain highly variable segments (e.g., atpB-rbcL , trnL-F ) as supplementary universal DNA barcodes for plant species identification, intergenic spacers are only utilized in resolving phylogenetic relationships at lower taxonomic levels in some cases (C. W. Chen et al. 2013; Escobari et al. 2021).
Another reason for this neglect is the lack of data. For intergenic spacers widely used in species identification, data is primarily acquired through amplification sequencing. For other intergenic spacers, the efficiency of data acquisition is limited due to the need for targeted primer design and amplification experiments. Despite the existence of relevant software tools (Riaz et al. 2011; Untergasser et al. 2012), challenges such as lack of support for gapped alignments and ambiguous bases, slow processing speeds, and cumbersome usage persist. Given the rapid accumulation of plastid genome data, extracting intergenic spacers from plastid genome sequences based on annotations is a viable method for data acquisition. However, there is currently a lack of specific tools for this purpose. Popular data analysis toolkits such as PhyloSuite and TBTools only offer functionality for extracting coding sequences from genomes.
The interference caused by irregular annotation information . Due to the wide time span of organelle genomic data production, researchers used different tools and criteria for generating and submitting data, and the names of homologous genes in different genomes may be inconsistently named, leading to errors or missing data in subsequent analyses. For instance, variations in letter case within gene names can mistakenly differentiate genes (e.g., “rbcL”, “rbcl,”RBCL”), while tRNA and rRNA are particularly problematic areas, including variations in naming patterns (e.g., “trnH-GUG”, “trnH”, “tRNA His gug”, “trnHgug”). An extreme example is the ribosomal RNA gene rrn16in plastid genomes, where seven different name variations can be found in public data: “16rrn”, “16S”, “16S ribosomal RNA”, “16S rRNA”, “rrn16”, “rrn16s”, and “rrn16S” (Maglott et al. 2011). Although manual correction methods can be employed such as CGIR, when dealing with a large volume of data, especially in the absence of standardized conventions, this approach may introduce additional errors and consume a significant amount of time.
Furthermore, Improvement is needed in presenting organelle genome sequence analysis results. Currently, the assembly and annotation results are typically displayed in circular plots, with various user-friendly tools available such as OGDRAW and CPGView (Greineret al. 2019; Liu et al. 2023). However, this display method is often limited to individual species or samples. For analyses targeting taxonomic groups containing multiple species, especially in sequence variation analyses, conventional graphical representations like box plots and bar graphs are still predominantly used. Therefore, the development of new visualization methods tailored for multi-species organelle genome analysis results would help elucidate the sequence evolutionary patterns within organelle genomes of specific taxa more intuitively.
Considering the aforementioned, to assess sequence variations more effectively, facilitate the acquisition and utilization of intergenic spacers, utilize standardized sequence data for analysis, and enhance result presentation, we have developed Organelle Genome Utilities (OGU) software. OGU functions as a versatile toolkit for gathering, analyzing, and visualizing organelle genomic data, thereby enhancing the efficient utilization of data resources. Users simply need to specify their interested taxa or organelles, and the software automates a comprehensive set of analyses, including but not limited to collecting, organizing, and standardizing organelle genomic sequence data, assessing sequence variations using multiple indicators, and designing universal primers for downstream applications. The output from the software includes various types of sequence files suitable for diverse downstream analyses, detailed result tables, and visual representations that incorporate characteristics of organelle genomes. The software is written in pure Python language, facilitating cross-platform operation on Microsoft Windows, macOS, and Linux operating systems. Moreover, the software’s hardware requirements are relatively modest. For most cases, a standard PC with 8GB of memory and a typical CPU are sufficient. Furthermore, to enhance user experience, OGU offers both command-line and graphical user interface along with comprehensive documentation for user guidance.
Materials and Methods
OGU contains three modules, GB2fasta, Evaluate, and Primer, plus a fully automated analysis results visualization pipeline. These components can be used individually or in combination to serve different purpose.
The GB2fasta module
The GB2fasta module primarily concentrates on the collection and organization of data. It accepts two kinds of input: local GenBank files or Entrez queries for downloading GenBank data. For the latter one, the GB2fasta module offers straightforward options to automatically generate the query string to precisely retrieve target records from GenBank including gene names, taxonomy, organelle type, date, length range, etc. Next, the program downloads the data in batches. To avoid the interference of network fluctuations, when the number of retrieved records is large, the program supports downloading in chunks and retransmission at breakpoints.
For the local GenBank files, the program firstly validates the integrity and reliability of the data to exclude abnormal records, such as truncated record due to Internet disconnection, invalid record due to annotation error, or empty record containing none of sequence data. Next, according to annotation information, validated records are divided precisely into a series of FASTA-format files with organized sequence ID, including types of sequences, accession number, and taxonomic information (kingdom, phylum, class, order, family, genus, and species). Each file contains only sub-fragments of the same locus. Notably, the intergenic spacer, which is inaccessible from GenBank directly, will be extracted according to its upstream and downstream annotations. For taxon information, although the full lineage name is already included in the record, it is tricky to use given that it only contains the name of each taxonomic rank but the rank levels are missing. Querying the NCBI Taxonomy database with Entrez APIs could solve this issue, yet it is unrealistic for large number of records. OGU follows theInternational Code of Nomenclature for algae, fungi, and plants (Shenzhen Code) to recognize plant rank levels since that commonly used ranks have uniform suffixes (Turland et al. 2018). For instance, all plant family names have the suffix “-aceae”. For other rank names or taxa without uniform naming conventions, such as animal family names, the program searches a self-contained very light-weight taxonomy database (simplified from NCBI Taxonomy) to efficiently determine the rank level (Federhen 2012).
In the process of splitting, considering that there may be repeated regions in the original sequence, especially in the case of inverted repeat regions in plastid genome, the program provides options to handle them in different ways, including allowing inverted repeat sequences or keeping only one (such as intergenic spacers atpB _rbcLand rbcL _atpB ) and whether other types of repeat sequences are allowed. Besides, the program also provides a renaming function for plastid genome, which can automatically rename the non-standard gene names into uniform gene names. For example, “rbcL”, “rbcl”, “Rbcl”, and “rbc large subunit” are all rbcL gene, and the program can automatically change them to “rbcL” to avoid splitting the same gene into multiple files due to irregular gene naming in annotations. This function is mainly implemented using regular expressions, especially optimized for plastid tRNA genes.
Finally, the GB2fasta module remove redundant sequences in extracted files. By default, the program keeps one record for each organism in each locus for ease of sequence variance analysis. Files without reduction are also available as a part of the output.
The Evaluate module
The Evaluate module handles the assessment of sequence variation levels. When the GB2fasta module finishes running, the Evaluate module is called subsequently. Users may also manually run this module by providing aligned or unaligned fasta-format files as input. The program calls MAFFT v7.475 to perform the multiple sequence alignment (Katoh & Standley 2013). Considering the different variance of input files, the option “–auto” is used to allow MAFFT to detect and utilize the best alignment strategy for each file. Subsequently, the program evaluates the variance of each aligned files via multiple indicators:
1. Nucleotide diversity (Pi). Under default options, this method behaves same as DnaSP to calculate nucleotide diversity of the alignment, a widely used method to measure the variance of the sequences. By adding options to accept gaps and ambiguous bases in the alignment, the program can utilize more data that DnaSP ignored, especially for alignment regions containing multiple gaps since one gap in one sequence results in the removal of the whole regions by DnaSP.
2. Shannon index. The Shannon index (or Shannon’s entropy) is used as a species richness index in ecology and population genetics (Peet 1974). Herein, it is used to quantify the diversity of the sequences. If the alignment includes several groups of sequences that shows highly variance between groups but has few variances in the group (for instance, a family includes two distinct genera while species in each genus have few divergence), the abnormal entropy could hint the existence of such case. For the convenience of comparison, the entropy was standardized by dividing the maximal entropy of each alignment.
3. Observed resolution. By dividing the number of kinds of different sequence by the total numbers of the sequences in the alignment, this method is the simplest and thus the fastest to calculate. Although such naive method may not accurately reflect the real situation, due to its computational speed advantage, it can be utilized as a upper bound given that an alignment with low observed resolution can hardly have sufficient variance.
4. Tree resolution. In this method, IQ-TREE v2.2.2.2 is used to infer maximum likelihood trees from the alignments (Minh et al. 2020). The option “-czb” is used to avoid the interference of invalid branch by collapsing all near zero branches in the tree. Then, the tree resolution is estimated by dividing the number of internal nodes by the number of terminal nodes in the tree. Finally, the program uses Biopython’s Phylo module to read the phylogenetic tree file and calculate the tree resolution (Talevich et al. 2012).
5. Phylogenetic diversity (PD) and its variations. With the phylogenetic trees built above, the phylogenetic diversity is estimated by summing the lengths of all those branches that are members of the corresponding minimum spanning path (Faith 1992). Phylogenetic diversity provides a comparable, evolutionary measure of biodiversity (Miller et al.2018). However, since it sums all branch length including internal and terminal branches, it mixes the variances within and across the clades that it may cause the overestimation of the ability of the DNA barcodes to classify samples. To address this issue, we proposed two variations of PD: stem phylogenetic diversity (PD-stem) and terminal phylogenetic diversity (PD-terminal). PD-stem only sums the length of internal branches that represent the synapomorphy of the sequences, while PD-terminal sums the length of terminal branches that denotes the amount of autapomorphy (Supplemental Information 1). PD-terminal focuses on variations between closely related lineages rather than dissimilarities between clades that a higher PD-terminal value indicates better species resolution, while PD-stem value merely indicates the distinction amongst the sub-groups of the samples. Nevertheless, PD-stem could measure the robustness of the phylogenetic tree that a poor PD-stem suggests an unstable taxonomic relation. For the convenience of comparison, the three PD values were standardized by dividing the number of the sequences of each alignment. To avoid interference of abnormal sequences, standard deviations were calculated to hint the existence of abnormal branch length values.
6. GC ratio and gap ratio. Apart from above evaluation methods, GC ratio and gap ratio of the alignment were also computed in case of potential usage. Particularly, an extremely high gap ratio indicates that the alignment quality should be concerned.
After the evaluation of the whole alignment with above methods, a sliding-window scanning is performed to infer variation hotspots in each alignment. The output of the module includes a table including results of all methods mentioned above and a figure showing the sliding-window scanning results for each input file.
The Primer module
After sequence variance evaluation, the Primer module is optional for designing universal primer used in downstream application. Users may also manually use this module by providing aligned fasta-format files as input. Firstly, the program generates a consensus sequence by counting the frequency of bases in each column of the alignment. If a column comprised of two or three bases with similar amount, their corresponded ambiguous base is used in the consensus. In extreme cases that all four bases (A, T, C, and G) have approximately equal proportions, the ambiguous base “N” will be used. Next, a quick scanning that only calculates Observed resolution is performed on the alignment to infer the variation hotspots. The detected hyper-variant fragments regions are marked in the consensus sequence for designing primer. Subsequently, the upstream/downstream regions of marked sequences are iteratively checked to screen suitable primer sequences, by evaluating the primer length, the number of ambiguous bases, the existence of homodimers and secondary structures, tandem repeat occurrences and other conditions. Then, anin silico validation is performed by comparing the candidate primer sequences with original sequences in the alignment (gap removed) using blastn-short of BLAST v2.13.0 (Boratyn et al. 2013). Primers with excessive mismatch ratios, insufficient positive matches, or low overall coverage are excluded. In addition, primers with abnormal melting temperature or have potential to form hairpins or heterodimers are removed according to Primer3-py package(https://github.com/libnano/primer3-py), which integrated Primer3 (Untergasser et al. 2012). During the validation, since BLAST and Primer3 have limited supporting of ambiguous bases, necessary auxiliary works are finished by the Primer module. Finally, the program picks matched primer pairs with suitable amplified length and coverage and generate a table of primers information as output.
The visualization pipeline
For displaying the variations within organellar genomes across multiple species of a target clade, we have developed an interactive plotting pipeline: draw_circle_plastid.ipynb for plastids and draw_circle_mitochondria.ipynb for mitochondria. Considering the diverse customization needs of users regarding the output imagery, the pipelines are provided as Jupyter notebooks to facilitate easy adjustments of styles such as colors and fonts (Kluyver et al.2016). The input to the pipeline comprises a GenBank file of an organellar genome from a reference species alongside a table of sequence variation assessments generated by the Evaluate module. For plastid genomes, users have the option to provide quadripartite structure information to better represent their inverted repeat structures. The pipeline starts with extracting gene names, lengths, and positions from the GenBank file and intersects these with names from the results table, which can minimize the confounding effects of additional or missing fragments. Subsequently, with pyCirclize (https://github.com/moshi4/pyCirclize), the program arranges genes and intergenic spacers in accordance with the annotation’s order from the reference genome, thereby allowing a consolidated display across multiple species. Following this setup, tracks are drawn from outermost to innermost, each corresponding to different variant indicators’ results. Within each track, segment values are displayed according to their position. Finally, the program outputs a vector format image file which can be further modified as needed.
Benchmarks
To test the compatibility of the GB2fasta module, one million records of green plants were randomly retrieved and processed with the command “python3 –m OGU.gb2fasta –taxon Viridiplantae -min_len 100 –max_len 1000000 –count 1000000 –out test_gb2fasta”. For benchmarking the function of Evaluate module and its ability to filter out the highest variant sequences, we tested records of Lamiaceae family in GenBank with the command “python3 –m OGU –taxon Lamiaceae –max_len 1000000 –out test_evaluate -quick” and the result was labeled as “default”. The “-quick” option was used to skip sliding-window analysis to reduce time consuming. In another run, the options “-ig” and “-iab” were applied to ignore gaps and ambiguous bases in the alignment to mimic the behavior of other software for comparison. To reduce the interference of redundant data and low-quality data to the analysis, several restrictions were applied: sequence number greater than 100, and the sequence type should be “gene” or “spacer” only since sequences of coding sequences (CDS), introns, RNA and most of “misc feature” were already covered by those two types. Statistical analysis and visualization were performed using seaborn and SciPy (Virtanen et al. 2020; Waskom 2021). Besides, the rbcL , a widely used gene for universal plant DNA barcode was selected for testing sliding-window analysis and Primer module (Hollingsworth et al. 2009). To further test the software’s compatibility, we evaluated mitochondrial genome data from rodents (Rodentia) using the following command: “python3 -m OGU -og mt -taxon rodents –refseq yes -out test_rodents”.
Finally, to further test the analysis function of the program on the intergenic spacers and the compatibility on various taxonomy, angiosperm families were selected according to the Angiosperm Phylogeny Website version 14 (APG 4) (Stevens 2017). Plastid genomic sequences were obtained by randomly selecting a representative species from each family using the GB2fasta module and sequences were combined to one GenBank file, and the command “python3 -m OGU -gb angiosperm_cp.gb -out test_spacers -rename” was run to divide and analyze the data. For subsequent phylogenetic analysis, IQ-TREE2 was used with the command “iqtree2 -s combine.fasta -bb 1000 -bnni -p -czb -safe” and the most suitable base substitution model is automatically selected using ModelFinder (Kalyaanamoorthy et al. 2017). The phylogenetic tree generated above was visualized using iTOL v5 (Letunic & Bork 2021).
Results
GB2fasta successfully processed one million random GenBank records
One million sequence records of green plants were successfully retrieved from NCBI server. 534 of these are complete plastid genomes, 30 are mitochondria genomes, and the rest are genes or genome fragments. 7,176 records were judged as invalid data during the analysis process, mainly because the records only include annotation information and the corresponding sequence are empty, or there is no feature annotation entry (such as GenBank ID BMBV00000000, KEPT00000000, MU180198, etc.). 194,758 features in the annotations of records failed to be extracted due to missing sequences or abnormal annotations. For instance,nad1 and nad5 genes in the Gastrodia elatamitochondrial genome referred sequences in another records (Yuanet al. 2018). After sequence extraction, validation and deduplication, 1,401,251 files containing 1,827,413 records were generated and 1,453,602 records were left (Supplemental Information 2 and 3). One of the reasons such an unexpected number of files generated is that one genomic sequence can be divided into hundreds of files. Moreover, since no organelle type filter was used during querying step, many records retrieved are nuclear genes without appropriate annotation. For instance, most of nuclear gene records only have a locus number that they must be stored in different files, such as “gene-LOC100274415”, “gene-Bca52824_090542”, and “gene-BPRA1399_LOCUS959”.
The download process cost about five hours due to speed limitation. The validation and divide process cost about half to an hour in different operating systems or hardware due to processing enormous files, which a solid-state disk (SSD) could be helpful. During the running, one CPU core and at most 1 GB memory were used that normal personal computers can easily handle.
The performance of multiple sequence polymorphism evaluation methods varied on Lamiaceae data
118,311 records of Lamiaceae family were successfully downloaded, and 30 records were found to be invalid (such as GenBank ID KEPD00000000). 35,205 files were generated including 129,181 non-redundant records and 2,253 files were successfully aligned and evaluated since most of generated files only include one records due to annotation issues such as missing annotation or empty sequences. After processing and filtration, there are 121 genes and 109 intergenic spacers left for evaluation and the dataset is labeled as “default”. By ignoring the columns containing ambiguous bases and gaps in the alignment that “-ig -iab” option did, there are 98 genes and 102 intergenic spacers and labeled as “ig_iab” dataset (Supplemental Information 4). The missing genes includes some widely applied genes such as matK andrbcL . This is due to the existence of gaps since the alignment ofmatK have 2,128 columns while all of them contain gaps just because of few records (GenBank ID: DQ640414, JQ767173).
After applying multiple sequence polymorphism evaluation methods on Lamiaceae data, the results show that the number of sequences and gap ratio has no or little effect on each method, and the alignment length only has a moderate correlation with observed resolution, Shannon index and tree resolution, but has no effect on phylogenetic diversity (PD) (Figure 1A). GC ratio showed a moderate negative correlation with each method, which may be due to the relatively high GC ratio of conserved RNA genes. Besides, there is a significant positive correlation between observed resolution, entropy, and tree resolution. The three kinds of PD also are significantly positive correlated.
The comparison of the result of two dataset show that the variance value assessed by each method significantly increased after retaining gaps and ambiguous bases, given that the indel variation information in the sequence was considered (Figure 1B). Compared with other methods, the significance of such change on PD is slightly lower, and there is no significant change in PD-stem, indicating that the gaps in these sequences have little effect on the construction of phylogenetic relationships. By sorting the results with ascending PD values, Figure 1C shows that the overall trend of the three methods is close. There are fluctuations in the opposite direction between PD-stem and PD-terminal, that is, the variance of these loci comes from different contributors. For example, the PD-terminal of rps12 and rrn5 is higher than that of PD-stem, which means that there are greater differences among closely related species; rps19 , petL and other genes are the opposite, that is, the sequence differences mainly originate between different clades rather than within.
To further compare the effectiveness of various methods, ten loci from “default” dataset with the highest degree of variation indicated by each method were selected for intersection analysis (Supplemental Information 5). Figure 1D shows that almost all highly variable fragments indicated by Pi were different from other methods which may be because gaps have a larger effect on it. Observed resolution, entropy and tree resolution have four intersections, and the PD series has three intersections, which is consistent with previous correlation analysis. Except for Pi, all other six methods considered the intergenic regionrpl32 -trnL UAG is highly variable. This result has also been supported by previous reports (Will & Claßen-Bockhoff 2017; H. Wu et al. 2021).
For rbcL , the program completed the sliding window analysis with 500 bp window size and 50 bp step size. The results show that Pi is consistent with the gap ratio which increase from the beginning to the site of about 700 bp and become flat at end of the alignment (Supplemental Information 6), which is consistent with the distribution of alignment gaps. Other evaluation methods show flat trend indicating that they are relatively insensitive to gaps. Interestingly, for phylogenetic diversity, PD-terminal remains flat, while PD-stem fluctuates synchronously with PD. The alignment show there are two or more kinds of homologous variation sites in these sudden rises and neighboring sequences are basically identical, that is, these mutations are mainly differences between sequences of different groups rather than within the group such as 176T>C and 306C>G (Supplemental Information 7). In addition, the program successfully designed and selected one pair of best performance universal primers based on the alignment. Due to the widely distribution of gaps, the average amplified product is shorter than 100 bp (Supplemental Information 8).
The download and divide process cost half an hour. The alignment process cost about eight hours and the evaluation cost two hours. During the running, one CPU core and at most 3 GB memory were used and the time consumption could be reduced by using more CPU cores theoretically.
Intergenic spacers behave similarly with CDS on 308 angiosperm plastid genomic data
A total of 308 complete plastid genomic sequences were retrieved from which 1,015 fragments were extracted, encompassing representation from 71% of APG IV families. After discarding sequences numbering fewer than 100 and those that were duplicates, a subset of 241 sequences remained for comparative analyses across different categories. Detailed evaluation results and associated tests for significance of differences can be found in Supplemental Information 9 and 10.
The two types of RNA generally exhibit similar patterns. Both tRNA and rRNA have significantly higher GC content compared to other types of sequences, while their nucleotide diversity (Pi) is markedly lower (Figure 2A). In terms of tree resolution and the three measures of phylogenetic diversity, RNA also scored lower than other types of fragments, with no significant differences observed between them (Figure 2A and 2B). Introns and spacers, both belonging to non-coding regions, had the highest GC content, and their phylogenetic diversity scores (PDs) were significantly greater than those of coding sequences (CDS). Apart from PDs, no significant differences were detected between spacers, introns, and CDS in other indicators. One explanation is that because the sampled taxa spanned different families, polymorphisms inferred from other metrics approached their maximum values for all sequence types. Such situation is particularly evident with the Shannon index and Observed resolution (Supplemental Information 11). PDs did not exhibit this saturation effect as they are solely associated with the branch lengths of phylogenetic trees constructed from the sequences. For spacers, PDs were highest (Figure 2C), consistent with the theoretical expectation of minimal negative selective pressure. Furthermore, although spacer sequences had the highest PD-terminal values, PD-stem values were also substantial. This suggests that the evaluated angiosperm samples retained a considerable number of homologous sequences within their spacer regions, which could provide phylogenetic information rather than being entirely composed of random variation.
For Gap ratio, differences are evident between different types of sequences; however, significant variation also exists within the same sequence type. For example, within coding sequences (CDS), the alignment length of ycf1 is approximately 15 kb, of which over 70% consists of gaps, whereas for petG — also a CDS — the alignment length is 114 bp with a gap ratio of 0%.
Figure 2D presents a circular map of the evaluated fragments arranged according to the plastid genome structure of Nicotiana tabacum(GenBank ID Z00044), generated using the pipeline draw_circle_plastid.ipynb. In the case of the inverted repeat regions (IRs), disregarding the saturation effects previously mentioned, the sequence polymorphism within these fragments is less than in other regions. Beyond the influence of quadripartite structures, another conspicuous observation is the intermittent dips across various segments in the circle, corresponding predominantly to genes of notably short lengths such as psbF and psbN .
To further compare coding sequences (CDS) with spacers, 77 CDS and 30 spacer sequence fragments were selected based on alignment length, gap ratio, and degree of sequence variation, subsequently compiled into two datasets: CDS-tree and spacer-tree. Phylogenetic trees were constructed using IQ-TREE2 with the parameters “iqtree2 –s sequence.nex -p -bb 1000 -bnni -czb -safe -m GTR+F+R4”. The results indicate three main differences between the two phylogenetic trees (Supplemental Information 12 and 13). Firstly, in the spacer tree, the order Austrobaileyales is basal, whereas in the CDS tree, the Amborellales takes this position. Secondly, in the CDS tree, the order Ceratophyllales is sister to the eudicots, while in the spacer tree it clusters near the base of angiosperms alongside Chloranthales. Additionally, within the core eudicots clade, Buxales and Trochodendrales occupy basal positions but exhibit reversed internal-external relationships between the two trees. Apart from these individual discrepancies—some of which have been previously acknowledged as challenging taxa—most samples’ positional relationships are consistent and align with APG IV (Hu et al. 2023). This suggests that plastid spacers hold potential for application in phylogenetic analysis among other fields. Utilizing spacers allows for a more comprehensive use and analysis of plastid genome data, providing new insights for related research.
Discussion
OGU helps better utilizing GenBank data
Till now, GenBank offers more than 1 billion sequence records and related taxonomy database including more than 600,000 formally described species. However, it is not straightforward to access and utilize such abundant resources. Although all sequence records in GenBank are accessible with the NCBI Entrez retrieval system, it is not easy to perform an accurate retrieval, especially for novices. Besides, the download process could be unsmooth, especially for large request targeting on specific taxonomy that NCBI FTP service cannot help. GB2FASTA provides a wealth of options and a fully automated process to help users solve the problem of accurate acquisition of target data.
Currently user can download records with various file format. The GFF3-format file only contains feature annotations. The FASTA-format file contains sequence and are widely compatible with other bioinformatical software but has very limit annotations. The GenBank-format files have full annotations and sequence but file structure is complex which is generally incapable to be directly utilized by downstream analytical software. The GB2fasta module of OGU offers simple yet powerful means to convert files from GenBank-format to FASTA-format with necessary information. By utilizing the annotation information, the original records were divided precisely to extract sequences of each locus, as well as intergenic spacers which are undefined in the original file. Extracted files have locus name, detailed taxonomy, feature type and original GenBank ID information which are well-formatted in each sequence ID hence are easier for organization and utilization, especially for analyzing organelle genomic data.
Broader application of intergenic spacer data demands extra efforts
The theoretical high rate of variability, the challenges associated with extraction, and potential structural reorganizations leading to loss have all impeded the utilization of intergenic spacer regions within plant plastid genomes. In this study, we employed the GB2fasta module of the OGU to extract intergenic spacer fragments from complete plastid genome sequences. Upon evaluating their sequence variation, we discovered that these intergenic spacers comprise both highly variable and extremely conserved segments, with overall polymorphism and phylogenetic resolution comparable to that of protein-coding genes. After a rigorous selection process that eliminated fragments with poor alignment quality, low homology, or aberrant lengths, we identified 30 plastid intergenic spacers from angiosperms for phylogenetic analysis. The resulting phylogenetic trees displayed credible systematic positions for various taxa, closely mirroring the results obtained from protein-coding gene exons with strong support values. This indicates that non-coding regions of the plastid genomes are valuable for phylogenetic studies. Beyond mere extraction from the plastid genomes, OGU also offers a Primer module designed to facilitate the development of universal primers for cross-species applications, assisting researchers in data acquisition and enabling the use of intergenic spacers in fields such as species identification as DNA barcoding or population genetics.
Nevertheless, several issues merit attention regarding the broader utilization of plastid genome intergenic spacers. The first pertains to the homology of these regions. Due to the relatively stable structure of plastid genomes, they are presumed to possess high collinearity and homology. This assumption holds true except when considering taxa that have undergone extensive genomic rearrangements, such as those found in families like Pinaceae, Geraniaceae, and Campanulaceae (Cosner et al. 2004). However, for groups lacking substantial data and research, as well as intergenic spacers located at the junctions of the quadripartite structure of plastid genomes, and regions with excessively high variability, it is prudent to conduct a homology check using software tools like BLAST prior to data utilization.
Another issue is the reliability of nomenclature. The naming of homologous intergenic spacers relies on the gene flanking them. Discrepancies in name criteria, such as synonyms, spelling variations, and formatting inconsistencies of the upstream and downstream genes, can result in the mislabeling of extracted intergenic spacers. OGU has partially addressed this challenge by implementing regular expressions and custom conversion rules within the “-rename” option of the GB2fasta module. Nonetheless, fully resolving this nomenclature problem necessitates collaborative efforts from data standard setters, data submitters, and software developers to ensure consistency and accuracy.
For animal mitochondrial genomes, although the program successfully retrieved 317 complete rodent mitochondrial genomes and completed their analysis (Supplemental Information 14), results show that all genes are highly conserved and intergenic spacers exhibit even lower degrees of variation (Supplemental Information 15). For plant mitochondrial genomes, due to currently limited public data availability and significant genomic structural variation, further exploration in the future is necessary.
Acknowledgements
We thank the authors of Primer3-py for building and update packages which enhance our software compatibility on Microsoft Windows operating system. We thank Yan Zhang, Dongyue Jiang, Huijie Liu, Guojin Zhang, and Haihua Hu for testing the program. We appreciate Prof. Shiliang Zhou for his advice on the Evaluate module. We also thank numerous volunteers on GitHub for participation, including but certainly not limited to minhtrung1997, beausmith94, yihenghu, weissab, BobbyWoo8, codeunsolved, sunlu123, and yjr930.
This research was supported by grants from the National Natural Science Foundation of China (32300539) and Startup Fund of Sichuan Normal University (XJ20220110).
References
Amiryousefi A, Hyvönen J, Poczai P (2018) IRscope: an online program to visualize the junction sites of chloroplast genomes.Bioinformatics, 34 (17), 3030-3031.
Anand A, Pandi G (2021) Noncoding RNA: An insight into chloroplast and mitochondrial gene expressions. Life, 11 (1), 1-20. doi:10.3390/life11010049
Barrett CF, Baker WJ, Comer JR, Conran JG, et al. (2016). Plastid genomes reveal support for deep phylogenetic relationships and extensive rate variation among palms and other commelinid monocots. New Phytologist , 209, 855-870.
Belbin FE, Noordally ZB, Wetherill SJ, Atkins KA, et al. (2017) Integration of light and circadian signals that regulate chloroplast transcription by a nuclear-encoded sigma factor. New Phytologist, 213 (2), 727-738. doi:10.1111/nph.14176
Boratyn GM, Camacho C, Cooper PS, Coulouris G, et al. (2013) BLAST: a more efficient report with usability improvements.Nucleic Acids Research, 41 (W1), W29-W33.
Chen C, Chen H, Zhang Y, Thomas HR, et al. (2020) TBtools: an integrative toolkit developed for interactive analyses of big biological data. Molecular plant, 13 (8), 1194-1202.
Chen CW, Huang YM, Kuo LY, Nguyen QD, et al. (2013) trnL-F is a powerful marker for DNA identification of field vittarioid gametophytes (Pteridaceae). Annals of Botany, 111 (4), 663-673.
Chung KP, Gonzalez-Duran E, Ruf S, Endries P, Bock R (2023) Control of plastid inheritance by environmental and genetic factors. Nature Plants, 9 (1), 68-80.
Cosner ME, Raubeson LA, Jansen RK (2004) Chloroplast DNA rearrangements in Campanulaceae: phylogenetic utility of highly rearranged genomes.BMC Evolutionary Biology, 4 , 1-17.
Darling AE, Mau B, Perna NT (2010) progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PloS one, 5 (6), e11147.
Dierckxsens N, Mardulyn P, Smits G (2016) NOVOPlasty: De novo assembly of organelle genomes from whole genome data. Nucleic Acids Research, 45 , e18. doi:10.1093/nar/gkw955
Dietrich A, Wallet C, Iqbal RK, Gualberto JM, Lotfi F (2015) Organellar non-coding RNAs: Emerging regulation mechanisms. Biochimie, 117 , 48-62. doi:https://doi.org/10.1016/j.biochi.2015.06.027
Escobari B, Borsch T, Quedensley TS, Gruenstaeudl M (2021) Plastid phylogenomics of the Gynoxoid group (Senecioneae, Asteraceae) highlights the importance of motif‐based sequence alignment amid low genetic distances. American Journal of Botany, 108 (11), 2235-2256.
Faith DP (1992) Conservation Evaluation and Phylogenetic Diversity.Biological Conservation, 61 (1), 1-10. doi:Doi 10.1016/0006-3207(92)91201-3
Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Research, 40 (Database issue), D136-143. doi:10.1093/nar/gkr1178
Gray MW (1993) Origin and evolution of organelle genomes. Current Opinion in Genetics and Development, 3 (6), 884-890.
Greiner S, Lehwark P, Bock R (2019) OrganellarGenomeDRAW (OGDRAW) version 1.3. 1: expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids Research, 47 (W1), W59-W64.
Gualberto JM, Newton KJ (2017) Plant mitochondrial genomes: dynamics and mechanisms of mutation. Annual Review of Plant Biology, 68 , 225-252.
Hebert PDN, Ratnasingham S, de Waard JR. (2003). Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proceedings of the Royal Society B: Biological Sciences , 270, S96-S99.
Hollingsworth PM, Forrest LL, Spouge JL, Hajibabaei M, et al.(2009) A DNA barcode for land plants. Proceedings of the National Academy of Sciences, USA, 106 , 12794-12797. doi:10.1073/pnas.0905845106
Hu H, Sun P, Yang Y, Ma J, Liu J (2023) Genome-scale angiosperm phylogenies based on nuclear, plastome, and mitochondrial datasets.Journal of Integrative Plant Biology, 65 (6), 1479-1489. doi:https://doi.org/10.1111/jipb.13455
Hua Z, Tian D, Jiang C, Song S, et al. (2022) Towards comprehensive integration and curation of chloroplast genomes.Plant Biotechnology Journal, 20 (12), 2239.
Jin JJ, Yu WB, Yang JB, Song Y, et al. (2020) GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biology, 21 (1), 241. doi:10.1186/s13059-020-02154-5
Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS (2017) ModelFinder: fast model selection for accurate phylogenetic estimates.Nature Methods, 14 (6), 587-589. doi:10.1038/nmeth.4285
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30 (4), 772-780. doi:10.1093/molbev/mst010
Kluyver T, Ragan-Kelley B, Pérez F, Granger BE, et al. (2016) Jupyter Notebooks-a publishing format for reproducible computational workflows. Elpub, 2016 , 87-90.
Kress WJ, Erickson DL. (2008). DNA barcodes: Genes, genomics, and bioinformatics. Proceedings of the National Academy of Sciences , 105, 2761-2762.
Lanfear R, Frandsen PB, Wright AM, Senfeld T, Calcott B (2017) PartitionFinder 2: new methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses.Molecular Biology and Evolution, 34 (3), 772-773.
Letunic I, Bork P (2021) Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Research, 49 (W1), W293-W296. doi:10.1093/nar/gkab301
Li HT, Yi TS, Gao LM, Ma PF, et al. (2019) Origin of angiosperms and the puzzle of the Jurassic gap. Nature Plants, 5 (5), 461-470. doi:10.1038/s41477-019-0421-0
Liu S, Ni Y, Li J, Zhang X, et al. (2023) CPGView: a package for visualizing detailed chloroplast genome structures. Molecular Ecology Resources, 23 (3), 694-704.
Lynch M, Koskella B, Schaack S (2006) Mutation pressure and the evolution of organelle genomic architecture. science, 311 (5768), 1727-1730.
Maglott D, Ostell J, Pruitt KD, Tatusova T (2011) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research, 39 , D52-D57. doi:10.1093/nar/gkq1237
Meng G, Li Y, Yang C, Liu S (2019) MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization.Nucleic Acids Research, 47 (11), e63–e63.
Miller JT, Jolley-Rogers G, Mishler BD, Thornhill AH (2018) Phylogenetic diversity is a better measure of biodiversity than taxon counting.Journal of Systematics and Evolution, 56 , 663-667. doi:10.1111/jse.12436
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, et al. (2020) IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution, 37 (5), 1530-1534. doi:10.1093/molbev/msaa015
Nei M. (1987). Molecular evolutionary genetics . New York: Columbia University Press.
Peet RK (1974) The measurement of species diversity. Annual Review of Ecology and Systematics, 5 (1), 285-307.
Pruitt KD, Tatusova T, Brown GR, Maglott DR (2012) NCBI Reference Sequences (RefSeq): Current status, new features and genome annotation policy. Nucleic Acids Research, 40 , 130-135. doi:10.1093/nar/gkr1079
Qu XJ, Moore MJ, Li DZ, Yi TS (2019) PGA: a software package for rapid, accurate, and flexible batch annotation of plastomes. Plant Methods, 15 , 50. doi:10.1186/s13007-019-0435-7
Riaz T, Shehzad W, Viari A, Pompanon F, et al. (2011). EcoPrimers: Inference of new DNA barcode markers from whole genome sequence analysis. Nucleic Acids Research , 39, 1-11.
Rozas J, Ferrer-Mata A, Sánchez-DelBarrio JC, Guirao-Rico S, et al. (2017) DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets. Molecular Biology and Evolution, 34 , 3299-3302. doi:10.1093/molbev/msx248
Sanchez-Puerta MV, Ceriotti LF, Gatica-Soria LM, Roulet ME, et al. (2023) Invited Review Beyond parasitic convergence: unravelling the evolution of the organellar genomes in holoparasites. Annals of Botany, 132 (5), 909-928. doi:10.1093/aob/mcad108
Sayers EW, Cavanaugh M, Clark K, Ostell J, et al. (2019). GenBank. Nucleic Acids Research , 47, D94-D99.
Smith DR, Keeling PJ (2015) Mitochondrial and plastid genome architecture: reoccurring themes, but significant differences at the extremes. Proceedings of the National Academy of Sciences, 112 (33), 10177-10184.
Song M-H, Yan C, Li J-T (2022) MEANGS: an efficient seed-free tool for de novo assembling animal mitochondrial genome using whole genome NGS data. Briefings in Bioinformatics, 23 (1), bbab538-bbab538.
Stevens PF. (2017). Angiosperm phylogeny website, version 14, July 2017 [and more or less continuously updated since].
Talevich E, Invergo BM, Cock PJ, Chapman BA (2012) Bio.Phylo: a unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython. BMC Bioinformatics, 13 , 209. doi:10.1186/1471-2105-13-209
Tillich M, Lehwark P, Pellizzer T, Ulbricht-Jones ES, et al.(2017) GeSeq - versatile and accurate annotation of organelle genomes.Nucleic Acids Research, 45 (W1), W6-W11. doi:10.1093/nar/gkx391
Tonti-Filippini J, Nevill PG, Dixon K, Small I (2017) What can we do with 1000 plastid genomes? Plant Journal, 90 , 808-818. doi:10.1111/tpj.13491
Turland NJ, Wiersema JH, Barrie FR, Greuter W, et al. (2018).International code of nomenclature for algae, fungi, and plants (Shenzhen Code) : adopted by the nineteenth International Botanical Congress, Shenzhen, China, July, 2017 . Glashütten, Germany: Koeltz Botanical Books.
Untergasser A, Cutcutache I, Koressaar T, Ye J, et al. (2012) Primer3-new capabilities and interfaces. Nucleic Acids Research, 40 , 1-12. doi:10.1093/nar/gks596
Virtanen P, Gommers R, Oliphant TE, Haberland M, et al. (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python.Nature Methods, 17 (3), 261-272.
Walker JF, Walker-Hale N, Vargas OM, Larson DA, Stull GW (2019) Characterizing gene tree conflict in plastome-inferred phylogenies.PeerJ, 7 , e7747. doi:10.7717/peerj.7747
Waskom ML (2021) Seaborn: statistical data visualization. Journal of Open Source Software, 6 (60), 3021.
Weng ML, Blazier JC, Govindu M, Jansen RK (2014) Reconstruction of the ancestral plastid genome in Geraniaceae reveals a correlation between genome rearrangements, repeats, and nucleotide substitution rates.Molecular Biology and Evolution, 31 (3), 645-659. doi:10.1093/molbev/mst257
Will M, Claßen-Bockhoff R (2017) Time to split Salvia s.l. (Lamiaceae) – New insights from Old World Salvia phylogeny. Molecular Phylogenetics and Evolution, 109 , 33-58. doi:https://doi.org/10.1016/j.ympev.2016.12.041
Wu H, Ma PF, Li HT, Hu GX, Li DZ (2021) Comparative plastomic analysis and insights into the phylogeny of Salvia (Lamiaceae). Plant Divers, 43 (1), 15-26. doi:10.1016/j.pld.2020.07.004
Wu P, Xu C, Chen H, Yang J, et al. (2021) NOVOWrap: An automated solution for plastid genome assembly and structure standardization.Molecular Ecology Resources, 21 (6), 2177-2186. doi:10.1111/1755-0998.13410
Xi Z, Liu L, Davis CC (2016) The Impact of Missing Data on Species Tree Estimation. Molecular Biology and Evolution, 33 (3), 838-860. doi:10.1093/molbev/msv266
Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood.Molecular Biology and Evolution, 24 (8), 1586-1591.
Yuan Y, Jin X, Liu J, Zhao X, et al. (2018) The Gastrodia elata genome provides insights into plant adaptation to heterotrophy.Nature Communications, 9 (1), 1615. doi:10.1038/s41467-018-03423-5
Zhang D, Gao FL, Jakovlic I, Zou H, et al. (2020) PhyloSuite: An integrated and scalable desktop platform for streamlined molecular sequence data management and evolutionary phylogenetics studies.Molecular Ecology Resources, 20 (1), 348-355. doi:10.1111/1755-0998.13096
Zhang HR, Xiang QP, Zhang XC (2019) The Unique Evolutionary Trajectory and Dynamic Conformations of DR and IR/DR-Coexisting Plastomes of the Early Vascular Plant Selaginellaceae (Lycophyte). Genome Biology and Evolution, 11 (4), 1258-1274. doi:10.1093/gbe/evz073