Corresponding author:
Ping Wu, wpwupingwp@outlook.com
Abstract:
Organelle genomes serve as crucial datasets for investigating the
genetics and evolution of plants and animals, genome diversity, and
species identification. To enhance the collection, analysis, and
visualization of such data, we have developed a novel open-source
software tool named Organelle Genome Utilities (OGU). The software
encompasses three modules designed to streamline the handling of
organelle genome data. The data collection module is dedicated to
retrieving, validating, and organizing sequence information. The
evaluation module assesses sequence variance using a range of methods,
including novel metrics termed stem and terminal phylogenetic diversity,
as well as observed resolution. The primer module could design universal
primers for downstream applications. Finally, a visualization pipeline
has been developed to present comprehensive insights into organelle
genomes across different lineages rather than focusing solely on
individual species. The performance, compatibility, and stability of OGU
have been rigorously evaluated through benchmarking with four datasets,
including one million mixed GenBank records, plastid genomic data from
the Lamiaceae family, mitochondrial data from rodents, and 308 plastid
genomes sourced from various angiosperm families. Based on software
capabilities, we have identified 30 plastid intergenic spacers that
exhibit a moderate evolutionary rate and offer practical utility
comparable to coding regions, which highlights the potential
applications of intergenic spacers in organelle genomes. We anticipate
that OGU will substantially enhance the efficient utilization of
organelle genomic data and broaden the prospects for related research
endeavors.
Keywords:
Organelle genome, polymorphism evaluation, plastid evolution, intergenic
spacers
Availability:
Organelle Genome Utilities is an open-source cross-platform Python
program freely available at https://github.com/wpwupingwp/OGU.
Introduction
Plastids and mitochondria are semi-autonomous organelles in eukaryotic
organisms, housing their own distinct genetic information in the form of
organelle genomes. Compared to the nuclear genome, organelle genomes are
typically smaller in size, often circular in structure, and exhibit a
simpler genetic background (Gray 1993; Lynch et al. 2006; Smith
& Keeling 2015). Notably, such as plastids in green plants and
mitochondria in animals are commonly inherited maternally (Chunget al. 2023). Due to moderate evolutionary rates, simple genetic
backgrounds, and the increasingly mature methods for data acquisition,
particularly in assembly (Dierckxsens et al. 2016; Jin et
al. 2020; Meng et al. 2019; Song et al. 2022) and
annotation (Qu et al. 2019; Tillich et al. 2017),
organelle genomes, especially plant plastid genomes, have found
widespread applications in studying species phylogenetic relationships,
species identification, and genomic variations (Barrett et al.2016; Hollingsworth et al. 2009; Weng et al. 2014).
Correspondingly, the related organelle genome data are also increasingly
abundant. As of early 2024, NCBI RefSeq has released over 7,300
non-redundant complete plastid genomes and more than 10,800 animal
mitochondrial genomes (Pruitt et al. 2012). These extensive
datasets facilitate cross-species, large-scale comparative studies (Huet al. 2023; Li et al. 2019; Tonti-Filippini et al.2017), and drive advancements in associated methodologies.
Currently, researchers can directly use full-length sequences for genome
structure and functional variation studies (Sanchez-Puerta et al.2023; H. R. Zhang et al. 2019). Software tools for these purposes
include but are not limited to progressiveMauve, IRscope, and CPGView
(Amiryousefi et al. 2018; Darling et al. 2010; Liuet al. 2023). For phylogenetic studies and species
identification, it is common to split the annotated genomes into coding
gene fragments for further analysis. Integrated tools like TBtools and
PhyloSuite are available for automating multi-step analyses to generate
datasets tailored for different downstream analyses (C. Chen et
al. 2020; D. Zhang et al. 2020). Additionally, fragmented coding
region data can be obtained directly from databases like NCBI GenBank
and CGIR (Hua et al. 2022; Sayers et al. 2019). However,
there are still some important issues worth addressing in the analysis
and utilization of organelle genome data. Several key problems are
outlined as followed.
Evaluation and selection of organelle genomic sequences are
crucial in the analysis . Genes and non-coding regions in organelle
genomes may evolved at varying rates, leading to differences in
subsequent analytical methods and applications. Highly variable
fragments like CO1, ycf1 and matK are more suitable for
DNA barcoding, moderately variable coding sequences (CDS) are better
suited for phylogenetic analysis, while highly conserved genes such asrrn16 , psaC hold application value in genome assembly and
evolutionary studies (Hebert et al. 2003; Kress & Erickson 2008;
Li et al. 2019; P. Wu et al. 2021). In phylogenetic analysis,
partitioning the data and assigning different evolutionary models to
each fragment using tools allows for the convenient utilization of data
with varied mutation rates, yet conflicts between fragments exceeding
the model’s capacity can still yield ambiguous results (Lanfear et
al. 2017; Walker et al. 2019). Therefore, sequence variation
assessment methods are particularly critical. There are existing
indicators and software tools can evaluate sequence polymorphism. For
example, nucleotide polymorphism is commonly used, with software like
DnaSP utilized for calculations (Nei 1987; Rozas et al. 2017).
However, excluding gap sites and ambiguous bases may underestimate
sequence variance, especially in loci with numerous insertions or
deletions (Xi et al. 2016). Furthermore, nucleotide diversity may
not differentiate variance within and across subgroups of assessed
sequences due to heterogeneity that could affect the results (Barrett et
al. 2016). Another method, the selection pressure dN/dS ratios are only
applicable to protein-coding sequences and cannot handle gaps well (Yang
2007).
The intergenic spacers in organelle genomes are often overlooked .
Intergenic spacers (or intergenic regions) refer to the sequences
between the stop codon of an upstream gene and the start codon of a
downstream gene. In animal mitochondrial genomes, these regions are
sparse and compact, containing only a few homologous intergenic spacers.
In plant mitochondrial genomes, due to massive structure variations,
these spacers are highly variable across species in terms of size and
structure, behaving low homology (Gualberto & Newton 2017). For plastid
genomes, due to their stable structure and good collinearity, there is
relatively higher homology among intergenic spacers. While they do not
encode protein products, some may contain regulatory elements such as
promoters or non-coding RNAs that play a role in regulating the
expression of plastid-related proteins in response to environmental
changes (Anand & Pandi 2021; Belbin et al. 2017; Dietrichet al. 2015). The notable homology and potential biological
functions of the plastid intergenic spacers suggest a scientific value
for further exploitation, but the current state of research does not
fully reflect this potential. Apart from using certain highly variable
segments (e.g., atpB-rbcL , trnL-F ) as supplementary
universal DNA barcodes for plant species identification, intergenic
spacers are only utilized in resolving phylogenetic relationships at
lower taxonomic levels in some cases (C. W. Chen et al. 2013;
Escobari et al. 2021).
Another reason for this neglect is the lack of data. For intergenic
spacers widely used in species identification, data is primarily
acquired through amplification sequencing. For other intergenic spacers,
the efficiency of data acquisition is limited due to the need for
targeted primer design and amplification experiments. Despite the
existence of relevant software tools (Riaz et al. 2011;
Untergasser et al. 2012), challenges such as lack of support for
gapped alignments and ambiguous bases, slow processing speeds, and
cumbersome usage persist. Given the rapid accumulation of plastid genome
data, extracting intergenic spacers from plastid genome sequences based
on annotations is a viable method for data acquisition. However, there
is currently a lack of specific tools for this purpose. Popular data
analysis toolkits such as PhyloSuite and TBTools only offer
functionality for extracting coding sequences from genomes.
The interference caused by irregular annotation information . Due
to the wide time span of organelle genomic data production, researchers
used different tools and criteria for generating and submitting data,
and the names of homologous genes in different genomes may be
inconsistently named, leading to errors or missing data in subsequent
analyses. For instance, variations in letter case within gene names can
mistakenly differentiate genes (e.g., “rbcL”, “rbcl,”RBCL”), while
tRNA and rRNA are particularly problematic areas, including variations
in naming patterns (e.g., “trnH-GUG”, “trnH”, “tRNA His gug”,
“trnHgug”). An extreme example is the ribosomal RNA gene rrn16in plastid genomes, where seven different name variations can be found
in public data: “16rrn”, “16S”, “16S ribosomal RNA”, “16S rRNA”,
“rrn16”, “rrn16s”, and “rrn16S” (Maglott et al. 2011).
Although manual correction methods can be employed such as CGIR, when
dealing with a large volume of data, especially in the absence of
standardized conventions, this approach may introduce additional errors
and consume a significant amount of time.
Furthermore, Improvement is needed in presenting organelle genome
sequence analysis results. Currently, the assembly and annotation
results are typically displayed in circular plots, with various
user-friendly tools available such as OGDRAW and CPGView (Greineret al. 2019; Liu et al. 2023). However, this display method is
often limited to individual species or samples. For analyses targeting
taxonomic groups containing multiple species, especially in sequence
variation analyses, conventional graphical representations like box
plots and bar graphs are still predominantly used. Therefore, the
development of new visualization methods tailored for multi-species
organelle genome analysis results would help elucidate the sequence
evolutionary patterns within organelle genomes of specific taxa more
intuitively.
Considering the aforementioned, to assess sequence variations more
effectively, facilitate the acquisition and utilization of intergenic
spacers, utilize standardized sequence data for analysis, and enhance
result presentation, we have developed Organelle Genome Utilities (OGU)
software. OGU functions as a versatile toolkit for gathering, analyzing,
and visualizing organelle genomic data, thereby enhancing the efficient
utilization of data resources. Users simply need to specify their
interested taxa or organelles, and the software automates a
comprehensive set of analyses, including but not limited to collecting,
organizing, and standardizing organelle genomic sequence data, assessing
sequence variations using multiple indicators, and designing universal
primers for downstream applications. The output from the software
includes various types of sequence files suitable for diverse downstream
analyses, detailed result tables, and visual representations that
incorporate characteristics of organelle genomes. The software is
written in pure Python language, facilitating cross-platform operation
on Microsoft Windows, macOS, and Linux operating systems. Moreover, the
software’s hardware requirements are relatively modest. For most cases,
a standard PC with 8GB of memory and a typical CPU are sufficient.
Furthermore, to enhance user experience, OGU offers both command-line
and graphical user interface along with comprehensive documentation for
user guidance.
Materials and Methods
OGU contains three modules, GB2fasta, Evaluate, and Primer, plus a fully
automated analysis results visualization pipeline. These components can
be used individually or in combination to serve different purpose.
The GB2fasta module
The GB2fasta module primarily concentrates on the collection and
organization of data. It accepts two kinds of input: local GenBank files
or Entrez queries for downloading GenBank data. For the latter one, the
GB2fasta module offers straightforward options to automatically generate
the query string to precisely retrieve target records from GenBank
including gene names, taxonomy, organelle type, date, length range, etc.
Next, the program downloads the data in batches. To avoid the
interference of network fluctuations, when the number of retrieved
records is large, the program supports downloading in chunks and
retransmission at breakpoints.
For the local GenBank files, the program firstly validates the integrity
and reliability of the data to exclude abnormal records, such as
truncated record due to Internet disconnection, invalid record due to
annotation error, or empty record containing none of sequence data.
Next, according to annotation information, validated records are divided
precisely into a series of FASTA-format files with organized sequence
ID, including types of sequences, accession number, and taxonomic
information (kingdom, phylum, class, order, family, genus, and species).
Each file contains only sub-fragments of the same locus. Notably, the
intergenic spacer, which is inaccessible from GenBank directly, will be
extracted according to its upstream and downstream annotations. For
taxon information, although the full lineage name is already included in
the record, it is tricky to use given that it only contains the name of
each taxonomic rank but the rank levels are missing. Querying the NCBI
Taxonomy database with Entrez APIs could solve this issue, yet it is
unrealistic for large number of records. OGU follows theInternational Code of Nomenclature for algae, fungi, and plants
(Shenzhen Code) to recognize plant rank levels since that commonly used
ranks have uniform suffixes (Turland et al. 2018). For instance,
all plant family names have the suffix “-aceae”. For other rank names
or taxa without uniform naming conventions, such as animal family names,
the program searches a self-contained very light-weight taxonomy
database (simplified from NCBI Taxonomy) to efficiently determine the
rank level (Federhen 2012).
In the process of splitting, considering that there may be repeated
regions in the original sequence, especially in the case of inverted
repeat regions in plastid genome, the program provides options to handle
them in different ways, including allowing inverted repeat sequences or
keeping only one (such as intergenic spacers atpB _rbcLand rbcL _atpB ) and whether other types of repeat
sequences are allowed. Besides, the program also provides a renaming
function for plastid genome, which can automatically rename the
non-standard gene names into uniform gene names. For example, “rbcL”,
“rbcl”, “Rbcl”, and “rbc large subunit” are all rbcL gene,
and the program can automatically change them to “rbcL” to avoid
splitting the same gene into multiple files due to irregular gene naming
in annotations. This function is mainly implemented using regular
expressions, especially optimized for plastid tRNA genes.
Finally, the GB2fasta module remove redundant sequences in extracted
files. By default, the program keeps one record for each organism in
each locus for ease of sequence variance analysis. Files without
reduction are also available as a part of the output.
The Evaluate module
The Evaluate module handles the assessment of sequence variation levels.
When the GB2fasta module finishes running, the Evaluate module is called
subsequently. Users may also manually run this module by providing
aligned or unaligned fasta-format files as input. The program calls
MAFFT v7.475 to perform the multiple sequence alignment (Katoh &
Standley 2013). Considering the different variance of input files, the
option “–auto” is used to allow MAFFT to detect and utilize the
best alignment strategy for each file. Subsequently, the program
evaluates the variance of each aligned files via multiple indicators:
1. Nucleotide diversity (Pi). Under default options, this method behaves
same as DnaSP to calculate nucleotide diversity of the alignment, a
widely used method to measure the variance of the sequences. By adding
options to accept gaps and ambiguous bases in the alignment, the program
can utilize more data that DnaSP ignored, especially for alignment
regions containing multiple gaps since one gap in one sequence results
in the removal of the whole regions by DnaSP.
2. Shannon index. The Shannon index (or Shannon’s entropy) is used as a
species richness index in ecology and population genetics (Peet 1974).
Herein, it is used to quantify the diversity of the sequences. If the
alignment includes several groups of sequences that shows highly
variance between groups but has few variances in the group (for
instance, a family includes two distinct genera while species in each
genus have few divergence), the abnormal entropy could hint the
existence of such case. For the convenience of comparison, the entropy
was standardized by dividing the maximal entropy of each alignment.
3. Observed resolution. By dividing the number of kinds of different
sequence by the total numbers of the sequences in the alignment, this
method is the simplest and thus the fastest to calculate. Although such
naive method may not accurately reflect the real situation, due to its
computational speed advantage, it can be utilized as a upper bound given
that an alignment with low observed resolution can hardly have
sufficient variance.
4. Tree resolution. In this method, IQ-TREE v2.2.2.2 is used to infer
maximum likelihood trees from the alignments (Minh et al. 2020).
The option “-czb” is used to avoid the interference of invalid branch
by collapsing all near zero branches in the tree. Then, the tree
resolution is estimated by dividing the number of internal nodes by the
number of terminal nodes in the tree. Finally, the program uses
Biopython’s Phylo module to read the phylogenetic tree file and
calculate the tree resolution (Talevich et al. 2012).
5. Phylogenetic diversity (PD) and its variations. With the phylogenetic
trees built above, the phylogenetic diversity is estimated by summing
the lengths of all those branches that are members of the corresponding
minimum spanning path (Faith 1992). Phylogenetic diversity provides a
comparable, evolutionary measure of biodiversity (Miller et al.2018). However, since it sums all branch length including internal and
terminal branches, it mixes the variances within and across the clades
that it may cause the overestimation of the ability of the DNA barcodes
to classify samples. To address this issue, we proposed two variations
of PD: stem phylogenetic diversity (PD-stem) and terminal phylogenetic
diversity (PD-terminal). PD-stem only sums the length of internal
branches that represent the synapomorphy of the sequences, while
PD-terminal sums the length of terminal branches that denotes the amount
of autapomorphy (Supplemental Information 1). PD-terminal focuses on
variations between closely related lineages rather than dissimilarities
between clades that a higher PD-terminal value indicates better species
resolution, while PD-stem value merely indicates the distinction amongst
the sub-groups of the samples. Nevertheless, PD-stem could measure the
robustness of the phylogenetic tree that a poor PD-stem suggests an
unstable taxonomic relation. For the convenience of comparison, the
three PD values were standardized by dividing the number of the
sequences of each alignment. To avoid interference of abnormal
sequences, standard deviations were calculated to hint the existence of
abnormal branch length values.
6. GC ratio and gap ratio. Apart from above evaluation methods, GC ratio
and gap ratio of the alignment were also computed in case of potential
usage. Particularly, an extremely high gap ratio indicates that the
alignment quality should be concerned.
After the evaluation of the whole alignment with above methods, a
sliding-window scanning is performed to infer variation hotspots in each
alignment. The output of the module includes a table including results
of all methods mentioned above and a figure showing the sliding-window
scanning results for each input file.
The Primer module
After sequence variance evaluation, the Primer module is optional for
designing universal primer used in downstream application. Users may
also manually use this module by providing aligned fasta-format files as
input. Firstly, the program generates a consensus sequence by counting
the frequency of bases in each column of the alignment. If a column
comprised of two or three bases with similar amount, their corresponded
ambiguous base is used in the consensus. In extreme cases that all four
bases (A, T, C, and G) have approximately equal proportions, the
ambiguous base “N” will be used. Next, a quick scanning that only
calculates Observed resolution is performed on the alignment to infer
the variation hotspots. The detected hyper-variant fragments regions are
marked in the consensus sequence for designing primer. Subsequently, the
upstream/downstream regions of marked sequences are iteratively checked
to screen suitable primer sequences, by evaluating the primer length,
the number of ambiguous bases, the existence of homodimers and secondary
structures, tandem repeat occurrences and other conditions. Then, anin silico validation is performed by comparing the candidate
primer sequences with original sequences in the alignment (gap removed)
using blastn-short of BLAST v2.13.0 (Boratyn et al. 2013).
Primers with excessive mismatch ratios, insufficient positive matches,
or low overall coverage are excluded. In addition, primers with abnormal
melting temperature or have potential to form hairpins or heterodimers
are removed according to Primer3-py
package(https://github.com/libnano/primer3-py), which integrated Primer3
(Untergasser et al. 2012). During the validation, since BLAST and
Primer3 have limited supporting of ambiguous bases, necessary auxiliary
works are finished by the Primer module. Finally, the program picks
matched primer pairs with suitable amplified length and coverage and
generate a table of primers information as output.
The visualization pipeline
For displaying the variations within organellar genomes across multiple
species of a target clade, we have developed an interactive plotting
pipeline: draw_circle_plastid.ipynb for plastids and
draw_circle_mitochondria.ipynb for mitochondria. Considering the
diverse customization needs of users regarding the output imagery, the
pipelines are provided as Jupyter notebooks to facilitate easy
adjustments of styles such as colors and fonts (Kluyver et al.2016). The input to the pipeline comprises a GenBank file of an
organellar genome from a reference species alongside a table of sequence
variation assessments generated by the Evaluate module. For plastid
genomes, users have the option to provide quadripartite structure
information to better represent their inverted repeat structures. The
pipeline starts with extracting gene names, lengths, and positions from
the GenBank file and intersects these with names from the results table,
which can minimize the confounding effects of additional or missing
fragments. Subsequently, with pyCirclize
(https://github.com/moshi4/pyCirclize), the program arranges genes and
intergenic spacers in accordance with the annotation’s order from the
reference genome, thereby allowing a consolidated display across
multiple species. Following this setup, tracks are drawn from outermost
to innermost, each corresponding to different variant indicators’
results. Within each track, segment values are displayed according to
their position. Finally, the program outputs a vector format image file
which can be further modified as needed.
Benchmarks
To test the compatibility of the GB2fasta module, one million records of
green plants were randomly retrieved and processed with the command
“python3 –m OGU.gb2fasta –taxon Viridiplantae -min_len 100
–max_len 1000000 –count 1000000 –out test_gb2fasta”. For
benchmarking the function of Evaluate module and its ability to filter
out the highest variant sequences, we tested records of Lamiaceae family
in GenBank with the command “python3 –m OGU –taxon Lamiaceae
–max_len 1000000 –out test_evaluate -quick” and the result was
labeled as “default”. The “-quick” option was used to skip
sliding-window analysis to reduce time consuming. In another run, the
options “-ig” and “-iab” were applied to ignore gaps and ambiguous
bases in the alignment to mimic the behavior of other software for
comparison. To reduce the interference of redundant data and low-quality
data to the analysis, several restrictions were applied: sequence number
greater than 100, and the sequence type should be “gene” or “spacer”
only since sequences of coding sequences (CDS), introns, RNA and most of
“misc feature” were already covered by those two types. Statistical
analysis and visualization were performed using seaborn and SciPy
(Virtanen et al. 2020; Waskom 2021). Besides, the rbcL , a
widely used gene for universal plant DNA barcode was selected for
testing sliding-window analysis and Primer module (Hollingsworth et al.
2009). To further test the software’s compatibility, we evaluated
mitochondrial genome data from rodents (Rodentia) using the following
command: “python3 -m OGU -og mt -taxon rodents –refseq yes -out
test_rodents”.
Finally, to further test the analysis function of the program on the
intergenic spacers and the compatibility on various taxonomy, angiosperm
families were selected according to the Angiosperm Phylogeny Website
version 14 (APG 4) (Stevens 2017). Plastid genomic sequences were
obtained by randomly selecting a representative species from each family
using the GB2fasta module and sequences were combined to one GenBank
file, and the command “python3 -m OGU -gb angiosperm_cp.gb -out
test_spacers -rename” was run to divide and analyze the data. For
subsequent phylogenetic analysis, IQ-TREE2 was used with the command
“iqtree2 -s combine.fasta -bb 1000 -bnni -p -czb -safe” and the most
suitable base substitution model is automatically selected using
ModelFinder (Kalyaanamoorthy et al. 2017). The phylogenetic tree
generated above was visualized using iTOL v5 (Letunic & Bork 2021).
Results
GB2fasta successfully processed one million random GenBank records
One million sequence records of green plants were successfully retrieved
from NCBI server. 534 of these are complete plastid genomes, 30 are
mitochondria genomes, and the rest are genes or genome fragments. 7,176
records were judged as invalid data during the analysis process, mainly
because the records only include annotation information and the
corresponding sequence are empty, or there is no feature annotation
entry (such as GenBank ID BMBV00000000, KEPT00000000, MU180198, etc.).
194,758 features in the annotations of records failed to be extracted
due to missing sequences or abnormal annotations. For instance,nad1 and nad5 genes in the Gastrodia elatamitochondrial genome referred sequences in another records (Yuanet al. 2018). After sequence extraction, validation and
deduplication, 1,401,251 files containing 1,827,413 records were
generated and 1,453,602 records were left (Supplemental Information 2
and 3). One of the reasons such an unexpected number of files generated
is that one genomic sequence can be divided into hundreds of files.
Moreover, since no organelle type filter was used during querying step,
many records retrieved are nuclear genes without appropriate annotation.
For instance, most of nuclear gene records only have a locus number that
they must be stored in different files, such as “gene-LOC100274415”,
“gene-Bca52824_090542”, and “gene-BPRA1399_LOCUS959”.
The download process cost about five hours due to speed limitation. The
validation and divide process cost about half to an hour in different
operating systems or hardware due to processing enormous files, which a
solid-state disk (SSD) could be helpful. During the running, one CPU
core and at most 1 GB memory were used that normal personal computers
can easily handle.
The performance of multiple sequence polymorphism evaluation methods
varied on Lamiaceae data
118,311 records of Lamiaceae family were successfully downloaded, and 30
records were found to be invalid (such as GenBank ID KEPD00000000).
35,205 files were generated including 129,181 non-redundant records and
2,253 files were successfully aligned and evaluated since most of
generated files only include one records due to annotation issues such
as missing annotation or empty sequences. After processing and
filtration, there are 121 genes and 109 intergenic spacers left for
evaluation and the dataset is labeled as “default”. By ignoring the
columns containing ambiguous bases and gaps in the alignment that “-ig
-iab” option did, there are 98 genes and 102 intergenic spacers and
labeled as “ig_iab” dataset (Supplemental Information 4). The missing
genes includes some widely applied genes such as matK andrbcL . This is due to the existence of gaps since the alignment ofmatK have 2,128 columns while all of them contain gaps just
because of few records (GenBank ID: DQ640414, JQ767173).
After applying multiple sequence polymorphism evaluation methods on
Lamiaceae data, the results show that the number of sequences and gap
ratio has no or little effect on each method, and the alignment length
only has a moderate correlation with observed resolution, Shannon index
and tree resolution, but has no effect on phylogenetic diversity (PD)
(Figure 1A). GC ratio showed a moderate negative correlation with each
method, which may be due to the relatively high GC ratio of conserved
RNA genes. Besides, there is a significant positive correlation between
observed resolution, entropy, and tree resolution. The three kinds of PD
also are significantly positive correlated.
The comparison of the result of two dataset show that the variance value
assessed by each method significantly increased after retaining gaps and
ambiguous bases, given that the indel variation information in the
sequence was considered (Figure 1B). Compared with other methods, the
significance of such change on PD is slightly lower, and there is no
significant change in PD-stem, indicating that the gaps in these
sequences have little effect on the construction of phylogenetic
relationships. By sorting the results with ascending PD values, Figure
1C shows that the overall trend of the three methods is close. There are
fluctuations in the opposite direction between PD-stem and PD-terminal,
that is, the variance of these loci comes from different contributors.
For example, the PD-terminal of rps12 and rrn5 is higher
than that of PD-stem, which means that there are greater differences
among closely related species; rps19 , petL and other genes
are the opposite, that is, the sequence differences mainly originate
between different clades rather than within.
To further compare the effectiveness of various methods, ten loci from
“default” dataset with the highest degree of variation indicated by
each method were selected for intersection analysis (Supplemental
Information 5). Figure 1D shows that almost all highly variable
fragments indicated by Pi were different from other methods which may be
because gaps have a larger effect on it. Observed resolution, entropy
and tree resolution have four intersections, and the PD series has three
intersections, which is consistent with previous correlation analysis.
Except for Pi, all other six methods considered the intergenic regionrpl32 -trnL UAG is highly variable. This
result has also been supported by previous reports (Will &
Claßen-Bockhoff 2017; H. Wu et al. 2021).
For rbcL , the program completed the sliding window analysis with
500 bp window size and 50 bp step size. The results show that Pi is
consistent with the gap ratio which increase from the beginning to the
site of about 700 bp and become flat at end of the alignment
(Supplemental Information 6), which is consistent with the distribution
of alignment gaps. Other evaluation methods show flat trend indicating
that they are relatively insensitive to gaps. Interestingly, for
phylogenetic diversity, PD-terminal remains flat, while PD-stem
fluctuates synchronously with PD. The alignment show there are two or
more kinds of homologous variation sites in these sudden rises and
neighboring sequences are basically identical, that is, these mutations
are mainly differences between sequences of different groups rather than
within the group such as 176T>C and 306C>G
(Supplemental Information 7). In addition, the program successfully
designed and selected one pair of best performance universal primers
based on the alignment. Due to the widely distribution of gaps, the
average amplified product is shorter than 100 bp (Supplemental
Information 8).
The download and divide process cost half an hour. The alignment process
cost about eight hours and the evaluation cost two hours. During the
running, one CPU core and at most 3 GB memory were used and the time
consumption could be reduced by using more CPU cores theoretically.
Intergenic spacers behave similarly with CDS on 308 angiosperm plastid
genomic data
A total of 308 complete plastid genomic sequences were retrieved from
which 1,015 fragments were extracted, encompassing representation from
71% of APG IV families. After discarding sequences numbering fewer than
100 and those that were duplicates, a subset of 241 sequences remained
for comparative analyses across different categories. Detailed
evaluation results and associated tests for significance of differences
can be found in Supplemental Information 9 and 10.
The two types of RNA generally exhibit similar patterns. Both tRNA and
rRNA have significantly higher GC content compared to other types of
sequences, while their nucleotide diversity (Pi) is markedly lower
(Figure 2A). In terms of tree resolution and the three measures of
phylogenetic diversity, RNA also scored lower than other types of
fragments, with no significant differences observed between them (Figure
2A and 2B). Introns and spacers, both belonging to non-coding regions,
had the highest GC content, and their phylogenetic diversity scores
(PDs) were significantly greater than those of coding sequences (CDS).
Apart from PDs, no significant differences were detected between
spacers, introns, and CDS in other indicators. One explanation is that
because the sampled taxa spanned different families, polymorphisms
inferred from other metrics approached their maximum values for all
sequence types. Such situation is particularly evident with the Shannon
index and Observed resolution (Supplemental Information 11). PDs did not
exhibit this saturation effect as they are solely associated with the
branch lengths of phylogenetic trees constructed from the sequences. For
spacers, PDs were highest (Figure 2C), consistent with the theoretical
expectation of minimal negative selective pressure. Furthermore,
although spacer sequences had the highest PD-terminal values, PD-stem
values were also substantial. This suggests that the evaluated
angiosperm samples retained a considerable number of homologous
sequences within their spacer regions, which could provide phylogenetic
information rather than being entirely composed of random variation.
For Gap ratio, differences are evident between different types of
sequences; however, significant variation also exists within the same
sequence type. For example, within coding sequences (CDS), the alignment
length of ycf1 is approximately 15 kb, of which over 70%
consists of gaps, whereas for petG — also a CDS — the
alignment length is 114 bp with a gap ratio of 0%.
Figure 2D presents a circular map of the evaluated fragments arranged
according to the plastid genome structure of Nicotiana tabacum(GenBank ID Z00044), generated using the pipeline
draw_circle_plastid.ipynb. In the case of the inverted repeat regions
(IRs), disregarding the saturation effects previously mentioned, the
sequence polymorphism within these fragments is less than in other
regions. Beyond the influence of quadripartite structures, another
conspicuous observation is the intermittent dips across various segments
in the circle, corresponding predominantly to genes of notably short
lengths such as psbF and psbN .
To further compare coding sequences (CDS) with spacers, 77 CDS and 30
spacer sequence fragments were selected based on alignment length, gap
ratio, and degree of sequence variation, subsequently compiled into two
datasets: CDS-tree and spacer-tree. Phylogenetic trees were constructed
using IQ-TREE2 with the parameters “iqtree2 –s sequence.nex -p -bb
1000 -bnni -czb -safe -m GTR+F+R4”. The results indicate three main
differences between the two phylogenetic trees (Supplemental Information
12 and 13). Firstly, in the spacer tree, the order Austrobaileyales is
basal, whereas in the CDS tree, the Amborellales takes this position.
Secondly, in the CDS tree, the order Ceratophyllales is sister to the
eudicots, while in the spacer tree it clusters near the base of
angiosperms alongside Chloranthales. Additionally, within the core
eudicots clade, Buxales and Trochodendrales occupy basal positions but
exhibit reversed internal-external relationships between the two trees.
Apart from these individual discrepancies—some of which have been
previously acknowledged as challenging taxa—most samples’ positional
relationships are consistent and align with APG IV (Hu et al. 2023).
This suggests that plastid spacers hold potential for application in
phylogenetic analysis among other fields. Utilizing spacers allows for a
more comprehensive use and analysis of plastid genome data, providing
new insights for related research.
Discussion
OGU helps better utilizing GenBank data
Till now, GenBank offers more than 1 billion sequence records and
related taxonomy database including more than 600,000 formally described
species. However, it is not straightforward to access and utilize such
abundant resources. Although all sequence records in GenBank are
accessible with the NCBI Entrez retrieval system, it is not easy to
perform an accurate retrieval, especially for novices. Besides, the
download process could be unsmooth, especially for large request
targeting on specific taxonomy that NCBI FTP service cannot help.
GB2FASTA provides a wealth of options and a fully automated process to
help users solve the problem of accurate acquisition of target data.
Currently user can download records with various file format. The
GFF3-format file only contains feature annotations. The FASTA-format
file contains sequence and are widely compatible with other
bioinformatical software but has very limit annotations. The
GenBank-format files have full annotations and sequence but file
structure is complex which is generally incapable to be directly
utilized by downstream analytical software. The GB2fasta module of OGU
offers simple yet powerful means to convert files from GenBank-format to
FASTA-format with necessary information. By utilizing the annotation
information, the original records were divided precisely to extract
sequences of each locus, as well as intergenic spacers which are
undefined in the original file. Extracted files have locus name,
detailed taxonomy, feature type and original GenBank ID information
which are well-formatted in each sequence ID hence are easier for
organization and utilization, especially for analyzing organelle genomic
data.
Broader application of intergenic spacer data demands extra efforts
The theoretical high rate of variability, the challenges associated with
extraction, and potential structural reorganizations leading to loss
have all impeded the utilization of intergenic spacer regions within
plant plastid genomes. In this study, we employed the GB2fasta module of
the OGU to extract intergenic spacer fragments from complete plastid
genome sequences. Upon evaluating their sequence variation, we
discovered that these intergenic spacers comprise both highly variable
and extremely conserved segments, with overall polymorphism and
phylogenetic resolution comparable to that of protein-coding genes.
After a rigorous selection process that eliminated fragments with poor
alignment quality, low homology, or aberrant lengths, we identified 30
plastid intergenic spacers from angiosperms for phylogenetic analysis.
The resulting phylogenetic trees displayed credible systematic positions
for various taxa, closely mirroring the results obtained from
protein-coding gene exons with strong support values. This indicates
that non-coding regions of the plastid genomes are valuable for
phylogenetic studies. Beyond mere extraction from the plastid genomes,
OGU also offers a Primer module designed to facilitate the development
of universal primers for cross-species applications, assisting
researchers in data acquisition and enabling the use of intergenic
spacers in fields such as species identification as DNA barcoding or
population genetics.
Nevertheless, several issues merit attention regarding the broader
utilization of plastid genome intergenic spacers. The first pertains to
the homology of these regions. Due to the relatively stable structure of
plastid genomes, they are presumed to possess high collinearity and
homology. This assumption holds true except when considering taxa that
have undergone extensive genomic rearrangements, such as those found in
families like Pinaceae, Geraniaceae, and Campanulaceae (Cosner et
al. 2004). However, for groups lacking substantial data and research,
as well as intergenic spacers located at the junctions of the
quadripartite structure of plastid genomes, and regions with excessively
high variability, it is prudent to conduct a homology check using
software tools like BLAST prior to data utilization.
Another issue is the reliability of nomenclature. The naming of
homologous intergenic spacers relies on the gene flanking them.
Discrepancies in name criteria, such as synonyms, spelling variations,
and formatting inconsistencies of the upstream and downstream genes, can
result in the mislabeling of extracted intergenic spacers. OGU has
partially addressed this challenge by implementing regular expressions
and custom conversion rules within the “-rename” option of the
GB2fasta module. Nonetheless, fully resolving this nomenclature problem
necessitates collaborative efforts from data standard setters, data
submitters, and software developers to ensure consistency and accuracy.
For animal mitochondrial genomes, although the program successfully
retrieved 317 complete rodent mitochondrial genomes and completed their
analysis (Supplemental Information 14), results show that all genes are
highly conserved and intergenic spacers exhibit even lower degrees of
variation (Supplemental Information 15). For plant mitochondrial
genomes, due to currently limited public data availability and
significant genomic structural variation, further exploration in the
future is necessary.
Acknowledgements
We thank the authors of Primer3-py for building and update packages
which enhance our software compatibility on Microsoft Windows operating
system. We thank Yan Zhang, Dongyue Jiang, Huijie Liu, Guojin Zhang, and
Haihua Hu for testing the program. We appreciate Prof. Shiliang Zhou for
his advice on the Evaluate module. We also thank numerous volunteers on
GitHub for participation, including but certainly not limited to
minhtrung1997, beausmith94, yihenghu, weissab, BobbyWoo8, codeunsolved,
sunlu123, and yjr930.
This research was supported by grants from the National Natural Science
Foundation of China (32300539) and Startup Fund of Sichuan Normal
University (XJ20220110).
References
Amiryousefi A, Hyvönen J, Poczai P (2018) IRscope: an online program to
visualize the junction sites of chloroplast genomes.Bioinformatics, 34 (17), 3030-3031.
Anand A, Pandi G (2021) Noncoding RNA: An insight into chloroplast and
mitochondrial gene expressions. Life, 11 (1), 1-20.
doi:10.3390/life11010049
Barrett CF, Baker WJ, Comer JR, Conran JG, et al. (2016). Plastid
genomes reveal support for deep phylogenetic relationships and extensive
rate variation among palms and other commelinid monocots. New
Phytologist , 209, 855-870.
Belbin FE, Noordally ZB, Wetherill SJ, Atkins KA, et al. (2017)
Integration of light and circadian signals that regulate chloroplast
transcription by a nuclear-encoded sigma factor. New Phytologist,
213 (2), 727-738. doi:10.1111/nph.14176
Boratyn GM, Camacho C, Cooper PS, Coulouris G, et al. (2013)
BLAST: a more efficient report with usability improvements.Nucleic Acids Research, 41 (W1), W29-W33.
Chen C, Chen H, Zhang Y, Thomas HR, et al. (2020) TBtools: an
integrative toolkit developed for interactive analyses of big biological
data. Molecular plant, 13 (8), 1194-1202.
Chen CW, Huang YM, Kuo LY, Nguyen QD, et al. (2013) trnL-F is a
powerful marker for DNA identification of field vittarioid gametophytes
(Pteridaceae). Annals of Botany, 111 (4), 663-673.
Chung KP, Gonzalez-Duran E, Ruf S, Endries P, Bock R (2023) Control of
plastid inheritance by environmental and genetic factors. Nature
Plants, 9 (1), 68-80.
Cosner ME, Raubeson LA, Jansen RK (2004) Chloroplast DNA rearrangements
in Campanulaceae: phylogenetic utility of highly rearranged genomes.BMC Evolutionary Biology, 4 , 1-17.
Darling AE, Mau B, Perna NT (2010) progressiveMauve: multiple genome
alignment with gene gain, loss and rearrangement. PloS one, 5 (6),
e11147.
Dierckxsens N, Mardulyn P, Smits G (2016) NOVOPlasty: De novo assembly
of organelle genomes from whole genome data. Nucleic Acids
Research, 45 , e18. doi:10.1093/nar/gkw955
Dietrich A, Wallet C, Iqbal RK, Gualberto JM, Lotfi F (2015) Organellar
non-coding RNAs: Emerging regulation mechanisms. Biochimie, 117 ,
48-62. doi:https://doi.org/10.1016/j.biochi.2015.06.027
Escobari B, Borsch T, Quedensley TS, Gruenstaeudl M (2021) Plastid
phylogenomics of the Gynoxoid group (Senecioneae, Asteraceae) highlights
the importance of motif‐based sequence alignment amid low genetic
distances. American Journal of Botany, 108 (11), 2235-2256.
Faith DP (1992) Conservation Evaluation and Phylogenetic Diversity.Biological Conservation, 61 (1), 1-10. doi:Doi
10.1016/0006-3207(92)91201-3
Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids
Research, 40 (Database issue), D136-143. doi:10.1093/nar/gkr1178
Gray MW (1993) Origin and evolution of organelle genomes. Current
Opinion in Genetics and Development, 3 (6), 884-890.
Greiner S, Lehwark P, Bock R (2019) OrganellarGenomeDRAW (OGDRAW)
version 1.3. 1: expanded toolkit for the graphical visualization of
organellar genomes. Nucleic Acids Research, 47 (W1), W59-W64.
Gualberto JM, Newton KJ (2017) Plant mitochondrial genomes: dynamics and
mechanisms of mutation. Annual Review of Plant Biology, 68 ,
225-252.
Hebert PDN, Ratnasingham S, de Waard JR. (2003). Barcoding animal life:
cytochrome c oxidase subunit 1 divergences among closely related
species. Proceedings of the Royal Society B: Biological Sciences ,
270, S96-S99.
Hollingsworth PM, Forrest LL, Spouge JL, Hajibabaei M, et al.(2009) A DNA barcode for land plants. Proceedings of the National
Academy of Sciences, USA, 106 , 12794-12797. doi:10.1073/pnas.0905845106
Hu H, Sun P, Yang Y, Ma J, Liu J (2023) Genome-scale angiosperm
phylogenies based on nuclear, plastome, and mitochondrial datasets.Journal of Integrative Plant Biology, 65 (6), 1479-1489.
doi:https://doi.org/10.1111/jipb.13455
Hua Z, Tian D, Jiang C, Song S, et al. (2022) Towards
comprehensive integration and curation of chloroplast genomes.Plant Biotechnology Journal, 20 (12), 2239.
Jin JJ, Yu WB, Yang JB, Song Y, et al. (2020) GetOrganelle: a
fast and versatile toolkit for accurate de novo assembly of organelle
genomes. Genome Biology, 21 (1), 241.
doi:10.1186/s13059-020-02154-5
Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS (2017)
ModelFinder: fast model selection for accurate phylogenetic estimates.Nature Methods, 14 (6), 587-589. doi:10.1038/nmeth.4285
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software
version 7: improvements in performance and usability. Molecular
Biology and Evolution, 30 (4), 772-780. doi:10.1093/molbev/mst010
Kluyver T, Ragan-Kelley B, Pérez F, Granger BE, et al. (2016)
Jupyter Notebooks-a publishing format for reproducible computational
workflows. Elpub, 2016 , 87-90.
Kress WJ, Erickson DL. (2008). DNA barcodes: Genes, genomics, and
bioinformatics. Proceedings of the National Academy of Sciences ,
105, 2761-2762.
Lanfear R, Frandsen PB, Wright AM, Senfeld T, Calcott B (2017)
PartitionFinder 2: new methods for selecting partitioned models of
evolution for molecular and morphological phylogenetic analyses.Molecular Biology and Evolution, 34 (3), 772-773.
Letunic I, Bork P (2021) Interactive Tree Of Life (iTOL) v5: an online
tool for phylogenetic tree display and annotation. Nucleic Acids
Research, 49 (W1), W293-W296. doi:10.1093/nar/gkab301
Li HT, Yi TS, Gao LM, Ma PF, et al. (2019) Origin of angiosperms
and the puzzle of the Jurassic gap. Nature Plants, 5 (5), 461-470.
doi:10.1038/s41477-019-0421-0
Liu S, Ni Y, Li J, Zhang X, et al. (2023) CPGView: a package for
visualizing detailed chloroplast genome structures. Molecular
Ecology Resources, 23 (3), 694-704.
Lynch M, Koskella B, Schaack S (2006) Mutation pressure and the
evolution of organelle genomic architecture. science, 311 (5768),
1727-1730.
Maglott D, Ostell J, Pruitt KD, Tatusova T (2011) Entrez Gene:
gene-centered information at NCBI. Nucleic Acids Research, 39 ,
D52-D57. doi:10.1093/nar/gkq1237
Meng G, Li Y, Yang C, Liu S (2019) MitoZ: a toolkit for animal
mitochondrial genome assembly, annotation and visualization.Nucleic Acids Research, 47 (11), e63–e63.
Miller JT, Jolley-Rogers G, Mishler BD, Thornhill AH (2018) Phylogenetic
diversity is a better measure of biodiversity than taxon counting.Journal of Systematics and Evolution, 56 , 663-667.
doi:10.1111/jse.12436
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, et al. (2020)
IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference
in the Genomic Era. Molecular Biology and Evolution, 37 (5),
1530-1534. doi:10.1093/molbev/msaa015
Nei M. (1987). Molecular evolutionary genetics . New York:
Columbia University Press.
Peet RK (1974) The measurement of species diversity. Annual Review
of Ecology and Systematics, 5 (1), 285-307.
Pruitt KD, Tatusova T, Brown GR, Maglott DR (2012) NCBI Reference
Sequences (RefSeq): Current status, new features and genome annotation
policy. Nucleic Acids Research, 40 , 130-135.
doi:10.1093/nar/gkr1079
Qu XJ, Moore MJ, Li DZ, Yi TS (2019) PGA: a software package for rapid,
accurate, and flexible batch annotation of plastomes. Plant
Methods, 15 , 50. doi:10.1186/s13007-019-0435-7
Riaz T, Shehzad W, Viari A, Pompanon F, et al. (2011).
EcoPrimers: Inference of new DNA barcode markers from whole genome
sequence analysis. Nucleic Acids Research , 39, 1-11.
Rozas J, Ferrer-Mata A, Sánchez-DelBarrio JC, Guirao-Rico S, et
al. (2017) DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data
Sets. Molecular Biology and Evolution, 34 , 3299-3302.
doi:10.1093/molbev/msx248
Sanchez-Puerta MV, Ceriotti LF, Gatica-Soria LM, Roulet ME, et
al. (2023) Invited Review Beyond parasitic convergence: unravelling the
evolution of the organellar genomes in holoparasites. Annals of
Botany, 132 (5), 909-928. doi:10.1093/aob/mcad108
Sayers EW, Cavanaugh M, Clark K, Ostell J, et al. (2019).
GenBank. Nucleic Acids Research , 47, D94-D99.
Smith DR, Keeling PJ (2015) Mitochondrial and plastid genome
architecture: reoccurring themes, but significant differences at the
extremes. Proceedings of the National Academy of Sciences,
112 (33), 10177-10184.
Song M-H, Yan C, Li J-T (2022) MEANGS: an efficient seed-free tool for
de novo assembling animal mitochondrial genome using whole genome NGS
data. Briefings in Bioinformatics, 23 (1), bbab538-bbab538.
Stevens PF. (2017). Angiosperm phylogeny website, version 14, July 2017
[and more or less continuously updated since].
Talevich E, Invergo BM, Cock PJ, Chapman BA (2012) Bio.Phylo: a unified
toolkit for processing, analyzing and visualizing phylogenetic trees in
Biopython. BMC Bioinformatics, 13 , 209.
doi:10.1186/1471-2105-13-209
Tillich M, Lehwark P, Pellizzer T, Ulbricht-Jones ES, et al.(2017) GeSeq - versatile and accurate annotation of organelle genomes.Nucleic Acids Research, 45 (W1), W6-W11. doi:10.1093/nar/gkx391
Tonti-Filippini J, Nevill PG, Dixon K, Small I (2017) What can we do
with 1000 plastid genomes? Plant Journal, 90 , 808-818.
doi:10.1111/tpj.13491
Turland NJ, Wiersema JH, Barrie FR, Greuter W, et al. (2018).International code of nomenclature for algae, fungi, and plants
(Shenzhen Code) : adopted by the nineteenth International Botanical
Congress, Shenzhen, China, July, 2017 . Glashütten, Germany: Koeltz
Botanical Books.
Untergasser A, Cutcutache I, Koressaar T, Ye J, et al. (2012)
Primer3-new capabilities and interfaces. Nucleic Acids Research,
40 , 1-12. doi:10.1093/nar/gks596
Virtanen P, Gommers R, Oliphant TE, Haberland M, et al. (2020)
SciPy 1.0: fundamental algorithms for scientific computing in Python.Nature Methods, 17 (3), 261-272.
Walker JF, Walker-Hale N, Vargas OM, Larson DA, Stull GW (2019)
Characterizing gene tree conflict in plastome-inferred phylogenies.PeerJ, 7 , e7747. doi:10.7717/peerj.7747
Waskom ML (2021) Seaborn: statistical data visualization. Journal
of Open Source Software, 6 (60), 3021.
Weng ML, Blazier JC, Govindu M, Jansen RK (2014) Reconstruction of the
ancestral plastid genome in Geraniaceae reveals a correlation between
genome rearrangements, repeats, and nucleotide substitution rates.Molecular Biology and Evolution, 31 (3), 645-659.
doi:10.1093/molbev/mst257
Will M, Claßen-Bockhoff R (2017) Time to split Salvia s.l. (Lamiaceae)
– New insights from Old World Salvia phylogeny. Molecular
Phylogenetics and Evolution, 109 , 33-58.
doi:https://doi.org/10.1016/j.ympev.2016.12.041
Wu H, Ma PF, Li HT, Hu GX, Li DZ (2021) Comparative plastomic analysis
and insights into the phylogeny of Salvia (Lamiaceae). Plant
Divers, 43 (1), 15-26. doi:10.1016/j.pld.2020.07.004
Wu P, Xu C, Chen H, Yang J, et al. (2021) NOVOWrap: An automated
solution for plastid genome assembly and structure standardization.Molecular Ecology Resources, 21 (6), 2177-2186.
doi:10.1111/1755-0998.13410
Xi Z, Liu L, Davis CC (2016) The Impact of Missing Data on Species Tree
Estimation. Molecular Biology and Evolution, 33 (3), 838-860.
doi:10.1093/molbev/msv266
Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood.Molecular Biology and Evolution, 24 (8), 1586-1591.
Yuan Y, Jin X, Liu J, Zhao X, et al. (2018) The Gastrodia elata
genome provides insights into plant adaptation to heterotrophy.Nature Communications, 9 (1), 1615. doi:10.1038/s41467-018-03423-5
Zhang D, Gao FL, Jakovlic I, Zou H, et al. (2020) PhyloSuite: An
integrated and scalable desktop platform for streamlined molecular
sequence data management and evolutionary phylogenetics studies.Molecular Ecology Resources, 20 (1), 348-355.
doi:10.1111/1755-0998.13096
Zhang HR, Xiang QP, Zhang XC (2019) The Unique Evolutionary Trajectory
and Dynamic Conformations of DR and IR/DR-Coexisting Plastomes of the
Early Vascular Plant Selaginellaceae (Lycophyte). Genome Biology
and Evolution, 11 (4), 1258-1274. doi:10.1093/gbe/evz073