Data Accessibility and Benefit-Sharing
The source code of the program and the user manual are freely available at https://github.com/wpwupingwp/OGU under GNU Affero General Public License (AGPL-3). Datasets used for benchmarking and related outputs from OGU have been deposited to Zenodo (https://dx.doi.org/10.5281/zenodo.10695931).
Benefits from this research accrue from the sharing of our software, data and results on public databases and code repositories as described above.
Author Contributions
Ping Wu wrote the program, the manuscript and conducted the analysis. Ningning Xue proofread the manuscript and joined the design and test of the Primer module. Jie Yang tested and optimized the GB2fasta module. Qiang Zhang joined the implementation of phylogenetic diversity. Yuzhe Sun and Wen Zhang tested and advised the Evaluate module.
Figure legends
Figure 1: The performance of evaluation methods on Lamiaceae data
A) Correlation coefficient matrix of different sequence variance indicators for the “default” dataset. Black/white color of the numbers in the matrix is only for distinguishing from the background. B) Effects of alignment gaps and ambiguous bases on sequence polymorphism evaluation. C) Determination of the highest mutation fragment by different methods. D) Relationship between PD, PD-stem and PD-terminal. “Observed_Res” represents the observed resolution method, “Tree_Res” represents the tree resolution method.
Figure 2: Sequence variance of different kinds of regions on 308 plastid genomes
A) GC ratio and gap ratio of five kinds of fragments on 308 angiosperm plastid genomes. Filled boxes are GC ratio and border-only boxes are gap ratio. B) Pi and tree resolution of fragments. Left axis is for Pi and right is for tree resolution. C) PD-stem and PD-terminal of fragments. D) The circular plot of sequence variances. Fragments are ordered according to the plastid genome structure of tobacco and white region indicates that the sequence used for analysis does not contain the fragment corresponding to this position in the tobacco plastid genome. One invert repeat region is omitted for convenience.
Supplemental information
S1. Schematic diagram of stem and terminal phylogenetic diversity
S2. Extraction results on 1 million random GenBank record
S3. Extraction results of one million random GenBank records
S4. Evaluation results of Lamiaceae data
S5. Top 10 highly variance Lamiaceae loci
S6. Sliding window analysis of Lamiaceae rbcL
S7. Lamiaceae rbcL multiple sequence alignment result
S8. Universal primer design results of Lamiaceae rbcL
S9. Evaluation results of 308 angiosperm plastid genomes
S10. Significant test results of 308 angiosperm plastid genomes
S11. Variance of 30 selected plastid intergenic spacers
S12. Consensus tree of CDS data from 308 angiosperm families
S13. Consensus tree of spacer data from 308 angiosperm families
S14. Evaluation results of rodents data
S15. Visualization of rodents mitochondrion genome’s variance