Genome quality assessment, assortment and annotation
We performed comprehensive assessments for the chromosome-level genome assembly using datasets from PacBio, Illumina, and RNA-seq derived from roots, stems and leaves. A number of other indices showed that the genome was of high quality. By mapping the genomic sequencing data, we found that the Illumina and PacBio data covered 99.45% and 99.76% of the whole genomes, repectively. In total, 96.5% of BUSCO (Benchmarking Universal Single-Copy Orthologs) genes were represented as complete, and the proportion of transcriptome data that mapped to the genome was 97.8%. The coverage depth distribution for duplicated and single-copy BUSCO core genes was identical, showing an expected Poisson distribution (Fig. S1). This indicates that duplicated genes were not derived from assembling redundancy.
We also used transcripts from several white poplar species, includingP. alba , P. adenopoda , P. davidiana , and P. grandidentata , together with genomes derived from P. alba var.pyramidalis (J. Ma et al., 2019),P. tremula and P. tremuloides(Lin et al., 2018) and P. trichocarpa (Tuskan et al., 2006), to assess the chromosome-scale pseudomolecules (based on co-phylogenetic analysis). The P. tomentosa genome was successfully separated into two subgenomes (2×19 choromosomes), with sizes of 336.7 Mb and 344.4 Mb, respectively (Table 1, Table S5a, Table S5b). Mapping of syntenic regions within the assembly showed clear chromosome-to-chromosome correspondence and also extensive synteny among different chromosomes, as expected for the highly duplicatedPopulus genome (Fig. 3). Furthermore, we also evaluated and compared the read depth between the two subgenomes using both the PacBio and Illumina reads; the results showed the same depth distribution between two subgenomes, suggesting an accurate haplotype-resolved assembly (Fig. S2).
Compared with previous poplar genome assemblies (J. Ma et al., 2019; T. Ma et al., 2013; Tuskan et al., 2006; Yang et al., 2017), the result of comprehensive assessments showed that the P. tomentosa assembly quality in the present study was substantially improved (Table S6).
Using a combination of RepeatModeler and RepeatMasker, 1,001,718 repeats were identified and masked (de novo identification, classification, and masking were all run under default parameters). Collectively, these repeats were 307.6 Mb in size and comprised ~41.6% of the genome (Fig. 3a). Long-terminal repeats (LTR) were the most abundant, making up 17.5% of the genome. 13.3% of these were LTR/Gypsy elements, and 4.0% were LTR/Copia repeats. Second to LTR were unknown elements, making up 9.8% of the genome. This was followed by 5.6% of Helitron repeats and 5.4% of DNA elements (Fig. 3b) (details in Table S7). There were only slight differences in repeat size between two subgenomes; total repeats sizes were 133.7 Mb and 137.9 Mb in subgenomes A and D (discussed further below), respectively, and the sizes of LTRs were 54.9 Mb and 58.3 Mb, respectively (Table 1).
Transposable element (TE) abundance and distribution varied significantly between the sub-genomes (Fig. S3). The TEs are composed of Class I and Class II [Class I consists of LTR (Copia, Gypsy), LINE and SINE (tRNA); Class II consists of DNA (CMC-EnSpm, hAT-Ac, hAT-Tag1, PIF-Harbinger) and RC (Helitron)], all of which were widely distributed on the two subgenomes) (Fig. S3 ①-⑦). For example, a total 109,542 and 114,615 TEs of Class I were found in Subgenome A and Subgenome D, respectively (4.6% increase), and a total 101,190 and 97,478 TEs of Class II were found in the two subgenomes (3.6% decrease), respectively (Fig. S4). Further Pearson’s Chi-square test showed that all TE categories but the LINE’s were significant differences after Bonferonni correction (alpha = 0.05/7 = 0.007142857) (Table S8).
To annotate genes, we first annotated the unmasked P. tomentosagenome, incorporating 73,919 protein sequences, and 137,918 transcripts assembled from P. tomentosa RNA-seq data. A total of 59,124 high quality gene models were identified, with an average coding-sequence length of 1.31 kb, 6.04 exons per gene, and 430 amino acids (aa) per protein. There was 28.5% genome coverage with an total length of 210.8 Mb (Table 1, Table S9, Table S10).
The annotated genes were then associated with the three onotological classes: biological process, cellular components, and molecular functions (Fig. 3c). We predicted 662 tRNAs with a total length of 49,659 bp (average length per tRNA: 75 bp), and 436 rRNAs (106 28S rRNAs, 106 18S rRNAs, and 224 5S rRNAs) with a length of 610,293 bp. We also annotated 2,072 ncRNAs with a length of 218,117 bp using RfamScan (Kalvari et al., 2018). Finally, we performed alignments with protein databases using BLAT (Kent, 2002), with a maximum annotation ratio of 98.6% (Table S11).
Comparative genomics and evolution
We compared 19,594 gene families, cointaining 59,124 genes, in theP. tomentosa genome with those of other three sequenced poplar genomes including P. trichocarpa , P. euphratica , andP. pruinosa using OrthoMCL (L. Li, Stoeckert, & Roos, 2003). A total of 22,386 gene families (142,738 genes) were identified by homolog clustering. In addition, 14,738 gene families (119,375 genes) were shared by all four poplar species, and 1,154 gene families consisting of 2,038 genes were found to be unique toP. tomentosa based on OrthoMCL “mutual optimization.” Similarly, 646/1,349, 179/261, and 399/1,041 gene families/genes were found to be unique to P. trichocarpa , P. euphratica andP. pruinosa, respectively (Fig. 4a, Table S12).
To phase the chromosome pairs and study the parental origin of P. tomentosa , we selected 1,052 orthologous genes that appear to be allelic between each P. tomentosa chromosme pair and are single-copy genes in poplars of other sections, and then constructed gene trees to assess phylogenetic distances. The allelic gene-pairs of each P. tomentosa chromsosme pair were observed to be clearly closest to either P. alba var. pyramidalis (PA) orP. adenopoda (PD), respectively, on most gene trees. Thus, it sucessfully divided a total of 38 chromosomes of P. tomentosainto two subgenomes (2 × 19 chromosomes) based on phylogenetic distances. To confirm the results, we measured Ks distances among two subgenomes of P. tomentosa , P. alba var.pyramidalis (PA) and P. adenopoda (PD) using 5,345 single-copy orthologous genes. The results were also consistent; we refer to the genome of P. tomentosa as comprised of subgenome A (putatively derived from P. alba var. pyramidalis ) and subgenome D (putatively derived from P. adenopoda ).
To investigate potential recombination events between the two sub-genomes, the synonymous (Ks ) distance between 5,345 single copy orthologus genes of P. tomentosa , P. alba var.pyramidalis and P. adenopoda was estimated. We found that there was limited apparent recombination events within the large majority of gene loci (4,309: 80.62%), though a low level of recombination appeared to occur (38 loci, 0.87%); and 998 loci (18.7%) did not meet either of above two hypotheses and thus were uninformative (Table S13, Fig. S5). This suggests that the two parental subgenomes may be largely still intact in P. tomentosa , at least with respect to genic composition.
We re-constructed phylogenetic trees of subgenome A, subgenome D and of other poplars (Fig. S6), as well as for each of the corresponding 19 pairs of chromosomes (Fig. S7). All of these analyses supported the hypothesis that the P. tomentosa genome originated from hybridization between P. adenopoda and P. alba var. pyramidalis . Based on the fact that the P. alba var. pyramidalis is a male clone, and no female clone are found, together with our previous phylogenetic analyses of chloroplast genomes from section Populus(Gao et al., 2019) which indicated thatP. adenopoda is the maternal parent of P. tomentosa , we deduce that P. alba var. pyramidalis and P. adenopoda were the male and female parents, respectively, in the hybrid formation of P. tomentosa .
To address dates of divergence and duplication events in poplars, we conducted collinearity analysis of homologous gene pairs derived fromPopulus species vs. Salix suchowensis using MCScanX (Y. Wang et al., 2012). From theKs (synonymous substitution rate) distribution, we inferred a whole genome duplication event (WGD) (based on paralogous pairs) and a species divergence event (based on orthologous pairs). The Ksdistribution among syntenic genes of the four poplar species andS. suchowensis contained two peaks. One peak indicated that poplar and Salix species both underwent a common WGD event (Ks ≈ 0.25). Such WGD events are known to have occurred frequently in the evolution of angiosperms (Jiao et al., 2011; Myburg et al., 2014; Otto, 2007; Van de Peer, Mizrachi, & Marchal, 2017). This result is also consistent with a previous study on Salix suchowensis (Dai et al., 2014). Another peak that represents divergence between Populus and Salixis also visible, suggesting some further chromosomal duplication (Ks ≈ 0.12) (Fig. 4b). Further analysis showed that sectionPopulus and P. trichocarpa have a divergence at Ks≈ 0.035, and P. adenopoda and P. alba at Ks ≈ 0.025. Subsequently, as a variant, P. alba var.pyramidalis is separated from P. alba at Ks ≈ 0.008. The hybridization event between P. adenopoda and P. alba var. pyramidalis subsequently occurred, followed by the emergence of P. tomentosa (Ks ≈ 0.005) (Fig. 4c).
To study the parental origin of P. tomentosa , we constructed phylogenetic trees using Salix suchowensis as an outgroup. Further, referencing the fossil-based divergence time of Populusand Salix at 48 Mya (Boucher, Manchester, & Judd, 2003; Manchester, Dilcher, & Tidwell, 1986), we estimated dates for taxonomic divergence. Phylogenetic analysis indicated that the divergence event between sectionPopulus and section Tacamahaca (P. trichocarpa ) occurred at approximately 13.4 Mya. P. adenopoda , an ancestor ofP. tomentosa , was the first to separate from the Populusfamily as an independent clade approximately 9.3 Mya. Subsequently, the aspen group and white poplars group underwent a divergence event (approximately 8.4 Mya). Another ancestor of P. tomentosa ,P. alba var. pyramidalis , gave rise to an independent variant of P. alba at approximately 4.8 Mya. Approximately 3.9 Mya, P. tomentosa was created by hybridization between P. adenopoda and P. alba var. pyramidalis (Fig. 4d). Phylogenetic trees constructed using the chloroplast genomes of 15 white poplar species and P. trichocapa indicated that the most probable female parent of P. tomentosa is P. adenopoda (Fig. S8 ).
Whole-genome synteny analysis revealed pairs of P. trichocarpa -homologous regions shared between chromosomes corresponding to the two subgenomes of P. tomentosa . A dot plot (Fig. 4e) indicated that most of the common linear segments of homologous chromosomes were shared between P. trichocarpa, subgenome A and subgenome D. The diagonal distribution (“/”) indicated orthologous collinear genes in P. tomentosa and P. trichocarpa , and other dispersed distribution-blocks in the dot plot, suggested the collinearity of paralogous genes on non-homologous chromosomes between the two poplars (Fig. 4e). These findings show that both of the P. tomentosa sub-genomes are highly syntenic with P. trichocarpa .