III. RESULTS
IIIA. SEQUENCING, ASSEMBLY, AND
SCAFFOLDING
Of the 13,500 embryos exposed to UV irradiation and pressure shock
treatments, two individuals survived beyond the post-embryo stage. The
individual selected for assembly was found to be homozygous at all 15
genotyped microsatellite loci, suggesting that chromosome set
manipulations were successful at inducing doubled haploidy. We proceeded
with PacBio sequencing, and produced a dataset with an estimated genome
coverage of 89X, with 53X coverage provided by reads longer than 12 KB
in length.
The Falcon-based assembly pipeline and polishing with Arrow and Pilon
yielded an initial assembly with 8,321 contigs, a total length of 2.3
GB, and a contig N50 of 1.3 megabases (MB) with a maximum contig length
of 19.6 MB. Our analysis comparing the correlation between the Lake
Trout linkage map and Hi-C scaffolds indicated that three iterations of
Salsa (the default setting) produced moderately large scaffolds, while
yielding a mean map versus scaffold correlation of 0.89. Thirty-three of
the 50 largest scaffolds had correlations greater than 0.95 and 42 had
correlations greater than 0.8. We opted to use these settings for
scaffolding. Salsa v2.2 split multiple contigs, resulting in 8,367
contigs with an N50 of 1.25 MB and 5,171 scaffolds with an N50 of 5.15
MB. Additional scaffolding with Chromonomer v1.13 increased scaffold N50
to 44 MB and reduced the total number of scaffolds to 4,122. Chromonomer
v1.13 also reduced contig N50 to a small degree due to the insertion of
additional gaps at likely misassembles. Scaffolding with Hi-C and the
Lake Trout linkage map ultimately allowed us to assign 84.7% of the
genome to chromosomes. Gap filling with PBJelly increased scaffold N50
to 44.97 MB, increased the total assembly size to 2.345 GB, and
increased contig N50 to 1.8 MB. Gap filling increased the maximum contig
length to 34.78 MB and the maximum scaffold length to 98.19 MB. The
estimated consensus accuracy after three rounds of error correction with
Polca was 99.9959 %. The polished assembly was submitted to GenBank for
public use (accession GCA_016432855.1).
IIIB. ASSEMBLY QUALITY
CONTROL
We estimated the total haploid genome size for Lake Trout to be between
2.119 and 2.122 GB using k-mer analysis and GenomeScope v1.0, with 38%
of the genome composed of unique sequence and 62% composed of
repetitive sequence. Heterozygosity for the sample used for polishing
was estimated to be between 2.78 and 2.9 heterozygous sites per 1000
base pairs. It should be noted that the individual used for polishing
was a diploid and not a gynogenetic double haploid. The estimated
coverage for the sample used for genome-size estimation was 16X, which
should be sufficient for k-mer based methods
(Williams et al. 2013).
We recovered 93.2% of BUSCO genes with 60.3% and 32.9% being present
as singletons and duplicates, respectively (Figure 3). The salmonid
genomes evaluated recovered between 88.1% and 95.3% complete BUSCOs
with between 25.3% and 34.9% being duplicated and between 58.3% and
65% being singletons. The proportion of duplicated BUSCOs in the Lake
Trout genome was the second highest among salmonid genomes (32.9%) and
appears to be comparable to the Brown Trout genome (GCA_901001165.1;
River Trout), which was also assembled using Falcon (Falcon-unzip) and
polished using a method based on the Freebayes variant caller
(Garrison and Marth 2012).
Spearman’s rank order correlations between the genome assembly and the
Lake Trout linkage map ranged from 0.89 to 1.0 for the 42 Lake Trout
chromosomes. The mean correlation was 0.98 and 39 of 42 chromosomes had
correlations greater than or equal to 0.96, suggesting that the final
genome assembly provides an accurate representation of the order of loci
along Lake Trout chromosomes.
IIIC. REPETITIVE DNA
RepeatModeler 2 identified 2,810 interspersed repeats and 462 of these
were classified by RepeatClassifier. RepeatMasker reported that 53.8%
of the Lake Trout genome is composed of sequences from this repeat
library. A total of 13.04% of the genome was composed of retroelements,
with 10.47% being LINEs and 2.57% being LTR elements, and 9.97% of
the genome was composed of DNA transposons. As has been observed in
other salmonids, TcMar-Tc1 was the most abundant superfamily and these
repeats were most abundant near centromeres (Figure 2; Lien et al. 2016;
Pearse et al. 2019). A total of 30.79% of the genome was composed of
interspersed repeats that were not classified by RepeatClassifier.
IIID. HOMEOLOG IDENTIFICATION AND
SYNTENY
Self-vs-self synteny analysis conducted using Symap v5 identified 126
syntenic blocks shared between putative Lake Trout homeologs (Figure 2).
Blocks ranged in size from 477,153 bp to 57,126,662 bp. Fifty-two blocks
were longer than 10 MB and 70 were longer than 5 MB (Figure 2, inner
links). We identified 50 syntenic blocks shared between Rainbow Trout
and Lake Trout and identified homologous rainbow trout chromosomes for
all Lake Trout chromosomes. Syntenic blocks shared between these two
species ranged in size from 1.9 MB to 97.2 MB. Symap identified
homologous chromosomes in Atlantic Salmon for all chromosomes except 32
and 39. However, we expect that Lake Trout chromosome 39 is homologous
to a region of Atlantic Salmon chromosome 2 and chromosome 32 is
homologous with a region of chromosome 14 based on the size of missing
synteny blocks. Fifty-four syntenic blocks were detected between the two
species that ranged in size from 208,516 bp to 88 MB. We identified 42
syntenic blocks shared between Dolly Varden and Lake Trout and
identified homologs for all chromosomes except chromosome 41. Syntenic
blocks ranged in size from 6.8 MB to 79.9 MB (Supplemental
Material 4 – Syntenic Blocks and Between Species Circos Plots).
IIID. GENOME ANNOTATION
We generated a total of 3.45 billion RNA-seq reads that were
subsequently used as input for the NCBI Eukaryotic Genome Annotation
Pipeline v8.5 (July 9, 2020 release date). An additional 528,760 reads
were used from previous Lake Trout gene expression studies. A total of
86% of reads were aligned to the genome assembly, and 12 Lake Trout
transcripts from GenBank and 3,547 known Atlantic Salmon transcripts
from RefSeq were also used as input for the pipeline.
The pipeline produced annotations for 49,668 genes and pseudogenes. A
total of 3,307 non-transcribed pseudogenes and two transcribed
pseudogenes were identified. Gene length ranged from 53 to 1,198,409 bp,
with a median length of 8,676 bp. Gene densities for chromosomes ranged
from 15.45 to 31.39 genes/mb with an average genome-wide density of
21.07 genes/mb (Figure 2, C). A total of 422,014 exons were identified,
with between 1 and 224 exons per transcript (mean=10.31, median=8).
IIIE. RECOMBINATION RATES AND
CENTROMERES
We were able to map between 1 and 238 centromere-associated RAD contigs
to their respective chromosomes and determine approximate centromere
locations for all chromosomes except chromosome 42. Smith et al. (2020)
did not determine the location of the centromere for chromosome 42,
which prohibited us from identifying its location. Across all
chromosomes, we mapped 35 centromere-associated RAD loci to each
chromosome on average. Between 39 and 238 centromeric loci were mapped
to metacentric chromosomes (mean = 93), while between 1 and 59 loci were
mapped for acrocentric or telocentric chromosomes (mean = 21).
In all, 14,438 linkage mapped contigs were mapped to the genome with
mapping qualities greater than 60. A total of11,232 loci were retained
for recombination rate estimation after manual curation and filtering
using loess model residuals. We determined the mean sex averaged
recombination rate to be 1.09 centimorgans/mb, with recombination rates
varying between 0 and 6.58 centimorgans/mb across the genome.