*, Zhencheng Su d,*, Baoquan Gaoa,c, Xingbin Ti a, Deping Yana, Guangjian Liu d, Ping Liua,c, Chunlin Wang b, Jian Lia,c
a .Key Laboratory of Sustainable Development of Marine Fisheries, Ministry of Agriculture, P.R.China, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, 266071 Qingdao,China.
b. Key Laboratory of Applied Marine Biotechnology, Ministry of Education, Ningbo University, Ningbo 315211, China
c Function Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, No. 1 Wenhai Road, Aoshanwei Town, Jimo, Qingdao, China
d Novogene Bioinformatics Institute, Beijing 100016, China
* These authors contributed equally
Corresponding author,lijian@ysfri.ac.cn (Jian Li),wangchunlin@nbu.edu.cn(Chunlin Wang),liuping@ysfri.ac.cn (Ping Liu) andliuguangjian@novogene.com(Guangjian Liu)
Abstract
Portunus trituberculatus (Crustacea: Decapoda: Brachyura), commonly known as the swimming crab, is of major ecological importance, as well as being important to the fisheries industry. P. trituberculatus is also an important farmed species in China due to its rapid growth rate and high economic value. Here, we report the genome sequence of the swimming crab, which was assembled at the chromosome scale, covering ~1.2 Gb, with 79.99% of the scaffold sequences assembled into 53 chromosomes. The contig and scaffold N50 values were 108.7 kb and 15.6 Mb, respectively, with 19,981 protein-coding genes and a high proportion of simple sequence repeats (49.43%). Based on comparative genomic analyses of crabs and shrimps, the C2H2 zinc finger protein family was found to be the only gene family expanded in crab genomes, and its members were mainly expressed in early embryonic development and during the flea-like larval stage, suggested it was closely related to the evolution of crabs. Combined with transcriptome and Bulked Segregant Analysis (BSA) providing insights into the genetic basis of salinity adaptation in P. trituberculatus , strong immunity and rapid growth of the species were also observed. In addition, the specific region of the Y chromosome was located for the first time in the genome of P. trituberculatus , and Dmrt1was identified as a key sex determination gene in this region. Decoding the swimming crab genome not only provides a valuable genomic resource for further biological and evolutionary studies, but is also useful for molecular breeding of swimming crabs.
Introduction
Brachyuran are generally called crabs, which are known as one of the most typical crustaceans belonging to Decapoda. In terms of numbers of species, they are one of the largest groups of the crustaceans (Warner, 1977), comprising about 6,000 species belonging to 47 families (Bowman, 1982). These species are found worldwide, largely in marine habitats, ranging from benthic to free living and planktonic or parasitic forms. Furthermore, this group includes many commercially-important species. Swimming crabs represent the most important group of crabs in fisheries and aquaculture, and have attracted considerable research attention.
In recent years, distinct morphotypes have drawn much attention and discussion within Decapoda (Haug et al., 2016). In evolutionary terms, all crabs have evolved from an ancestral macruran or “lobster” morphotype (Haug et al., 2016). Compared with the elongate bodies of shrimps and lobsters, crabs are characterised by a compact body with a depressed, short carapace, and a ventrally-folded pleon. The evolutionary transformation from a lobster-like crustacean to a crab is called ‘carcinization’, and has been interpreted as a dramatic morphological change (Scholtz, 2014). However, how crabs evolved and their underlying genetic basis have not been elucidated (McNamara & Faria, 2012).
Salinity is an important abiotic factor that influences the distribution, abundance, physiology and wellbeing of crustaceans (Romano & Zeng, 2012). The decapods have occupied and exploited virtually all habitats available on earth, and include species restricted to the marine environment, estuarine and freshwater species, and amphibious and terrestrial travelers, including desert and arboreal adventurers (McNamara & Faria, 2012). Given the innumerous challenging habitats, the mechanism of salinity adaptation of the decapod Crustacea has long aroused scientific curiosity. Osmoregulation mediated by ion transport is an important physiological process of salinity adaptation. Ion transport-related genes, including Na+,K+-ATPase, V-type H+-ATPase, and Na+,K+, 2Cl cotransporters have been cloned and studied (Garcon et al., 2011; Lv, Zhang, Liu, & Li, 2016; Tsai & Lin, 2007). Some scholars also tried to study the salinity adaptation using comparative transcriptome analyses (Lv et al., 2013; Q. H. Xu & Liu, 2011), and hundreds of potential salt-tolerance-related genes were identified. However, the complex molecular mechanisms involved in salinity tolerance remain poorly understood.
Sex determination is a plastic biological developmental process, which has always had an intriguing aspect in evolutionary and developmental biology (Martins, 2002). The mechanisms of sex determination in crustaceans are remarkably diverse and are controlled by genetic and/or environmental factors (Ford, 2008). In crabs, some species (Eriocheir japonicus , Hemigrapsus sanguineus ,Hemigrapsus penicillatus and Plagusia dentipes ) were believed to have an XY sex determination system based on karyotype analysis (Lecher & Noel, 1995; Niiyama, 1937, 1938, 1959); however, due to the large numbers of chromosomes and complex genomes, sex determination by karyotype analysis cannot be determined in most crustaceans (Z. Torrecilla, Martínezlage, Perina, Gonzálezortegón, & Gonzáleztizón, 2017).
The swimming crab, Portunus trituberculatus (Crustacea: Decapoda: Brachyura), inhabits seafloors with sand or pebbles, and is widely distributed in the coastal waters of China, Japan, and Korea19. This species is one of the most common edible crabs in China, and is artificially propagated and stocked. As an important crustacean species in aquaculture, the swimming crabs is famous for its rapid growth rate and large body size. Its body mass can reach up to 400-500 g in 7-8 months of artificial culture. In pond culture, crabs are often mixed with shrimps because they feed on diseased shrimps and exhibit strong disease resistance, which can effectively prevent outbreaks of shrimp diseases. Therefore, P. trituberculatus are also regarded as an ecologically important species. After more than 30 years of high-intensity artificial breeding (Liu et al., 2015), there are many problems in crab farming, such as significant variability in individual sizes, environmental factor stress, and disease outbreak, among others. However, the genetic mechanisms leading to rapid growth, stress resistance, and strong immunity remain poorly understood in this economically important species.
Here, we report the chromosome-scale genome assembly of P. trituberculatus . Genome evolution and comparative genomic analyses provided insights into the genetic basis of salinity adaptation, strong immunity, rapid growth, and sex determination in this species. This genome can serve as the genetic basis for future investigations ofP. trituberculatus evolution and biology, and will be a valuable resource for conservation and breeding management of the swimming crab.
Materials and Methods

Sample collection and sequencing

The male individual used for genome sequencing originated from an F12 full sibling that was created by artificial insemination. Total genomic DNA was extracted from muscle tissue and stored at Novogene Bioinformatics Institute.
Pair-ended (PE) libraries with an insert size of 350 bp were constructed using the standard Illumina protocol (Mardis, 2008). Sequencing was performed using the Illumina HiSeq4000 platform. Low-quality reads with more than 10% of bases having quality scores lower than 20 (representing a 1% error rate) were removed, as were tag-contaminated sequences and duplications.
We also extracted RNA from eight P. trituberculatus tissues at different developmental stages, including the eyes (E), brain (Br), ganglia thoracalis (Tr), gill (G), blood (B), heart (H), liver (L), and muscle (M), for transcriptome sequencing on the Illumina HiSeq platform, followed by gene prediction. Four libraries for Bulked Segregant Analysis (BSA) analysis were sequenced using the Illumina Novaseq platform, and 150 bp paired-end reads were generated with an insert size of ~350 bp.

Genome assembly

We used a K-mer frequency distribution method (R. Li et al., 2010) to estimate the genome size by the following formula: G = k-mer number/average k-mer depth, where k-mer number = total k-mers - abnormal k-mers (with too low or too high frequency). Seventeen-mers (17 bp k-mers) were extracted from the sequencing data and the frequency of each 17-mer was calculated. Finally, the k-mer depth was 41(Fig. S1) and the genome size of P. trituberculatus was estimated to be ∼1.2 Gb.
Libraries for single molecule real-time (SMRT) PacBio genome sequencing were constructed according to standard protocols from the Pacific Biosciences company. Briefly, high molecular-weight genomic DNA was sheared into large fragments, followed by damage repair and end repair, blunt-end adaptor ligation, and size selection. Then, the libraries were sequenced on the PacBio Sequel platform. PacBio reads were used to assemble contigs of the P. trituberculatus genome using SMARTdenovo, and polishing errors with Quiver (smrtlink 5.0.1) (Chin et al., 2013). Then, PacBio contigs were scaffolded using six rounds of SSPACE-LongRead (Boetzer & Pirovano, 2014), seven rounds of PBjelly (http://sourceforge.net/projects/pb-jelly/ ), and three iterations of plantanus (Kajitani et al., 2014) with default parameters for all programs. The resulting scaffolds were further connected to super-scaffolds using 10x genomics linked-read data with fragScaff (Adey et al., 2014) software. Finally, Illumina-derived short reads was used to correct the remaining errors by pilon (Walker et al., 2014), and a high-density genetic map was then constructed by chromonomer (version 1.07) (Catchen, Amores, & Bassham, 2020) to anchor the scaffolds into the chromosome-level genome.

Assessing the completeness of the assembly

To assess the completeness of the assembled P. trituberculatusgenome, we performed Benchmarking Universal Single-Copy Orthologs (BUSCO) (http://busco.ezlab.org/) (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015) analysis by searching against the arthropod BUSCO (version 3.0). We also assessed the completeness of the P. trituberculatus genome using the Core Eukaryotic Genes Mapping Approach (CEGMA) (http://korflab.ucdavis.edu/datasets/cegma/) (Genis Parra, Bradnam, & Korf, 2007). Both of the analyses were performed with the default settings.

Genome prediction and annotation

The repetitive sequences in the P. trituberculatus genome were identified after genome assembly. Repetitive sequences included transposable elements (TEs) and tandem repeats. Two methods were used to discover TEs. RepeatMasker (version 4.0.5) (N. Chen, 2004) was used to identify TEs in an integrated repeat library, which was derived from a known repeat library (Repbase 15.02) and the de novo repeat library, built by RepeatModeler (http://www.repeatmasker.org) (Vision 1.0.5), RepeatScout (Price, Jones, & Pevzner, 2005), and LTR_FINDER (Z. Xu & Wang, 2007). RepeatProteinMask (Bergman & Quesneville, 2007) was used to detect TEs in the P. trituberculatus genome by comparing it to the TE protein database. Tandem repeats in the genome were ascertained using Tandem Repeats Finder (Benson, 1999).
Based on the repeat-masked genome, we used homology-based, ab initio, and transcriptome-based prediction methods to predict protein-coding genes in the genome. Briefly, the protein sequences of homologous species were downloaded from the NCBI database, including Homo sapiens , Tetranychus urticae , Caenorhabditis elegans ,Crassostrea gigas, Drosophila melanogaster, Daphnia pulex, Ixodes scapularis, Parasteatoda tepidariorum, Penaeus vannamei, Strongylocentrotus purpuratus, and Tribolium castaneum, and were used as queries to search the genome using TBLASTN (Altschul, Gish, Miller, Myers, & Lipman, 1990) (E -value ≤ 1e-05). Homologous genome sequences were then aligned to the matching proteins using Genewise (version 2.4.0) (Birney, Clamp, & Durbin, 2004) for accurate spliced alignments. For ab initio, five tools were used for gene structures prediction, namely, Augustus (version 2.5.5) (Stanke & Morgenstern, 2005), GlimmerHMM (version 3.01) (Majoros, Pertea, & Salzberg, 2004), SNAP (Birney et al., 2004), Geneid (G. Parra, Blanco, & Guigó, 2000), and Genescan (Burge & Karlin, 1997), all with default settings. In addition, the RNA-seq data from several tissues were aligned to the genome using Tophat (version 2.0.10) (Trapnell, Pachter, & Salzberg, 2009), while gene structures were predicted using cufflinks (version 2.1.1) (Trapnell et al., 2012). Genes predicted using the above methods were then merged to a consensus gene set using EVidenceModeler (EVM) (http://evide ncemodeler.sourceforge.net/) (Haas et al., 2008).
Functional annotations of the predicted genes in the P. trituberculatus genome were then annotated using homology searching in several public gene databases, including NCBI-NR, KEGG (Kanehisa, Sato, Kawashima, Furumichi, & Tanabe, 2016), and SwissProt (Apweiler et al., 2004) using BLASTP (E -value ≤ 1e-05). We also used InterProScan (version 4.7) (Jones et al., 2014) to obtain protein domain annotations in the Interpro and Gene Ontology (GO) (The Gene Ontology Consortium, 2016) databases. Finally, functional annotations of the best alignments in each database were used as the final consensus gene annotation results.

Genome evolution

To identify gene families in the P . trituberculatusgenome, we used the nucleotide and protein sequences from 11 other species, including E. sinensis(GCA_003336515.1), P. virginalis(GCA_002838885.1), L. vannamei(GCA_003789085.1), D. pulex (GCA_000187875.1), D. rerio(GCF_000002035.6), C. semilaevis (GCF_000523025.1), O. niloticus (GCF_001858045.2), C. gigas (GCF_000297895.1), M. yessoensis (GCA_002113885.2), C. elegans(GCF_000002985.6), and N. vectensis(GCF_000209225.1). To exclude putative fragmented genes, we only retained gene models at each gene locus that encoded the longest protein sequence, and removed genes encoding protein sequences shorter than 30 amino acids. First, we performed the all-against-all BLASTP method to identify the similarities among genes in the species with an E-value cutoff of 1e-7. Then, we used OrthoMCL (L. Li, Stoeckert, & Roos, 2003) to generate orthologous and paralogous relationships among all the organisms with the parameter of “-inflation 1.5”. Genes were classified into orthologues, paralogues, and single-copy orthologues (only one gene in each species). In addition, specific genes were selected by comparing P. trituberculatus to E. sinensis ,P. virginalis , and L. vannamei, and GO and KEGG enrichment analysis were then performed.
Using the single-copy orthologues, we constructed a phylogenetic tree using the Maximum Likelihood model with RAxML (Stamatakis, 2006) software. Then the mcmctree program of PAML (http://abacus.gene.ucl.ac.uk/software/paml.html) (Yang, 2007) was used to estimate the divergence times among 12 species. Several calibration points were selected from the TimeTree (Kumar, Stecher, Suleski, & Hedges, 2017) database () to be used as normal priors to restrain the age of the nodes, such as 204-225 Mya for TMRCA of C. semilaevis - D. rerio ; 421-447 Mya for TMRCA ofM. yessoensis - C. gigas ; and 256-429 Mya for TMRCA ofL. vannamei - P. virginalis . The phylogenetic tree confirmedN. vectensis as the outgroup.

Gene family expansion and contraction analyses

To avoid extreme gene families, families with gene numbers ≥ 200 in one species and ≤ 2 in all other species were removed. The CAFÉ tool (version 4.0) (http://sourceforge.net/projects/cafehahnlab/) (De Bie, Cristianini, Demuth, & Hahn, 2006) was used to analyze the expansion and contraction of orthologous gene families between ancestors, and each of the 11 species using a stochastic birth and death model with a lambda parameter. This model was further used to examine the changes in gene families along each lineage on the phylogenetic tree. A probabilistic graphical model was introduced to calculate the probability of transitions in gene family size from parent to child nodes. The corresponding p-values were calculated in each lineage based on the conditional likelihood. Finally, we used the Fisher’s exact test to identify overrepresented GO and KEGG pathways among the expanded and contracted genes, which were adjusted by the false discovery rate (FDR < 0.05).

PacBio Iso-Seq analysis

The Iso-Seq library was prepared according to the Isoform Sequencing protocol (Iso-Seq) using the Clontech SMARTer PCR cDNA Synthesis Kit and the BluePippin Size Selection System, as described by Pacific Biosciences (PN 100-092-800-03).
Sequence data were processed using SMRTlink 5.1 software. A circular consensus sequence (CCS) was generated from subread BAM files with the following parameter settings: –minLength 50, –maxDropFraction 0.8, –minPasses 2, –minPredictedAccuracy 0.8, and –maxLength 15000. Non-full length and full-length FASTA files from the CCS were then fed into the cluster step, which performed isoform-level clustering (ICE). Finally, the Arrow polishing step was used to generate high-quality consensus FASTA files using the following parameters: –hq_quiver_min_accuracy 0.99, –bin_by_primer false, –bin_size_kb 1, –qv_trim_5p 100, and –qv_trim_3p 30.
Additional nucleotide errors in the consensus reads were corrected using the Illumina RNAseq data with LoRDEC software (Salmela & Rivals, 2014). Any redundancy in the corrected consensus reads was removed by CD-HIT (-c 0.95 -T 6 -G 0 -aL 0.00 -aS 0.99) (W. Li, Jaroszewski, & Godzik, 2002) to obtain final transcripts for subsequent analyses. The ANGEL pipeline, which is a long read implementation of ANGEL (https://github.com/PacificBiosciences/ANGEL), was used to determine the protein coding sequences from cDNAs. The following analysis mainly included structural analyses, differential enrichment analyses, and function annotation.
The structural analyses included CDS prediction, TF (transcription factors) analyses, LncRNA (long non-coding RNA) analyses, and SSR (Simple Sequence Repeat) analyses. TF analysis was performed using the animalTFDB 2.0 database (H.-M. Zhang et al., 2015). In LncRNA analyses, we used Coding-Non-Coding-Index (CNCI) (https://github.com/www-bioinfo-org/CNCI), Coding Potential Calculator (CPC) (Kong et al., 2007), Pfam-scan (Finn et al., 2015), and PLEK (A. Li, Zhang, & Zhou, 2014) tools to predict the coding potential of transcripts. CNCI profiles adjoining nucleotide triplets were performed with default parameters to distinguish protein-coding and non-coding sequences, independent of known annotations. The CPC mainly assessed the extent and quality of the ORFs, and searched the sequences in the NCBI eukaryotic protein database. For such searches, the e-value was set to 1e-10 to identify the coding and non-coding transcripts. Pfam-scan was used with default parameters to identify the occurrence of any of the known protein family domains in the Pfam database in transcripts translated in all three frames.
The PLEK SVM classifier uses an optimized K-mer approach to construct the best classifier to assess the coding potential for species that lack high-quality genomic sequences and annotations. The coding potential was predicted for all transcripts after the three tools described above were used to remove transcripts that lacked coding potential. The remaining transcripts comprised our candidate set of lncRNAs. SSR in the transcriptome were identified using MISA(http://pgrc.ipkgatersleben.de/misa/misa.html). MISA identifies and determines the location of perfect microsatellites, as well as compound microsatellites, which are interrupted by a specific number of nucleotides.
Differential enrichment analyses included the quantification of gene expression levels, differential expression analyses, and GO and KEGG enrichment analyses. The gene expression levels for each sample were estimated by RSEM (B. Li & Dewey, 2011), and clean data were mapped back to the transcript sequences and the read count for each transcript was obtained from the mapping results. Differential expression analysis of two conditions/groups was performed using the DESeq R package (1.18.0). DESeq provides statistical assessments to determine differential expression from digital gene expression data using a model based on a negative binomial distribution. The resulting p-values were adjusted using the Benjamini and Hochberg approach to control the false discovery rate. Genes with an adjusted p-value < 0.05 found by DESeq were assigned as differentially expressed. GO and KEGG enrichment analyses were then implemented using the Goseq R package and KOBAS (Mao, Cai, Olyarchuk, & Wei, 2005) software, respectively. Significant content with an FDR threshold of < 0.05 was selected as the results.

RNA-seq analysis

P. trituberculatus samples from different developmental stages (F, Z1-Z4, M, and J), different molting stages (PrM, InM, and PoM), the gills after 0–72 h of low salt stress (S0h, S12h, S24h, S48h and S72h), and haemocytes after 0–24 h of V. alginolytica infection (T0B and T24B) were used for RNA-seq analysis.
A total amount of 3 µg of RNA per sample was used as input material for the RNA sample preparations. Sequencing libraries were generated using the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, USA), following manufacturer’s recommendations. Index codes were added to assign sequences to each sample.
Clustering of the index-coded samples was performed on a cBot Cluster Generation System using the TruSeq PE Cluster Kit v3-cBot-HS (Illumina), according to the manufacturer’s instructions. After cluster generation, the library preparations were sequenced on an Illumina Hiseq platform with 150 bp paired-end reads generated.
Raw data (raw reads) in FASTQ format were processed using in-house Perl scripts. During this step, clean data (clean reads) were obtained by removing reads containing adapters, containing ploy-N, or with low quality reads. At the same time, the Q20, Q30, and GC content of the clean data were calculated. All downstream analyses were based on the clean data.
The reference genome and gene model annotation files were used following genome assembly and annotation. Hisat2 (v2.0.4) software (Kim, Langmead, & Salzberg, 2015) was used to build the index of the reference genome and align clean paired-end reads to the genome. HTSeq v0.6.1 (Anders, Pyl, & Huber, 2015) was used to count the number of reads mapped to each gene. Then, the FPKM (expected number of fragments per kilobase of transcript sequence per millions base pairs sequenced) of each gene was calculated based on the length of the gene and the read counts mapped to the same gene. This statistic simultaneously considers the effect of sequencing depth and gene length for the read counts, and is currently the most commonly used method for estimating gene expression levels (Trapnell et al., 2010).
Differential expression analysis of two conditions/groups was performed using the DESeq R package (1.18.0). The resulting p-values were adjusted using the Benjamini and Hochberg’s approach to control the FDR. Genes identified by DESeq with an adjusted p-value < 0.05 were assigned as differentially expressed. GO and KEGG enrichment analyses of differentially expressed genes were processed following the same method as that used for PacBio Iso-Seq analysis.

BSA analysis

In this study, growth- and disease resistance-related genes and markers were found by BSA analysis. For growth traits, SG and BG groups composed of 20 small male individuals (68.24±10.5g) and 20 large male individuals (142.2±16.5g) were constructed. For disease resistance traits, 20 susceptible individuals and 20 tolerant individuals were identified byV. parahaemolyticus infection experiments, which were used to construct SuG and ToG groups, respectively.
A total amount of 1.5 µg of DNA per sample was collected from 20 individuals to build DNA pooling for the DNA sample preparations according to growth and disease trait, respectively. Sequencing libraries were generated using the Truseq Nano DNA HT Sample preparation Kit (Illumina USA), according to the manufacturer’s recommendations. Index codes were added to assign sequences to each sample. Briefly, the DNA sample was fragmented to a size of 350 bp using sonication. Fragments were then end-polished, A-tailed, and ligated to the full-length adapter for Illumina sequencing and additional PCR amplification. Finally, PCR products were purified using an AMPure XP system and the size distribution of libraries was analyzed using the Agilent2100 Bioanalyzer and quantified using real-time PCR. Four libraries constructed as described above were sequenced using the Illumina HiSeq4000 platform, and 150 bp paired-end reads were generated with an insert size of ~350 bp.
Raw data (raw reads) in FASTQ format were initially processed through a series of quality control (QC) procedures using in-house C scripts to ensure that reads were reliable, and without artificial bias. The QC procedures were as follows: a. removal of reads with ≥ 10% unidentified nucleotides (N); b. removal of reads with > 50% of bases with a phred quality < 5; c. removal of reads with > 10 nucleotides aligned to the adapter, allowing ≤ 10% mismatches; and d. removal of putative PCR duplicates generated by PCR amplification during the library construction process (i.e., read 1 and read 2 from two paired-end reads that were completely identical).
The Burrows-Wheeler Aligner (BWA) was used to align clean reads against the reference genome with the following parameters: “mem -t 4 -k 32 –M -R” and alignment files were converted to BAM files using SAMtools software with “–bS –t”. In addition, potential PCR duplications were removed using the SAMtools command, “rmdup”. If multiple read pairs had identical external coordinates, only the pair with the highest mapping quality was retained. Variant calling was performed for all samples using the Unified Genotyper function in GATK software (version 3.8) (McKenna et al., 2010). SNPs (single nucleotide polymorphism) were filtered using the Variant Filtration parameter in GATK (settings: –filterExpression ”QD < 4.0 || FS > 60.0 || MQ < 40.0 ”, -G_filter ”GQ<20”, –cluster WindowSize 4, ). InDels were filtered using the Variant Filtration parameter (settings: –filter Expression ”QD < 4.0 || FS > 200.0 || Read PosRankSum < -20.0 || Inbreeding Coeff < -0.8 ”).
ANNOVAR (K. Wang, Li, & Hakonarson, 2010) (Annotate Variation), an efficient software tool to functionally annotate genetic variants, was used to annotate SNPs and InDels based on the GFF3 files for the reference genome. The homozygous SNPs and InDels between two samples were extracted from the vcf files for SNP/InDel in three comparison groups. The read depth information for homozygous SNPs and InDels described above in one sample pool was used to calculate the SNP/InDel index. We used the genotype of one sample as the reference and to statistic reads number for this sample genotype. Then calcalate the ratio of the number of different reads in total number, which is the SNP/InDel index of the base sites. We removed those sites which the SNP/InDel index in both pools less than 0.3. The sliding window method was used to present the SNP/InDel index of the whole genome. The average of all SNP/InDel indexes in each window was taken as the SNP/InDel index for the overall window. In general, we used a window size of 1 Mb and a step size of 10 Kb as the default settings. The difference in SNP/InDel index between the two pools was calculated as the delta SNP/InDel index with the following parameter settings: -fs1 0.2 -fs2 0.8 in the in-house Perl scripts.

CQ (chromosome quotient) analysis

Forty healthy males and forty healthy females were randomly selected from the core breeding population of “Huangxuan NO.1”, which were reared at the Chang-Yi AquaFarming Company, Weifang, China. To improve the efficiency and accuracy of the sequencing data, DNA from individuals in two groups were pooled to generate two DNA bulks based on sexuality. Libraries with an insert size of ~350 bp were sequenced using the Illumina HiSeq4000 platform and 150 bp paired-end reads were generated.
To obtain more reliable reads from the raw data, quality control (QC) procedures were performed similar to those for BSA analysis. The chromosome quotient (CQ) (Hall et al., 2013) method was then used to systematically discover Y chromosome genes. It uses the number of alignments from male and female sequence data to determine whether a sequence is Y-linked. First, the scaffolds were divided into fragments by masked sequences, and fragments shorter than 250 bp were removed. Then, the BWA (H. Li & Durbin, 2009) was used to align the clean reads against the split reference genome with mem -t 4 -k 32 –M -R. Alignment files were then converted to BAM files using SAMtools software (settings: –bS –t) (H. Li et al., 2009). The CQ values of those sequences were then calculated. To classify a sequence as Y-linked, a CQ value of < 0.3, with > 30 alignments from male data, and < 30 alignments from female data were filtered out. The results of the screening with CQ values equal to zero were focused on specifically.

qPCR analysis

For the genes associated with development, immunity, and sex-determination, gene expressions were analyzed by qPCR in the materials of different developmental stages (fertilized egg stage, multicellular stage, blastula stage, gastrulation stage, egg-nauplius stage, egg-zoea stage, zoea stage, megalopal stage, and juvenile crab stage), haemocytes, and hepatopancreas after V. parahaemolyticusinfection (0–72 h). Total RNA was extracted individually using Trizol (Invitrogen, Carlsbad, CA, USA). Quantitative real-time PCR (qPCR) primers were designed using the Primer Premier 5 tool (Premier Biosoft International) (Table S1) . First strand cDNA was synthesized using the PrimeScript RT reagent kit (Takara, Dalian, China). qPCR was performed using a 7500 Fast Real-Time PCR System and SYBR®Premix Ex Taq™ Kit (Takara, Dalian, China). PCR was performed with the following parameters: 95 °C for 30 s, followed by 40 cycles of 95 °C for 15 s and 60 °C for 34 s.
Results and Discussion

Genome sequencing and assembly

A combination of three technologies, including Pacific Bioscience’s single-molecule real-time sequencing, Illumina’s paired-end sequencing and 10× Genomics link-reads, were used in this study. After sequencing with the PacBio SEQUEL platform at Novogene (Tianjin), a total of 120.79 Gb of long reads were generated and used for the subsequent genome assembly. Two paired-end Illumina sequencing libraries were constructed with an insert size of 350 bp, and sequencing was carried out on the Illumina HiSeq 4000 platform according to the manufacturer’s instructions. A total of 111.01 Gb (92.51× coverage) of sequencing data were produced. In addition, one 10× Genomics linked-read library was constructed and sequencing was performed on the Illumina HiSeq 4000 platform, which produced 85.97 Gb (71.64× coverage) of sequencing data. The genome size of P. trituberculatus was estimated to be 1.2 Gb (Fig S1 ), and therefore, the average sequencing coverage was 264.81× Raw sequence data generated by the Illumina platform were filtered based on the following criteria: filtered reads with adapters, filtered reads with N bases more than 10%, and removed reads with low-quality bases (≤ 5) more than 50%. All sequence data are summarized in Table 1 .
The genome assembly yielded a final draft P. trituberculatusgenome with a total length of 864.45 Mb, a contig N50 of 108.68 kb, and a scaffold N50 of 15.59 Mb (Table 2 ). The assembly statistics of the crab genome are comparable to, or better than, those of previously published for shrimp genomes (contig N50: 57.65 kb and scaffold N50: 605.56kb) (X. Zhang et al., 2019).
A total of 2,396 scaffolds were attached to 53 pseudochromosomes based on the constructed genetic map, and accounted for 79.99% of the total length (691.44 Mbp). Markers between linkage groups and scaffolds exhibited fine collinearity, suggesting that our assembly was of high continuity and high accuracy (Fig. 1).

Completeness of the geneset and assembly

To evaluate the accuracy of the genome at the single base level, we mapped the short sequence reads generated by the Illumina platform to the assembled genome and evaluated the accuracy using BWA. We performed variant calling using SAMtools. The assembly was evaluated by mapping reads with approximately 90.72% coverage, explaining the completeness of the genome assembly. We obtained 47,482 homozygous SNPs(Table S2) , thus, reflecting a low homozygous rate (0.0067%) and the high accuracy of the genome assembly at the single-base level.
To assess the completeness of the assembled P. trituberculatusgenome, we performed Benchmarking Universal Single-Copy Orthologs (BUSCO) by searching against the Arthropoda BUSCO datasets (version 3.0). Overall, 84.7% of the 1,066 Arthropoda BUSCOs were identified in the assembled genome (Table S3) . We also assessed the completeness of the P. trituberculatus genome using the Core Eukaryotic Genes Mapping Approach (CEGMA). According to CEGMA, 211 (85.08%) conserved genes were identified in the P. trituberculatus genome (Table S4 ). All results indicated that the genome assembly had good coverage and completeness.

Genome prediction and annotation

The P. trituberculatus genome encodes 19,981 protein-coding genes. The average gene length and coding sequence (CDS) length were 10,484.05 and 1,231.18 bp, respectively, which were consistent with the distributions of gene features in other arthropods (Table S5, Fig S2) . Among the predicted genes, 19,763 (~98.9%) were annotated based on at least one of the NR (RefSeq non-redundant proteins), SwissProt, KEGG, and GO databases (Table S6) . Furthermore, ~84.4% of complete BUSCO genes were successfully identified. The repeat content accounted for 49.46% of the assembly, which was much lower than that of Eriocheir japonica sinensis (61.42%) (Tang et al., 2020), but marginally higher thanLitopenaeus vannamei (49.38%). Among the repetitive sequences, the most abundant TEs were long interspersed elements (LINE, 27.29% of the genome), followed by long terminal repeats (LTR, 17.07%) and DNA transposons (10.95%) (Table S7, Fig S3) .

Genome Evolution

Identifying gene families between closely related species provides important insights into the evolutionary relationship of those species. We identified 32,918 gene families altogether and 246 single-copy gene families in gene family cluster analyses performed in twelve species using OrthoMCL (Table S8, Table S9, Fig S4) . To investigate the phylogenetic relationship between P. trituberculatus and other species, we constructed a phylogenetic tree. Using single-copy orthologues, phylogenetic analyses showed thatE. sinensis was most closely related to P. trituberculatus, with a divergence time around 153.0 (77.6–274.3) million years ago. The ancestor of P. trituberculatus andE. sinensis separated from the ancestor of P. virginalisand L. vannamei ~326.0 million years ago(Fig. 2) .
We identified 18 gene families that were expanded and seven gene families that were contracted in the crab genome, compared to the other species (Table S10) . KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment of the expression of the expanded genes suggested that some expanded genes played important roles in different physiological process, such as development, molting, salinity adaption, and immune stress (Fig S5). In addition, we also identified 1,471 unique gene families containing 3,199 genes in the crab genome (Table S11 ). Those lineage-specific gene families may contribute to traits that are specific to the species. KEGG enrichment analysis also revealed significantly enriched pathways (p < 0.01), which contained ribosome, protein digestion and absorption, proximal tubule bicarbonate reclamation, and metabolism of xenobiotics by cytochrome P450 (Fig S6) . Among which, the pathway of ribosome were the most significantly enriched, containing the largest number of genes (140 genes). Furthermore, we analyzed the expression of genes in the ribosome pathway in different stages of development, molting, under low salt stress, and upon pathogen infection (Fig S7) . We found that some genes were ubiquitously expressed in the above processes, but most were mainly expressed in different molting stages. The major endocrine complex of crustaceans, the X-organ/sinus gland complex (XOSG) (Hopkins), is located in the eyestalk and is known to play a role in molting (Chang & Mykles, 2011). A large number of specific ribosome-related genes are expressed in the eyestalk, and their expressions are significantly related to different molting stages, suggesting that they play an important role in the regulation of eyestalk neuroendocrine and molting.

Genetic basis of carcinization

Compared with the elongate bodies of shrimps or lobsters, crabs are characterised by a compact body organization with a depressed, short carapace and a ventrally-folded pleon (Scholtz, 2014). The evolutionary transformation from a lobster-like crustacean towards a crab is called ‘carcinization’, which has been interpreted as a dramatic morphological change (Borradaile, 2019). Comparative genomic analyses of crab and shrimp allowed us, for the first time, to infer genomic-level evolutionary processes in carbonization. Compared with the two shrimp species, one expanded and forty-seven contracted gene families were identified in two crab species using the Fisher’s exact test (Table S12 ). The only expanded gene family was the C2H2 zinc finger protein family, and previous research supports a role for the proteins in the regulation of morphogenesis (Urrutia, 2003). There are 0, 3 and 23 genes in penaeid shrimp, crayfish, and crab (Table S13) . Further analysis of its expression in embryonic development and larval metamorphosis showed that the gene family was mainly expressed in early embryonic development at the flea like larval stage (Fig. 3) ; thus, suggesting that it may have specific functions in development and metamorphosis.
Two contracted gene families deserve special attention, including the actin and WDFY families (Table S14) . There are 84, 8, and 4 actin genes, and 19, 1, and 0 WDFY genes in penaeid shrimp, crayfish and crab, respectively, as characterized by positive correlations between gene number and body proportion from shrimp, to crayfish, to crab. More interestingly, both genes are linked to a common human disease, rheumatoid arthritis (RA) (Steinz et al., 2019; Y. Q. Zhang, Bo, Zhang, Zhuang, & Liu, 2014). RA is a systemic autoimmune diseases caused by a failure of immune self-tolerance. RA shows some similar appearances to carcinization, such as skeletal muscle weakness, synovial inflammation and hyperplasia, and cartilage and bone destruction. Therefore, the contraction of the two gene families in crab provides clues to study the molecular mechanisms of ‘carcinization’.
The Hox gene family is well known for its important function in directing tissue differentiation and morphological development throughout all of the principal axes of an embryo (Yan et al., 2019). Many Hox genes are expressed during early embryogenesis; thus, indicating their roles in development (Hogan, 1987; McGinnis). EightHox genes were found in the crab genome, of which their expressions were detected at different ontogenic stages, identifying functions in development and metamorphosis (Fig 4) . The results showed that different Hox genes had different levels of expression at different stages, however, in general, Lab ,Pb and Dfd were more active before the gastrulation stage, and Scr , Antp , Ubx , Abd-A and Abd-Bwere more active after the egg-nauplius stage.
Brachyurization is one of the most important characteristics of carcinization. The development of crab from the megalopa to the first juvenile crab is the key period of brachyurization (Song et al., 2015). We found that the Ubx and Abd-A genes, which mainly control the shape of the abdominal segment of crab, were significantly down regulated from the megalopa to the first juvenile crab phase. Conversely, it should be noted that the expression of Ubx andAbd-A is maintained at a high level throughout these stages in shrimp (Xiaoqing et al., 2015). Thus, the different expression patterns of the two genes in the development of crab and shrimp suggest that they are involved in brachyurization.

Mechanism of salinity adaptation

P. trituberculatus are weak high osmoregulators (McNamara & Faria, 2012). Their internal osmotic pressure is vulnerable to external salinity changes, which indirectly affects their growth, development, and immune system. Thus, it is of great theoretical and practical importance to clarify the mechanisms relating to their osmoregulation. In the present study, we examined the complete transcriptome of P. trituberculatus in response to low salt stress (from 0 h to 72 h). Comparisons of gene expression showed that a total of 4,603 unigenes were significantly differentially expressed (DEG), accounting for 33.5% of DEGs, and thus, indicating that low salt stress has a substantial impact on the crab (Table S15) .
DEGs were clustered using the STEM Clustering Method, and were grouped into 20 profiles (Fig S8) . Among the profiles, Profile 1 contained the highest number of transcripts (670, 14.6%) that were down-regulated at 12 and 24 h post-salt stress, and returned to normal after 72 h. Profiles 16 (535, 11.6%) and 18 (436, 9.5%) exhibited the next highest number of DEGs upon salt stress, with transcripts that were up-regulated immediately following the salinity stress, which were then down regulated or restored to normal levels after 72 h.
To understand the mechanisms of salinity adaptation in P. trituberculatus , we further analyzed the expanded and unique DEGs. Eight expanded genes and 141 unique genes were identified from the DEGs(Table S16) . Among which, the V-type proton ATPase catalytic subunit A was involved in osmoregulation, and plays an important role in ion transport (Romano & Zeng, 2012). In addition, 14-3-3-like protein, Heat shock protein 70, and two caspase family genes were also identified, and are associated with stress resistance and apoptosis(Kaeodee, Pongsomboon, & Tassanakajon, 2011; Q. Wang; Zhong).The results suggest that ion transport, stress resistance, and apoptosis were involved in salinity adaptation in P. trituberculatus .
According to the cluster analysis of the expression patterns during the salinity stress, the expanded and unique DEGs were divided into two categories. One group showed down-regulation of genes after the salinity test, while the other group was up-regulated (Fig 5) . Based on the above results, we speculated that some physiological compromises or adjustments were made by P. trituberculatus to adapt to the low salt environment. Such adjustments were mainly through ”passive” and ”active” mechanisms. The passive pathway refers to the inhibition of some nonessential physiological processes upon salinity stress, and mainly include ribosome-mediated protein translation, and antigen processing and presentation (Fig S9) . The active pathway refers to the activation of some physiological processes essential for survival in low salt stress environments, including protein digestion and absorption, apoptosis, and pancreatic secretion, among others(Fig S10) .

Immune mechanism

Gene family expansion is probably one of the most important contributors to phenotypic diversity and evolutionary adaption in the environment (X. Zhang et al., 2019). P. trituberculatus has stronger resistance to some pathogens, such as WSSV and Vibrios. According to the cumulative mortality rates at 72 h, and the median lethal doses (LD50) of V. parahaemolyticus , the anti-disease ability of four kinds of crustacean was: P. trituberculatus > E. carinicauda > F. chinensis > L. vannamei (the data no shown). In addition, P. trituberculatus is not sensitive to WSSV disease compared to shrimp (Zhongfa, Zhizheng, Wenjun, Weixian, & Songye, 2008). Thus, for the first time, comparative genomic analyses were used to elucidate the immune mechanism of the crab.
Expanded gene families were significantly enriched in 27 KEGG pathways(Fig S5) . In particular, the expanded gene families were specifically involved in pathways related to pathogen invasion, such as the African trypanosomiasis (8.63E-08), Legionellosis (6.50E-05), and Salmonella infection (7.33E-05). Some expanded gene families were overrepresented in immune system pathways, such as phagosome (0.00418914), the NOD-like receptor signaling pathway (0.01011326), and the Hippo signaling pathway (0.017947197). In addition, the apoptosis pathway was also enriched, which has been proved to be closely related to the crab immunity (Ren, Yu, Gao, Liu, & Li, 2017). Thus, the significant expansion of these immune- or disease-related gene families may have shaped the crab’s ability to resist the invasion of exogenous pathogens.
Although crabs have strong resistance to disease, their individual resistance is different. Using the materials for disease-resistance differentiation, we screened 453 markers related to disease resistance and performed BSA analysis (Fig 6) . A total of 427 genes were anchored in the region near the disease-resistance related markers. According to the △index and sequencing depth, some markers were selected and verified based on whether they were related to the resistance ofV. parahaemolyticus . Finally, 12 molecular markers significantly associated with the resistance of V. parahaemolyticus were identified using the Pearson chi-square test, including nine SNPs and three InDels (Table S17 ). Based on the location information of these markers, three disease-resistance-related genes were anchored, including DNA-damage-inducible transcript 4 (evm.model.Contig81.19), WAP four-disulfide core domain protein 1 (evm.model.Contig242.11) and protein trapped in endoderm-1 (evm.model.Contig405.11). Among those genes, DNA-damage-inducible transcript 4 was involved in the mTOR signaling pathway and the Autophagy pathway, while protein trapped in endoderm-1 was involved in the circadian entrainment and neuroactive ligand-receptor interaction pathways. All such KEGG pathways were associated with immunity, disease, and environmental adaptation (Fruman, Chiu, Hopkins, Bagrodia, & Abraham, 2017; Silva & Jung, 2013; X. Xu, Ye, Araki, & Ahmed, 2012). After V. parahaemolyticus infection, the expression of the three genes in hemocytes and the hepatopancreas were significantly changed (Fig 7) . Thus, these genes may play an important role in the immune defense of P. trituberculatus,which could be related to the differentiation of disease resistance traits between individuals.

Regulatory mechanism of rapid growth

P. trituberculatus is an important aquaculture species in China and breeders are particularly interested in its growth traits due to their high commercial significance. The new variety of fast-growingP. trituberculatus, Huangxuan No. 1, was selected in 2010 after five generations of selection from a wild population. During selection, the body weight increased 20.12%, compared to the unselected population. The growth rate of P. trituberculatus is also significantly faster than that of E. sinensis , and this study sought to determine the regulatory mechanism contributing to such rapid growth.
A total of 562 growth-related SNP/InDels were found by BSA strategy(Table S18 and Fig 8) . Based on the locations of these markers in genome, 676 candidate growth-related genes were identified, which were significantly enriched (p<0.01) in Type II diabetes mellitus, pancreatic secretion, protein digestion and absorption, insulin resistance and salivary secretion KEGG pathways. (Fig S11) . The results suggest that these pathways may be associated with growth traits. Among these candidate growth-related genes, 42 genes were expanded or unique to P. trituberculatus (Table 3) . Some of the genes played key roles in pathways for carbohydrate metabolism, amino acid metabolism, and the endocrine system, which are known to be related to growth, including acidic mammalian chitinase production, and carbohydrate sulfotransferase and neural-cadherin synthesis.
Unlike E. sinensis and L. vannamei , P. trituberculatus is a typical carnivorous crab. There are often a large number of shellfishes and small crustacean shells found in their stomachs, which contain substantial amounts of chitin. Chitin is a linear polymer of β-1,4-linked N-acetyl-D-glucosamine (GlcNAc), and is the second most abundant natural polysaccharide in nature. Moreover, chitin functions as a major structural component in fungi, crustaceans, and insects (Koch, Stougaard, & Spaink, 2015). Chitinases are enzymes that hydrolyze chitin. In the mammalian digestive system, chitin has long been considered as a source of dietary fiber that is not digested. However, recent studies confirmed that acidic mammalian chitinase (AMCase) can function as a major protease-resistant glycosidase under stomach and intestinal conditions and can degrade chitin substrates to produce (GlcNAc)2, which is a source of carbon, nitrogen, and energy (Misa et al.). A specific AMCase gene was found in the P. trituberculatus genome, thus, suggesting the crab can efficiently digest food containing chitin to produce essential energy and substances that support rapid growth.
Rapid growth is inseparable from myogenesis. Cell alignment and fusion in myotubes are key steps during myogenesis. Cell alignment requires a different structural organization of the extracellular matrix, such as lumican, which is composed of keratan sulfates (Chakravarti & S.). Carbohydrate sulfotransferase 5 (Chst5 ) is involved in keratan sulfate biosynthesis, and contributes to the remodeling of the extracellular matrix during myogenic progression in skeletal murine satellite cells (MSC). Decreased expression of Chst5 delayed cell fusion in myotubes (Grassot et al.). In addition, N-cadherin plays an important role in myogenesis and its expression was highest in prefusion myoblasts, which declined thereafter (Maccalman, Bardeesy, Holland, & Blaschuk). Based on comparative genomics analyses, one specificChst5 gene and one expanding N-cadherin family were found in the crab’s genome, which indicated that there may be a specific and rapid mechanism of myogenesis in P. trituberculatus.
Insulin signalling is important for the regulation of glucose and lipid metabolism, which also stimulate cell growth and differentiation (Saltiel & Kahn, 2001). Growth retardation, obesity and type 2 diabetes is associated with insulin resistance (Kahn & Flier, 2000; Phillips, 1995). Protein-tyrosine phosphatase (LAR) plays an essential role in the regulation of reversible tyrosine phosphorylation of cellular proteins related to insulin resistance (P. M. Li, Zhang, & Goldstein, 1996), and the knock-down of LAR induces insulin resistance (Mander, Hodgkinson, & Sale). The result of BSA analyses showed that one SNP located between two tandem LAR genes was associated with growth traits, and one of which was unique to P. trituberculatus . Those data suggested that rapid growth and the differentiation of growth traits in P. trituberculatus could be regulated by insulin signaling.

Sex determination

Due to the large number of chromosomes and complex genomes, sex determination by karyotype analysis cannot be performed in most crustaceans (Zeltia Torrecilla, Martínez-Lage, Perina, González-Ortegón, & González-Tizón, 2017). Moreover, the mechanism of sex determination in P. trituberculatus has not previously been determined. In our previous studies, a highly significant QTL located on LG24 (LOD > 14) was identified for the first time using a high-density linkage map. Those studies also identified heterogametic genotypes of sex-associated markers in male support the XY sex determination mechanism(Lv et al., 2018). To further analyze the mechanism of sex determination in P. trituberculatus , we used CQ analyses to anchor the Y chromosome. We focused on results with CQ values equal to zero and found that they were all located in the two contigs (Contig153 and Contig475)( Fig 9) . Interestingly, the two previously-anchored sex markers were also located on those two contigs(Lv et al., 2018); thus, proving our accuracy in locating the Y chromosome.
A total of 10 genes were predicted in the Y-linked region (Table 4) , and their expression patterns were analyzed during different developmental stages with known sexes of Z1 to juvenile crabs, based on the sex marker (Lv et al., 2018). Finally, eight genes were successfully analyzed, and the expression patterns of three of which, including doublesex (Ptdmrt , Contig475.13), myb/SANT-like (Ptmyb , contig153.28) and an unknown gene (Contig475.8) were significantly different in males and females during different developmental stages,(Fig 10) .
The doublesex gene was involved in sex determination, which is characterized by one or more highly conserved DNA-binding (DM) domains (Meng, Moore, Tang, Yuan, & Lin, 1999). In the XX/XY sex-determining system, the Y-linked DMY/Dmrt1bY genes of the teleost, medaka, were characterized as sex-determining genes, leading to male sexual development (Masaru et al., 2002). In chicken and flatfish, which exhibit the ZZ/ZW system, DMRT1 is located on the Z chromosome, and the gene dosage may induce male development (S. Chen et al., 2014; Nanda et al.). In X. laevis , one W-linked gene, DM-W, was identified as a likely sex (ovary)-determining gene (Yoshimoto et al.). The full-length cDNA sequence (2,165 base pairs) of Ptdmrt was obtained by rapid amplification of cloned ends (RACE), and the open reading frame spans three exons and encodes a putative protein of 265 amino acids, including one DM domain (Fig 11) . Ptdmrt can only amplify in males (40 female and 40 male individuals were used in this study) (Fig 12) , and RT-PCR results also showed that the Ptdmrt gene was expressed in all male tissues tested, but not in females (Fig S12), suggesting that Ptdmrt is a Y-linked gene and plays an important role in sex determination. During the different stages of embryonic development, Ptdmrt expression was significantly upregulated at the egg-nauplius stage, reaching its peak at the egg-zoea stage (upregulated 18.54-times, compared to the multicellular stage). Expression was then decreased during the Z1 stage and reached a second peak in expression in the Z3 phase (6.20-times) (Fig 13) . Those results indicated that the egg-zoea and zoea periods may be critical periods of sex differentiation.
Similar to the Ptdmrt gene, the expression of Ptmyb in males was also significantly higher than that in females. The difference between the expression of the two genes was that, Ptmyb was mainly expressed throughout development from Z1 to juvenile crab stages. The unknown sex-related gene was different from Ptdmrt andPtmyb , of which expression in female was significantly higher than that in male during Z1 to Z4 phase. However, in the megalopal and juvenile crab stages, the expression pattern of the unknown sex-related gene changed, and was higher in males. Neither Ptmyb or the unknown sex-related genes were found to be involved in sex determination in previous studies, indicating the complexity and species specificity of the sex determination mechanisms in P. trituberculatus.
Conclusions
We present a chromosomal-scale genome assembly of the P. trituberculatus , a representative crab of the brachyuran species and an important specie of aquatic crustaceans. Genomic annotation and comparative genomic analyses provided insights into the genomic structure, evolution, and mechanisms underlying the biology of complex marine ecosystems. The data shed light on the salinity adaptation ofP. trituberculatus , as well as their strong immunity, rapid growth and sex determination. This reference genome will provide crucial resources for evolutionary and biological studies, which are also important to clarify the regulatory mechanisms of important breeding traits and MAS for this economic crab.
Acknowledgements
This research was supported by the National Key R&D Program of China (2018YFD0900303 and 2018YFD0901304), Ministry of Agriculture of China & China Agriculture Research System (CARS-48), Efficient Eco Agriculture Innovation Project of Taishan Leading Talent Project (No. LJNY2015002), National Natural Science Foundation of China (41776160).
Author contributions
Jian Li, Chunlin Wang and Ping Liu.conceived and designed the research. Guangjian Liu, and Zhencheng Su performed the genome sequencing. Jianjian Lv, Ronghua Li, Zhencheng Su, Deping Yan and Xingbin Ti performed the experiments and data analyses. Ping Liu and Baoquan Gao performed sample preparation. Jianjian Lv and Zhencheng Su wrote the manuscript. All authors reviewed the manuscript.
Data availability
All genomic sequence datasets can be found on NCBI (https://www.ncbi.nlm.nih.gov/bioproject/ PRJNA631920/)
Figure and table
Fig 1 Genomic features of the P.trituberculatus . From outer to inner circles: 1, distribution of across the genome; 2, distribution of positive strand genes across the genome; 3, distribution of negative strand genes across the genome; 4, GC content across the genome. 5, represents P.trituberculatus chromosomes. In 5, each line joins paralogous genes at different chromosomes. 2–4 are drawn in nonoverlapping 0.5-Mb sliding windows
Fig 2 Phylogenetic tree of 12 species, which was constructed using 246 single copy orthologous genes. Gene family expansion (+) and contraction (-) are shown in green and red colors above their branches, respectively. Red points at ancestral nodes represent bootstrap values > 90 in 100 experiments. In addition toP.trituberculatus , pictures of other species are from the website http://phylopic.org/about/#about –use.
Fig 3 Phylogenetic tree and expression of C2H2 family at different developmental stages in P. trituberculatus .(A) Phylogenetic tree of genes of the expanded gene family compared with the two shrimp species. (B) Hierarchical clustering of C2H2 family genes at seventeen different developmental stages in P.trituberculatusincluding fertilized egg stage (F), zoea stage (Z1–Z4), megalopal stage (M), and juvenile crab stage (J).
Fig 4 Hox Genes Clusters and the Expressions at Different Developmental Stages in P. trituberculatus. (A) Clustering ofHox genes in P. trituberculatus (Ptr ), E. sinensis (Esi ), L. vannamei (Lva ), and P. virginalis (Pvi). The relative position and orientation of the genes are indicated.(B) qPCR of Hox genes at different developmental stages in P. trituberculatus .
Fig 5 Cluster analysis of the expanded and unique DEG during salinity stress. Green ring represents up regulated gene and yellow ring represents down regulated gene.
Fig 6 Identification of disease resistance differentiation through BSA analysis. X-axis represents the posiotion of Contigs of P. trituberculatus and Y-axis represents the SNP-index or Δ(SNP-index). The color dots represent the SNP-index or Δ(SNP-index) value of every SNP locus. The red lines show the SNP-index or Δ(SNP-index) value of fitting results. a. the SNP-index graph of X group. b. the SN-index graph of XT group. c. the Δ(SNP-index) graph. The blue dashed line and purple dashed line represent the condition of screening of two groups, respectively.
Fig 7 Expression analysis of disease resistance related genes in different tissues following infection with V. parahaemolyticus
Fig 8 Identification of fast growth differentiation through BSA analysis. X-axis represents the posiotion of Contigs of P. trituberculatus and Y-axis represents the SNP-index or Δ(SNP-index). The color dots represent the SNP-index or Δ(SNP-index) value of every SNP locus. The red lines show the SNP-index or Δ(SNP-index) value of fitting results. a. the SNP-index graph of PBX group. b. the SN-index graph of PSX group. c. the Δ(SNP-index) graph. The blue dashed line and purple dashed line represent the condition of screening of two groups, respectively.
Fig 9 Reads and coverage plot of the CQ results. The alignments from male (PBX-PSX) and female (PBC-PSC) sequence data to demonstrate how Y-linked sequences can be differentiated by distinctive CQ values in the same region of the splitted genome. X-axis represrnts a region and Y-axis represents male and female individuals.
Fig 10 Expression pattern of 8 genes from Z1 to juvenile crab in male and female individuals.
Fig 11 Sequence and structural features of Ptdmrt genes. (A) Nucleotide and deduced amino acid sequence of Ptdmrt. The black box indicates the conserved DM domain and the gray labeled region is the zinc finger structure in the DM domain. (B) Alignment of cDNA sequence of Ptdmrt and DNA sequence of Contig 475 (Ptdmrt gene was located in Contig 475). Ptdmrt gene consists of three exons and two introns.
Fig 12 Ptdmrt amplify in male and female individuals
Fig 13 Expression of Ptdmrt at different developmental stages. Eleven different developmental stages from fertilized eggs to young crabs including multicellular stage (Mc), blastula stage (B), gastrulation stage (G), egg-nauplius stage (En), egg-zoea stage (Ez), zoea stage (Z1–Z4), megalopal stage (M), and juvenile crab stage (J).
Table 1 Sequencing data used for the P. trituberculatus genome assembly