Library preparation and sequencing
For genotyping, we included only individuals with the highest wasting scores from each site, which ranged from ranks 3 to 6 (Table S1). In total, 82 individuals were included in the wasting group and 112 in the apparently normal group. DNA was extracted using the E.Z.N.A Tissue DNA kit (Omega Biotek, Norcross, GA) and quantified using a fluorescence method (Quant-iT dsDNA Assay Kit, ThermoFisher, Waltham, MA). We used the 2bRAD protocol for genotyping SNPs (Wang et al. 2012), following the original published protocol, but using the enzyme AlfI. We also used adaptors with "NN" overhangs to target 100% of restriction sites. Multiplexed individuals were pooled at approximately equimolar amounts (after quantification via quantitative PCR) and sequenced across five lanes of an Illumina HiSeq 3000 as 50-bp single reads, at the Center for Genome Research and Biocomputing at Oregon State University.
Data filtering
Adaptors were trimmed and low-quality reads were filtered (<30 phred) using publicly available scripts (https://github.com/Eli-Meyer/2brad_utilities/). Because of the cleavage pattern of the AlfI enzyme, 2bRAD DNA inserts are 34-36 bp in length. Therefore, after adaptor and quality trimming, we filtered out any reads that were shorter than 34 bp to reduce chances of mismapping. Cleaned reads were mapped to the reference P. ochraceus genome (NCBI accession GCA_010994315.1) using SHRiMP (Rumble et al., 2009), reporting the top three maximum hits per read. We used Stacks v.1.35 (Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013; Catchen, Amores, Hohenlohe, Cresko, & Postlethwait, 2011) to call genotypes using default parameters, with samples coded by disease status group (wasting/apparently normal). Additional filtering parameters used included: selection of a single (first) SNP per stack, removal of loci that were genotyped in only one group, the minimum minor allele frequency (MAF) set to 0.025, a minimum minor allele count set to four, and only loci represented in at least 50% of samples were retained. Using VCFtools (Danecek et al., 2011), we retained only biallelic loci and only genotypes with a minimum coverage of 6 reads. Finally, we used PLINK (Purcell et al., 2007) to remove individuals that were missing more than 50% of loci. This filtering pipeline retained a dataset with 133 individuals (74 normal and 59 wasting) and 71,784 SNP loci, which we will refer to as the 'full dataset'.
Population structure
Before comparing genotypic variation between sea stars varying in disease status, we assessed whether significant population genetic structure exists among the sampled sites. We used the Bayesian clustering approach in STRUCTURE (Pritchard, Stephens, & Donnelly, 2000) with the dataset of 133 individuals, but we further filtered it to reduce computational burden. For this, we filtered loci that were genotyped in at least 90% of individuals, using VCFtools, retaining 4,626 loci. STRUCTURE runs included 100,000 burn-in and 100,000 MCMC sampling replicates, assuming admixture and with sampling locality used as prior information. We ran three values of K to assess whether clustering occurring at the specific site (K = 6) or at the subregion-level (K = 3) were more likely than a large panmictic population (K = 1). To confirm that parameters converged, each value of K was run three separate times and likelihoods and clustering patterns compared across runs. This clustering analysis suggested genetic variation was not structured among sampling sites (Figure S1). We therefore considered our samples as coming from a single population in the subsequent analyses.
Analyses of genetic variation
We used two types of approaches for estimating differentiation between disease status groups. We estimated Weir & Cockerham’s Fst (Weir & Cockerham, 1984) using GPAT++ (Shapiro et al., 2013), with significance levels adjusted using the false discovery rate (FDR) in R. To test for Fst outliers that may be indicative of selection, we used BayeScan v2.1 (Foll & Gaggiotti, 2008) with the following settings: prior odds of the neutral model of 10, burn-in of 50,000 replicates, a thinning interval of 10, 20 pilot runs for 5,000 iterations, and recorded output of 5,000 iterations. Significance for Bayescan Fst outliers was assessed at a q-value false-discovery rate (FDR) of 0.01.
We also used a multivariate approach to test for differentiation between grossly normal and wasting sea stars. A discriminant analyses of principal components (DAPC) (Jombart, Devillard, & Balloux, 2010) was used, implemented in the R package 'adegenet' (Jombart, 2015) and identified outlier loci based on their loadings associated with the discriminant function separating the two groups.
Regions linked to discriminant loci
To identify functional candidates that may be associated with disease status, we scanned for protein-coding genes linked to outlier SNPs. We first determined the appropriate genomic window size for scanning around each SNP by estimating linkage disequilibrium (LD) between pairs of SNPs. The r2 was calculated using VCFtools between pairs of SNPs within 100,000-bp windows. Average r2 was calculated in bins of 100-bp increments, and were plotted against physical distance. This plot showed that LD decays rapidly up to 15 kb, then continues to decrease but at lower rates (Figure S2). We hence scanned a 30-kb window centered at each outlier SNP (15 kb on either side) by overlaying the protein-coding genome annotation from Ruiz-Ramos et al. (2020) onto the genome assembly, using the Integrative Genomics Viewer (Robinson et al., 2011).