Genome alignment and SNP calling
Raw Illumina reads of 199 samples were processed to remove adaptor
sequences using Trimmomatic software (v0.39). We used the
‘ILLUMINACLIP:2:30:10 MINLEN:50’ settings to keep the read pairs with a
minimum length of 50bp for further processing. Adaptor removed reads
were aligned to the PDK50 reference genome (PRJNA40349) using bwa mem
software (v0.7.15-r1140) (Heng Li, 2013) with default parameters. Reads
with supplementary (split/short) alignments were marked by the bwa mem
-M option. Each sample alignment was validated with ‘ValidateSam’
(Picard) (http://broadinstitute.github.io/picard) to ensure no errors in
output BAM files. Before variant calling, duplicate reads were marked
using GATK MarkDuplicates (gatk-4.1.7.0) (DePristo et al., 2011). The
following GATK function, ‘FixMateInformation’ and ‘SetNmMdAndUqTags,’
were used to ensure consistent paired-read information and fix the NM,
MD, and UQ tags in the sorted BAM files. Variant calling was performed
using GATK HaplotypeCaller with GVCF mode. Reads with mapping quality
less than 20 and duplicate marked reads were excluded from the variant
call. Hard filter parameters were applied to filter raw SNPs and INDELS
using bcftools (v1.10.2) (H. Li, 2011). The raw SNPs were filtered using
‘QD<2.0 || FS>60.0
|| MQ<40 ||
SOR>4.0 || MQRankSum<-10
|| MQRankSum>10 ||
ReadPosRankSum<-6’ parameters, and INDELS were filtered with
‘QD < 2.0 || FS > 200.0
|| ReadPosRankSum <−20.0’ parameters.
Thereafter we utilized QC filter on SNP data. We excluded samples with
greater than 70% missingness and SNPs with depth > 11000
summed across the samples in QC analysis. Genotypes were marked as
missing if DP was below 10, genotype call rate less than 80%, or a
minor allele frequency below 0.01. Finally, multiallelic SNP
(-max-alleles 2) and SNPs with Hardy–Weinberg Equilibrium p-value less
than 1e-06 (–hwe 1e-06) were filtered out using Vcftools (V0.1.16).
The resulting samples and SNPs were used for further analysis.
Linkage Disequilibrium (LD) Analysis
Linkage Disequilibrium (LD) was calculated using PopLDdecay software (C.
Zhang, Dong, Xu, He, & Yang, 2018) with -MaxDist 200 -MAF 0.05 option.
The pairwise correlation coefficient ( r2 ) was
calculated for all SNPs in 200kb window with minor allele frequency
greater than 5% for the whole genome and 18 linkage groups separately
(LG). The decay of LD was calculated by
plotting pairwise r2 onto genetic distance between the
SNP pairs using an approach mentioned in Marroni et al. (Marroni
et al., 2011) and then calculated the distance at which pairwise
correlation coefficient ( r2 ) is half its maximum
value as half decay distance.