Characterization of neutral variation and SNPs under selection
Before performing the genetic structure analysis, we use PLINK v1.9
software (Purcell et al. 2007) to prune the SNPs according to their
linkage disequilibrium, estimated by correlation coefficients between
SNPs. This filtering step was needed because the subsequent structure
analysis does not take into account linkage disequilibrium, and this may
lead linked SNPs to bias the grouping of individuals. Genetic structure
analyses were performed with the Admixture v.1.3 software (Alexander et
al. 2009), which allowed us to estimate the ancestries of each
individual through maximum likelihood. To determine the number of groups
for which the genetic structure model had more predictive power, we used
cross validation errors (cv-errors).
We also applied another filtering step before doing the outlier analysis
to minimize the false positive rate. We discarded loci whose minor
allele frequencies were < 0.05 in any population (thus
excluding privative alleles) and loci that could not be sequenced in at
least 75% of the individuals in each population. The resulting database
consisted of 6,421 SNPs. To identify SNPs putatively under selection, we
performed an outlier analysis with Bayescan v.2.1. (Foll and Gaggiotti,
2008), a very conservative method which is not prone to false positives,
and is very useful when the number of populations is low (Foll and
Gaggiotti 2008). This program uses a logistic regression to split the
FST coefficients into a population-specific effect (β)
and a locus-specific effect (α). We selected loci with α >
0, suggesting positive selection, and a false discovery rate (corrected
by multiple testing) of q < 0.05.
Once this was done, we tried to annotate SNPs with the highest values of
α. For this purpose, we run a BLASTn analysis against all the NCBI
database. In addition, a Chi2 analysis was performed
to determine if any of these loci had allelic frequencies with
significant deviations from what was expected under Hardy-Weinberg
conditions (H-W). This was done because the environmental differences
described above between the two populations could lead not only to
divergent selective pressures in some SNPs, but also to respond to a
selective pressure present in one of the populations that is absent in
the other (in which case, the deviation of allelic frequencies from
expected under H-W conditions should occur only in that population).