Genotype-Environment association analysis
We used latent factor mixed model 2 (LFMM2) for GEA association, which
has been shown to outperform similar approaches with several
orders-of-magnitude faster computing (Caye et al. 2019), which
also controls for the effects of demographic processes and population
structure (Wang et al. 2017). This approach is robust to high
amounts of missing data, such as GBS sequencing tends to produce, when
sample sizes are >100 (Xuereb et al. 2017).
LFMM2 regression models combine fixed and latent effects with the
following equation:
Y =XB T+W +E .
Y is a matrix of genetic information measured from pgenetic markers for n individuals, and X is a matrix ofd environmental variables measured for n individuals. The
fixed effect sizes are recorded in the B matrix, which has dimensionp * d . The E matrix represents residual errors
with the same dimensions as the response matrix. The matrix Wis a matrix of rank K, defined by K latent factors where model choice
procedures can determine K. The K factors represent unobserved
confounders - usually geographical structure in the genotypes of the
samples – defined as an n *K matrix, U . V is ap × K matrix of loadings. The matrix U is obtained from the
matrix’s singular value decomposition (SVD):
W =UV T
We used the two approaches implemented in the LEA v.2.6.0 R package to
determine K: principal component analysis (PCA) and admixture analysis
(Frichot et al. 2013; Frichot & François 2015). First, we ran
the LEA function PCA to select the number of significant PCA components
by computing Tracy-Widom tests with the LEA function Tracy.widom
(Patterson et al. 2006). Second, we ran the LEA function snmf for
K values between 1 and 5 with ten repetitions each. The most likely K
value was identified by minimizing the cross-validation error evaluated
in the 10-fold cross-validation procedure (Frichot & Francois, 2014).
We then chose significant associations based on p(<10-5) value.