Genotype-Environment association analysis
We used latent factor mixed model 2 (LFMM2) for GEA association, which has been shown to outperform similar approaches with several orders-of-magnitude faster computing (Caye et al. 2019), which also controls for the effects of demographic processes and population structure (Wang et al. 2017). This approach is robust to high amounts of missing data, such as GBS sequencing tends to produce, when sample sizes are >100 (Xuereb et al. 2017).
LFMM2 regression models combine fixed and latent effects with the following equation:
Y =XB T+W +E .
Y is a matrix of genetic information measured from pgenetic markers for n individuals, and X is a matrix ofd environmental variables measured for n individuals. The fixed effect sizes are recorded in the B matrix, which has dimensionp * d . The E matrix represents residual errors with the same dimensions as the response matrix. The matrix Wis a matrix of rank K, defined by K latent factors where model choice procedures can determine K. The K factors represent unobserved confounders - usually geographical structure in the genotypes of the samples – defined as an n *K matrix, U . V is ap × K matrix of loadings. The matrix U is obtained from the matrix’s singular value decomposition (SVD):
W =UV T
We used the two approaches implemented in the LEA v.2.6.0 R package to determine K: principal component analysis (PCA) and admixture analysis (Frichot et al. 2013; Frichot & François 2015). First, we ran the LEA function PCA to select the number of significant PCA components by computing Tracy-Widom tests with the LEA function Tracy.widom (Patterson et al. 2006). Second, we ran the LEA function snmf for K values between 1 and 5 with ten repetitions each. The most likely K value was identified by minimizing the cross-validation error evaluated in the 10-fold cross-validation procedure (Frichot & Francois, 2014). We then chose significant associations based on p(<10-5) value.