Whole-genome sequencing and Variant Calling.
Genomic DNA from L. cidri colonies was prepared for whole-genome sequencing using a Qiagen Genomic-tip 20/G kit (Qiagen, Hilden, Germany) as previously described (Nespolo et al., 2020b) and sent for sequencing using DNBseq technology (BGISEQ-G400 platform) (Liu et al., 2018). Read quality was checked using FastQC 0.11.8 (Andrews, 2010). Reads were processed with fastp 0.19.4 (low quality 3’ end trimming, 40 bp minimum read size) (Brickwedde et al., 2018; Chen et al., 2018). We also obtained publicly available sequencing reads of L. cidri CBS2950 (Agier et al., 2018), which were processed identically. Reads were aligned against the L. cidri CBS2950 reference genome (Vakirlis et al., 2016) using BWA-mem (options: -M -R) (Li, 2013). Mapping quality and overall statistics were collected and examined with Qualimap (García-Alcalde et al., 2012). Sorting and indexing of output bam files were performed using SAMTOOLS 1.9 (Li et al., 2009). A L. fermentai isolate (CBS770) was also mapped against the L. cidriCBS2950 genome for phylogenetic analysis (Bellut et al., 2020). Mapping files were tagged for duplicates using MarkDuplicates of Picard tools 2.18.14 (http://broadinstitute.github.io/picard/). Variant calling and filtering were done with GATK version 4.0.10.1 (DePristo et al., 2011). More specifically, variants were called per sample and chromosome using HaplotypeCaller (default settings), after which variant databases were build using GenomicsDBImport. Genotypes for each chromosome were called using GenotypeGVCFs (-G StandardAnnotation). Variant files were merged into one genome-wide file using MergeVcfs. This file was divided into SNP calls and INDEL calls using SelectVariants. We applied recommended filters for coverage (> 10 mapping reads = “FORMAT/DP>10”) and quality (–minQ 30) (Van der Auwera et al., 2013b). This VCF file was further filtered, depending on the requirements of the given analysis, using vcftools (Van der Auwera et al., 2013a, b). For all datasets, we only considered SNPs that had no missing data using vcftools option –max-missing 1.