WES data analysis
SNV and indel variant annotation were conducted using VarSeq software version 1.5.0 (Golden Helix) and selected by the reading depth (>10), Phred score (>20), and alternative allele frequency (>0.35). Based on the public variant databases in ABraOM (http://abraom.ib.usp.br - [31], GnomAD (https://gnomad.broadinstitute.org - [32], and 1000 Genomes - Phase 3 (https://www.internationalgenome.org - [33]), we filtered out germline variants with frequencies above 0.5% or 1% (for dominant and recessive models of inheritance, respectively), as well as those mapped to hypervariable genes [34] or detected in an in-house dataset including data from 19 healthy controls.
Coding variants - Coding nonsynonymous missense and loss-of-function (LoF; frameshift, stop loss/gain, essential splice site, nonsense) variants were maintained for further analysis. In silico pathogenicity prediction for missense variants was based on six algorithms provided by the database dbNSFP (version 2.4); those predicted to be damaging to protein function by at least five different tools were prioritized. The final set of genes with rare damaging coding variants was annotated using Varelect [35] and HPO [36] for ranking in association with specific phenotypes. All LoF and prioritized missense variants were validated by visual inspection of the BAM files, further annotated using the Varsome tool [37], and classified according to the American College of Medical Genetics and Genomics (ACMG) guidelines [38, 39].
Noncoding variants - Intronic, intergenic, 3’ and 5’ untranslated regions (UTRs), and splice region variants were annotated using SNPnexus v4 [40], a web-based annotation tool for the analysis and interpretation of variants that includes databases of regulatory elements and regions, such as miRbase (ftp://mirbase.org/pub/mirbase/20/genomes/), Vista HMR (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/), and ENCODE (ftp://ftp.ensembl.org/pub/grch37/release-95/mysql/regulation_mart_95/); phenotype and disease association (Genetic Association of Complex Diseases and Disorders (GAD) - http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/, ClinVar, and COSMIC); and noncoding scoring (Combined Annotation Dependent Depletion (CADD) - http://cadd.gs.washington.edu/download, Fitness Consequences of Functional Annotation (fitCons) - http://compgen.cshl.edu/fitCons/0downloads/tracks/V1.01/i6/scores/; and Chromatin Effects of Sequence Alterations (DeepSEA) - http://deepsea.princeton.edu/help/).
Copy number variants (CNVs) - CNVs and region of homozygosity (ROH) events were derived from WES data using the software Nexus Copy Number 9 (Biodiscovery) with the SNP-FASST2 segmentation algorithm (threshold log2 Cy3/Cy5 ratio of |0.3| for gains and losses; minimum ROH length of 5 Mb). Common CNVs (Database of Genomic Variants, http://dgv.tcag.ca/dgv/app/home) were disregarded. For CNV validation, chromosome microarray analysis (CMA) was performed using the 180K platform (Agilent Technologies), as previously reported [41].