WES data analysis
SNV and indel variant annotation were conducted using VarSeq software
version 1.5.0 (Golden Helix) and selected by the reading depth
(>10), Phred score (>20), and alternative
allele frequency (>0.35). Based on the public variant
databases in ABraOM (http://abraom.ib.usp.br - [31], GnomAD
(https://gnomad.broadinstitute.org - [32], and 1000 Genomes - Phase
3 (https://www.internationalgenome.org - [33]), we filtered out
germline variants with frequencies above 0.5% or 1% (for dominant and
recessive models of inheritance, respectively), as well as those mapped
to hypervariable genes [34] or detected in an in-house dataset
including data from 19 healthy controls.
Coding variants - Coding nonsynonymous missense and
loss-of-function (LoF; frameshift, stop loss/gain, essential splice
site, nonsense) variants were maintained for further analysis. In
silico pathogenicity prediction for missense variants was based on six
algorithms provided by the database dbNSFP (version 2.4); those
predicted to be damaging to protein function by at least five different
tools were prioritized. The final set of genes with rare damaging coding
variants was annotated using Varelect [35] and HPO [36] for
ranking in association with specific phenotypes. All LoF and prioritized
missense variants were validated by visual inspection of the BAM files,
further annotated using the Varsome tool [37], and classified
according to the American College of Medical Genetics and Genomics
(ACMG) guidelines [38, 39].
Noncoding variants - Intronic, intergenic, 3’ and 5’ untranslated
regions (UTRs), and splice region variants were annotated using SNPnexus
v4 [40], a web-based annotation tool for the analysis and
interpretation of variants that includes databases of regulatory
elements and regions, such as miRbase
(ftp://mirbase.org/pub/mirbase/20/genomes/), Vista HMR
(http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/), and ENCODE
(ftp://ftp.ensembl.org/pub/grch37/release-95/mysql/regulation_mart_95/);
phenotype and disease association (Genetic Association of Complex
Diseases and Disorders (GAD) -
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/, ClinVar, and
COSMIC); and noncoding scoring (Combined Annotation Dependent Depletion
(CADD) - http://cadd.gs.washington.edu/download, Fitness Consequences of
Functional Annotation (fitCons) -
http://compgen.cshl.edu/fitCons/0downloads/tracks/V1.01/i6/scores/; and
Chromatin Effects of Sequence Alterations (DeepSEA) -
http://deepsea.princeton.edu/help/).
Copy number variants (CNVs) - CNVs and region of homozygosity
(ROH) events were derived from WES data using the software Nexus Copy
Number 9 (Biodiscovery) with the SNP-FASST2 segmentation algorithm
(threshold log2 Cy3/Cy5 ratio of |0.3|
for gains and losses; minimum ROH length of 5 Mb). Common CNVs (Database
of Genomic Variants, http://dgv.tcag.ca/dgv/app/home) were disregarded.
For CNV validation, chromosome microarray analysis (CMA) was performed
using the 180K platform (Agilent Technologies), as previously reported
[41].