Contributions of Rare Gene Variants to Familial and Sporadic FSGS : Journal of the American Society of Nephrology

Journal Logo

Basic Research

Contributions of Rare Gene Variants to Familial and Sporadic FSGS

Wang, Minxian1,2,3; Chun, Justin1,2,4; Genovese, Giulio5; Knob, Andrea U.1; Benjamin, Ava1; Wilkins, Maris S.1; Friedman, David J.1,2; Appel, Gerald B.6; Lifton, Richard P.7; Mane, Shrikant8; Pollak, Martin R.1,2,3

Author Information
JASN 30(9):p 1625-1640, September 2019. | DOI: 10.1681/ASN.2019020152
  • Free
  • Infographic
  • SDC
  • SDC

Abstract

Focal and segmental glomerulosclerosis (FSGS) is a common histologically defined pattern of kidney injury. The clinical syndrome associated with FSGS varies in severity, age of onset, progression of disease, responsiveness to steroids, and degree of proteinuria.1 Over the past two decades, the importance of genetic factors in the development of FSGS has become increasingly clear. More than 50 single gene forms of the pathologically defined entity FSGS or the related clinically defined phenotype steroid-resistant nephrotic syndrome (SRNS) have been identified.2 As a general rule, the autosomal dominant forms of FSGS tend to be late onset, associated with progressive kidney disease, and often manifest with subnephrotic proteinuria and the absence of the full set of features that define the nephrotic syndrome. This is typical of kidney disease associated with mutations in ACTN4, INF2, and TRPC6.3–5 In contrast, recessive forms of FSGS and SRNS tend to be earlier in onset and associated with more severe proteinuria and overt nephrotic syndrome.6 Among the most severe form of inherited SRNS is congenital nephrotic syndrome caused by mutations in the nephrin gene, NPHS1. Mutations in NPHS1 typically cause neonatal nephrosis and diffuse podocyte foot process effacement, rather than FSGS. However, at least one report has suggested increased frequency of rare NPHS1 variants in patients with FSGS.7 It is reasonable to view early onset nephrotic syndrome and later onset FSGS as part of a spectrum of phenotypes caused by a variety of mutations in genes that primarily affect podocyte function.8 In addition, it is increasingly clear that mutations in genes not typically thought of as glomerular disease genes can lead to FSGS as a secondary response to primary injury. As examples, mutations in type 4 collagen genes and nephronophthisis genes have been reported in association with FSGS.9–12

The majority of known FSGS and SRNS genes have been identified through linkage analysis (including homozygosity mapping) and subsequent sequence analysis of the candidate regions.6 Here, we report on the results of an exome-sequencing study aimed at understanding the genetic landscape of mutations and variants in patients and families carrying the diagnosis of FSGS. We examined the frequencies of rare variants in known FSGS-causing genes, other known kidney disease genes, and the rest of the human exome in a cohort of patients with familial or sporadic FSGS. We performed the same analysis in a similar number of control individuals. We analyzed the burden of such variants in cases versus controls. We identified several new genes that may be possible contributors to the risk of FSGS. The top 50 new genes in our analysis are significantly closer to known FSGS genes in a gene interaction network analysis than are randomly chosen sets of genes, suggesting that several of these genes are in fact true contributors to FSGS. Importantly, we find that for many genes previously described as FSGS or kidney disease genes, nonsynonymous and loss-of-function (LOF) variants are as common in matched controls as they are in cases. These results have important implications for understanding the landscape of genetic variation in FSGS, as well as for the interpretation of genetic testing in patients with FSGS.

Methods

Human Subjects

The study was approved by the institutional review board at Beth Israel Deaconess Medical Center. Individuals belonging to 395 families with FSGS were included in this study. Familial FSGS–affected status was defined as having either a reported history of proteinuria with urine albumin-to-creatinine ratio of 250 mg/g, nephrotic syndrome, or biopsy-proven FSGS in a family with at least one other case of documented FSGS or nephrotic syndrome. We obtained blood or saliva for DNA isolation as well as clinical information after receiving informed consent. Control exome data were obtained from the Yale Center for Mendelian Genetics. DNA for sequencing was extracted from blood or saliva using standard methods. Of 1284 exomes, 1079 (84%) were captured by Roche’s MedExome target enrichment kit. The other 205 exomes were captured by different capture technologies. This study analyzed 1284 exomes. Excluding the repeated samples, this represents 1241 unique samples, 622 controls and 619 affected individuals or their family members. The number of sequenced individuals per family ranged from one to seven (Supplemental Table 1).

Variant Calling, Quality Control, and Inbreeding Coefficient Estimation

Read quality was checked and preprocessed by FastQC13 and Trimmomatic (version 0.36). Quality-checked FASTQ files were aligned against the human reference genome (GRCH38, hg38) by BWA (version 0.7.13-r1126).14 Variants were called according to the best practices for use of the Genome Analysis Toolkit (version 3.5–0-g36282e4).15,16 Only variants located in exon regions and the 100-base flanking region of each exon were called. The gene exon annotations were from refSeq (hg38).17 CrossMap (v0.2.2)18 was used to convert the genome coordinates from hg38 to hg19 for the VCF file. Three additional filters were also applied at each individual call, read depth ≥4, genotype quality ≥20, and minor read ratio ≥0.2 for heterozygous sites. The average read depth was approximately 20–30× after removing individuals with read depth ≤10×. Three individuals with an excess number of Mendelian errors were also removed. Individuals were also removed if genetic relationship was not consistent with pedigree structure. Additionally, the genome-wide inbreeding coefficient was estimated by IBDLD software.19

Variant Annotation

Variants were annotated by Variant Effect Predictor (VEP) software,20 with annotated information including: minor allele frequency [MAF] of continental populations (in the GnomAD project21 and in the Exome Aggregation Consortium [ExAC] project21); the Genomic Evolutionary Rate Profiling (GERP) score22; CADD score23; and deleterious effect prediction by PolyPhen2,24 SIFT,25 M-CAP,26 LRT,27 and the MutationTaster method.28 The confidences of LOF effects of splice sites, nonsense sites, and indels were annotated using the LOFTEE plugin of the VEP software.20

Rare Variant Spectrum in Known Disease-Causing Genes

In order to catalog the spectrum of rare variants in known disease-causing genes, we performed rare variant filtering according to the suspected genetic inheritance model, the allele frequency, and the genetic prediction scores from in silico genetic score prediction software. Variant cosegregation with phenotype was checked whenever possible. Only the splicing sites, stop gain/loss sites, in-frame/frame-shift indels, and nonsynonymous sites were analyzed. Note that for the compound heterozygous model we did not have phasing information available. In order to comply with the American College of Medical Genetics (ACMG) standards for gaining at least a supporting level of pathogenic classification by computational prediction, we required that putative causal variants should be predicted as damaging by at least three algorithms of five prediction methods used for nonsynonymous rare variants, or as LOF variants classified by the LOTTEE plugin of VEP20 as high-confidence LOF sites. Additional filters specifically for each inheritance model were as follows:

Dominant Model

(1) The maximum population-level allele frequency (pop_max) was assumed to be <0.0001 in the general population. (2) The identified rare variant was absent from control samples. (3) CADD score ≥10.

Recessive Model (Homozygous)

(1) Maximum population frequency <1%. (2) No homozygous alternative allele genotype in GnomAD and ExAC databases, or in our in-house control samples. (3) CADD score ≥10.

Recessive Model (Compound Heterozygous)

(1) Maximum population frequency <8%. Note: the allele frequency of R229Q in NPHS2 was 0.06893 in the European Finnish population in the GnomAD database. (2) At least two rare heterozygous genotypes in a single gene, such that at least one of these heterozygous sites did not have a homozygous genotype found in the GnomAD and ExAC databases.

Ancestry Labeling and Case/Control Matching

Our samples were mixed with the samples from the 1000 Genomes Project29 for the principal component analysis (PCA). The population label was consistent with the label from the 1000 Genomes Project; in total, six ancestry labels were used (EUR, European; EAS, East Asian; AMR, admixed American [MXL, Mexican Ancestry from Los Angeles; PUR, Puerto Ricans from Puerto Rico; CLM, Colombians from Medellin, Colombia; and PEL, Peruvians from Lima, Peru]; SAS, South Asian; AA, African American; AFR, African). The ancestry of our samples was inferred by the k-nearest neighbors (k-NN) algorithm on the basis of the distance between individuals measured in the top ten principal components space. Specifically, we set k=1, and the distance between two individuals (i and j) was measured as , where Vm is the proportion of variance explained by the mth principal component, and Pkl is the kth projection value of individual l. We matched each case by one control with the closest distance measured by D.

Association Test for Common Variants and Gene-Based Burden Test for Moderately Rare Variants

The genomic inflation factor (ƛ)30,31 and the data for QQ plot were estimated by PLINK2 software32 for the case/control data of before and after case/control matching. Only common variants (MAF≥0.05) were used in this analysis. In order to deal with heterogeneous variant effects and suboptimal allele frequency cut-offs, three methods (variable threshold [VT] test, C-α test, and SKATO test) were applied for moderately rare variants (MAF≤5%). These tests were performed using PLINK/seq (http://atgu.mgh.harvard.edu/plinkseq/) and EPACTS software (http://genome.sph.umich.edu/wiki/EPACTS).

Gene-Based Burden Test for Extremely Rare Variants

For different inheritance models, the qualified variants were classified similarly to how variants in known disease-causing genes were classified, but only case/control-matched data were used. After variant filtering and classification, two tests were performed. The “single-group” burden test33,34 (Supplemental Figure 1A) compares the observed rare variant rate to the expected rate of each gene. The “case-control” burden test (Supplemental Figure 1B) compares the observed rare variant rate between case and control samples. To validate the performance of this method, we explored the model performance by different parameter combinations. The performance was assessed by the total information content I () and by the ranking of these genes (Supplemental Figure 2). To test the asymmetric distribution of the top genes between case and control samples, the rank sums of genes from each group were compared using the Wilcoxon rank-sum test (U test) to check the overall rank orders between these groups.35

Enrichment Test for Top-Ranked Genes in Renal Glomerulus–Enriched Genes

Nine glomerular expression datasets were collected by Ding et al.36 A high-confidence set of glomerulus-enriched genes was defined by those genes that have significant differential expression between glomerular and nonglomerular compartments in at least two datasets (see Table 1 of Ding, et al.36). Fisher’s exact test was used to evaluate the significance of the enrichment of renal glomerulus–enriched genes in top-ranked genes.

Table 1. - Summary for familial and sporadic cases
Group Patientsa Age of Onset b Sex Ancestry
Male Female EUR EAS SAS AA AMR AFR
Familial cases 337 22.6 (15.8) 179 158 216 5 16 31 69 0
Sporadic cases 187 20.1 (14.9) 77 110 162 2 1 8 13 1
EUR, European; EAS, East Asian; SAS, South Asian; AA, African American; AMR, admixed American (MXL, Mexican Ancestry from Los Angeles, PUR, Puerto Ri-cans from Puerto Rico, CLM, Colombians from Medellin, Colombia, and PEL, Peruvians from Lima, Peru); AFR, African.
bMean in years (SD).

Network Distance Analysis

We define the distance (dij) of two genes (i and j) as the shortest path between gene i and j of all of the possible paths between them on a given network. This shortest path problem was solved by Dijkstra’s algorithm.37 Given two gene sets K and T, we define the distance between a gene t from T to gene set K as , the minimal distance of t with all of the genes from K. Then, the distance (GTK) of a candidate gene set T with a known gene set K was defined as , the average of the distance of all of the candidate genes from T with its minimal distance to genes from K. Depending on the comparisons, the known gene set K and the novel gene set T were set appropriately. Statistical significance was estimated by repeating the random process 1000 times. Each time, T was replaced by 50 random genes from a gene pool that was covered by the exome-sequencing target. We also performed another analysis by fixing the top 50–ranked new genes, but replacing the known genes by randomly picked genes for the estimation of the background distribution, repeating this process 1000 times. The network analysis was implemented in Java on the basis of the Java version of Dijkstra’s algorithm from Princeton University (https://algs4.cs.princeton.edu/44sp/DijkstraUndirectedSP.java.html).

Two gene-gene interaction networks were used as backbones for checking and comparing distances, the STRING38 and inBio Map networks.39 The connection score (Ci) in these networks was converted to CmaxCi in order to apply the shortest path algorithm. Cmax is the maximum edge score in a network. In order to check the robustness of the gene set distance between top-ranked new genes and known disease-associated genes, we performed “leave one out” and “leave three out” analyses (details in Supplemental Material).

Results

Study Design, Samples, Sequencing, and Quality Control

To identify genes that contribute to FSGS, we analyzed a total of 1284 whole exomes including 662 exomes from individuals diagnosed with, or with family members diagnosed with, FSGS (Figure 1), either sporadic or familial, and 622 exomes from unrelated control individuals (all adults). None of the control individuals were reported to have kidney disease, although, in most cases, they were not extensively studied for kidney-related phenotypes. The number of sequenced individuals per family in our cohort ranged from one to seven (Supplemental Table 1). In total, we sequenced samples from 395 distinct family units, including 337 cases coming from 208 families with multiple affecteds and 187 from sporadic cases (Supplemental Table 1, Table 1). Variants from our cohort were called using the Genome Analysis Toolkit.15,16 To estimate individual-level ancestry and perform quality control, we performed PCA and genetic relationship estimation using the SNPRelate R package.40 Our case and control samples had a world-wide ancestry distribution, with the majority having predominantly recent European ancestry (Figure 2, Table 1). As revealed by PCA-based clustering, the breakdown for families was 73.4%, 21.7%, 4.5%, and 0.25% for European, African American and Admixed American, Asian, and African ancestry, respectively (see Methods).

fig1
Figure 1.:
Data analysis diagram. After quality control, 363 unrelated case families with ancestry-matched controls were included in this study. Rare variants in known disease-causing genes were profiled (genetic diagnosis); common variants association test and gene burden test for moderately rare and extremely rare variants were conducted thereafter.
fig2
Figure 2.:
Quality control for study samples. (A and B) PCA plot for case and control samples mixed with 1000 Genomes Project samples. EUR, European; EAS, East Asian; AMR, admixed American; SAS, South Asian; AA, African American; AFR, African. (C and D) PCA plot for case and control matched samples. (E) QQ plot for the genome-wide association study before case/control matching. (F) QQ plot for the genome-wide association study after case/control matching, where ƛ is the genomic inflation factor.

Next, given the diverse ancestry distribution, we applied a case/control matching approach to control for population structure in our case/control statistical testing.41,42 After the removal of low–call-rate samples, 363 family units, either unrelated FSGS families or sporadic FSGS cases, with high-quality sequence were used for downstream analysis (Supplemental Table 1). PCA plots for the matched case and control samples are shown in Figure 2, C and D. As shown in Figure 2, E and F, the genomic inflation factor (ƛ) for a common variant association test dropped from 1.21 to 1.06 after case/control matching, indicative of an improvement in our ability to control for population structure.

Exome Sequence Analysis for Genetic Diagnosis and Identifying Candidate Genes for FSGS

We hypothesized that in our FSGS cohort, approximately 30% of samples would be explained by mutations in previously reported genes on the basis of earlier reports, with a higher rate of mutation identification in the familial cases.43,44 We also hypothesized that mutations in genes not yet identified as FSGS-causing contributed to the development of disease in a nontrivial fraction of these subjects. We therefore performed several analyses: (1) characterizing the rare variant distribution spectrum in known disease-causing genes (i.e., genetic diagnosis (2) common variant association test; and (3) burden test for moderately rare variants and extremely rare variants. For each class of variants, different kinds of statistical methods were applied to test the association of variants/genes with the disease status.

Rare Variants Identified in Known Disease-Causing Genes

In order to define the distribution of rare variants in known FSGS-associated genes, we performed variant filtering on the basis of each family’s compatibility with the different inheritance models. Variants were filtered on the basis of the annotation of allele frequency and the predicted effects of several in silico genetic prediction tools (see Methods).20 We checked the distribution of these variants in 165 known kidney disease–related genes (excluding kidney stone genes, Supplemental Table 2)6.

Known Dominantly Acting Genes

First, dominant inheritance genes were analyzed in the 363 case family units (including a total of 483 affected individuals) and 363 matched controls. The rare variants in each individual were filtered by consistency with a dominant inheritance model (i.e., shared by all affecteds in a family if multiple affecteds were available), allele frequency ≤0.0001 in ExAC and GnomAD databases, as well as CADD≥10. Figure 3A and Supplemental Table 3 illustrate the numbers of families (cases) and controls carrying rare variant genotypes in known disease-associated genes that passed these filters. COL4A5, WT1, UMOD, and INF2 were the genes identified most frequently as harboring likely causal mutations, with only a small number of similar variants identified in controls. However, we identified several known disease genes in which a nontrivial number of rare variant genotypes passing these same filtering conditions in the control samples (Figure 3A). For example, genes MYH9, FN1, SALL1, PODXL, TBX18, ROBO2, COL4A6, and SRGAP1 were all found to have a higher burden of such variants in control samples than in cases. There are also seven kidney disease genes in which suspicious variants were detected only in controls, namely LMX1B, SIX2, UPK3A, BMP4, DSTYK, HNF1B, and SIX5.

fig3
Figure 3.:
Rare variants discovered in known genes. Variants were filtered by allele frequency, CADD score, and genotype-phenotype cosegregation. (A) Variants were filtered by dominant model/X-linked model. (B) Variants were filtered by recessive compound heterozygous model. (C) Variants were filtered by single site homozygous recessive model. (D–F) Variants filtered by the same conditions as A, B, and C, respectively, with further variant filtering following ACMG guidelines, where only variants with multiple lines of computational evidence supporting pathogenicity were included. *Variants outside of the known disease-causing domain were not included.

Given this observation, we explored whether a stricter filter using ACMG guidelines could help us better distinguish the causal rare variants from noise. We applied filters to include only those rare variants with multiple lines of computational evidence supporting pathogenicity following these ACMG guidelines. Specially, we included only nonsynonymous rare variants predicted as damaging by at least three of five algorithms (PolyPhen2,24 SIFT,25 M-CAP,26 LRT,27 and MutationTaster28), as well as splice sites, nonsense sites, and indels that were classified as high-confidence LOF by the LOTTEE plugin of VEP.20 We refer to such variants as LOF+3D. After application of these stricter filtering guidelines, most of the known dominant disease genes in which we observed a high variant burden in controls still retained this trend (Figure 3D). Thus, applying a stricter in silico classification is not sufficient to distinguish causal variants from noise.

We further partitioned the rare variants seen in cases according to whether they were observed in patients with familial or sporadic FSGS (Supplemental Figure 3). The rare variants in COL4A5 and WT1 were evenly distributed between familial and sporadic patients. In contrast, rare variants in UMOD and COL4A3 were significantly enriched in those patients with familial inheritance (P value <0.05, Fisher’s exact test). For INF2, TRPC6, and PAX2, we observed trends toward detection in familial cases, but the statistical significance was limited by the smaller numbers.

Known Recessive Disease Genes

For genes following recessive inheritance, 266 affected cases from 236 unrelated family units (one case per family plus sporadic cases) with an inheritance pattern compatible with a recessive model and 236 matched controls were analyzed. For the compound heterozygous model, the rare variants in each family were filtered by: (1) allele frequency ≤0.08 in ExAC and GnomAD databases (a number slightly greater than the most common recessive monogenic FSGS variant NPHS2 p.R229Q, with allele frequency as high as 6.8% in some populations per the GnomAD database); (2) the presence of at least two heterozygous rare variants in the same gene; and (3) at least one of the heterozygous variants lacking any corresponding homozygous genotype found in either the ExAC or GnomAD databases (for details see the Methods section). We also required that this genotype be shared by all of the affecteds in a family. We identified FRAS1, FAT1, and PKHD1 as the three genes most likely to have a genotype consistent with disease contribution under this model (Figure 3B). However, we detected a similar burden of such genotypes in control samples for FRAS1, FAT1, PKHD1, ALMS1, DYNC2H1, CUBN, AHI1, NPHP4, WDR35, WDR19, TTC21B, and RPGRIP1L, as well as C5orf42 (Figure 3B, Supplemental Table 4 illustrated by orange bars). There are also several genes in which suspicious variants were detected only in controls, including ZNF423, PDE6D, NUP93, NPHP3, NEK1, KANK4, ITGB4, ITGA3, IFT140, GRIP1, GRHPR, EMP2, COL4A6, CEP41, CC2D2A, BBS9, and ACE genes. Even after applying stricter in silico thresholds for classification of LOF (LOF+3D sites only), we still find a high burden of variants in these genes in control individuals, including FRAS1, FAT1, FRME2, ADCK4, RPGRIP1L, PKHD1, NPHP4, NPHP3, DYNC2H1, and CUBN genes (Figure 3E).

To assess the hypothesis that these genes are likely to harbor false-positive “disease-causing” genotypes, we used the gene damage index score45 and frequently mutated genes (FLAGS) classification (rare variants in gene “Frequently Mutated in Public Exomes”).46,47 On the basis of the gene damage index prediction, FRAS1, FAT1, ALMS1, DYNC2H1, and CUBN are all significantly enriched for rare variants in the general populations.45 These five genes have also been labeled as FLAGS, as all are observed to be frequently mutated in public exomes, irrespective of the phenotypes studied.46,47 By contrast, for several other genes, the burden of disease-consistent genotypes in cases is much greater than in controls. For example, we saw no candidate disease-causing genotypes in NPHS2 or COL4A4 among controls, but a significant burden in cases. As shown in Supplemental Table 5, the distribution of variant NPHS2 genotypes is consistent with previous reports, with the disease-causing variants seen in combination with the relatively common p.Arg229Gln variant all located in exon 7 or 8.48

Under a homozygous recessive model, the rare variants were filtered by CADD≥10, allele frequency ≤0.01, and the absence of any homozygous genotype in ExAC and GnomAD databases (details in Methods section). Eleven families have rare homozygous variants in the known disease-causing genes C5orf42, COL4A3, IFT80, NPHP3, NPHS2, PLCE1, NUP93, and OFD1 (Figure 3C, Supplemental Table 6). All of these families show a high level of inbreeding (Supplemental Table 6); ten of 11 families had a genome-wide inbreeding coefficient ranked in the top 12% of all of the families tested, and four ranked in the top 1%. All of the presumed causal variants detected in these families are very rare, with the maximum population allele frequency <0.0066 as estimated from the ExAC database.21 However, attributing a causal relationship to variants in C5orf42, IFT80, and OFD1 is difficult, because these genes contained candidate variants that were not predicted to be damaging and were detected in control samples. Except for the variants in these three genes, all of the other missense variants were predicted to be either very strongly damaging or LOF and all were located in evolutionarily conserved regions22 (Supplemental Table 6). All of the rare variant genotypes identified in NPHS2, COL4A3, NPHP3, NUP93, and PLCE1 show compelling genetic evidence for being disease causing (Figure 3, Supplemental Table 6).

Fraction of Families Explained by Rare Variants in Known Genes

By grouping together all of the rare variants detected under the above models, rare variants or rare variant genotypes in known FSGS and FSGS-like disease-causing genes were detected in 49.6% (180 of 363) of unrelated families and sporadics, although 35.5% (129 of 363) of control individuals showed a similar rare variant genotype. This observation in cases is clearly an overestimation of the disease explanation rate, because a nontrivial proportion of genes harbor a similar or higher level of burden in control samples (Figure 3). In order to give a better estimate of the identification of true causal genotypes, we removed those genes that were among the top 100 FLAGS genes46,47 or with a similar or higher level of rare variant burden in controls (with enrichment of rare variants in cases versus controls ≤1.5 in our data set). We refer to these genes as “high control burden genes” hereafter. When we do this, 33.6% (122 of 363) of unrelated families or sporadic cases have rare variants in known kidney disease genes (Figure 4A). By comparison, only 8.3% (30 of 363) of the control individuals have rare variant genotypes that pass the same filtering procedure as used for case samples.

fig4
Figure 4.:
Pie chart of rare variants in known kidney disease–associated genes. Genes with a similar or higher level of burden in control samples were not included here. (A) The proportion of families with rare variants in known genes (blue section) versus without (orange section). (B) A pie chart showing variants in FSGS cases with disease explained by rare variants in known genes.

We observed a higher proportion of rare variant rate in familial cases (37.2%, 70 of 188) than sporadic cases (29.7%, 52 of 175). These numbers are similar to previous findings in SRNS, where NPHS1, NPHS2, and WT1 are the major contributors.43,44 In our cohort, COL4A5, WT1, and COL4A4 are the top three contributors under a dominant or X-linked model and NPHS2 is the top contributor under a recessive model (Figure 3). The differences in the relative contributions of the major genes reflect differences in subject ascertainment.43,44 COL4A5, WT1, NPHS2, COL4A4, UMOD, COL4A3, INF2, ACTN4, PLCE1, CEP290, PAX2, and TRPC6 accounted for 62.8% of the families with likely causal rare variant genotypes (Figure 4B). For these 12 genes, together with known disease-causing protein domain information, we have very high confidence in assigning disease causality to the rare variants identified.3–5 However, we are not able to use in silico prediction alone to attribute disease causality for high control burden genes (Figure 3). The in silico predictions of CADD and GERP scores of the extremely rare variants (frequency<1/10,000) seen in case and control samples are indistinguishable (P value >0.05, t test; Supplemental Figure 4). Even by applying stricter filtering (LOF+3D sites), an equal or higher level of burden in control samples remained (Figure 3, lower panel). This highlights the difficulty in using simple sequencing and filtering for classifying variants as disease-causing even in those genes previously reported as disease-associated. Previously reported data as well as future data regarding these genes and their variants therefore need to be interpreted cautiously.

APOL1

Two specific variants in APOL1 (termed G1 and G2) greatly influence the risk of FSGS (with an odds ratio of approximately 10 (95% confidence interval 6.0 to 18.4) under a recessive model), and are common in people of recent African ancestry.49–51 We therefore looked specifically at APOL1. We counted the number of samples carrying two copies of the so-called G1 and G2 risk variants (genotypes G1G1, G2G2, and G1G2, referred as high-risk APOL1 genotypes). Of the 363 unrelated families studied, 34 families (9.4%) carried a high-risk APOL1 genotype. All of the members tested from these families were found to be admixed individuals (labeled as AA or AMR) by PCA-based ancestry inference (see Methods section), which is consistent with the African origin of the G1 and G2 variants.49–51 None of the control samples studied were found to have a high-risk APOL1 genotype. We note that two sporadic cases and one control are compound heterozygous for rare APOL1 variants (Supplemental Tables 4 and 7). When including APOL1 high-risk genotypes, 41.3% (150 of 363) of families or sporadic cases can be explained by a genetic cause or genetic susceptibility. We note that ten families with high-risk APOL1 genotypes also had rare variants in known kidney disease genes, of which eight families qualify for LOF+3D criteria (Supplemental Table 8). The severity of the phenotype, as assessed by age of disease onset, was not significantly different in the individuals with mutations in a known disease gene with a high-risk APO1 genotype compared with those with a high-risk APOL1 genotype alone. In addition, we found no evidence to support a genetic interaction between APOL1 and these other genes, as the number of co-occurrences of these events is less than the expected value of independence ().

Age of Disease Onset in the Different Inheritance Models

We looked at the age of disease onset in families with rare variants detected in known disease-associated genes (excluding high control burden genes) after partitioning by inheritance model. Age of onset is significantly lower for those families with the homozygous recessive inheritance model when compared with the other two models (P value <0.05, U test35; Supplemental Figure 5). The age of disease onset in families with dominant inheritance is also slightly higher than that of families with the compound heterozygous model, but not significantly (P value=0.14, U test35).

Analysis for the Effect of Common Variants and Moderately Rare Variants

We performed an association study to try to identify coding variants associated with increased risk of FSGS, given the high-density coverage of common coding variants in the whole-exome–sequencing data. The only locus that reached genome-wide significance was APOL1, specifically the G1 (rs73885319 and rs60910145) variant (P value=1.69E-7; Supplemental Figure 6). This implies that larger sample sizes are required to detect the contribution of common coding variants other than those in APOL1 to FSGS risk.

To look for the possible contribution of moderately rare variants, we used three other burden tests: (1) variant threshold (VT) test, (2) C-α test, and (3) SKAT-O, in order to deal with the unknown optimal cut-off of allele frequency (with the VT test) and heterogeneity of variant effects (with C-α test and SKAT-O). We applied these three methods to our ancestry-matched case/control data for those rare variants with MAF<5%. APOL1, WT1, and COL4A5 were ranked in the top ten by the P values from the VT test (Supplemental Table 9), but none of them reached a genome-wide level of significance. The APOL1 signal was driven by the G1 and G2 variants, and the WT1 and COL4A5 signals were driven by the very rare variants as discussed earlier (Figure 3, Supplemental Table 3).

Gene-Based Burden Tests for Extremely Rare Variants

Next, we wanted to determine whether additional new genes could also be identified as likely contributors to the development of FSGS, given that in only 33.6% of families with disease be explained by rare variants in known Mendelian disease genes. We performed a single-group burden test and a case-control burden test. In the single-group burden test, the observed number of rare variants was compared with the expected number estimated from the whole-genome background (Supplemental Figure 1A).33,34 In the case-control burden test, the number of observed rare variants in cases was compared with the observed number in controls (Supplemental Figure 1B). The statistical power of the single-group test was tuned by testing several combinations of different parameters. As shown in Supplemental Figure 2, we found that the best combination of parameters for the single-group analysis was to use missense and LOF variants (splice-altering, nonsense, frameshift, and indels) with zero allele frequency as estimated from the GnomAD database.21 In this setting, the positive control genes WT1, COL4A5, ACTN4, INF2, TRPC6, and PAX2 had the largest total information content I, (where the smallest overall P values of these genes), and the best overall ranking (summation of the rank of each gene, Supplemental Figure 2, A and B).

To help visualize and distinguish the distribution of disease-associated genes, we compiled the information from the single-group burden test with that of the case-control burden test in a single plot. We subtracted the number of rare variants seen in each gene in controls from that of cases (x axis of Figure 5) and plotted this difference against the P value of single-group analysis (y axis of Figure 5). Using this graphing method, the known disease-causing genes PAX2, NPHS2, TRPC6, WT1, COL4A5, INF2, and ACTN4 are all located in the top-right quadrant of the graph (Figure 5), as expected.

fig5
Figure 5.:
Burden test for extremely rare variants of the dominant model. Volcano plot for “single-group” analysis (comparison of observed rare variant rate with expected rate of each gene) and “case-control” analysis. The x axis shows the difference in the number of families with qualified rare variants between case and control samples. The y axis height of each dot shows the −log10(P value) of the “single-group” analysis, where on the left (x<0), P value is on the basis of the analysis of control samples, whereas in the right part of this figure (x>0), each dot shows the P value of a gene from the analysis of case samples.

In order to quantitatively determine whether the distribution of candidate genes in cases compared with controls was asymmetric, we compared the rank order of genes from the single-group analysis. We expected that the genes from the case group would have a smaller rank sum (with smaller P value) than those from the control group, because the case group should include rare variants that are in fact causally related to the disease in addition to statistical noise. As shown in Table 2, we observed a significantly smaller rank sum in the case group when either the top 50 or 75 genes were chosen for the analysis (P value=0.02 and 0.01, respectively). The same trend was also found when the top 100 genes were compared (P value=0.08). As a negative control, we analyzed rare synonymous variants (zero allele frequency in GnomAD database) in the same manner as we did for the rare missense and LOF variants. In the case of the synonymous variants, a smaller rank sum from the case group was not observed (Supplemental Table 10). In fact, there was a trend toward lower rank sum in controls (Supplemental Table 10). Therefore, despite a limited number of genes reaching genome-wide significance, we conclude that there is an enrichment of true disease-causing rare variants in the case group, where enrichment of disease-causing rare missense and LOF variants is driving genes to have smaller rank sum in the case group.

Table 2. - Higher rank of genes from case samples ranked by missense and LOF variants
Top Genes Case Rank Sum Control Rank Sum P Value
50 2212 2838 0.03
75 5044 6281 0.02
100 9475 10,625 0.16
Comparison of the gene rank sum between the case and control groups by Wilcoxon rank-sum test. Equal numbers of top genes were picked from the “single-group” test of case and control samples, respectively, then these genes were ranked together on the basis of the “single-group” test P value. The rank sums of genes from the case group were compared with those from the control group. Under the null hypotheses, an equal number of case and control total rank sums is expected.

Ranking FSGS Genes Using the Rare Variant Burden Test

We next combined the burden information from the y axis (single-group analysis) and x axis (case-control analysis) shown in Figure 5 as a single value using Fisher’s method. The combination of the P values as a single value provides a simpler gene rank list. As shown in Table 3, the top four genes in this analysis were WT1, COL4A5, TRPC6, and NPHS2. The known disease-causing genes INF2, ACTN4, and PAX2 also rank near the top of this list, with ranks of 6, 20, and 56, respectively. As expected, we found a significant enrichment of our top-ranked genes toward those genes enriched for glomerular expression (Supplemental Figure 7, Supplemental Table 11).36 When the top 20 genes were considered, the enrichment odds ratio was 8.51 (P value <0.001; 95% confidence interval, 3.21 to 22.09). The top 60 genes ranked by this method (and the associated rare variants) are listed in Supplemental Tables 12 and 13.

Table 3. - Top 20 ranked genes using rare variant burden test
Order Gene No. Cases Single-Group P Value No. Controls Case-Control P Value Combined P Value
1 WT1 a 14 8.27E−19 0 5.54E−05 1.93E−19
2 COL4A5 a 14 1.93E−12 0 6.05E−05 1.20E−14
3 TRPC6 a 5 4.67E−05 0 3.09E−02 2.22E−05
4 NPHS2 a 5 1.91E−04 0 3.11E−02 6.38E−05
5 SCAF1 5 3.10E−04 0 2.89E−02 8.42E−05
6 INF2 a 7 6.71E−05 2 8.01E−02 1.11E−04
7 SETD2 6 1.11E−03 0 1.57E−02 1.14E−04
8 LY9 5 9.10E−04 0 3.10E−02 2.12E−04
9 ANKRD34A 3 9.21E−05 0 1.22E−01 2.63E−04
10 CDC16 4 8.16E−04 0 6.25E−02 4.63E−04
11 KCNS1 4 9.75E−04 0 6.21E−02 5.23E−04
12 SMC6 4 1.17E−03 0 6.23E−02 6.03E−04
13 PRCP 4 1.40E−03 0 6.25E−02 6.90E−04
14 ATP13A3 4 1.64E−03 0 6.25E−02 7.80E−04
15 RNF115 3 4.94E−04 0 1.24E−01 8.32E−04
16 NCAPG 4 1.84E−03 0 6.16E−02 8.34E−04
17 MAP3K15 5 5.63E−03 0 3.12E−02 9.35E−04
18 MAATS1 6 4.81E−04 2 1.44E−01 1.02E−03
19 GTPBP2 3 7.08E−04 0 1.23E−01 1.04E−03
20 ACTN4 a 5 1.03E−03 1 1.08E−01 1.13E−03
The summary statistics of the top 20 ranked genes. Genes were ordered by the combined P value from the single-group analysis and the case-control analysis using Fisher’s method.
aKnown disease-causing genes.

Network-Based Analysis of Top Candidate Genes

Next, we wanted a better sense as to whether it is likely that some of our highly ranked new genes are true contributors to kidney disease and FSGS. To test this, we compared the gene set distance between the set of our top-ranked new genes and previously identified genes associated with kidney disease or with FSGS (Supplemental Tables 2 and 14), because previous studies have shown that disease-causing genes tend to cluster together.52,53 We curated two sets of disease-causing genes, Kfsgs (Supplemental Table 14) and Kexpanded (Supplemental Table 2). The first gene set, Kfsgs, comprises genes with compelling genetic evidence as causative of FSGS or FSGS-like phenotypes when mutated; this set is called the FSGS panel.6 The other set, Kexpanded, is a mixture of genes that are associated with a broader range of kidney phenotypes (excluding kidney stones).6 We referred to this set as the EXPANDED panel. The shorter distance between disease-causing genes was validated by a shorter distance between the Kexpanded gene panel (excluding genes in Kfsgs) and the Kfsgs gene panel than the distance between disease-causing genes and the randomly picked genes (P value ≤0.002; Figure 6, Supplemental Table 15). Next, we computed the gene set distance between the 50 top-ranked new genes (T, Supplemental Table 12, with known kidney disease risk genes excluded if any) and the Kfsgs or Kexpanded gene panel, respectively. As shown in Figure 6 and Supplemental Table 15, compared with a randomly chosen set of 50 genes, the top 50 genes ranked from the control group, or the top 50 genes ranked by synonymous rare variants from the case group, our top 50 new genes had shorter distances to the known disease-causing genes. This held true for both the STRING and the inBio Map networks (Figure 6, Supplemental Table 15).38,39 Fixing these 50 new genes and bootstrapping the known genes 1000 times by choosing random genes, we saw that the distance between the top 50 new genes was also significantly closer to known disease-associated genes than to genes chosen by random bootstrapping (Z score ≤−1.87, P value ≤0.02).

fig6
Figure 6.:
Closer network distance of top ranked genes with known kidney disease–associated genes than a random set of genes. We computed the network distance between the 50 top-ranked genes (with known kidney disease–associated genes excluded) identified from the rare variant burden test under a dominant model and known disease-associated genes. “fsgs” refers to genes known to cause FSGS when mutated; “expanded” means an expanded set of genes that are associated with various kidney disease phenotypes. “expanded.fsgs” indicates the normalized network distance between the “fsgs” panel and the “expanded” gene panel (excluding overlaps with “fsgs” panel) as a positive control. “fsgs.CASE” indicates the network distance of the 50 top-ranked genes from case samples on the basis of rare missense variants and LOF variants compared with the FSGS panel. “expanded.CASE” indicates the distance from the top 50 genes to the “expanded” panel. “fsgs.CTRL” and “expanded.CTRL” are control analyses that indicate the network distance between the 50 top-ranked genes from the control samples chosen by the burden of rare missense variants and LOF variants and the “fsgs” or “expanded” panels. “fsgs.CASE.syn” and “expanded.CASE.syn” indicate the network distance between the 50 top-ranked genes from case samples by synonymous rare variants and the “‘fsgs” or “expanded” panels. The background distribution was estimated by bootstrapping 1000 times, and randomly picking 50 genes from the pool of all genes to replace the 50 top-ranked genes. (A) Network distance measured on the basis of the structure from the STRING network. (B) Network distance measured on the basis of the structure from the inBio Map network.

To determine whether this shorter distance of our top-ranked new genes was driven by a few genes or driven by at least several genes, we performed “leave one out” and “leave three out” analyses. As shown in Supplemental Figure 8, the shorter distance was robust to these analyses, confirming that the shorter distance was driven by at least several genes ranked among the top. Taken together, these results strongly suggest that among the variants identified in the most highly ranked genes (Supplemental Table 12, Table 3), a nontrivial subset are in fact true contributors to the development of FSGS.

Discussion

In this study, we performed exome sequencing in a large set of individuals diagnosed with familial or sporadic FSGS and compared this data with exome sequence data from control individuals. We identified rare nonsynonymous variants and LOF variants in Mendelian disease genes that are plausible contributors to disease in 49.6% of cases (not including APOL1, a non-Mendelian FSGS risk gene). This is almost certainly an overestimate of the true percentage of families explained by known disease genes, because (1) similar rare variant genotypes were detected in 35.5% of controls and (2) for a number of known kidney disease genes (specifically MYH9, FN1, SALL1, PODXL, TBX18, ROBO2, COL4A6, SRGAP1, FRAS1, FAT1, PKHD1, ALMS1, DYNC2H1, CUBN, AHI1, NPHP4, WDR35, WDR19, RPGRIP1L, and C5orf42) we identified rare variant genotypes at similar or higher frequencies in controls as in cases. A more conservative estimate, excluding those genes with an equal or higher variant burden in controls, puts the percentage at 33.6% for cases compared with 8.3% in controls. In our analysis, a higher explanation rate was found in familial cases: 37.2% of familial cases and 29.7% of sporadic cases could be attributed to variants in known FSGS genes. This is consistent with the notion that familial cases are more likely to be the result of single gene defects than nonfamilial cases.

Our collection of FSGS subjects reflects the specific nature of our group’s ascertainment over the past 20 years, and may not correlate closely with genetic variant detection in the clinical setting. The same caveats should be applied to making clinical decision-making on the basis of other published reports. Nevertheless, our results do suggest that (1) a rare genotype consistent with an inherited form of FSGS or FSGS-like injury is present in about a third of individuals with the histologic diagnosis of FSGS; (2) in a significant subset of these cases, the genotype is of unclear significance, because genetic variants indistinguishable from putative disease-causing variants are found in several Mendelian disease genes in control individuals at rates similar to that in cases. Even by applying stricter genetic score–based filtering according to ACMG guidelines54 in which multiple lines of in silico mutation effect prediction were used, a similar or higher rare variant burden in controls was still present in several genes. This demonstrates the difficulty in using filtering and in silico prediction alone for judging the pathogenicity of rare variants in many genes reported to cause FSGS (or kidney diseases that can phenocopy FSGS). After filtering by ACMG guidelines (and for consistency with observed inheritance pattern in familial cases), the only genes in which the remaining variants were limited to cases but not controls were COL4A5, WT1, INF2, ACTN4 (but only when restricting to the known disease-associated domain), and TRPC6, under a dominant model; and NPHS2, PLCE1, OFD1, NPHP3, COL4A3, C5orf42, CEP290, TTC8, TTC21B, TMEM67, PLCE1, NPHS1, LRIG2, LAMB2, ITGA8, COQ6, and ALMS1, under a recessive model. However, attributing disease to genes with a high control variant burden, such as C5orf42, TTC21B, and ALMS1, should be done with extreme caution.

The difficulty of assigning disease causality to rare variants was also demonstrated by the large proportion of cases with candidate rare variants detected in more than one disease-associated gene. In 14.9% (54 of 363) of familial cases or sporadic cases, we detected two or more genes with a candidate rare variant genotype. After removing those genes with a similar or higher burden in control samples, 5.9% (21 of 363) of the familial and sporadic cases still remain with more than one candidate mutation.

Among the merits of this study is the analysis of case samples together with a set of well matched controls. On the basis of the observation of many putative disease genes in which the rare variant burden in cases was similar to that in control samples, we conclude that special care must be taken to assign disease causality in those genes based solely on in silico genetic prediction. When we used a larger set of candidate genes for analysis, rare variant genotypes were detected in 49.6% (180 of 363) of cases, although a surprisingly high 35.5% (129 of 363) of control individuals showed similar rare variant genotypes. The high rare variant burden in known kidney disease genes in controls was also highlighted recently in a large cohort with 7974 self-declared healthy adults in which such genes were analyzed.55 Panels of genes for analysis in individuals with kidney disease are being used with increasing frequency in the clinic. Our results emphasize that substantial caution should be taken in both the design and interpretation of such panels.

ACMG guidelines are widely used for reducing the false-positive rate in claiming a variant to be disease related. After applying ACMG guidelines in our cohort, 30.9% (112 of 363) of case families have a rare variant genotype in a plausible kidney disease gene, compared with 18.7% (68 of 363) in controls (Figure 1). Thus, a nontrivial proportion of rare variant genotypes remained in control samples after application of this strict filter. Removing high control burden genes gives the values 24.2% and 5.5%, respectively (Figure 1). The high proportion in control samples illustrates the limited help of ACMG guidelines in this setting. Reasons for this may include incomplete penetrance of disease-causing variants and incorrect labeling of some genes as disease-causing in the existing literature.

After filtering by ACMG guidelines and removal of genes with high variant burden in controls, there are still 5.5% of control samples with rare variants in previously reported disease genes that appear similar to what are presumed to be disease-causing variants in cases. This may be because of several reasons: (1) incomplete penetrance, (2) the existence of “neutral” or unknown significance rare variants in our genome, (3) false-positive calls from sequencing errors or unknown phasing information under a compound heterozygous model, and (4) limited power to filter out “neutral” rare variants and false-positive genes. This nontrivial number cannot be attributed to undetected kidney phenotypes in the control, because rare variants were observed in cases but almost none in controls for several well established disease-causing genes, including COL4A5, WT1, INF2, ACTN4, TRPC6, NPHS2, NPHP3, and APOL1. Unlike other genes with very high control variant burdens, these genes are not “contaminated” by controls in this study. The sample with variants that are “neutral” or of unknown significance in cases is expected to be <5.5%, because we have >30% of families in which we sequenced two or more members (and 17% in which we sequenced three or more members). Therefore, we are better able to filter out false-positive variants by checking cosegregation in case families.

On the other hand, this strict filtering removed variant genotypes in several case families which are likely to be in fact disease-causing (for example, in WT and NPHS2). There are several possible reasons for this. For example, there is discordance between different genetic prediction algorithms. The availability of increasing numbers of large next-generation sequencing cohorts, together with better resolution of regional constraints in genes and protein domains, is expected to help improve the sensitivity and specificity of in silico prediction tools.56 In the case of two well known autosomal dominant FSGS genes, INF2 and ACTN4, the variants we found in control samples did not localize to the well defined disease-causing domains (Figure 3A, Supplemental Table 3).3,5 This observation highlights the importance of the delineation of disease-causing domains using both functional studies and further analyses of variant localization in large case and control populations. It also suggests that even for genes that are thought to cause a Mendelian form of kidney disease, assigning causality to a specific variant is not simple, particularly when the biology of disease causation is poorly understood and disease-specific domains are not known.

This study confirms the observation that mutations in type IV collagen genes are a frequent contributor to FSGS.11,57 It also confirms that WT1 mutations, which cause a variety of syndromes of abnormal urogenital development with glomerulopathy, are a frequent cause of inherited FSGS.44,58,59 In addition, UMOD variants, a known cause of CKD characterized by tubule dysfunction, can present with a histologic lesion of FSGS (presumably developing secondary to the tubulopathy).60

Although we were not able to identify new FSGS genes in this study, our results make us confident that our top-ranking candidate genes include some that do in fact contribute to FSGS risk. Specifically, in a network analysis, we find that the top 50 genes on our list (after eliminating the known genes) are significantly closer to known FSGS genes as well as a longer list of kidney disease genes than they are to a randomly chosen set of genes. By comparing the rare variant distribution in cases with controls and by comparing the distribution pattern of missense and LOF variants with synonymous rare variants, we observe that there is enrichment of missense and LOF rare variants in case samples (Figure 5, Supplemental Table 10, Table 2). We conclude, therefore, that some of these genes and their variants are almost certainly contributing to FSGS. However, without either analyses of larger families (e.g., large families segregating these variants) or larger sample sizes (more individuals with familial and/or sporadic FSGS), we cannot at present determine which of these genes are most likely to be true contributors to disease.

Disclosures

Dr. Friedman and Dr. Pollak are co-inventors on patents related to APOL1 diagnostics and therapeutics. Dr. Friedman and Dr. Pollak receive research support from and have consulted for Vertex Pharmaceuticals. Dr. Friedman and Dr. Pollak have equity in Apolo1Bio.

Funding

This work was supported by National Institutes of Health grant R01DK54931 (to Dr. Pollak) and by National Institutes of Health award UM1HG006504-08 (Yale Center for Mendelian Genomics, to Dr. Lifton and Dr. Mane). Dr. Chun was supported by an Alberta Innovates Health Solutions Clinician Fellowship and a KRESCENT (Kidney Research Scientist Core Education and National Training Program) postdoctoral fellowship.

Published online ahead of print. Publication date available at www.jasn.org.

We thank the study participants.

Dr. Wang, Dr. Chun, and Dr. Pollak designed the study. Ms. Wilkins, Ms. Benjamin, Ms. Knob, and Dr. Appel carried out patient ascertainment and managed the DNA sequencing. Dr. Wang analyzed the data with the help of Dr. Chun, Dr. Genovese, Dr. Friedman, and Dr. Pollak. Dr. Lifton and Dr. Mane provided control sample data and performed exome sequencing in both cases and controls. Dr. Wang, Dr. Chun, and Dr. Pollak drafted and revised the paper.

Supplemental Material

This article contains the following supplemental material online at http://jasn.asnjournals.org/lookup/suppl/doi:10.1681/ASN.2019020152/-/DCSupplemental.

Detailed methods.

Supplemental Figures:

Figure S1: Illustration figures for the “single-group” analysis (A) and “case-control” analysis (B).

Figure S2: “Single-group” burden test for extreme rare variants of dominant model, showing the statistics for known FSGS genes WT1, COL4A5, ACTN4, INF2, TRPC6, and PAX2.

Figure S3: The rare variant distribution in familial and sporadic families for each dominant disease-causing gene.

Figure S4: The comparison of rare variant CADD and GERP score between the case and control samples for those dominant genes with high burden in control samples.

Figure S5: Diagnosis age distribution for affected individuals with rare variants detected in known disease-causing genes (without genes with high level of burden in control samples), partitioned by inheritance model.

Figure S6: Manhattan plot for the results of the association test of the exonic common variants.

Figure S7: The enrichment of renal glomerulus expression enriched genes in top ranked genes.

Figure S8: Robustness analysis of gene set distance comparison.

Supplemental Tables:

Table S1. Summary of study samples.

Table S2. The gene list of kidney disease-associated genes.

Table S3. The rare variants in dominant/X-linked kidney disease-associated genes.

Table S4. The rare variants in recessive (compound-heterozygous) kidney disease-associated genes.

Table S5. The rare variants in NPHS2 gene.

Table S6. The rare variants in recessive kidney disease-associated genes.

Table S7: Individuals of compound heterozygous for rare APOL1 variants.

Table S8. Rare variants detected in APOL1 high risk genotype families.

Table S9. Top 100 genes for moderate and rare variant burden test.

Table S10. Comparison of gene rank sum between case and control group, genes were ranked by synonymous sites.

Table S11. The enrichment of glomerulus-expression enriched genes in top ranked genes.

Table S12. Top ranked 60 genes, known genes and new genes.

Table S13. The rare variants in top ranked 60 genes (Table S12).

Table S14. FSGS disease associated genes.

Table S15. Network distance of top ranked new genes with known genes.

References

1. Rosenberg AZ, Kopp JB: Focal segmental glomerulosclerosis. Clin J Am Soc Nephrol 12: 502–517, 2017 28242845
2. Lovric S, Ashraf S, Tan W, Hildebrandt F: Genetic testing in steroid-resistant nephrotic syndrome: When and how? Nephrol Dial Transplant 31: 1802–1813, 2016 26507970
3. Kaplan JM, Kim SH, North KN, Rennke H, Correia LA, Tong HQ, et al.: Mutations in ACTN4, encoding α-actinin-4, cause familial focal segmental glomerulosclerosis. Nat Genet 24: 251–256, 2000 10700177
4. Winn MP, Conlon PJ, Lynn KL, Farrington MK, Creazzo T, Hawkins AF, et al.: A mutation in the TRPC6 cation channel causes familial focal segmental glomerulosclerosis. Science 308: 1801–1804, 2005 15879175
5. Brown EJ, Schlöndorff JS, Becker DJ, Tsukaguchi H, Tonna SJ, Uscinski AL, et al.: Mutations in the formin gene INF2 cause focal segmental glomerulosclerosis. Nat Genet 42: 72–76, 2010 20023659
6. Vivante A, Hildebrandt F: Exploring the genetic basis of early-onset chronic kidney disease. Nat Rev Nephrol 12: 133–146, 2016 26750453
7. Santín S, García-Maset R, Ruíz P, Giménez I, Zamora I, Peña A, et al.; FSGS Spanish Study Group: Nephrin mutations cause childhood- and adult-onset focal segmental glomerulosclerosis. Kidney Int 76: 1268–1276, 2009 19812541
8. Stokman MF, Renkema KY, Giles RH, Schaefer F, Knoers NV, van Eerde AM: The expanding phenotypic spectra of kidney diseases: Insights from genetic studies. Nat Rev Nephrol 12: 472–483, 2016 27374918
9. Mistry K, Ireland JH, Ng RC, Henderson JM, Pollak MR: Novel mutations in NPHP4 in a consanguineous family with histological findings of focal segmental glomerulosclerosis. Am J Kidney Dis 50: 855–864, 2007 17954299
10. Al-Romaih KI, Genovese G, Al-Mojalli H, Al-Othman S, Al-Manea H, Al-Suleiman M, et al.: Genetic diagnosis in consanguineous families with kidney disease by homozygosity mapping coupled with whole-exome sequencing. Am J Kidney Dis 58: 186–195, 2011 21658830
11. Gast C, Pengelly RJ, Lyon M, Bunyan DJ, Seaby EG, Graham N, et al.: Collagen (COL4A) mutations are the most frequent mutations underlying adult focal segmental glomerulosclerosis. Nephrol Dial Transplant 31: 961–970, 2016 26346198
12. Brown EJ, Pollak MR, Barua M: Genetic testing for nephrotic syndrome and FSGS in the era of next-generation sequencing. Kidney Int 85: 1030–1038, 2014 24599252
13. Andrews S: FastQC: A quality control tool for high throughput sequence data, 2010. Available at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed July X, 2019
14. Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26: 589–595, 2010 20080505
15. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43: 491–498, 2011 21478889
16. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, et al.: From FastQ data to high confidence variant calls: The genome analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 43: 11.10.1–11.10.33, 2013
17. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al.: Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44[D1]: D733–D745, 2016 26553804
18. Zhao H, Sun Z, Wang J, Huang H, Kocher JP, Wang L: CrossMap: A versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30: 1006–1007, 2014 24351709
19. Han L, Abney M: Identity by descent estimation with dense genome-wide genotype data. Genet Epidemiol 35: 557–567, 2011 21769932
20. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, et al.: The ensembl variant effect predictor. Genome Biol 17: 122, 2016 27268795
21. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al.; Exome Aggregation Consortium: Analysis of protein-coding genetic variation in 60,706 humans. Nature 536: 285–291, 2016 27535533
22. Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S: Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLOS Comput Biol 6: e1001025, 2010 21152010
23. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J: A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46: 310–315, 2014 24487276
24. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al.: A method and server for predicting damaging missense mutations. Nat Methods 7: 248–249, 2010 20354512
25. Kumar P, Henikoff S, Ng PC: Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 4: 1073–1081, 2009 19561590
26. Jagadeesh KA, Wenger AM, Berger MJ, Guturu H, Stenson PD, Cooper DN, et al.: M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet 48: 1581–1586, 2016 27776117
27. Chun S, Fay JC: Identification of deleterious mutations within three human genomes. Genome Res 19: 1553–1561, 2009 19602639
28. Schwarz JM, Rödelsperger C, Schuelke M, Seelow D: MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods 7: 575–576, 2010 20676075
29. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, et al.; 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature 526: 68–74, 2015
30. Devlin B, Roeder K: Genomic control for association studies. Biometrics 55: 997–1004, 1999 11315092
31. Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM, et al.: Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet 37: 1243–1246, 2005 16228001
32. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ: Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4: 7, 2015 25722852
33. Besse W, Dong K, Choi J, Punia S, Fedeles SV, Choi M, et al.: Isolated polycystic liver disease genes define effectors of polycystin-1 function. J Clin Invest 127: 1772–1785, 2017 28375157
34. Jin SC, Homsy J, Zaidi S, Lu Q, Morton S, DePalma SR, et al.: Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat Genet 49: 1593–1601, 2017 28991257
35. Fay MP, Proschan MA: Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. Stat Surv 4: 1–39, 2010 20414472
36. Ding F, Tan A, Ju W, Li X, Li S, Ding J: The prediction of Key cytoskeleton components involved in glomerular diseases based on a protein-protein interaction network. PLoS One 11: e0156024, 2016 27227331
37. Sedgewick R, Wayne K: Algorithms, Boston, Addison-Wesley, 2011
38. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, et al.: STRING: Known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res 33: D433–D437, 2005 15608232
39. Li T, Wernersson R, Hansen RB, Horn H, Mercer J, Slodkowicz G, et al.: A scored human protein-protein interaction network to catalyze genomic interpretation. Nat Methods 14: 61–64, 2017 27892958
40. Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS: A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28: 3326–3328, 2012 23060615
41. Luca D, Ringquist S, Klei L, Lee AB, Gieger C, Wichmann HE, et al.: On the use of general control samples for genome-wide association studies: Genetic matching highlights causal variants. Am J Hum Genet 82: 453–463, 2008 18252225
42. Epstein MP, Duncan R, Broadaway KA, He M, Allen AS, Satten GA: Stratification-score matching improves correction for confounding by population stratification in case-control association studies. Genet Epidemiol 36: 195–205, 2012 22714934
43. Giglio S, Provenzano A, Mazzinghi B, Becherucci F, Giunti L, Sansavini G, et al.: Heterogeneous genetic alterations in sporadic nephrotic syndrome associate with resistance to immunosuppression. J Am Soc Nephrol 26: 230–236, 2015 25060053
44. Trautmann A, Bodria M, Ozaltin F, Gheisari A, Melk A, Azocar M, et al.; PodoNet Consortium: Spectrum of steroid-resistant and congenital nephrotic syndrome in children: The PodoNet registry cohort. Clin J Am Soc Nephrol 10: 592–600, 2015 25635037
45. Itan Y, Shang L, Boisson B, Patin E, Bolze A, Moncada-Vélez M, et al.: The human gene damage index as a gene-level approach to prioritizing exome variants. Proc Natl Acad Sci U S A 112: 13615–13620, 2015 26483451
46. Shyr C, Tarailo-Graovac M, Gottlieb M, Lee JJ, van Karnebeek C, Wasserman WW: FLAGS, frequently mutated genes in public exomes. BMC Med Genomics 7: 64, 2014 25466818
47. Fuentes Fajardo KV, Adams D, Mason CE, Sincan M, Tifft C, Toro C, et al.; NISC Comparative Sequencing Program: Detecting false-positive signals in exome sequencing. Hum Mutat 33: 609–613, 2012 22294350
48. Tory K, Menyhárd DK, Woerner S, Nevo F, Gribouval O, Kerti A, et al.: Mutation-dependent recessive inheritance of NPHS2-associated steroid-resistant nephrotic syndrome. Nat Genet 46: 299–304, 2014 24509478
49. Genovese G, Friedman DJ, Ross MD, Lecordier L, Uzureau P, Freedman BI, et al.: Association of trypanolytic ApoL1 variants with kidney disease in African Americans. Science 329: 841–845, 2010 20647424
50. Tzur S, Rosset S, Shemer R, Yudkovsky G, Selig S, Tarekegn A, et al.: Missense mutations in the APOL1 gene are highly associated with end stage kidney disease risk previously attributed to the MYH9 gene. Hum Genet 128: 345–350, 2010 20635188
51. Kopp JB, Nelson GW, Sampath K, Johnson RC, Genovese G, An P, et al.: APOL1 genetic variants in focal segmental glomerulosclerosis and HIV-associated nephropathy. J Am Soc Nephrol 22: 2129–2137, 2011 21997394
52. Moreau Y, Tranchevent LC: Computational tools for prioritizing candidate genes: Boosting disease gene discovery. Nat Rev Genet 13: 523–536, 2012 22751426
53. Cowen L, Ideker T, Raphael BJ, Sharan R: Network propagation: A universal amplifier of genetic associations. Nat Rev Genet 18: 551–562, 2017 28607512
54. Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al.; ACMG Laboratory Quality Assurance Committee: Standards and guidelines for the interpretation of sequence variants: A joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17: 405–424, 2015 25741868
55. Rasouly HM, Groopman EE, Heyman-Kantor R, Fasel DA, Mitrotti A, Westland R, et al.: The burden of candidate pathogenic variants for kidney and genitourinary disorders emerging from exome sequencing. Ann Intern Med 170: 11–21, 2019 30476936
56. Samocha KE, Kosmicki JA, Karczewski KJ, O’Donnell-Luria AH, Pierce-Hoffman E, MacArthur DG, et al.: Regional missense constraint improves variant deleteriousness prediction [published online ahead of print June 12, 2017]. bioRxiv doi:1101/148353
57. Voskarides K, Damianou L, Neocleous V, Zouvani I, Christodoulidou S, Hadjiconstantinou V, et al.: COL4A3/COL4A4 mutations producing focal segmental glomerulosclerosis and renal failure in thin basement membrane nephropathy. J Am Soc Nephrol 18: 3004–3016, 2007 17942953
58. Sadowski CE, Lovric S, Ashraf S, Pabst WL, Gee HY, Kohl S, et al.; SRNS Study Group: A single-gene cause in 29.5% of cases of steroid-resistant nephrotic syndrome. J Am Soc Nephrol 26: 1279–1289, 2015 25349199
59. Schumacher V, Schärer K, Wühl E, Altrogge H, Bonzel KE, Guschmann M, et al.: Spectrum of early onset nephrotic syndrome associated with WT1 missense mutations. Kidney Int 53: 1594–1600, 1998 9607189
60. Devuyst O, Pattaro C: The UMOD locus: Insights into the pathogenesis and prognosis of kidney disease. J Am Soc Nephrol 29: 713–726, 2018 29180396
Keywords:

focal segmental glomerulosclerosis; human genetics; kidney

Copyright © 2019 by the American Society of Nephrology