Cumulative Allele Score Analysis
Figure 1 shows the ORs associated with increasing cumulative allele scores for CD subjects compared with a reference group of CD subjects with ≤44 risk alleles, which represented the lowest 20% of total samples in a logistic regression model.
Interestingly, the difference in average cumulative allele scores (±SD) between CD subjects (49.1 ± 3.3; range, 37–60) and controls (46.9 ± 3.4; range, 36–56) was relatively small (Fig. 2A). Not surprisingly, this absolute difference in the average cumulative allele scores between CD and controls became increasingly smaller after sequentially removing the SNPs that are most strongly associated with CD risk (Fig., Supplemental Digital Content 1, http://links.lww.com/IBD/A165) until the top 49 CD-associated SNPs were removed. This finding highlights that the difference in the cumulative allele scores between CD and controls is relatively small and is driven by more than just a few SNPs.
Figure 2B shows predicted CD risk from a logistic regression model that includes all 71 CD SNPs in relation to the cumulative allele score. The Spearman correlation coefficient was 0.80 (dominant mode), indicating good correlation between CD risk and the cumulative allele score.
Genetic Interactions Analysis Using LR
Applying LR to the 71 CD meta-analysis SNPs in the CC/UP GWAS dataset, a best model of 5 Trees that included 13 SNPs was identified through the LR randomization model comparison process:
- Tree 1: rs2076756|NOD2 and [(rs3764147|C13orf31 and (rs3792109c|ATG16L1)) or (rs3024505|IL10/IL19 or (rs1736020c|chr21q))]
- Tree 2: (rs7927997c|C11orf30) and [(rs13073817c|chr3p) or (rs3180018c|MUC1/SCAMP3)]
- Tree 3: (rs3792109c|ATG16L1) or (rs6568421|PRDM1 or rs8005161|GALC/GPR65)
- Tree 4: (rs7714584c|IRGM) and rs2476601|PTPN22
- Tree 5: rs11209026c|IL23R
(where ‘c’ means the complement of the SNP risk allele, i.e., not carrying the risk allele).
To test whether the LR-identified SNP interactions (i.e., Trees) added additional information beyond the SNPs from which they were composed, we examined each Tree’s CD associations (except for Tree 5 because it contained only one SNP) after adjusting the marginal effects of the SNPs from which the Tree was composed. For example, for Tree 1, we assessed significance after adjusting for the 5 SNPs from which Tree 1 was composed.
All four tested Trees (Trees 1–4) remained significant (nominal P ≤ 0.002) after adjusting for the SNPs from which the Trees were composed (Table 2).
Analysis of Model Discriminative Accuracy of 71 CD SNPs and Their Interactions in the Discovery CC/UP Cohort
We next assessed whether the SNP interactions (i.e., Trees 1–5) improved CD risk prediction. The AUC was 0.75 for prediction based on the 71 CD SNPs and was significantly improved to 0.77 (P < 0.0001) after adding the SNP interactions (i.e., Trees 1–5).
Replication Using the WTCCC Case–Control Dataset
Distributions of cumulative allele score between CD and controls and predicted CD risk in the replication WTCCC dataset are shown in Figures A and B, Supplemental Digital Content 2, http://links.lww.com/IBD/A166. Similar to the CC/UP cohort, the average cumulative allele scores (±SD) in the WTCCC were 49.1 ± 3.3 (range, 38–59) for CD and 46.9 ± 3.4 (range, 35–58) for controls. The Spearman correlation coefficient (rho ρ) between the cumulative allele score and the predicted probability was 0.79, indicating a good correlation between CD risk and the cumulative allele score.
Figure, Supplemental Digital Content 3, http://links.lww.com/IBD/A167, shows that the ORs become higher with the increasing cumulative allele scores compared with a reference group having ≤44 risk alleles in a logistic regression model. For example, compared with reference group (i.e., carrying ≤44 risk alleles), carrying 45, 47, 49, 51, and 53 risk alleles have increasing ORs of 1.46 (95% CI: 0.94–2.28, P = 0.09), 2.48 (95% CI: 1.65–3.73, P < 0.0001), 3.10 (95% CI: 2.09–4.59, P < 0.0001), 5.21 (95% CI: 3.38–8.03, P < 0.0001), and 8.51 (95% CI: 5.08–14.3, P < 0.0001), respectively.
We applied the same 5 genetic interactions (Trees 1–5) identified in the primary CC/UP cohort in the larger independent replicate WTCCC dataset. The AUC was 0.733 for prediction based on the 71 CD SNPs with a borderline increase to 0.736 (P = 0.08) after adding the genetic interactions (i.e., Trees).
We examined Trees 1–4 (not Tree 5 because of having only one SNP) in the WTCCC cohort to determine if they provided additional information beyond the SNPs from which they were composed. Only Tree 1 remain significant (P = 0.0004) after adjusting for the SNPs from which the Trees were composed (Table, Supplemental Digital Content 4, http://links.lww.com/IBD/A168).
Analysis of Genetic Heritability Explained by the 71 CD SNPs and the LR-identified Genetic Interactions (Trees)
We estimated the likelihood-based rescaled pseudo R2 for both CC/UP and WTCCC cohorts. In the CC/UP discovery cohort, the explained heritability of the 71 CD SNPs alone in dominant mode was 24% and increased to 27% after adding the LR-identified high-order genetic interactions (Trees 1–5). In the replicate WTCCC cohort, only a small increase of explained heritability was observed (from 21% to 22%) by adding the same SNP interactions (Trees 1–5) into the 71 CD SNP model.
We investigated the predictive value of combining the 71 CD meta-analysis SNPs in a case–control study and replicated the findings in a larger independent population-based study. We found that the combination of the 71 CD SNPs (in dominant mode) has modest to good predictability of CD risk (AUC 0.75 in our discovery CC/UP cohort and 0.73 in the replicate WTCCC cohort). Higher cumulative allele scores predict higher CD risk. However, our study also highlights that the absolute difference of CD cumulative allele scores in CD (49.1 ± 3.3) versus controls (46.9 ± 3.4) is relatively small but statistically different (P < 0.001). This small difference was mostly driven by several of the top CD-associated SNPs. We also identified high-order genetic interactions among the 71 CD loci using a novel approach (LR) and found that risk prediction improved by adding genetic interactions to the 71 CD SNP model (AUC increased from 0.75 to 0.77, P < 0.0001). High-order SNP interaction Tree 1, composed of 5 SNPs in NOD2, ATG16L1, IL10/IL19, C13orf31, and chr21q, was successfully replicated in a large independent WTCCC cohort. To our knowledge, this is the first study investigating CD predictability using the combined 71 CD meta-analysis loci and showing improved model predictability and explained CD heritability by adding the genetic interactions identified through a novel application of LR.
There are several algorithms to search for interactions, including 2 well-known adaptive algorithms CART20 and MARS.21 In this study, we used LR, which has been shown to outperform CART and MARS in situations in which the underlying model is particularly complicated.9–11 LR is a tool to detect interactions between binary predictors that may play important roles in complex traits. Applying LR to the 71 CD SNPs, we identified 5 high-order logical SNP interactions (i.e., Trees 1–5). Trees 1, 2, 3, and 4 were composed of 2 to 5 SNPs, whereas Tree 5 had only one SNP (rs11209026 in IL23R).
Gene–gene interactions can be approached in statistical versus mechanistic ways.22 Looking for genes interacting mechanistically may be more useful than identifying statistical interactions in understanding which factors are parts of the biological mechanisms influencing the CD susceptibility. To assess if the LR-identified SNP interactions could add additional information beyond that from the individual SNPs that constituted the Tree models, we applied conventional logistic regression model to compare a model consisting of the 70 single SNPs to a model that included both the 70 SNPs and the identified Trees. The SNP interactions remained statistically significant after excluding the marginal effects of the SNPs that made up the Trees (P ≤ 0.002, Table 2). By examining the excess risk in a logistic regression model, our approach is similar to testing the mechanistic interaction to identify the underlying causal mechanisms.22 When we applied the same Trees model in the replicate WTCCC cohort, only Tree 1 remained significant (P = 0.0004). This finding further supports the evidence of genetic interaction discovered in Tree 1.
Once high-order genetic interactions are identified through LR, the next challenge relates to interpreting and proving that the statistically identified genetic interactions are of importance at a biological level. Tree 1, composed of SNPs in NOD2, ATG16L1, IL10/IL19, and C13orf31, and a gene poor region on chr21q, demonstrates a good example of biological plausibility for the identified genetic interactions by highlighting mechanisms of host response to microbial organisms. NOD2 was the first gene identified to be associated with CD23,24 and its variants affect the leucine-rich repeat domain of the nucleotide-binding oligomerization domain (Nod) microbial-recognition receptors. ATG16L1 is part of the autophagy pathway, which is a crucial element of the innate immune response to intracellular bacteria. Recent studies have provided a mechanistic link between NOD2 and ATG16L1,25,26 both of which were included in the SNP interaction term Tree 1. One study demonstrated that bacterial autophagy triggered by Nod proteins relied on interactions with ATG16L1. In the absence of other activators, Nod1 and Nod2 induced autophagy by interacting with ATG16L1 and promoted the sequestration of bacteria inside autophagosomes.26 Similarly, a study from our group showed evidence of functional interactions between ATG16L1 and NOD2 in an autophagy-dependent antibacterial pathway that is altered by CD-associated variants in a cell type–specific manner.25 Interleukin (IL)-10 is an anti-inflammatory cytokine with several lines of evidence to support its role in the pathogenesis of IBD, including identification of genetic variants in the IL-10 receptor in subjects with early-onset IBD.27–30 From a possible gene–gene interaction standpoint, peripheral blood mononuclear cells from patients with the NOD2 3020insC mutation show defective release of IL-10 after stimulation with microbial organisms.31 Furthermore, a recent study illustrated the biological relationship between NOD2 3020insC loss-of-function mutations and inhibited IL-10 transcription by blocking phosphorylation of the nuclear ribonucleoprotein hnRNP-A1.32 IL-19 is a member of the IL-10 family and has been shown to have a role in protecting the host from colonic bacterial infection following injury to intestinal epithelial cells.33
A candidate gene USP25 (ubiquitin-specific deubiquitination protease) at gene-poor region chr21q11.2,34 located 200 kb away from SNP rs1736020 in Tree 1, is a member of the DUB (deubiquitinating enzymes) superfamily. The first DUB-associated disease, familial cylindromatosis, disclosed the biological relationship between the nuclear factor-kappa B signaling pathway and deubiquitination enzymes.35 In addition, DUB protein has been increasingly found involved in human pathological disorders such as Parkinson’s disease, prostate cancer, lung cancer, and ataxia.36,37 Another locus C13orf31 (chromosome 13 open reading frame), which accounted for SNP rs3764147 in Tree 1, has been found to be associated with leprosy38 and IBD.39C13orf31 encodes laccase (multicopper oxidoreductase) domain containing one protein which acts as antioxidant molecule, scavenging free radicals in the bloodstream and eliminating pro-cancerous environmental estrogen bisphenol A.40 Based on all these lines of evidence, we conclude that there is biological plausibility for interaction between the 5 genes/loci identified in our replicated Tree 1. Follow-up functional work will be necessary although we believe that the methodology we have used can help identify pathway-based approaches to unravel the pathogenesis of IBD.
As previously reported, only 23% of CD heritability can be explained by the 71 CD GWAS meta-analysis SNPs.1 For case–control studies, disease heritability can be alternatively approximated by the proportion of disease variability explained in a SNP model using generalized R2 measures implemented in the logistic regression model. In our primary CC/UP GWAS, 24% of phenotypic variability was explained by the 71 CD SNPs in dominant mode. After adding the SNP interactions (i.e. Trees) into the 71 CD SNP model, the explained heritability increased from 24% to 27%. More heritability would be expected to be explained by searching more complicated interaction models within or between IBD-related biological pathways (e.g., Th17/Th1 pathway, tumor necrosis factor-α, nuclear factor-κB signaling, the KIR-family, and the NOD2 pathway).41 Of note, in the replicate WTCCC dataset, only a minimal increase of explained CD variability was observed (from 21% to 22%) by adding the SNP interactions into the 71 CD SNP models, which probably relates to heterogeneity between the 2 populations.
The limitations of this study include the relatively small sample size in our CC/UP GWAS (692 CD and 1,437 controls) that may explain why some of the 71 CD meta-analysis SNPs did not achieve genome-wide level of significance. Also, the suboptimal imputation quality in a few SNPs from the replicate WTCCC cohort may hinder the reproducibility of the selected SNP interactions model. Despite using a high imputation accuracy program like MaCH,16,17,41 several factors can still affect imputation results, including the density of genotype platform in the experimental sample (the WTCCC Affymetrix 500k gene chip is considered relatively low density for a GWAS platform), level of linkage disequilibrium in different genetic regions, minor allele frequency of the marker being imputed, and degree of genetic relationship or population heterogeneity between the experimental and reference populations.
In conclusion, this is the first study investigating CD risk using the 71 CD meta-analysis loci and demonstrating increased risk with increasing CD risk allele burden. However, we also highlighted the small differences of cumulative allele scores between CD subjects (49.1 ± 3.3) and healthy controls (46.9 ± 3.4), which may explain why only 23% of CD heritability is attributed to the 71 CD loci. Through a novel approach, we found a high-order genetic interaction, which was successfully replicated in a large independent WTCCC cohort. Using the 71 CD SNPs, model predictability improved from 75% to 77% and estimated explained heritability improved from 24% to 27% after adding the genetic interactions. Potential future applications of our work will include verifying these findings in a larger population-based cohort, including environmental factors to explore gene–environment interactions, defining the biological significance of suggested interactions in cell line or animal models, and searching for genetic interactions using pathway-based approaches.
The authors acknowledge Dr. M. Ilyas Kamboh from the University of Pittsburgh for the contribution of genotype data that was used in this study. The authors thank the patients and the controls for participating in this study. The authors acknowledge the Feinstein Institute for Medical Research of the North Shore-Long Island Jewish Health System for Illumina Genotyping BeadChip processing. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113.
1. Franke A, McGovern DPB, Barrett JC, et al.. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci. Nat Genet. 2010;42:1118–1125.
2. Liu S, Song Y. Building genetic scores to predict risk of complex diseases in humans: is it possible? Diabetes. 2010;59:2729–2731.
3. Yang Q, Khoury MJ, Botto L, et al.. Improving the prediction of complex diseases by testing for multiple disease-susceptibility genes. Am J Hum Genet. 2003;72:636–649.
4. Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007;17:1520–1528.
5. Weersma RK, Stokkers PCF, Cleynen I, et al.. Confirmation of multiple Crohn's disease susceptibility loci in a large Dutch-Belgian cohort. Am J Gastroenterol. 2009;104:630–638.
6. Peter I, Mitchell A, Ozelius L, et al.. Evaluation of 22 genetic variants with Crohn's disease risk in the Ashkenazi Jewish population: a case-control study. BMC Med Genet. 2011;12:63.
7. Park JH, Gail MH, Weinberg CR, et al.. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci USA. 2011;108:18026–18031.
8. Zuk O, Hechter E, Sunyaev SR, et al.. The mystery of missing heritability: genetic interactions create phantom heritability. Proc Natl Acad Sci USA. 2012;109:1193–1198.
9. Ruczinski I, Kooperberg C, LeBlanc L. Logic regression. J Comput Graph Stat. 2003;12:475–511.
10. Ruczinski I, Kooperberg C, LeBlanc L. Exploring interactions in high-dimensional genomic data: an overview of logic regression, with applications. J Mult Anal. 2004;90:178–195.
11. Schwender H, Ruczinski I. Logic regression and its extensions. Adv Genet. 2010;72:25–45.
12. Achkar JP, Klei L, Bakker PIW, et al.. Amino acid position 11 of HLA-DR[beta]1 is a major determinant of chromosome 6p association with ulcerative colitis. Genes Immun. 2012;13:245–252.
13. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. 2009;84:210–223.
14. Marchini J, Howie B, Myers S, et al.. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007;39:906–913.
15. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678.
16. Li Y, Willer C, Sanna S, et al.. Genotype imputation. Annu Rev Genomics Hum Genet. 2009;10:387–406.
17. Li Y, Willer CJ, Ding J, et al.. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–834.
18. Makowsky R, Pajewski NM, Klimentidis YC, et al.. Beyond missing heritability: prediction of complex traits. PLoS Genet. 2011;7:e1002051.
19. Stokes ME, Davis C, Koch G. Categorical Data Analysis Using the SAS System. Second Edition. Cary, NC: SAS Institute Inc.; 2000.
20. Breiman L, Friedman JH, Stone C, et al.. Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks; 1984.
21. Friedman JH, Roosen CB. An introduction to multivariate adaptive regression splines. Stat Methods Med Res. 1995;4:197–217.
22. Berzuini C, Dawid P, Zhang H, et al.. Analysis of interaction for identifying causal mechanisms. In: Berzuini C, Dawid P, Bernardinelli L, eds. Causality: Statistical Perspectives and Applications. Wiley; 2012:192–207.
23. Hugot JP, Chamaillard M, Zouali H, et al.. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn's disease. Nature. 2001;411:599–603.
24. Ogura Y, Bonen DK, Inohara N, et al.. A frameshift mutation in NOD2 associated with susceptibility to Crohn's disease. Nature. 2001;411:603–606.
25. Homer CR, Richmond AL, Rebert NA, et al.. ATG16L1 and NOD2 interact in an autophagy-dependent antibacterial pathway implicated in Crohn's disease pathogenesis. Gastroenterology. 2010;139:1630–1641.
26. Travassos LH, Carneiro LAM, Ramjeet M, et al.. Nod1 and Nod2 direct autophagy by recruiting ATG16L1 to the plasma membrane at the site of bacterial entry. Nat Immunol. 2010;11:55–62.
27. Li MC, He SH. IL-10 and its related cytokines for treatment of inflammatory bowel disease. World J Gastroenterol. 2004;10:620–625.
28. Kuhn R, Lohler J, Rennick D, et al.. Interleukin-10-deficient mice develop chronic enterocolitis. Cell. 1993;75:263–274.
29. Kotlarz D, Beier R, Murugan D, et al.. Loss of interleukin-10 signaling and infantile inflammatory bowel disease: implications for diagnosis and therapy. Gastroenterology. 2012;143:347–355.
30. Glocker EO, Kotlarz D, Boztug K, et al.. Inflammatory bowel disease and mutations affecting the interleukin-10 receptor. N Engl J Med. 2009;361:2033–2045.
31. Netea MG, Kullberg BJ, de Jong DJ, et al.. NOD2 mediates anti-inflammatory signals induced by TLR2 ligands: implications for Crohn's disease. Eur J Immunol. 2004;34:2052–2059.
32. Noguchi E, Homma Y, Kang X, et al.. A Crohn's disease-associated NOD2 mutation suppresses transcription of human IL10 by inhibiting activity of the nuclear ribonucleoprotein hnRNP-A1. Nat Immunol. 2009;10:471–479.
33. Azuma YT, Matsuo Y, Kuwamura M, et al.. Interleukin-19 protects mice from innate-mediated colonic inflammation. Inflamm Bowel Dis. 2010;16:1017–1028.
34. Valero R, Marfany G, Gonzalez-Angulo O, et al.. USP25, a novel gene encoding a deubiquitinating enzyme, is located in the gene-poor region 21q11.2. Genomics. 1999;62:395–405.
35. Kovalenko A, Chable-Bessia C, Cantarella G, et al.. The tumour suppressor CYLD negatively regulates NF-kappaB signalling by deubiquitination. Nature. 2003;424:801–805.
36. Leroy E, Boyer R, Auburger G, et al.. The ubiquitin pathway in Parkinson's disease. Nature. 1998;395:451–452.
37. Graner E, Tang D, Rossi S, et al.. The isopeptidase USP2a regulates the stability of fatty acid synthase in prostate cancer. Cancer Cell. 2004;5:253–261.
38. Zhang FR, Huang W, Chen SM, et al.. Genomewide association study of leprosy. N Engl J Med. 2009;361:2609–2618.
39. Barrett JC, Hansoul S, Nicolae DL, et al.. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat Genet. 2008;40:955–962.
40. Keri RA, Ho SM, Hunt PA, et al.. An evaluation of evidence for the carcinogenic activity of bisphenol A. Reprod Toxicol. 2007;24:240–252.
41. Cho JH, IIBDGC. Immunochip-based analysis of a large IBD case-control cohort identifies 50 novel loci, refining definitions of disease pathways. Gastroenterology. 2012;142:S149–S150.
genetics; genetic interaction; cumulative genetic effect; Crohn’s disease
Supplemental Digital Content
© Crohn's & Colitis Foundation of America, Inc.