Inflammatory Bowel Diseases:
Original Clinical Articles
A Novel Approach to Detect Cumulative Genetic Effects and Genetic Interactions in Crohn's Disease
Wang, Ming-Hsi MD, PhD*,†; Fiocchi, Claudio MD*,†; Ripke, Stephan MD‡; Zhu, Xiaofeng PhD§; Duerr, Richard H. MD‖,¶; Achkar, Jean-Paul MD*,†
*Department of Gastroenterology and Hepatology, Digestive Disease Institute, Cleveland Clinic, Cleveland, Ohio;
†Department of Pathobiology, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio;
‡Division of Gastroenterology, Hepatology, & Nutrition, Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania;
§Department of Human Genetics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania;
‖Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio; and
¶Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts.
Reprints: Jean-Paul Achkar, MD, Department of Gastroenterology and Hepatology, Digestive Disease Institute, Cleveland Clinic, Desk A31, Cleveland, OH 44195 (e-mail: firstname.lastname@example.org).
Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions of this article on the journal's Web site ( www.ibdjournal.org).
Supported by T32 DK083251, National Institutes of Health–National Institute of Diabetes and Digestive and Kidney Diseases (M.-H.W., C.F.); DK068112 (J.-P.A.), DK062420 (R.H.D.) and DK076025 (R.H.D.); a Crohn’s & Colitis Foundation of America Senior Research Award (R.H.D.); and funds generously provided by Kenneth and Jennifer Rainin, Gerald and Nancy Goldberg, and Victor and Ellen Cohn.
The authors have no conflicts of interest to disclose.
Received October 29, 2012
Accepted December 10, 2012
Genome-wide association studies have identified at least 71 Crohn’s disease (CD) genetic risk loci, but the role of gene–gene interactions is unclear. The value of genetic variants in clinical practice is not defined because of limited explained heritability.
We examined model predictability of combining the 71 CD risk alleles and genetic interactions in an ongoing inflammatory bowel disease genome-wide association study. The Wellcome Trust Case Control Consortium inflammatory bowel disease genome-wide association study was used as a replicate cohort. We used logic regression, an adaptive regression methodology, to search for high-order binary predictors (e.g., single-nucleotide polymorphism [SNP] interactions).
The combined 71 CD SNPs had good CD risk predictability (area under the curve of 0.75 and 0.73 in the 2 cohorts). Higher cumulative allele score predicted higher CD risk, but a relatively small difference in cumulative allele scores was observed between CD and controls (49 versus 47, P < 0.001). Through LR, we identified high-order genetic interactions and significantly improved the model predictability (area under the curve, from 0.75 to 0.77, P < 0.0001). A genetic interaction model, including NOD2, ATG16L1, IL10/IL19, C13orf31, and chr21q loci, was discovered and successfully replicated in the independent Wellcome Trust Case Control Consortium cohort. The explained heritability of the 71 CD SNPs alone was 24% and increased to 27% after adding the genetic interactions.
A novel approach allowed the identification and replication of genetic interactions among NOD2, ATG16L1, IL10/IL19, C13orf31, and chr21q loci. CD risk can be predicted by a model of 71 CD loci and improved by adding genetic interactions.
The pathogenesis of Crohn’s disease (CD) is probably the result of an inappropriate persistent immune response, in which genetic, microbial, and environmental factors play important roles. Recent genome-wide association studies (GWAS) and meta-analyses have identified at least 71 CD-associated genetic risk loci.1 However, the value of currently known genetic variants in clinical practice is still unclear.
Single genetic marker analysis has limited value in predicting risk for complex diseases such as CD, and studies suggest that the predictive value of combining multiple common genetic variants can be improved, especially when GWAS data are incorporated.2–4 The development of cumulative CD genetic risk scores based on GWAS results to predict disease risk could provide clinical strategies for disease screening, extent of diagnostic testing, and potentially treatment. Weersma et al5 proposed genetic risk profiles based on 10 CD single-nucleotide polymorphisms (SNPs) identified from GWAS, but the tradeoff between sensitivity and specificity limited the discriminative accuracy of such models. Similarly, genetic risk profiles consisting of 22 GWAS meta-analysis CD SNPs could not explain the excessive prevalence of CD in individuals of Ashkenazi Jewish descent.6 To our knowledge, CD risk prediction using the recent 71 CD GWAS genetic loci has not been investigated.
Furthermore, even after accounting for cumulative effects of the 71 CD loci, only 23% of CD heritability can be explained.1 Using a model effect size distribution,7 only a small increase in the explained percentage of CD heritability is predicted when adding further, new CD genetic loci.1 Therefore, it is possible that identifying genetic interactions could uncover some of the missing heritability left out from genetic additive models and could provide better CD risk predictive models. In addition, such an approach could expand the understanding of the pathogenic mechanisms of CD.8
The aims of this study were to assess the distribution and CD risk predictability of the 71 loci identified in CD GWAS meta-analyses and also to explore high-order genetic interactions and associated additional contributions to model predictability. To assess interaction effects, we used logic regression (LR), a novel regression method that can identify logical interactions of binary predictors.9–11 We applied LR to the 71 CD SNPs to search for logical SNP combinations that can explain the disease risk better than an additive model. We then replicated the identified genetic interactions in a larger independent population-based dataset and also assessed whether these interactions could recover any missing heritability. To our knowledge, this is the first study to evaluate model predictability of combining the 71 GWAS meta-analysis CD loci with high-order genetic interactions.
MATERIALS AND METHODS
Two GWAS datasets were used for this study, the Cleveland Clinic/University of Pittsburgh (CC/UP) inflammatory bowel disease (IBD) GWAS and The Wellcome Trust Case Control Consortium (WTCCC) IBD GWAS. The CC/UP GWAS dataset was used for the cumulative risk allele analysis and as the discovery dataset for evaluation of high-order genetic interaction. The study design and data collection of this GWAS have been previously described.12 Of note, the full GWAS has not yet been completed as the replication phase of the study is ongoing. However, we were able to pursue the current study because its main purposes were to predict CD risk using the 71 CD GWAS loci and to identify high-order genetic interactions through a novel methodological approach. In brief, this GWAS consists of 692 CD cases and 1,428 controls, all of non-Jewish, European ancestry, who were genotyped using the Illumina HumanOmni-1 Quad beadchip (Illumina, San Diego, CA) at the Feinstein Institute for Medical Research of the North Shore-Long Island Jewish Health System. All participants gave written informed consent. Genotype imputation was done using a 5-Mb region across the whole genome with an imputation program (BEAGLE).13 All 71 CD meta-analysis SNPs were imputed with good quality (R2 > 0.90) and with Hardy–Weinberg equilibrium P > 10−5 in controls.
The WTCCC IBD GWAS (Affymetrix GeneChip 500K arrays) was used as the replicate dataset for the interaction analysis. We downloaded the genotype data called by the CHIAMO algorithm14 for the CD samples and the shared controls (the 1958 Birth Cohort and UK Blood Service sample) from the WTCCC website. After excluding individuals with evidence of non-European ancestry or poor call rates, 1,748 CD cases and 2,980 controls remained. We applied the following quality control criteria to exclude SNPs: (1) Hardy–Weinberg equilibrium test P < 5.7 × 10−7 in controls; (2) minor allele frequency <1% in CD cases and controls; (3) call rate <99%; and (4) plate relatedness (suggesting batch effect) provided by WTCCC. In total, 469,557 SNPs passed the quality control filters.15 Of the 71 CD meta-analysis SNPs, 24 were successfully genotyped in WTCCC dataset. The remaining 47 ungenotyped SNPs were imputed using the MaCH program16,17 with 1000 Genome phase 1 haplotype data as reference. We used parameters of 60 iterations of the Markov sampler and 200 states. The 1000 genomes reference panel was obtained from the University of Michigan Abecasis laboratory, version 20100804 ( http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G-2010-08.html). A total of 566 reference haplotypes for European ancestry served as the reference panel. Among the 47 imputed CD SNPs, 43 had good imputation quality (R2 > 0.60) and 4 had borderline or low quality of imputation—chr19: rs281379 (R2 = 0.36), chr19: rs736289 (R2 = 0.33), chr19: rs12720356 (R2 = 0.30), and chr9: rs4077515 (R2 = 0.27).
Cumulative Allele Scores for the 71 CD GWAS Loci
Risk alleles were designated as the alleles reported to be associated with CD in the GWAS meta-analysis article.1 Odds ratios (ORs) and 95% confidence intervals (CIs) were calculated to estimate effects for risk alleles and genotypes in dominant mode. These analyses were implemented in the Golden Helix SVS software suite 7 (Golden Helix, Bozeman, MT).
The cumulative allele score was calculated by summing up the number of risk alleles (in dominant mode) across the 71 CD SNPs for each study subject with complete genotype information. Prediction of CD risk by cumulative allele score was conducted through conventional logistic regression model using SAS 9.2/Genetics package PROC LOGISTIC procedures (SAS Institute, Cary, NC). The CD predicted probability was calculated from the log odds (i.e., risk = odds/(1 + odds)) using the logistic regression model including all 71 CD SNPs. The correlation between the cumulative allele score and the predicted probabilities was assessed by the Spearman correlation coefficient (rho, ρ).
Exploring High-order Genetic Interactions Using LR Among 71 CD Genes
LR (R package Logic.Reg ver. 1.4.10) was used to search for models of high-order SNP combinations. All 71 CD SNPs were recoded into binary predictors in dominant mode to have consistency as we calculated the cumulative allele score. Simulated annealing, a stochastic search algorithm with increase of model size (i.e., the complexity of SNP logical combinations), implemented in LR was used to search for models composed of logical SNP interactions (called “Trees”), which could explain the disease risk better than single SNP models. Through the randomization processes, a test for different model sizes can be used to determine an optimal model size and, in combination with the greedy algorithm, a best model can then be identified.9–11 To assess the significance of genetic interactions contributed from the chosen Trees (i.e., high-order SNP interactions), we evaluated the CD association of each Tree after excluding the marginal effects of the SNPs from which the Tree was composed.
We then applied logistic regression modeling to investigate the predictive accuracy of models derived from the 71 CD SNPs with and without including the SNP interactions (i.e., Trees). Discriminative accuracy was evaluated using the area under the receiver operating characteristic curves (area under the curve [AUC]). The AUC can range from 0.5 (lack of discrimination) to 1.0 (perfect discrimination). AUCs were calculated for the predicted risks of the logistic regression models, the cumulative allele score, and the linear predictor values of the logistic regression models. AUCs were compared with SAS 9.2 GPLOT procedures (SAS Institute). Goodness-of-fit tests between the models of 71 CD SNPs with and without genetic interactions (i.e., Trees) were conducted through the PROC LOGISTIC and receiver operating characteristic functions in SAS 9.2.
Assessment of Heritability
The estimate of the proportion of disease variation explained by a model can be regarded as a measure of goodness of fit to the data, and this can be done using the generalized pseudo-R2 estimate from the logistic regression model. SAS gives the likelihood-based pseudo R2 and its rescaled estimates for the proportion of disease variation explained by the genetic predictors.18,19
Individual Effects of 71 SNPs on CD Risk
Table 1 shows the CD risk of each of the 71 CD SNPs in the CC/UP dataset. The association patterns of effect sizes (ORs) and association directions are similar to those in the recent CD meta-analysis article,1 suggesting that our dataset is representative of those included in the meta-analysis study, which consisted of 6,333 CD cases and 15,056 controls.1
Cumulative Allele Score Analysis
Figure 1 shows the ORs associated with increasing cumulative allele scores for CD subjects compared with a reference group of CD subjects with ≤44 risk alleles, which represented the lowest 20% of total samples in a logistic regression model.
Interestingly, the difference in average cumulative allele scores (±SD) between CD subjects (49.1 ± 3.3; range, 37–60) and controls (46.9 ± 3.4; range, 36–56) was relatively small (Fig. 2A). Not surprisingly, this absolute difference in the average cumulative allele scores between CD and controls became increasingly smaller after sequentially removing the SNPs that are most strongly associated with CD risk (Fig., Supplemental Digital Content 1, http://links.lww.com/IBD/A165) until the top 49 CD-associated SNPs were removed. This finding highlights that the difference in the cumulative allele scores between CD and controls is relatively small and is driven by more than just a few SNPs.
Figure 2B shows predicted CD risk from a logistic regression model that includes all 71 CD SNPs in relation to the cumulative allele score. The Spearman correlation coefficient was 0.80 (dominant mode), indicating good correlation between CD risk and the cumulative allele score.
Genetic Interactions Analysis Using LR
Applying LR to the 71 CD meta-analysis SNPs in the CC/UP GWAS dataset, a best model of 5 Trees that included 13 SNPs was identified through the LR randomization model comparison process:
- Tree 1: rs2076756|NOD2 and [(rs3764147|C13orf31 and (rs3792109c|ATG16L1)) or (rs3024505|IL10/IL19 or (rs1736020c|chr21q))]
- Tree 2: (rs7927997c|C11orf30) and [(rs13073817c|chr3p) or (rs3180018c|MUC1/SCAMP3)]
- Tree 3: (rs3792109c|ATG16L1) or (rs6568421|PRDM1 or rs8005161|GALC/GPR65)
- Tree 4: (rs7714584c|IRGM) and rs2476601|PTPN22
- Tree 5: rs11209026c|IL23R
(where ‘c’ means the complement of the SNP risk allele, i.e., not carrying the risk allele).
To test whether the LR-identified SNP interactions (i.e., Trees) added additional information beyond the SNPs from which they were composed, we examined each Tree’s CD associations (except for Tree 5 because it contained only one SNP) after adjusting the marginal effects of the SNPs from which the Tree was composed. For example, for Tree 1, we assessed significance after adjusting for the 5 SNPs from which Tree 1 was composed.
All four tested Trees (Trees 1–4) remained significant (nominal P ≤ 0.002) after adjusting for the SNPs from which the Trees were composed (Table 2).
Analysis of Model Discriminative Accuracy of 71 CD SNPs and Their Interactions in the Discovery CC/UP Cohort
We next assessed whether the SNP interactions (i.e., Trees 1–5) improved CD risk prediction. The AUC was 0.75 for prediction based on the 71 CD SNPs and was significantly improved to 0.77 (P < 0.0001) after adding the SNP interactions (i.e., Trees 1–5).
Replication Using the WTCCC Case–Control Dataset
Distributions of cumulative allele score between CD and controls and predicted CD risk in the replication WTCCC dataset are shown in Figures A and B, Supplemental Digital Content 2, http://links.lww.com/IBD/A166. Similar to the CC/UP cohort, the average cumulative allele scores (±SD) in the WTCCC were 49.1 ± 3.3 (range, 38–59) for CD and 46.9 ± 3.4 (range, 35–58) for controls. The Spearman correlation coefficient (rho ρ) between the cumulative allele score and the predicted probability was 0.79, indicating a good correlation between CD risk and the cumulative allele score.
Figure, Supplemental Digital Content 3, http://links.lww.com/IBD/A167, shows that the ORs become higher with the increasing cumulative allele scores compared with a reference group having ≤44 risk alleles in a logistic regression model. For example, compared with reference group (i.e., carrying ≤44 risk alleles), carrying 45, 47, 49, 51, and 53 risk alleles have increasing ORs of 1.46 (95% CI: 0.94–2.28, P = 0.09), 2.48 (95% CI: 1.65–3.73, P < 0.0001), 3.10 (95% CI: 2.09–4.59, P < 0.0001), 5.21 (95% CI: 3.38–8.03, P < 0.0001), and 8.51 (95% CI: 5.08–14.3, P < 0.0001), respectively.
We applied the same 5 genetic interactions (Trees 1–5) identified in the primary CC/UP cohort in the larger independent replicate WTCCC dataset. The AUC was 0.733 for prediction based on the 71 CD SNPs with a borderline increase to 0.736 (P = 0.08) after adding the genetic interactions (i.e., Trees).
We examined Trees 1–4 (not Tree 5 because of having only one SNP) in the WTCCC cohort to determine if they provided additional information beyond the SNPs from which they were composed. Only Tree 1 remain significant (P = 0.0004) after adjusting for the SNPs from which the Trees were composed (Table, Supplemental Digital Content 4, http://links.lww.com/IBD/A168).
Analysis of Genetic Heritability Explained by the 71 CD SNPs and the LR-identified Genetic Interactions (Trees)
We estimated the likelihood-based rescaled pseudo R2 for both CC/UP and WTCCC cohorts. In the CC/UP discovery cohort, the explained heritability of the 71 CD SNPs alone in dominant mode was 24% and increased to 27% after adding the LR-identified high-order genetic interactions (Trees 1–5). In the replicate WTCCC cohort, only a small increase of explained heritability was observed (from 21% to 22%) by adding the same SNP interactions (Trees 1–5) into the 71 CD SNP model.
We investigated the predictive value of combining the 71 CD meta-analysis SNPs in a case–control study and replicated the findings in a larger independent population-based study. We found that the combination of the 71 CD SNPs (in dominant mode) has modest to good predictability of CD risk (AUC 0.75 in our discovery CC/UP cohort and 0.73 in the replicate WTCCC cohort). Higher cumulative allele scores predict higher CD risk. However, our study also highlights that the absolute difference of CD cumulative allele scores in CD (49.1 ± 3.3) versus controls (46.9 ± 3.4) is relatively small but statistically different (P < 0.001). This small difference was mostly driven by several of the top CD-associated SNPs. We also identified high-order genetic interactions among the 71 CD loci using a novel approach (LR) and found that risk prediction improved by adding genetic interactions to the 71 CD SNP model (AUC increased from 0.75 to 0.77, P < 0.0001). High-order SNP interaction Tree 1, composed of 5 SNPs in NOD2, ATG16L1, IL10/IL19, C13orf31, and chr21q, was successfully replicated in a large independent WTCCC cohort. To our knowledge, this is the first study investigating CD predictability using the combined 71 CD meta-analysis loci and showing improved model predictability and explained CD heritability by adding the genetic interactions identified through a novel application of LR.
There are several algorithms to search for interactions, including 2 well-known adaptive algorithms CART20 and MARS.21 In this study, we used LR, which has been shown to outperform CART and MARS in situations in which the underlying model is particularly complicated.9–11 LR is a tool to detect interactions between binary predictors that may play important roles in complex traits. Applying LR to the 71 CD SNPs, we identified 5 high-order logical SNP interactions (i.e., Trees 1–5). Trees 1, 2, 3, and 4 were composed of 2 to 5 SNPs, whereas Tree 5 had only one SNP (rs11209026 in IL23R).
Gene–gene interactions can be approached in statistical versus mechanistic ways.22 Looking for genes interacting mechanistically may be more useful than identifying statistical interactions in understanding which factors are parts of the biological mechanisms influencing the CD susceptibility. To assess if the LR-identified SNP interactions could add additional information beyond that from the individual SNPs that constituted the Tree models, we applied conventional logistic regression model to compare a model consisting of the 70 single SNPs to a model that included both the 70 SNPs and the identified Trees. The SNP interactions remained statistically significant after excluding the marginal effects of the SNPs that made up the Trees (P ≤ 0.002, Table 2). By examining the excess risk in a logistic regression model, our approach is similar to testing the mechanistic interaction to identify the underlying causal mechanisms.22 When we applied the same Trees model in the replicate WTCCC cohort, only Tree 1 remained significant (P = 0.0004). This finding further supports the evidence of genetic interaction discovered in Tree 1.
Once high-order genetic interactions are identified through LR, the next challenge relates to interpreting and proving that the statistically identified genetic interactions are of importance at a biological level. Tree 1, composed of SNPs in NOD2, ATG16L1, IL10/IL19, and C13orf31, and a gene poor region on chr21q, demonstrates a good example of biological plausibility for the identified genetic interactions by highlighting mechanisms of host response to microbial organisms. NOD2 was the first gene identified to be associated with CD23,24 and its variants affect the leucine-rich repeat domain of the nucleotide-binding oligomerization domain (Nod) microbial-recognition receptors. ATG16L1 is part of the autophagy pathway, which is a crucial element of the innate immune response to intracellular bacteria. Recent studies have provided a mechanistic link between NOD2 and ATG16L1,25,26 both of which were included in the SNP interaction term Tree 1. One study demonstrated that bacterial autophagy triggered by Nod proteins relied on interactions with ATG16L1. In the absence of other activators, Nod1 and Nod2 induced autophagy by interacting with ATG16L1 and promoted the sequestration of bacteria inside autophagosomes.26 Similarly, a study from our group showed evidence of functional interactions between ATG16L1 and NOD2 in an autophagy-dependent antibacterial pathway that is altered by CD-associated variants in a cell type–specific manner.25 Interleukin (IL)-10 is an anti-inflammatory cytokine with several lines of evidence to support its role in the pathogenesis of IBD, including identification of genetic variants in the IL-10 receptor in subjects with early-onset IBD.27–30 From a possible gene–gene interaction standpoint, peripheral blood mononuclear cells from patients with the NOD2 3020insC mutation show defective release of IL-10 after stimulation with microbial organisms.31 Furthermore, a recent study illustrated the biological relationship between NOD2 3020insC loss-of-function mutations and inhibited IL-10 transcription by blocking phosphorylation of the nuclear ribonucleoprotein hnRNP-A1.32 IL-19 is a member of the IL-10 family and has been shown to have a role in protecting the host from colonic bacterial infection following injury to intestinal epithelial cells.33
A candidate gene USP25 (ubiquitin-specific deubiquitination protease) at gene-poor region chr21q11.2,34 located 200 kb away from SNP rs1736020 in Tree 1, is a member of the DUB (deubiquitinating enzymes) superfamily. The first DUB-associated disease, familial cylindromatosis, disclosed the biological relationship between the nuclear factor-kappa B signaling pathway and deubiquitination enzymes.35 In addition, DUB protein has been increasingly found involved in human pathological disorders such as Parkinson’s disease, prostate cancer, lung cancer, and ataxia.36,37 Another locus C13orf31 (chromosome 13 open reading frame), which accounted for SNP rs3764147 in Tree 1, has been found to be associated with leprosy38 and IBD.39C13orf31 encodes laccase (multicopper oxidoreductase) domain containing one protein which acts as antioxidant molecule, scavenging free radicals in the bloodstream and eliminating pro-cancerous environmental estrogen bisphenol A.40 Based on all these lines of evidence, we conclude that there is biological plausibility for interaction between the 5 genes/loci identified in our replicated Tree 1. Follow-up functional work will be necessary although we believe that the methodology we have used can help identify pathway-based approaches to unravel the pathogenesis of IBD.
As previously reported, only 23% of CD heritability can be explained by the 71 CD GWAS meta-analysis SNPs.1 For case–control studies, disease heritability can be alternatively approximated by the proportion of disease variability explained in a SNP model using generalized R2 measures implemented in the logistic regression model. In our primary CC/UP GWAS, 24% of phenotypic variability was explained by the 71 CD SNPs in dominant mode. After adding the SNP interactions (i.e. Trees) into the 71 CD SNP model, the explained heritability increased from 24% to 27%. More heritability would be expected to be explained by searching more complicated interaction models within or between IBD-related biological pathways (e.g., Th17/Th1 pathway, tumor necrosis factor-α, nuclear factor-κB signaling, the KIR-family, and the NOD2 pathway).41 Of note, in the replicate WTCCC dataset, only a minimal increase of explained CD variability was observed (from 21% to 22%) by adding the SNP interactions into the 71 CD SNP models, which probably relates to heterogeneity between the 2 populations.
The limitations of this study include the relatively small sample size in our CC/UP GWAS (692 CD and 1,437 controls) that may explain why some of the 71 CD meta-analysis SNPs did not achieve genome-wide level of significance. Also, the suboptimal imputation quality in a few SNPs from the replicate WTCCC cohort may hinder the reproducibility of the selected SNP interactions model. Despite using a high imputation accuracy program like MaCH,16,17,41 several factors can still affect imputation results, including the density of genotype platform in the experimental sample (the WTCCC Affymetrix 500k gene chip is considered relatively low density for a GWAS platform), level of linkage disequilibrium in different genetic regions, minor allele frequency of the marker being imputed, and degree of genetic relationship or population heterogeneity between the experimental and reference populations.
In conclusion, this is the first study investigating CD risk using the 71 CD meta-analysis loci and demonstrating increased risk with increasing CD risk allele burden. However, we also highlighted the small differences of cumulative allele scores between CD subjects (49.1 ± 3.3) and healthy controls (46.9 ± 3.4), which may explain why only 23% of CD heritability is attributed to the 71 CD loci. Through a novel approach, we found a high-order genetic interaction, which was successfully replicated in a large independent WTCCC cohort. Using the 71 CD SNPs, model predictability improved from 75% to 77% and estimated explained heritability improved from 24% to 27% after adding the genetic interactions. Potential future applications of our work will include verifying these findings in a larger population-based cohort, including environmental factors to explore gene–environment interactions, defining the biological significance of suggested interactions in cell line or animal models, and searching for genetic interactions using pathway-based approaches.
The authors acknowledge Dr. M. Ilyas Kamboh from the University of Pittsburgh for the contribution of genotype data that was used in this study. The authors thank the patients and the controls for participating in this study. The authors acknowledge the Feinstein Institute for Medical Research of the North Shore-Long Island Jewish Health System for Illumina Genotyping BeadChip processing. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113.
1. Franke A, McGovern DPB, Barrett JC, et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci. Nat Genet. 2010; 42:1118–1125.
2. Liu S, Song Y. Building genetic scores to predict risk of complex diseases in humans: is it possible? Diabetes. 2010; 59:2729–2731.
3. Yang Q, Khoury MJ, Botto L, et al. Improving the prediction of complex diseases by testing for multiple disease-susceptibility genes. Am J Hum Genet. 2003; 72:636–649.
4. Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007; 17:1520–1528.
5. Weersma RK, Stokkers PCF, Cleynen I, et al. Confirmation of multiple Crohn's disease susceptibility loci in a large Dutch-Belgian cohort. Am J Gastroenterol. 2009; 104:630–638.
6. Peter I, Mitchell A, Ozelius L, et al. Evaluation of 22 genetic variants with Crohn's disease risk in the Ashkenazi Jewish population: a case-control study. BMC Med Genet. 2011; 12:63
7. Park JH, Gail MH, Weinberg CR, et al. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci USA. 2011; 108:18026–18031.
8. Zuk O, Hechter E, Sunyaev SR, et al. The mystery of missing heritability: genetic interactions create phantom heritability. Proc Natl Acad Sci USA. 2012; 109:1193–1198.
9. Ruczinski I, Kooperberg C, LeBlanc L. Logic regression. J Comput Graph Stat. 2003; 12:475–511.
10. Ruczinski I, Kooperberg C, LeBlanc L. Exploring interactions in high-dimensional genomic data: an overview of logic regression, with applications. J Mult Anal. 2004; 90:178–195.
11. Schwender H, Ruczinski I. Logic regression and its extensions. Adv Genet. 2010; 72:25–45.
12. Achkar JP, Klei L, Bakker PIW, et al. Amino acid position 11 of HLA-DR[beta]1 is a major determinant of chromosome 6p association with ulcerative colitis. Genes Immun. 2012; 13:245–252.
13. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. 2009; 84:210–223.
14. Marchini J, Howie B, Myers S, et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007; 39:906–913.
15. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007; 447:661–678.
16. Li Y, Willer C, Sanna S, et al. Genotype imputation. Annu Rev Genomics Hum Genet. 2009; 10:387–406.
17. Li Y, Willer CJ, Ding J, et al. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010; 34:816–834.
18. Makowsky R, Pajewski NM, Klimentidis YC, et al. Beyond missing heritability: prediction of complex traits. PLoS Genet. 2011; 7:e1002051
19. Stokes ME, Davis C, Koch G. Categorical Data Analysis Using the SAS System. Second Edition. Cary, NC: SAS Institute Inc.; 2000; .
20. Breiman L, Friedman JH, Stone C, et al. Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks; 1984; .
21. Friedman JH, Roosen CB. An introduction to multivariate adaptive regression splines. Stat Methods Med Res. 1995; 4:197–217.
22. Berzuini C, Dawid P, Zhang H, et al. Analysis of interaction for identifying causal mechanisms. In: Berzuini C, Dawid P, Bernardinelli L., eds. Causality: Statistical Perspectives and Applications. Wiley; 2012; :192–207.
23. Hugot JP, Chamaillard M, Zouali H, et al. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn's disease. Nature. 2001; 411:599–603.
24. Ogura Y, Bonen DK, Inohara N, et al. A frameshift mutation in NOD2 associated with susceptibility to Crohn's disease. Nature. 2001; 411:603–606.
25. Homer CR, Richmond AL, Rebert NA, et al. ATG16L1 and NOD2 interact in an autophagy-dependent antibacterial pathway implicated in Crohn's disease pathogenesis. Gastroenterology. 2010; 139:1630–1641.
26. Travassos LH, Carneiro LAM, Ramjeet M, et al. Nod1 and Nod2 direct autophagy by recruiting ATG16L1 to the plasma membrane at the site of bacterial entry. Nat Immunol. 2010; 11:55–62.
27. Li MC, He SH. IL-10 and its related cytokines for treatment of inflammatory bowel disease. World J Gastroenterol. 2004; 10:620–625.
28. Kuhn R, Lohler J, Rennick D, et al. Interleukin-10-deficient mice develop chronic enterocolitis. Cell. 1993; 75:263–274.
29. Kotlarz D, Beier R, Murugan D, et al. Loss of interleukin-10 signaling and infantile inflammatory bowel disease: implications for diagnosis and therapy. Gastroenterology. 2012; 143:347–355.
30. Glocker EO, Kotlarz D, Boztug K, et al. Inflammatory bowel disease and mutations affecting the interleukin-10 receptor. N Engl J Med. 2009; 361:2033–2045.
31. Netea MG, Kullberg BJ, de Jong DJ, et al. NOD2 mediates anti-inflammatory signals induced by TLR2 ligands: implications for Crohn's disease. Eur J Immunol. 2004; 34:2052–2059.
32. Noguchi E, Homma Y, Kang X, et al. A Crohn's disease-associated NOD2 mutation suppresses transcription of human IL10 by inhibiting activity of the nuclear ribonucleoprotein hnRNP-A1. Nat Immunol. 2009; 10:471–479.
33. Azuma YT, Matsuo Y, Kuwamura M, et al. Interleukin-19 protects mice from innate-mediated colonic inflammation. Inflamm Bowel Dis. 2010; 16:1017–1028.
34. Valero R, Marfany G, Gonzalez-Angulo O, et al. USP25, a novel gene encoding a deubiquitinating enzyme, is located in the gene-poor region 21q11.2. Genomics. 1999; 62:395–405.
35. Kovalenko A, Chable-Bessia C, Cantarella G, et al. The tumour suppressor CYLD negatively regulates NF-kappaB signalling by deubiquitination. Nature. 2003; 424:801–805.
36. Leroy E, Boyer R, Auburger G, et al. The ubiquitin pathway in Parkinson's disease. Nature. 1998; 395:451–452.
37. Graner E, Tang D, Rossi S, et al. The isopeptidase USP2a regulates the stability of fatty acid synthase in prostate cancer. Cancer Cell. 2004; 5:253–261.
38. Zhang FR, Huang W, Chen SM, et al. Genomewide association study of leprosy. N Engl J Med. 2009; 361:2609–2618.
39. Barrett JC, Hansoul S, Nicolae DL, et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat Genet. 2008; 40:955–962.
40. Keri RA, Ho SM, Hunt PA, et al. An evaluation of evidence for the carcinogenic activity of bisphenol A. Reprod Toxicol. 2007; 24:240–252.
41. Cho JH., IIBDGC. Immunochip-based analysis of a large IBD case-control cohort identifies 50 novel loci, refining definitions of disease pathways. Gastroenterology. 2012; 142:S149–S150.
genetics; genetic interaction; cumulative genetic effect; Crohn’s disease
Supplemental Digital Content
Copyright © 2013 Crohn's & Colitis Foundation of America, Inc.
Highlight selected keywords in the article text.