Secondary Logo

Journal Logo

How Much Are We Missing in SNP-by-SNP Analyses of Genome-wide Association Studies?

Shi, Min; Weinberg, Clarice R.

doi: 10.1097/EDE.0b013e31822ffbe7

Genome-wide association studies have discovered common genetic variants associated with susceptibility for several complex diseases, but they have been unfruitful for many others. Typically, analysis is done “agnostically,” by considering one single nucleotide polymorphism (SNP) at a time and controlling the overall type I error rate by correcting for multiple testing. Such one-at-a-time analyses may be inadequate for screening genes under realistic causal models. We use oral clefting as a disease model to develop a range of toy example scenarios: risk might involve only genes, or genes and exposure, or genes, exposure, and their supermultiplicative interaction. These examples illustrate how dramatically important genetic variants can be obscured by a one-SNP-at-a-time analysis when multiple biologic pathways and multiple genes jointly influence etiology. These examples highlight the need for better methods for gene-by-environment and gene-by-gene analyses.

From the Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC.

Submitted 9 February 2011; accepted 14 June 2011; posted 30 August 2011.

Supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (Z01-ES040007; Z01-ES45002).

Correspondence: Min Shi, Biostatistics Branch, A3–03 101/A315, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709. E-mail:

The decoding of the human genome and the availability of affordable and accurate high-throughput genotyping have raised great expectations for genomic medicine. Although genome-wide association studies (GWAS) have identified some important genetic variants associated with certain common diseases, GWAS results have not fully lived up to our high expectations.1,2 In addition, variants that have been identified appear to explain only a small proportion of genetic heritability, even for diseases that clearly have a major genetic component.3 Various explanations have been offered. For some outcomes, the genetic contribution could be due to the action of diverse but rare susceptibility variants, which can result in apparent modest associations with any one common SNP.4 In this short paper, we provide examples to illustrate another sort of mechanism, where gene-by-gene interplay of common variants can result in very modest associations when the etiology requires particular genotypes at multiple loci. Here, we will focus on a birth defect (oral clefting), but similar examples can readily be constructed involving common outcomes, such as type II diabetes.

The prevalence at birth of oral clefting ranges from 1/600 to 1/1200, depending on ethnicity, geographic region, and socioeconomic status. The recurrence rate in siblings of affected babies is 40 times that in the general population, and the concordance rate for monozygotic twins ranges from 25% to 40%—much higher than that in dizygotic twins (3% to 6%),5 which is quite close to the risk in a later full sibling of an affected baby. Genetics clearly plays an important role in clefting. However, even with the enormous efforts devoted to studying this condition,6–9 most of its etiology has remained elusive.

Rare variants, structural variations, epigenetic effects, gene-gene, and gene-environment interactions have all been proposed as phenomena that may contribute to the low yield of GWAS.10 The methods of analysis used can also impose limitations. Typically, analysis is done, either for case-parent triad designs or for case–control studies, by “agnostically” considering one SNP at a time, controlling the overall type I error rate by correcting for multiple testing. Here, we develop toy examples to illustrate how a one-SNP-at-a-time analysis can overlook important genetic variants when multiple biologic pathways, multiple genes, and gene-environment jointly play an important role in etiology.

Replete with genes having similar roles, the human genome is well known for “genetic redundancy,” and disruption of a single gene may often be selectively neutral.11 Consequently, it may take the malfunction of several genes in a pathway to produce a particular phenotype. At the same time, genetic heterogeneity is common for complex diseases and many pathways may be involved in a biologic process; disruption of any one of these pathways may produce the same phenotype. Palate formation, for example, requires coordination of several signaling pathways: fibroblast growth factor, transforming growth factor beta, bone morphogenetic protein, and sonic hedgehog pathways.12 Dysfunction of any of these pathways could potentially lead to clefting.

Consider this simple example. Suppose 2 independent pathways can cause clefting, and in each pathway simultaneous disruption of 4 unlinked genes is needed to trigger the event (Figure). In other words, a fetus carrying at least one risk allele at each of 4 particular loci in either one of the pathways will develop oral cleft. With 2 such sets of loci, suppose that all 8 risk alleles are unlinked, and each diallelic locus has a carrier frequency of 0.11 for the susceptibility allele. Suppose a fetus with no susceptibility variant sets has a risk (the baseline risk) of 1/1500 and the risk associated with having all 4 susceptibility variants for either causative set is 1.0; thus, the relative risk for either high-risk sufficient multilocus genotype is 1500. This scenario is summarized in the first column of the Table.





We performed simulations under the aforementioned assumptions (R code is available at We first simulated parental genotypes assuming Hardy–Weinberg equilibrium and generated 2 offspring per family assuming Mendelian transmission. The clefting status of the offspring was then assigned probabilistically through a Bernoulli trial based on the fetal genotype alone, applying the assumed risk model. We continued simulation until 100,000 families were obtained with the first offspring affected.

Simulation results allowed us to select a scenario (“scenario 1”) for which the population prevalence at birth of clefting was approximately 1/1000 and the recurrence relative risk in siblings was 30. As shown in the first column of the Table, for this scenario the estimated concordance rate in monozygotic twins was 0.305, whereas that in dizygotic twins was 0.029, corresponding to the epidemiologic data. The fraction of cases of clefting attributable to genotype (either set) was 0.305. Despite the huge relative risk (1500) corresponding to co-occurrence of either set of 4 susceptibility alleles, the marginal relative risk for each SNP, when considered alone, was only 2.6.

Scenario 2 (second column of the Table) is slightly modified such that the carrier frequency at each of the 8 loci was 0.15 and the risk in a fetus with no susceptibility variant sets was 1/2475. The relative risk for either high-risk multilocus genotype was 1100. The population prevalence at birth of clefting was approximately 1/1200 and the recurrence relative risk in a sibling of a case was 28. The concordance rates in monozygous and dizygous twins were 0.23 and 0.024, respectively. All of these numbers approximate the known epidemiology of oral clefting, whereas the marginal relative risk for each of the 8 SNPs was only 3.4. In both scenarios, SNP-wise analysis would overlook most of the information carried by the 4 interacting loci and the effects of genotype could easily be missed in a GWAS analysis.

As with many other diseases, the etiology of clefting apparently involves both environmental effects and gene-environment interactions as well. We next considered a more complicated scenario where, in addition to the genetic risk factors, an environmental factor (eg, maternal smoking) can also increase risk of clefting. Again, we assumed 2 genetic pathways cause clefting, each consisting (for simplicity) of 4 unlinked SNPs in a high-risk sufficient set. We assumed the maternal exposure status remains the same across pregnancies and consequently the concordance rate for dizygous twins is the same as the sibling recurrence risk. We computed the concordance rate for monozygous twins, e(risk2)/e(risk) (e means expectation), where the averaging is across possible exposure (E) by genotype (G) categories, by applying the known risk model and the known prevalences for E and G. One such scenario (“scenario 3”) is described in the Table. We assumed an exposure prevalence of 0.25, which does not change between the 2 births for a given mother, that is, smokers continue to smoke. Except for an exposure relative risk of 1.5, all the other parameters remained the same as in scenario 2. Because there is no G × E interaction in this scenario, the relative risk for either causative gene set is 1100 in both exposed and unexposed. We again used simulations to evaluate this scenario. A dichotomous maternal exposure was simulated via Bernoulli trials with outcome probability equal to the designated exposure prevalence. Our simulations showed that under this scenario, the population prevalence at birth of clefting was approximately 1/1000 and the recurrence relative risk in a later sibling was 31, again agreeing well with epidemiologic data. The concordance rate in monozygotic twins became 0.27, whereas that in dizygotic twins was 0.03. The marginal relative risk for each SNP, which is the parameter estimated when testing SNP-by-SNP, diminished to 3.5.

In scenario 4, we allow a supermultiplicative interaction between smoking and one of the multilocus genotypes (the first causative gene set), but not the other. The relative risk for the first causative gene set was 1100 in an unexposed maternal-fetal pair and 1650 in an exposed maternal-fetal pair (an interaction of 1.5), whereas the relative risk for the second causative gene set was 1100 in both exposed and unexposed pairs (no interaction). The exposure prevalence, exposure relative risk, variant allele carrier frequency, and baseline disease risk remained the same as those in scenarios 2 and 3. Again, the marginal effect for a single SNP was markedly reduced (to 3.3 in the unexposed).

These examples illustrate how single SNP analyses can miss important genetic effects, and the importance of developing better identification of G × E and G × G joint effects. Pathway libraries exist to identify genes known to be functionally involved in particular pathways, and analyses can be undertaken to assess enrichment of those pathways among the “hits” from GWAS. We refer the interested reader to the review by Wang et al.13 However, regulation of pathways is complex and poorly understood, and many genes can be involved in multiple pathways; thus, methods that do not rely on previous knowledge are also needed. Another problem is that, unlike in our toy examples, the investigator is typically not lucky enough to have typed the causative SNPs, but must rely on “marker” SNPs (SNPs in linkage disequilibrium with the actual causative SNPs) to “tag” the real ones. This unavoidable use of proxies produces considerable measurement error, adding a further level of complication to inference related to multilocus and G × E analyses of GWAS. Future methods development will also need to take this into consideration.

Our message here has been discouraging, but there are some positive implications. If we do a careful GWAS or a candidate-gene study of a complex condition and fail to find evidence for risk-related SNPs that can withstand a Bonferroni correction, perhaps it is not that the genetic effects are small or that we have selected the wrong candidate genes. Perhaps, we simply do not yet own the statistical tools needed to tease out the complex interactions that lead to this condition.

Back to Top | Article Outline


We thank Douglas Bell and Dmitri Zaykin for their careful review and valuable comments.

Back to Top | Article Outline


1.Eichler EE, Flint J, Gibson G, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11:446–450.
2.Chee-Seng K, En Yun L, Yudi P, Kee-Seng C. Genome-wide association studies: the success, failure and future. In: Encyclopedia of Life Sciences. Hoboken, NJ; Wiley-Blackwell: 2009.
3.Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753.
4.Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB. Rare variants create synthetic genome-wide associations. PLoS Biol. 2010;8:
5.Mitchell LE, Risch N. Mode of inheritance of nonsyndromic cleft lip with or without cleft palate: a reanalysis. Am J Hum Genet. 1992;51:323–332.
6.Birnbaum S, Ludwig KU, Reutter H, et al. Key susceptibility locus for nonsyndromic cleft lip with or without cleft palate on chromosome 8q24. Nat Genet. 2009;41:473–477.
7.Mangold E, Ludwig KU, Birnbaum S, et al. Genome-wide association study identifies two susceptibility loci for nonsyndromic cleft lip with or without cleft palate. Nat Genet. 2010;42:24–26.
8.Beaty TH, Murray JC, Marazita ML, et al. A genome-wide association study of cleft lip with and without cleft palate identifies risk variants near MAFB and ABCA4. Nat Genet. 2010;42:525–529.
9.Grant SF, Wang K, Zhang H, et al. A genome-wide association study identifies a locus for nonsyndromic cleft lip with or without cleft palate on 8q24. J Pediatr. 2009;155:909–913.
10.McCarthy MI, Abecasis GR, Cardon LR, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9:356–369.
11.Kimura M. The Neutral Theory of Molecular Evolution. Cambridge: Cambridge University Press, 1985.
12.Murray JC, Schutte BC. Cleft palate: players, pathways, and pursuits. J Clin Invest. 2004;113:1676–1678.
13.Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide association studies. Nat Rev Genet. 2010;11:843–854.
© 2011 Lippincott Williams & Wilkins, Inc.