Secondary Logo

Journal Logo

Genetics: Commentary

Population Stratification Bias

More Widespread Than Previously Thought

Kraft, Peter

Author Information
doi: 10.1097/EDE.0b013e3182137e03
  • Free

The specter of population stratification has long haunted genetic epidemiology. If study participants are drawn from a non–randomly mating population—likely true, thanks to geographic and social barriers1—and if the distribution of risk factors differs across preferentially mating subpopulations—toxic waste dumps are more likely to cluster poor districts, for example, and fresh fruits and vegetables are strikingly difficult to find in African American neighborhood2–6—then the crude association between a measured genetic marker and disease can be confounded, sometimes greatly so.7 In this issue, Shi et al8 show that this confounding can also affect a method for analyzing family-based studies for gene-environment interaction. This is a surprising result, as this method was previously thought to be unaffected by population stratification.

Controlling for potential confounding by population stratification can be quite challenging. Self-reported race is a poor surrogate for the underlying causal variables—labels like African American or Latino encompass a broad range of genetic and social diversity.9 Data from modern genome-wide genotyping platforms can provide fine-scale measurements of genetic ancestry,10 but many candidate-gene studies will not have collected such data.

Because family-based tests condition on parental genotypes, they remain valid tests of the null hypothesis that a marker is neither a causal variant itself nor in linkage disequilibrium with a causal marker—even in the presence of population stratification. Originally proposed for simple family structures such as a case and parents or disease-discordant sibling pairs, this approach has since been extended to more complicated family structures and other hypotheses, including tests of gene-environment interaction.11,12

As Shi et al8 point out, there is a problem with tests of gene-environment interaction when the marker being tested is associated with disease risk, but is not itself the causal variant. This would likely be the case if the marker had been identified by a genome-wide association study. If the strength of the linkage disequilibrium between the marker and the causal variant varies across preferentially-mating subpopulations, and the distribution of the exposure also differs across subpopulations, then differences in the genetic effect across subpopulations due solely to difference in linkage disequilibrium will appear to be due to gene-environment interaction. In other words, population stratification can bias standard family-based tests of gene-environment interaction. Exposure-stratum specific estimates of the genetic effect will also be more difficult to interpret.13

Shi et al8 construct a test for gene-environment interaction that is unaffected by population stratification, although this test requires that information on the environmental exposure be collected for an unaffected sibling. They fit a saturated model for the genetic main effect within strata defined by the siblings' exposure profile. While this approach is elegant and effective, attempts to extend it beyond the binary exposure case will face practical challenges—for example, it will be difficult to fit a saturated genetic model when some exposure profiles are rare.

The empirical extent and magnitude of the form of population stratification bias identified by Shi et al8 is unknown. The example they give to illustrate the problem is admittedly extreme, but there are realistic scenarios where this bias cannot be readily dismissed. As with crude tests of association in unrelated samples, the likelihood and magnitude of population stratification bias for family-based tests of gene-environment interaction is probably greater for recently admixed populations, such as African Americans and Latinos.14

There is one important application where population stratification will not bias family-based tests of gene-environment interaction. If one wishes to test the global null hypothesis that the marker is not associated with disease in any exposure stratum (ie, there is no genetic main effect and no gene-environment interaction), then the family-based test of gene-environment interaction will not have an inflated Type I error rate. This null hypothesis would be appropriate if one wished to conduct a genome-wide gene-environment interaction study to identify markers that standard marginal-association tests failed to detect.15,16 In contrast, if the goal of a study is to characterize a marker that has already been shown to have a marginal association with disease, then the concern raised by Shi et al.8 is relevant.


PETER KRAFT is an Associate Professor of Epidemiology and Biostatistics and Deputy Director of the Program in Molecular and Genetic Epidemiology at the Harvard School of Public Health. He has published papers on statistical methods for the study of gene-environment interaction in both family- and population-based studies.


1. Park H. Who is marrying whom? The New York Times, 2011.
2. Evans GW, Kantrowitz E. Socioeconomic status and health: the potential role of environmental risk exposure. Annu Rev Public Health. 2002;23:303–331.
3. Zenk SN, Schulz AJ, Israel BA, James SA, Bao S, Wilson ML. Fruit and vegetable access differs by community racial composition and socioeconomic position in Detroit, Michigan. Ethn Dis. 2006;16:275–280.
4. Galvez MP, Morland K, Raines C, et al. Race and food store availability in an inner-city neighbourhood. Public Health Nutr. 2008;11:624–631.
5. Morland K, Filomena S. Disparities in the availability of fruits and vegetables between racially segregated urban neighbourhoods. Public Health Nutr. 2007;10:1481–1489.
6. Bodor JN, Rice JC, Farley TA, Swalm CM, Rose D. Disparities in food access: does aggregate availability of key foods from other stores offset the relative lack of supermarkets in African-American neighborhoods? Prev Med. 2010;51:63–67.
7. Knowler W, Williams R, Pettitt D, Steinberg A. Gm3,5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet. 1988;43:520–526.
8. Shi M, Umbach D, Weinberg C. Family based gene-by-environment interaction studies: Revelations and remedies. Epidemiology. 2011;22:400–407.
9. Florez JC, Price AL, Campbell D, et al. Strong association of socioeconomic status with genetic ancestry in Latinos: implications for admixture studies of type 2 diabetes. Diabetologia. 2009;52:1528–1536.
10. Novembre J, Johnson T, Bryc K, et al. Genes mirror geography within Europe. Nature. 2008;456:98–101.
11. Laird N, Lange C. Family-based designs in the era of large-scale gene-association studies. Nat Rev Genet. 2006;7:385–394.
12. Cordell HJ, Barratt BJ, Clayton DG. Case/pseudocontrol analysis in genetic association studies: A unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions, and parent-of-origin effects. Genet Epidemiol. 2004;26:167–185.
13. Kaufman JS. Interaction reaction. Epidemiology. 2009;20:159–160.
14. Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. JNCI. 2000;92:1151–1158.
15. Khoury MJ, Wacholder S. Invited commentary: from genome-wide association studies to gene-environment-wide interaction studies–challenges and opportunities. Am J Epidemiol. 2009;169:227–230, discussion 2347—235.
16. Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-environment interaction to detect genetic associations. Hum Hered. 2007;63:111–119.
© 2011 Lippincott Williams & Wilkins, Inc.