In this regard, we (and others) have begun to question the soundness of the design of large cohort studies being launched in some countries such as the United Kingdom and under active consideration in others.7 For example, the 500,000-subject U.K. BioBank Study has a prototypic “huge, (data-) thin” design. The BioBank appears to have had its sample size determined largely on the basis of what could be afforded. The official, and apparently post hoc, BioBank sample-size calculations are based on the primary objective of detecting gene–environment interactions in the classic epidemiologic sense of biologically significant departures from multiplicative interaction between risk factors, with analyses carried out on nested case–control studies of dichotomous disease outcomes. In fact, the BioBank web site's power calculations suggest that there will be adequate statistical power for detecting such interactions only for the most frequent half-dozen diseases of mid to late life such as heart disease, diabetes, and the top 3 or 4 cancer sites—even after a full 10 years of follow up.
As pointed out by Burton and others,8 this low-power problem can be partially addressed by selecting continuously distributed variables as primary outcomes. Examples would be certain markers of disease risk (blood pressure, serum lipid levels) as well as normal variation in traits such as height, IQ, and many developmental scales; and intermediary indicators of biologic pathways that signal altered body metabolic responses to environmental stressors or exposures (eg, diurnal patterns of stress hormone levels in serum, urine or saliva; Table 2).
Continuous outcomes also have less onerous requirements—in sample size terms—for the detection of interactions that constitute departures from additivity. At least one eminent epidemiologist9 has argued that such departures are the only “generally biologically meaningful” sort of interaction anyway. The implication is that spending considerably more resources to detect departures from multiplicative interaction—which requires much larger sample sizes—may not be justified.
Furthermore, the U.K. BioBank design has reduced study costs by limiting most of its environmental and lifestyle exposure measures to self-report, apparently at baseline only. As a result, important environmental and lifestyle exposures for chronic disease will be measured with considerable error or not at all, with predictable effects on the modeling of gene–environment interactions, as more fully explicated below.
In Canada and other countries where large cohort studies are being planned, a key design issue is the inherently uneven playing field that surrounds the accurate measurement of genetic factors compared with environmental exposures. Table 1 compares the essential measurement features of these 2 very different types of risk factors. The playing field may become even more uneven with each passing year as genomic measurements become less expensive and more comprehensive (eg, whole genome screens), whereas the list of suspect environmental exposures mushrooms (eg, specific isomers of PCBs and other persistent organic pollutant body burdens).
The power of new genomic approaches is illustrated by the whole genome screen, which recently established that a variant of complement factor H accounts for approximately 50% of the attributable risk of developing age-related macular degeneration,10–13 a common, genetically complex disorder. This landmark study led to the subsequent recognition that 2 other genes in the alternative complement pathway contribute to the risk as well.14 In contrast, the only major environmental contributor to age-related macular degeneration that has been identified to date is tobacco and, because most persons with this condition never have smoked, other important environmental factors must be required for it to develop.
We would reemphasize the obvious: environmental exposure covariates, whether they be physicochemical, biologic (eg, infectious), or sociocultural, are time-varying. This fact, together with our growing understanding of “windows of vulnerability” in human development and disease causation,15–21 dictates the importance of repeated measures of the environment on multiple occasions, especially during those periods of life when a specific exposure is thought to be both potent in its health effects and temporally unstable.
Indeed, without multiple measures of changing exposures over time, the cohort design may lose much of its inherent advantage over case–control or case–cohort designs. Furthermore, regression dilution and other biases22 can be expected to compromise the cohort study's ability to characterize gene–environment interactions accurately.
Table 2 summarizes the implications of Tables 1 for 2 archetypal life-course cohort designs: “huge and (data-) thin” versus “small and (data-) thick.” As Table 2 suggests, the inherently time-invariant character of pure genomic measures will tend to ensure that their effects, even if small for any one locus, swamp those of all but the most powerful and temporally stable environmental covariates, purely on the basis of their much smaller measurement error. Thus, a large life-course cohort with poor-quality or one-time measures of time-varying environmental exposures will likely provide biased results, with the dice loaded in favor of the detection and estimation of purely genetic effects.
This potential bias would not be a problem, of course, for Mendelian genetic effects, ie, strong effects of one or a few genes on a given disease virtually independent of environmental exposures. However, few people realistically expect new, large single-gene effects to be discovered for the common diseases. If such effects are waiting to be discovered, there is no advantage of cohort designs over case–control designs as exemplified by the elegant studies of complement-related genes and age-related macular degeneration.23
Rather, the complex diseases that afflict most of us, most often, throughout the life course and certainly in later life, can be expected to be associated with genetic variation as well as with a mixture of diverse specific environmental exposures, particularly during certain developmental “windows.” Thus, large cohort studies are most likely to be worth their considerable investment if accurate and repeated—and therefore often expensive—environmental exposure measures (or validated proxies of them such as blood lead levels) are built into the design at the outset.
In addition, the genetic architecture of common diseases is not only complex, as described previously, but is likely to vary widely from one disease to another in terms of effect sizes and frequency of relevant genetic factors at the population level. As examples, we would cite recently discovered associations of common genetic variants with age-related macular degeneration10,11,14; a common variant with moderate effect on prostate cancer24; and multiple rare variants with modest effects on low plasma levels of high-density lipoprotein cholesterol25 and plasma levels of low-density lipoprotein cholesterol.26 This suggests that various analytic tools and strategies, capable of addressing rare as well as common variants, will be required to identify and characterize at the population level the genetic factors underlying common disorders.
For all of these reasons, we in Canada are expending considerable effort in deciding how best to spend the funds we can afford to commit to large cohort platforms. A suboptimally designed cohort such as a “huge and thin” one, that fails to capture critical environmental covariate information, is less likely to yield new etiologic understanding of the common health outcomes. In short, any cohort worth designing is worth designing well.
The ultimate goal is multigenerational cohort studies that are able to provide accurate data on the full spectrum of the determinants of health over the entire life course, as well as flexible and efficient platforms that allow various analytic tools and strategies to be used. Without a careful design that takes account of the dilemmas described previously, and is “epistemologically balanced”—in terms of its likelihood of detecting both genetic and environmental main effects, let alone their interactions—such a cohort may not be worth the investment.
The authors acknowledge the expert editorial assistance of Anne Silversides and the background work completed in 2002–2003 by Stefania Maggi, PhD Candidate at the University of British Columbia.
1. Eaton WW. The logic for a conception-to-death cohort study. Ann Epidemiol
2. Wright AF, Carothers AD, Campbell H. Gene–environment interactions—the BioBank UK study. The Pharmacogenomics Journal
4. The National Institute of Child Health and Human Development. The Longitudinal Cohort Study of Environmental Effects on Child Health and Development. October 2001. Available at: http://www.nichd.nih.gov/despr/cohort/
5. Sanson A, ed. Children's Health and Development: New Research Directions for Australia.
Australian Institute of Family Studies: Research Report No. 8. Commonwealth of Australia; 2002.
6. Willms JD. Vulnerable Children: Findings From Canada's National Longitudinal Survey of Children & Youth.
Edmonton: University of Alberta Press; Hull: Human Resources Development Canada, Applied Research Branch; 2002.
7. Collins FS. The case for a US prospective cohort study of genes and environment. Nature
8. Burton P. Invited Address to Canadian Institutes of Health Research Expert Meeting on Large Cohort Design Issues
. Toronto; November 2004.
9. Rothman KJ. Epidemiology—An Introduction
. New York: Oxford University Press;2002;178–179.
10. Klein RJ, et al. Complement factor H polymorphism in age-related macular degeneration. Science
11. Haines JL, et al. Complement factor H variant increases the risk of age-related macular degeneration. Science
12. Edwards AO, et al. Complement factor H polymorphism and age-related macular degeneration. Science
13. Hageman GS, et al. A common haplotype in the complement regulatory gene factor H (HF1/CFH) predisposes individuals to age-related macular degeneration. PNAS
14. Gold B, Merriam JE, Zernant J, et al. Variation in factor B (BF) and complement component 2 (C2) genes is associated with age-related macular degeneration. Nat Genet
15. Hertzman C, Wiens M. Child development and long-term outcomes: a population health perspective and summary of successful interventions. Soc Sci Med
16. Lynch J, Smith GD. A lifecourse approach to chronic disease epidemiology. Annu Rev Public Health
17. McCormack VA, Santos Silva I dos, De Stavola BL, et al. Fetal growth and subsequent risk of breast cancer: results from long term follow up of Swedish cohort. BMJ.
18. Jefferies BJMH, Power C, Hertzman C. Birth weight, childhood socioeconomic environment, and cognitive development in the 1958 British birth cohort study. BMJ
19. Barker DJP. Fetal nutrition and cardiovascular disease in later life. Br Med Bull
20. Dodic M, Peers A, Coghlan JP, et al. Can excess glucocorticoids, in utero, predispose to cardiovascular and metabolic disease in middle age? TEM
21. Wadsworth M. Early life. In: Marmot M, Wilkinson R, eds. Social Determinants of Health.
Oxford University Press; 1999:44–63.
22. MacMahon S, Peto R, Cutler J, et al. Epidemiology: blood pressure, stroke, and coronary heart disease. Lancet
23. Clayton D, McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet
24. Amundadottir LT, Sulem P, Gudmundsson J, et al. A common variant associated with prostate cancer in European and African populations. Nat Genet
25. Cohen JC, Kiss RS, Pertsemlidis A, et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science
© 2006 Lippincott Williams & Wilkins, Inc.
26. Cohen JC, Pertsemlidis A, Fahmi S, et al. Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. PNAS