Secondary Logo

Journal Logo

Large Life-Course Cohorts for Characterizing Genetic and Environmental Contributions: The Need for More Thoughtful Designs

Frank, John*†; Ruggiero, Erica Di*†; McInnes, Roderick R.†‡; Kramer, Michael§; Gagnon, France*

doi: 10.1097/01.ede.0000239725.48908.7d
The Changing Face of Epidemiology

From the *Department of Public Health Sciences, University of Toronto, Toronto, Canada; †CIHR Institute of Population and Public Health, Toronto, Canada; the ‡Departments of Pediatrics and Molecular & Medical Genetics, University of Toronto, Toronto, Canada; and the §Departments of Epidemiology, Biostatistics and Occupational Health, and Paediatrics, McGill University, Montreal, Quebec, Canada.

Editors' Note:This series addresses topics that affect epidemiologists across a range of specialties. Commentaries are first invited as talks at symposia organized by the Editors. This paper was originally presented at the 2005 SER meeting in Toronto.

Address correspondence to: John Frank, Professor of Public Health Sciences, CIHR Institute of Population and Public Health, Suite 207-L, Banting Building, University of Toronto, 100 College St., Toronto Ontario, Canada M5G 1L5. E-mail:

Large studies of thousands of ordinary infants, followed closely from birth to adulthood and even into late life, provide a unique resource for illuminating the underlying causes of the health problems of early (and later) life.1 The mother's state of health just before conception, now thought to be potentially important to her offspring's lifelong health, should also be included when feasible. This type of study, usually termed a “birth cohort,” is critical to understanding the complex way that various health conditions unfold at various ages, depending on both genetic characteristics and the precise nature and sequence of environmental exposures that each child has experienced since conception.

In a single cohort, it is efficient to measure a large number of potential risk and protective factors and analyze them against the occurrence of a large number of common health outcomes (diseases and dysfunctions). The only essential requirement is that the right physical and social exposures in early life are selected and that suitable multiple measurements are made as the child ages. For example, appropriate samples of blood, urine, and so on, can provide a physiological record of environmental exposures (such as absorbed environmental contaminants at a given age) or early/intermediary health effects (such as altered daily cycles of stress hormones in the saliva or blood).

Because a single large birth cohort offers an efficient long-term research investment for studying a wide variety of risk and protective factors, and subsequent health and developmental trajectories, we label such cohorts as research “platforms.” Such a platform can ultimately support literally hundreds of specific spinoff studies over succeeding decades. Furthermore, only core data collection and storage costs need to be funded directly within the essential cohort infrastructure. Specific studies to make use of these stored questionnaire and physical examination data and biologic samples can be initiated as new scientific concepts and societal needs emerge.

Cohort studies can start later in life as well. The gigantic U.K. “BioBank” platform will collect baseline interview data and blood samples for genetic and other analyses from some 500,000 healthy Britons over age 45. This study is intended to elucidate the joint genetic and environmental determinants of common conditions of later life—hypertension, diabetes, heart disease, and some cancers—as they develop and are reported through the U.K. National Health Service. Such a late-life study will, of course, have limited ability to uncover new early-life causes of these late-life diseases unless those early-life factors can be reliably remembered by the study subjects after 50 years or more or were recorded at the time in a way that is currently accessible.2

Several countries, recognizing the power of the cohort approach, are either in the early stages of initiating such birth cohorts or have plans to do so. Typically, these cohorts include up to 100,000 infants and are planned to continue for at least 20 years so that early-life risk factors for the common disorders of adolescence can be examined, including both physical disorders such as obesity and behavioral problems, drug or alcohol abuse. Remarkably, Denmark and Norway, countries with relatively small populations, have committed major funds to start such large studies. In the United States, the still unfunded (and recently threatened3) Children's Health Study of 100,000 newborns had been instructed by Congress to focus specifically on potential environmental causes of the main childhood disorders.4 Genetic factors were also to be analyzed in recognition of the genetic contribution to virtually all diseases usually acting in consort with environmental exposures.

In Australia, $20 million has been allocated by the Ministry of Family and Community Services for the first 9 years of a 10,000-child cohort that will recruit half its subjects before birth and half at age 4 to 6 years.5 Interestingly, the focus of that study will be on social factors (including the family) as well as educational influences on child development and mental health, with little biomedical content. Some Australian researchers therefore regard this cohort as a missed opportunity.

In Canada, we have the National Longitudinal Study of Children and Youth (NLSCY), currently completing nearly a decade of follow up on an original sample of 25,000 children who are now ages 9 to 20. The NLSCY is unique for Canada as a rich source of many useful sorts of population-based information that can be collected by questionnaires (especially children's intellectual, social, and emotional development). Standardized school achievement results are also available. Its full potential is only beginning to be realized as analyses of multiple waves of follow-up data begin in earnest.6 Human Resources and Development Canada sponsored this study, so it is perhaps not surprising that the collection of physical examination data or biologic samples was not one of the goals. The NLSCY therefore lays the groundwork for a new Canadian national birth cohort that focuses, in a complementary way, on biologic pathways to full function and optimal development as well as on health and disease based on the main and joint effects of genetic and environmental factors.

Typical costs for population-based birth cohorts vary widely. In Scandinavia, where healthcare records of all citizens are computerized and routinely linkable, primary data collection over the first few years of life can be done for a few hundred dollars per subject. At the opposite extreme, the full 20 years of data collection proposed in the U.S. National Children's Study carries a price tag of approximately $27,000 per subject. The high cost of the U.S. birth cohort seems to have been driven in part by a long list of exposure and outcome measurements, including expensive “absorbed body burden” measures of many environmental pollutants and direct samples of the ambient environments of children at various ages. Another factor is the absence of a universal health system or single-party payer in the United States, making computerized linkage of routine medical care records difficult or impossible.

Back to Top | Article Outline

The Critical Issue: What Design Features Simply Cannot be Scrimped on?

We in Canada are at the critical stage of identifying the essential core elements of any large life-course cohort study that intends to detect the main and joint effects of genetic and environmental influences on health and development. Our thinking has been substantially informed by the recent experiences of other countries in designing their own large cohorts whether these begin in early or later life (Tables 1 and 2).





In this regard, we (and others) have begun to question the soundness of the design of large cohort studies being launched in some countries such as the United Kingdom and under active consideration in others.7 For example, the 500,000-subject U.K. BioBank Study has a prototypic “huge, (data-) thin” design. The BioBank appears to have had its sample size determined largely on the basis of what could be afforded. The official, and apparently post hoc, BioBank sample-size calculations are based on the primary objective of detecting gene–environment interactions in the classic epidemiologic sense of biologically significant departures from multiplicative interaction between risk factors, with analyses carried out on nested case–control studies of dichotomous disease outcomes. In fact, the BioBank web site's power calculations suggest that there will be adequate statistical power for detecting such interactions only for the most frequent half-dozen diseases of mid to late life such as heart disease, diabetes, and the top 3 or 4 cancer sites—even after a full 10 years of follow up.

As pointed out by Burton and others,8 this low-power problem can be partially addressed by selecting continuously distributed variables as primary outcomes. Examples would be certain markers of disease risk (blood pressure, serum lipid levels) as well as normal variation in traits such as height, IQ, and many developmental scales; and intermediary indicators of biologic pathways that signal altered body metabolic responses to environmental stressors or exposures (eg, diurnal patterns of stress hormone levels in serum, urine or saliva; Table 2).

Continuous outcomes also have less onerous requirements—in sample size terms—for the detection of interactions that constitute departures from additivity. At least one eminent epidemiologist9 has argued that such departures are the only “generally biologically meaningful” sort of interaction anyway. The implication is that spending considerably more resources to detect departures from multiplicative interaction—which requires much larger sample sizes—may not be justified.

Furthermore, the U.K. BioBank design has reduced study costs by limiting most of its environmental and lifestyle exposure measures to self-report, apparently at baseline only. As a result, important environmental and lifestyle exposures for chronic disease will be measured with considerable error or not at all, with predictable effects on the modeling of gene–environment interactions, as more fully explicated below.

In Canada and other countries where large cohort studies are being planned, a key design issue is the inherently uneven playing field that surrounds the accurate measurement of genetic factors compared with environmental exposures. Table 1 compares the essential measurement features of these 2 very different types of risk factors. The playing field may become even more uneven with each passing year as genomic measurements become less expensive and more comprehensive (eg, whole genome screens), whereas the list of suspect environmental exposures mushrooms (eg, specific isomers of PCBs and other persistent organic pollutant body burdens).

The power of new genomic approaches is illustrated by the whole genome screen, which recently established that a variant of complement factor H accounts for approximately 50% of the attributable risk of developing age-related macular degeneration,10–13 a common, genetically complex disorder. This landmark study led to the subsequent recognition that 2 other genes in the alternative complement pathway contribute to the risk as well.14 In contrast, the only major environmental contributor to age-related macular degeneration that has been identified to date is tobacco and, because most persons with this condition never have smoked, other important environmental factors must be required for it to develop.

We would reemphasize the obvious: environmental exposure covariates, whether they be physicochemical, biologic (eg, infectious), or sociocultural, are time-varying. This fact, together with our growing understanding of “windows of vulnerability” in human development and disease causation,15–21 dictates the importance of repeated measures of the environment on multiple occasions, especially during those periods of life when a specific exposure is thought to be both potent in its health effects and temporally unstable.

Indeed, without multiple measures of changing exposures over time, the cohort design may lose much of its inherent advantage over case–control or case–cohort designs. Furthermore, regression dilution and other biases22 can be expected to compromise the cohort study's ability to characterize gene–environment interactions accurately.

Table 2 summarizes the implications of Tables 1 for 2 archetypal life-course cohort designs: “huge and (data-) thin” versus “small and (data-) thick.” As Table 2 suggests, the inherently time-invariant character of pure genomic measures will tend to ensure that their effects, even if small for any one locus, swamp those of all but the most powerful and temporally stable environmental covariates, purely on the basis of their much smaller measurement error. Thus, a large life-course cohort with poor-quality or one-time measures of time-varying environmental exposures will likely provide biased results, with the dice loaded in favor of the detection and estimation of purely genetic effects.

This potential bias would not be a problem, of course, for Mendelian genetic effects, ie, strong effects of one or a few genes on a given disease virtually independent of environmental exposures. However, few people realistically expect new, large single-gene effects to be discovered for the common diseases. If such effects are waiting to be discovered, there is no advantage of cohort designs over case–control designs as exemplified by the elegant studies of complement-related genes and age-related macular degeneration.23

Rather, the complex diseases that afflict most of us, most often, throughout the life course and certainly in later life, can be expected to be associated with genetic variation as well as with a mixture of diverse specific environmental exposures, particularly during certain developmental “windows.” Thus, large cohort studies are most likely to be worth their considerable investment if accurate and repeated—and therefore often expensive—environmental exposure measures (or validated proxies of them such as blood lead levels) are built into the design at the outset.

In addition, the genetic architecture of common diseases is not only complex, as described previously, but is likely to vary widely from one disease to another in terms of effect sizes and frequency of relevant genetic factors at the population level. As examples, we would cite recently discovered associations of common genetic variants with age-related macular degeneration10,11,14; a common variant with moderate effect on prostate cancer24; and multiple rare variants with modest effects on low plasma levels of high-density lipoprotein cholesterol25 and plasma levels of low-density lipoprotein cholesterol.26 This suggests that various analytic tools and strategies, capable of addressing rare as well as common variants, will be required to identify and characterize at the population level the genetic factors underlying common disorders.

For all of these reasons, we in Canada are expending considerable effort in deciding how best to spend the funds we can afford to commit to large cohort platforms. A suboptimally designed cohort such as a “huge and thin” one, that fails to capture critical environmental covariate information, is less likely to yield new etiologic understanding of the common health outcomes. In short, any cohort worth designing is worth designing well.

The ultimate goal is multigenerational cohort studies that are able to provide accurate data on the full spectrum of the determinants of health over the entire life course, as well as flexible and efficient platforms that allow various analytic tools and strategies to be used. Without a careful design that takes account of the dilemmas described previously, and is “epistemologically balanced”—in terms of its likelihood of detecting both genetic and environmental main effects, let alone their interactions—such a cohort may not be worth the investment.

Back to Top | Article Outline


The authors acknowledge the expert editorial assistance of Anne Silversides and the background work completed in 2002–2003 by Stefania Maggi, PhD Candidate at the University of British Columbia.

Back to Top | Article Outline


1. Eaton WW. The logic for a conception-to-death cohort study. Ann Epidemiol. 2002;12:445–451.
2. Wright AF, Carothers AD, Campbell H. Gene–environment interactions—the BioBank UK study. The Pharmacogenomics Journal. 2002;2:75–82.
3. The National Children's Study. The National Children's Study's Progress: A Funding Update. March 2006. Available at:
4. The National Institute of Child Health and Human Development. The Longitudinal Cohort Study of Environmental Effects on Child Health and Development. October 2001. Available at:
5. Sanson A, ed. Children's Health and Development: New Research Directions for Australia. Australian Institute of Family Studies: Research Report No. 8. Commonwealth of Australia; 2002.
6. Willms JD. Vulnerable Children: Findings From Canada's National Longitudinal Survey of Children & Youth. Edmonton: University of Alberta Press; Hull: Human Resources Development Canada, Applied Research Branch; 2002.
7. Collins FS. The case for a US prospective cohort study of genes and environment. Nature. 2004;429:475–477.
8. Burton P. Invited Address to Canadian Institutes of Health Research Expert Meeting on Large Cohort Design Issues. Toronto; November 2004.
9. Rothman KJ. Epidemiology—An Introduction. New York: Oxford University Press;2002;178–179.
10. Klein RJ, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389.
11. Haines JL, et al. Complement factor H variant increases the risk of age-related macular degeneration. Science. 2005;308:419–421.
12. Edwards AO, et al. Complement factor H polymorphism and age-related macular degeneration. Science. 2005;308:421–424.
13. Hageman GS, et al. A common haplotype in the complement regulatory gene factor H (HF1/CFH) predisposes individuals to age-related macular degeneration. PNAS. 2005;102:7227–7232.
14. Gold B, Merriam JE, Zernant J, et al. Variation in factor B (BF) and complement component 2 (C2) genes is associated with age-related macular degeneration. Nat Genet. 2006;38:458–462.
15. Hertzman C, Wiens M. Child development and long-term outcomes: a population health perspective and summary of successful interventions. Soc Sci Med. 1996;43:1083–1095.
16. Lynch J, Smith GD. A lifecourse approach to chronic disease epidemiology. Annu Rev Public Health. 2005;26:1–35.
17. McCormack VA, Santos Silva I dos, De Stavola BL, et al. Fetal growth and subsequent risk of breast cancer: results from long term follow up of Swedish cohort. BMJ. 2003;248:326–332.
18. Jefferies BJMH, Power C, Hertzman C. Birth weight, childhood socioeconomic environment, and cognitive development in the 1958 British birth cohort study. BMJ. 2002;325:305.
19. Barker DJP. Fetal nutrition and cardiovascular disease in later life. Br Med Bull. 1997;53:96–108.
20. Dodic M, Peers A, Coghlan JP, et al. Can excess glucocorticoids, in utero, predispose to cardiovascular and metabolic disease in middle age? TEM. 1999;10:86–91.
21. Wadsworth M. Early life. In: Marmot M, Wilkinson R, eds. Social Determinants of Health. Oxford University Press; 1999:44–63.
22. MacMahon S, Peto R, Cutler J, et al. Epidemiology: blood pressure, stroke, and coronary heart disease. Lancet. 1990;335:765–774.
23. Clayton D, McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet. 2001;358:1356–1360.
24. Amundadottir LT, Sulem P, Gudmundsson J, et al. A common variant associated with prostate cancer in European and African populations. Nat Genet. 2006;38:652–658.
25. Cohen JC, Kiss RS, Pertsemlidis A, et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004;305:869–872.
26. Cohen JC, Pertsemlidis A, Fahmi S, et al. Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. PNAS. 2006;103:1810–1815.
© 2006 Lippincott Williams & Wilkins, Inc.