Thomas, Duncan C.
From the Department of Preventive Medicine, Keck School of Medicine, University of Southern California.
Submitted 16 May 2006; accepted 19 May 2006.
Address correspondence to: Duncan C. Thomas, Department of Preventive Medicine, 1540 Alcazar St., Suite CHP-220, University of Southern California, Los Angeles, CA 90089. E-mail: firstname.lastname@example.org.
As David Hunter puts it so aptly in his commentary,1 Pandora's box has been opened: genomics, transcriptomics, metabolomics, proteomics, interactomics, methylomics …. It seems like every day a new genome-wide technology is introduced. Is this to become “a diagnostic dream and an analytical nightmare?”2 What is the real value added that can be expected for molecular epidemiology?
Much of the use of “-omics” technologies to date has been exploratory in nature.3 Analyses are typically performed in 2 stages using a training sample to search for patterns in a high-dimensional data followed by a validation sample to test the predictions of the model. A wide range of data-mining tools have been developed for this purpose, including hierarchical clustering, classification and regression trees, self-organizing maps, random forests, multifactor dimension reduction, and neural nets, to name a few (see Hoh and Ott4 for a partial review).
I would like to explore a somewhat different paradigm: using these “-omics” technologies to inform hypothesis-directed pathway-based approaches to molecular epidemiology and to help direct genome-wide exploratory analyses into more promising directions. In either case, the basic idea entails using “-omics” data to inform the estimation of the parameters of a model for the epidemiologic data at hand.5 For example, in a candidate gene association study of a highly polymorphic gene like ATM, one might include in silico measures of evolutionary conservation and predicted effects on protein conformation or in vitro functional assays of the effects of each polymorphism6 as prior covariates in a hierarchical model for the relative risk of each variant.7,8 On a broader scale, one might use readily available genomic annotation data to prioritize the signals from an initial genome-wide association scan so as to improve the selection of markers to carry forward for testing in the later stages of a multistage study.9,10 If the multiple comparisons problem in a genome-wide association scan is not daunting enough, consider a study of the genetics of gene expression involving perhaps tens of thousands of expressed genes examined in relation to hundreds of thousands of single nucleotide polymorphism (SNPs),11 or all possible gene–gene interactions,12 or all possible haplotype associations13 in a genome-wide SNP scan. Here, the opportunity to exploit bioinformatics resources to focus the search has enormous potential that has so far not been tapped.
In other circumstances, rather than relying on external bioinformatics databases, an investigator might want to apply some of these “-omics” tools directly to the samples from an epidemiologic case–control or cohort study to better characterize intermediate pathways.14 Here, one might think of the “-omics” data as providing the “missing link” among exposures, genes, and disease. The cost of these assays, difficulties in getting subject participation, or the need for special tissue preparations may preclude obtaining such data on all subjects in a large-scale study. Furthermore, the problem of reverse causation in a case–control design (the biomarker being affected by the disease or its treatment rather than the other way around) could make any simple case–control comparison of the biomarker meaningless. This suggests the need for some form of multistage sampling15 combined with modeling of the latent disease process.8 For example, one could use data from an appropriately designed substudy to build a model for the relationship of the biomarker(s) to their genetic and environmental determinants and apply the predictions of this model to the analysis of the full study in which biomarkers were not available.16 Alternatively, one could rely on the idea of “Mendelian randomization.”17 Here, to investigate the role of an intermediate variable on disease, one focuses instead on the associations of a gene with a biomarker for the intermediate variable and with disease separately. A variety of analysis approaches are available for such designs, but the challenges of extending them to genome-wide scale are daunting.18,19
There remain many challenges in the design and analysis of genome-wide studies using any of the technologies discussed here, and their real potential remains uncertain, even for the SNP association studies currently underway.20 Careful attention to the basic principles of epidemiologic study design, including avoidance of differential handling of case and control samples and confounding by ancestral origins, will be essential.21 Clearly, a fundamental requirement will be replication as has been well recognized in gene-association studies.22 Indeed, most gene expression and other genome-wide studies have generally adopted some form of training/validation 2-stage approach, although there may be circumstances in which a joint analysis is preferred.23 Despite these challenges, there is considerable enthusiasm, particularly in such fields as pharmacogenetics,24–26 in which the potential for personalized medicine in the form of genetically tailored preventive or therapeutic drugs is enormous.
As exciting as these opportunities are, there will always remain an important role for classic risk-factor environmental epidemiology. Not every study need understand the molecular mechanism of an exposure–disease relationship for it to provide information that can be useful for informing important public health policy. As David Hunter points out, “the lure of new technologies should not distract us from following social, economic, demographic, or ecologic explanations for disease etiology.”
ABOUT THE AUTHOR
DUNCAN C. THOMAS is Professor of Biostatistics and Verna Richter Chair in Cancer Research at the Keck School of Medicine, University of Southern California, and Co-Director of the Division of Biostatistics. His research has focused on the development of statistical methods in epidemiology, particularly cancer epidemiology, occupational and environmental health, and genetic epidemiology.
1. Hunter DJ. Genomics and proteomics in epidemiology: treasure trove or “high-tech stamp collecting”? Epidemiology.
2. Laird PW. The power and promise of DNA methylation markers. Nat Rev Cancer
3. Goodman L. Hypothesis—limited research. Genome Res
4. Hoh J, Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet
5. Thomas DC. The need for a comprehensive approach to complex pathways in molecular epidemiology [Editorial]. Cancer Epidemiol Biomarkers Prev
6. Rebbeck TR, Spitz M, Wu X. Assessing the function of genetic variants in candidate gene association studies. Nat Rev Genet
7. Hung RJ, Brennan P, Malaveille C, et al. Using hierarchical modeling in genetic association studies with multiple markers: application to a case–control study of bladder cancer. Cancer Epidemiol Biomarkers Prev
8. Conti DV, Cortessis V, Molitor J, et al. Bayesian modeling of complex metabolic pathways. Hum Hered
9. Roeder K, Bacanu SA, Wasserman L, et al. Using linkage genome scans to improve power of association in genome scans. Am J Hum Genet
10. Whittemore AS. Genetic association studies: time for a new paradigm? Cancer Epidemiol Biomarkers Prev
11. Cheung VG, Spielman RS, Ewens KG, et al. Mapping determinants of human gene expression by regional and genome-wide association. Nature
12. Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet
13. Lin S, Chakravarti A, Cutler DJ. Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat Genet
14. Sellers TA, Yates JR. Review of proteomics with applications to genetic epidemiology. Genet Epidemiol
15. Spiegelman D, Carroll RJ, Kipnis V. Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. Stat Med
16. Kraft P, Hunter D. Integrating epidemiology and genetic association: the challenge of gene–environment interaction. Philos Trans R Soc Lond B Biol Sci
17. Davey Smith G, Ebrahim S. What can mendelian randomisation tell us about modifiable behavioural and environmental exposures? BMJ
18. Yasui Y, Pepe M, Thompson ML, et al. A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics
19. Feng Z, Prentice R, Srivastava S. Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics
20. Thomas DC. Are we ready for genome-wide association studies? Cancer Epidemiol Biomarkers Prev.
21. Clayton DG, Walker NM, Smyth DJ, et al. Population structure, differential bias and genomic control in a large-scale, case–control association study. Nat Genet
22. Rebbeck TR, Martinez ME, Sellers TA, et al. Genetic variation and cancer: Improving the environment for publication of association studies. Cancer Epidemiol Biomarkers Prev
23. Skol AD, Scott LJ, Abecasis GR, et al. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet
24. Ulrich CM, Robien K, McLeod HL. Cancer pharmacogenetics: polymorphisms, pathways and beyond. Nat Rev Cancer
25. Stoughton RB, Friend SH. How molecular profiling could revolutionize drug discovery. Nat Rev Drug Discov
26. Little J, Sharp L, Khoury MJ, et al. The epidemiologic approach to pharmacogenomics. Am J Pharmacogenomics
© 2006 Lippincott Williams & Wilkins, Inc.