The New World of Data Linkages in Clinical Epidemiology: Are We Being Brave or Foolhardy?

Weiss, Noel S.

doi: 10.1097/EDE.0b013e318210aca5
The Changing Face of Epidemiology
Author Information

From the Department of Epidemiology, University of Washington, and the Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA.

Supported by National Institute of Health (Grant 2K05CA092002).

Correspondence: Noel S. Weiss, University of Washington, BOX 357236, Seattle, WA 98195. E-mail:

Editors' note: Related articles appear on pages 290, 295, 298, and 302.

Article Outline

The conduct of research in clinical epidemiology—the study of the outcomes of illness—can be facilitated by having access to data on a large number of patients at relatively low cost. The availability of computerized databases with information on illness in defined populations, and on tests and treatments received by members of these populations, has enabled a number of clinical epidemiologic studies to be performed. Without these databases, many such studies would be extremely expensive or not possible at all. Large, linked databases have enormous potential for providing information on the impact of tests and treatments on health. This potential will not be realized, however, unless:

1. Actual receipt of these interventions, health outcomes, and potentially confounding variables can be ascertained accurately for individual patients; and

2. Selection bias can be minimized by the identification of an appropriate basis for comparison.

Back to Top | Article Outline
Ascertainment of Interventions

Juurlink et al1 used records of the Ontario Drug Benefit Program to enumerate those residents of Ontario who had been prescribed an angiotensin-converting enzyme (ACE) inhibitor. The authors made use of a province-wide hospital discharge database to identify 523 persons ages 65 years and older who developed hyperkalemia during 1994–2000 while using such a drug. To these 523 “cases” they matched (on the basis of age, sex, and the presence or absence of renal disease) 25,708 Ontario seniors who had been prescribed an ACE inhibitor but were not hospitalized for hyperkalemia. Again using the records of the Drug Benefit Program, the investigators contrasted cases and controls with regard to the prescription of a potassium-sparing diuretic in the week prior to the onset of each case's hospitalization. In all, 8.2% of cases but only 0.3% of controls had received a potassium-sparing diuretic during that period (adjusted relative risk = 20.3). If these particular data linkage elements had not been present on such a large population, it is unlikely that this striking association could have been documented.

However, sometimes the information on exposure status contained in the database is not sufficiently accurate for the purpose at hand. For example, a second study in Ontario2 sought to compare the proportion of previous screening colonoscopy among persons who had died of colorectal cancer (ascertained through the mortality file of the provincial cancer registry) to that among matched controls drawn from a population at large (through the province's Registered Persons Database). However, the source of information on colonoscopy—the records of the Ontario Health Insurance Plan—was not able to distinguish screening colonoscopies from those done in response to symptoms or signs of illness. Because such diagnostic tests could be expected to be far more common in cases than controls, at least in the months leading up to the diagnosis of the ultimately fatal cancer, the investigators compared cases and controls for colonoscopy more than 6 months before the time of the diagnosis of the case. Unfortunately, a restriction to tests done so early in time allowed the study to address no more than the ability of a negative colonoscopy to predict a low risk of death, not the ability of colonoscopy to produce that low risk. A case-control study can address the efficacy of screening colonoscopy against colorectal cancer mortality3 only by considering receipt of screening colonoscopies performed during the entire period when early cancer and detectable cancer precursor lesions are present.

Back to Top | Article Outline
Ascertainment of Health Outcomes

The validity of data on health events in a population under study will vary by the nature of the event and the database. The sensitivity and specificity of a diagnosis of ocular melanoma in a cancer registry will no doubt be greater than that of a diagnosis of depression in an outpatient billing file. Because of the particular threat to validity arising from a low specificity of the outcome definition,4 investigators who use administrative databases have made efforts to obtain data from multiple files to reduce the number of false positives. For example, Ray et al5 used the electronic files of the Michigan Medicaid program to compare persons hospitalized with hip fracture to other Medicaid recipients for receipt of a prescription of a psychoactive drug during the prior month. The investigators examined electronic outpatient files generated prior to the hospitalization to identify and exclude patients hospitalized for a fracture that had occurred some time in the past (and in whom the present hospitalization was for a complication, or for delayed treatment).

Similarly, Gerstman et al6 operationally defined the occurrence of pulmonary embolism or venous thrombosis as not just a hospital diagnosis but also receipt of a prescription for an anticoagulant (as ascertained in a computerized pharmacy file) in the 6 months after hospital discharge. Using a standardized clinician review of paper medical records as the reference standard, this approach led to a substantial decrease in the proportion of false-positive diagnoses of pulmonary embolism or venous thrombosis compared with the hospital discharge diagnoses alone.

Back to Top | Article Outline
Control of Confounding Variables

Consideration of potential confounding variables is a feature of virtually every clinical epidemiologic study based on linkage of databases. Confounding variables that are not captured by the data files, or that are captured incompletely, will lead to a distorted estimate of the size of the association between a test or a treatment and the occurrence of a particular health outcome. For example, a number of studies have linked records of influenza immunization received by members of healthcare plans with those containing information on illnesses, hospitalizations, and deaths among these persons. During the flu “season,” mortality was observed to be sharply decreased among people who had been vaccinated.7 Jackson et al8 were concerned that the magnitude of the decrease—about 50%—exceeded what plausibly could be attributed to influenza infection, and that the statistical adjustments were based only on demographic characteristics and limited available clinical information. They conducted their own study, in which mortality among immunized and nonimmunized seniors was compared not only during flu season, but during those weeks between the administration of vaccine and the onset of a flu epidemic in a given year. Adjusting for several measured characteristics, all-cause mortality was observed to be decreased by 40%–50% during flu season among people who had been vaccinated, but decreased (in relative terms) to an even greater extent (60%) prior to flu season. This observation suggests that some characteristics strongly associated with immunization and the likelihood of death in the next weeks to months—eg, terminal illness, extreme frailty—were not documented in the computerized administrative records available for analysis, leading to a greatly exaggerated estimate of the relative decrease in mortality associated with immunization.

Sometimes information on a potentially important confounding variable is available in databases only at a point in time that is not germane to the question at hand. For example, in an effort to evaluate the impact of physical activity on the risk of death for persons with cancer, data from a population-health survey (which obtained data on exercise) were linked with those of a cancer registry serving that same population.9 A total of 293 persons with a history of cancer took part in the health survey (on average, 4.9 years after diagnosis). During a mean follow-up period of 5.9 years, those who reported engaging in vigorous activity at least 20 minutes per day for at least 3 days per week had about half the rate of death as more sedentary persons. However, though the analysis was adjusted for the type of cancer, and could have been adjusted for the severity of cancer (in terms of stage) at the time of diagnosis, the cancer registry files focus on events at the time of initial diagnosis, and contain no information regarding severity as of the time the program of physical activity was initiated, or at the time the physical activity information was assessed. It is likely that the presence of more advanced cancer is related to both an increased risk of death and an inability to engage in vigorous exercise. Thus, the negative association between exercise and mortality must be interpreted with caution.

Back to Top | Article Outline
Basis for Comparison

In most clinical epidemiologic studies using linked databases, there is a defined population to which the database refers. This population may be geographically based (eg, residents of the province of Ontario) or a group of persons with a specified form of medical insurance coverage (eg, Medicaid recipients in Michigan). When health outcomes are being identified in defined populations, the choice of a comparison group is generally straightforward. In a cohort study, such a group would comprise nonexposed members of the same population at risk for the event (or a sample thereof). In a case-control study, it would be a sample of members of the population at risk for the event who had not (yet) sustained it. However, the data files may not adequately document which members are or are not at risk. For example, many older women in North America and Europe have previously undergone a hysterectomy and are not at risk for endometrial or cervical cancer. Because the hysterectomy may have occurred years earlier (before that period of time covered by the data files or before women became members of the population under study), a study based on data linkages would improperly include some women who in fact had no uterus. To the extent that an exposure (eg, hormone use) can differ between women with and without a uterus, the results of such a study would be biased.

A case-control study10 of the efficacy of cervical screening against the development of invasive cervical cancer had to deal with the problem of an appropriate control group. Cases were female residents of New South Wales (Australia) 20–69 years of age with a diagnosis of invasive cervical cancer during 2000–2003. The screening history of each woman was obtained from a state-wide screening register in operation since 1996. Perhaps recognizing the problems of sampling 20–69-year-old women in general (many of whom no longer had a uterus), the investigators actually chose as matched controls a sample of women whose names appeared in the screening register during 1996–2004. Because women who had been screened in 2004 may not have been screened previously, the authors reasoned that “each case would still have had a possibility of being matched to control with no screening history prior to the case's diagnosis.” However, data on women who had the “possibility” of not being screened during 1996–2003 are unlikely to generate a valid estimate of the prevalence of not having been screened. This would lead to a spuriously high estimate of the level of screening in the population-at-risk, and of screening efficacy.

Back to Top | Article Outline


Whether they are brave or not, clinical epidemiologists should not hesitate to enter the new world of data linkages, provided that they can:

1. Ascertain exposures, outcomes, and potential confounding variables with reasonable accuracy; and

2. Obtain a valid basis for comparison to the incidence in exposed persons (or to the proportion of cases who had sustained the exposure).

If these conditions can not be met, it would be foolish to rush into this new world. Just because an analysis can be done does not mean it should be done.

Back to Top | Article Outline


NOEL WEISS is a member of the faculties of the Department of Epidemiology, University of Washington School of Public Health and the Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, in Seattle, WA.

Back to Top | Article Outline


Thanks to Thomas Koepsell for his suggestions on an earlier draft of this manuscript.

Back to Top | Article Outline


1. Juurlink DN, Mamdani M, Kopp A, Laupacis A, Redelmeier DA. Drug-drug interactions among elderly populations hospitalized for drug toxicity. JAMA. 2003;289:1652–1658.
2. Baxter NN, Goldwasser MA, Paszat LF, Saskin R, Urbach DR, Rabeneck L. Association of colonoscopy and death from colorectal cancer. Ann Intern Med. 2009;150:1–8.
3. Weiss NS, Dhillon PK, Etzioni R. Case-control studies of the efficacy of cancer screening: Overcoming bias from nonrandom patterns of screening. Epidemiology. 2004;15:409–413.
4. Koepsell TD, Weiss NS. Epidemiologic Methods: Studying the Occurrence of Illness. New York: Oxford; 2003;233–234.
5. Ray WA, Griffin MR, Schaffner W, Baugh DK, Melton LJ III. Psychotropic drug use and the risk of hip fracture. N Engl J Med. 1987;316:363–369.
6. Gerstman BB, Freisman JP, Hine LK. Use of subsequent anticoagulants to increase the predictive value of Medicaid deep venous thromboembolism diagnosis. Epidemiology. 1990;1:122–127.
7. Simonsen L, Taylor RJ, Viboud C, Miller MA, Jackson LA. Mortality benefits of influenza immunization in elderly people: an ongoing controversy. Lancet Infect Dis. 2007;7:658–666.
8. Jackson LA, Jackson ML, Nelson JC, Neuzil KM, Weiss NS. Evidence of bias in estimates of influenza vaccine effectiveness in seniors. Int J Epidemiol. 2006;35:337–344.
9. Hamer M, Stamatakis E, Saxton JM. The impact of physical activity on all-cause mortality in men and women after a cancer diagnosis. Cancer Causes Control. 2009;20:225–231.
10. Yang B, Morrell S, Zuo Y, Roder D, Tracey E, Jelfs P. A case-control study of the protective benefit of cervical screening against invasive cervical cancer in NSW women. Cancer Causes Control. 2008;19;569–576.
Back to Top | Article Outline
Section Description

Editors' note: This series addresses topics of interest to epidemiologists across a range of specialties. Commentaries start as invited talks at symposia organized by the Editors. This paper was presented at the 2010 Society for Epidemiologic Research Annual Meeting in Seattle, WA.

Cited By:

This article has been cited 2 time(s).

With Great Data Comes Great Responsibility: Publishing Comparative Effectiveness Research in Epidemiology
Hernán, MA
Epidemiology, 22(3): 290-291.
PDF (91) | CrossRef
Improving Automated Database Studies
Ray, WA
Epidemiology, 22(3): 302-304.
PDF (128) | CrossRef
Back to Top | Article Outline
© 2011 Lippincott Williams & Wilkins, Inc.