Designs for the Combination of Group- and Individual-level Data : Epidemiology

Secondary Logo

Journal Logo

Methods: Review Article

Designs for the Combination of Group- and Individual-level Data

Haneuse, Sebastiena; Bartell, Scottb

Author Information
Epidemiology 22(3):p 382-389, May 2011. | DOI: 10.1097/EDE.0b013e3182125cff

Abstract

In ecologic studies, the fundamental unit of investigation is a group of individuals, rather than individuals themselves.1 Ecologic studies are widely used because group-level (or aggregated) data are easy and inexpensive to obtain, particularly through data depositories such as disease registries and census data. Further, developments in computing (eg, geographical information systems) let researchers combine information at varying levels of aggregation.2 Taking advantage of these strengths, ecologic designs continue to be employed in many epidemiologic settings, including studies of environmental risk factors,3–7 cancer screening,8,9 and investigations of chronic10 and infectious diseases.11,12

Notwithstanding their continued use, ecologic studies are controversial because they directly assess group-level associations, that is, relationships between group-level outcomes and group-level exposure measures. Such associations are sometimes of interest,13 particularly for policymaking.14 Typically, though, the scientific goal in epidemiology is to assess individual-level associations. With group-level data alone, one generally cannot estimate individual-level associations, although one may be tempted to interpret results from an ecologic study in terms of such associations. Doing so has many pitfalls.1,15–21 Of particular concern is that misuse of ecologic results may give rise to the ecologic fallacy, in which conclusions based on a group-level analysis differ from those that would have been drawn had an individual-level analysis been performed.

The fundamental difficulty is that ecologic studies cannot characterize within-group joint outcome/exposure/confounder distributions. This makes estimation of individual-level associations extremely difficult and is analogous to the challenge faced when an important confounder is missing. Unfortunately, the problem cannot be overcome solely via post hoc analytic methods, at least not without making untestable assumptions.22 The only reliable way to address the problem is to collect and incorporate appropriate individual-level data.

Combining group- and individual-level data has intuitive appeal. The individual-level data permit identifiability of individual-level associations via 3 mechanisms: (i) evaluation and control of bias; (ii) separation of contextual, within-, and between-group effects; and (iii) the ability to check models. Once identifiability is established, ecologic data may provide gains in power and efficiency, particularly if they represent large sample sizes and if the exposure of interest exhibits large between-group variation.

The past 20 years have seen numerous study designs and methods proposed to combine group- and individual-level data. Despite an extensive literature on developing models, particularly multilevel models,23,24 little work has been published to help researchers choose among alternative designs and, consequently, to help them plan additional data collection efforts. This paper reviews recently proposed “combined” epidemiologic study designs and describes the statistical frameworks they use to estimate individual-level associations. These ideas are illustrated with a simple, hypothetical study of birth weight, using data from North Carolina. Given individual-level data, the additional complexity of combining 2 sources of information at the analysis stage may make it appealing to ignore the group-level data. However, a simulation study illustrates the potentially substantial benefits of accommodating group-level data.

MODEL SPECIFICATION

Fundamental to each design or method reviewed here is the premise that scientific interest lies with some underlying individual-level model. Suppose the population of interest can be stratified into K groups of sizes N1, ..., NK; in environmental epidemiology, such groups are often based on geographic location.20

Let Yki be some binary outcome of interest for the ith individual in the kth group and πki = P(Yki = 1) the corresponding outcome probability. Consider the following general regression model:

where g() is a link function [eg, log() for a log-linear model and logit() for a logistic model] and Xki denotes a vector of covariates. The latter may include exposures of interest, confounders, and potential effect modifiers and, further, may be defined at either the individual or group level. When both individual- and group-level covariates are included, such models are often called multilevel models.23,24

Regardless of the level at which components of Xki are defined, we call the β parameters in model (1) individual-level associations because they correspond to differences in risk between 2 individuals (because the outcome is defined at the individual level).

DESIGN OPTIONS

The focus of this paper is on designs that supplement readily available group-level data with a sample of individuals for whom outcome and covariate information are observed. The following describes 4 general classes of such designs and their statistical methods.

Aggregate Data Methods

Suppose one has access to the number of cases in each group, denoted N1k. Given group-specific population totals, one can calculate the observed proportion of cases for the kth group as k = N1k/Nk. Consider the induced model for the group-level outcome, πk = E[k], obtained by averaging the individual-level model (1) over the Nk individuals in the group:

The right-hand side of (2) shows that the induced model for k is a function of the underlying individual-level β parameters. Further, it demonstrates that evaluating the model requires only individual-level information on the components of X within the kth group {Xki:i = 1, ..., Nk}. Note that these data constitute the (observed) group-specific (marginal) covariate distribution.

Exploiting these 2 features of expression (2), the aggregate data design supplements group-level outcome data with individual-level information on the covariate distribution.25 Such information could be obtained by surveying individuals for information on exposures, confounders, and effect modifiers. As there is no requirement to link this information to individual-level outcomes, it may be possible to take advantage of existing surveys to obtain these data. When the survey represents a complete enumeration of each group (ie, all Nk individuals), the combined design is called the full-survey aggregate data design. When a complete enumeration is not available or feasible, the survey subsample aggregate data design collects individual-level covariate information on a random subsample within each group. Assuming a log link for the individual-level model (1), estimates of β under both designs are obtained as the solution to an estimating equation.25

Both the full-survey and survey subsample aggregate data design are useful when aggregated outcome counts are available, and administering a survey solely for covariates is most practical. In some settings, one may be able to administer the survey to collect individual-level information on outcomes and covariates jointly. Combining these data with group-level outcome information, Martinez et al26 proposed the integrated aggregate data design and developed an estimating-equations framework for estimation and inference for β. They showed that combining the two sources of data can correspond to improvements of analyses that use only survey-based individual-level outcome/covariate information.27

Hierarchical Related Regression

Each of the full-survey, survey subsample, and integrated aggregate data designs employ semiparametric estimating equations for their analyses. The estimating-equations framework is appealing because it does not rely on assumptions regarding the within-group covariate distributions. However, if one is willing to make distributional assumptions, one can take advantage of a fully parametric statistical framework.15,28 Modifying our notation slightly, let πki(x) = P(Yki = 1 X = x) denote the outcome probability for the ith individual in the kth group, given a covariate vector value of X = x. As with expression (1), πki(x) is taken to be specified in terms of individual-level associations of interest. Let fk(x) denote the joint covariate distribution for the kth group. The induced group-level model is obtained by integrating the individual-level model over fk(x):

Specifying fk(x) depends on the components of X. For example, Jackson et al29 considered two covariates, X1 and X2. X1 is binary and follows a Bernoulli distribution; X2 is continuous and assumed to be normally distributed, conditional on X1. While specification in general settings does require care, adopting a specific distributional form for fk(x) can improve small-sample bias, efficiency, and power if the assumptions are correct.

A key advantage of the parametric approach is that one can imbed (3) into a fully Bayesian analysis.2,29 Recently, Jackson et al30 introduced hierarchical related regression as a flexible Bayesian framework for combining group- and individual-level data. Hierarchical related regression extends previous developments based on expression (3) in that it permits the use of within-group joint outcome/covariate information. Viewed as a parametric analog of the integrated aggregate data design, the Bayesian formulation of hierarchical related regression is appealing because it provides flexibility for incorporating prior information and accommodating challenging data features such as spatial structure, measurement error, and missingness. The framework also facilitates data synthesis across various data sources leading to improved power to distinguish individual-level and contextual effects.31

Two-phase Designs

Recently, the two-phase design was proposed as a convenient framework for overcoming ecologic bias.32 Briefly, two-phase studies were proposed as an extension of the case-control design for settings where the exposure of interest is rare.33,34 At phase I, the population is cross-classified according to the outcome and some stratification variable, S. The latter takes on a finite number of levels and is observed on all individuals of the population. The phase I stratification provides an efficient sampling frame from which additional individual-level information is collected on a subsample at phase II.35 In this respect, the design resembles a stratified case-control study with the added advantages of being able to (i) estimate coefficients corresponding to the stratification variable (ie, S) and (ii) obtain general efficiency gains by incorporating stratified outcome totals for the population.

In the ecologic context, group-level data can be used as the basis for the phase I stratification. A simple strategy is to cross-classify the population by case status and group membership. A drawback, however, is that if the number of groups is large, the phase I stratification will have many strata and potentially small cell sizes. This may lead to a breakdown in the analysis methodology. An alternative strategy is to base the phase I stratification on observed group-level covariate measures. We illustrate this approach in our simulation study below.

Estimation and inference of individual-level association parameters using data from a two-phase design follows using standard weighting or likelihood-based methods; Wakefield and Haneuse give a detailed summary.32

Hybrid Designs for Ecologic Inference

Proposed to address ecologic bias directly, the hybrid design for ecologic inference supplements an ecologic study with case-control data drawn from the same underlying population.36 Specifically, the design assumes that group-level outcome and covariate data are available and that individual-level covariate data (stratified by outcome status) are collected from each group.

Assuming an individual-level logistic model, estimation/inference proceeds via the induced hybrid likelihood, derived by averaging the individual-level likelihood over all the possible configurations of the unobserved complete individual-level data. This differs from the various aggregate data designs and the hierarchical related regression, which consider the induced group-level model derived by averaging the individual-level model over the unobserved individual-level data (see expressions (2) and (3)). Estimation and inference based on the hybrid likelihood can proceed via either maximum likelihood or within the Bayesian framework.37

Like the two-phase design, the hybrid design may be viewed as a stratified case-control design. Indeed, the hybrid design that collects case-control samples from each group is equivalent to the two-phase design where the phase I stratification is based on group membership. A key distinction, however, is that under the hybrid design, one can choose not to collect individual-level data or to collect case-only data from certain areas. This provides flexibility at the design stage, where logistical or financial constraints may preclude or limit individual-level data collection for some groups. In contrast, current analysis techniques for the aggregate data design and two-phase design exclude groups for which no or case-only individual-level data are available. Depending on modeling and distributional assumptions, hierarchical related regression also has flexibility to incorporate information from groups with no or case-only individual-level data.

EXAMPLE: LOW-BIRTH-WEIGHT DATA

To illustrate these approaches, we introduce a simple study of low birth weight (LBW; <2500 g) and consider the task of estimating the impact of infant race and sex using data compiled by the North Carolina State Center for Health Statistics (http://www.irss.unc.edu). Restricting to 2003 and 2004, North Carolina had 237,978 births, of which 21,493 were LBW. Across the K = 100 counties, the LBW rate varied from 6.0% to 15.9%; the percent nonwhite from 0.0% to 76.4%; and the percent male from 45.0% to 56.8% (Figure).

F1-20
FIGURE.:
Distributions of group-level outcome and exposure data, across 100 counties, for the North Carolina low-birth-weight data: A, county-specific percent low birth weight; B, county-specific percent nonwhite births; and C, county-specific percent male births.

Let Yki be a binary indicator of LBW for the ith infant born in the kth county, and πki be the corresponding probability of LBW. Consider the following individual-level model:

where Xki indicates race (0/1 = white/nonwhite), Zki indicates sex (0/1 = female/male), and g() is a link function. In model (4), βX and βZ are the individual-level associations of interest.

Notational Framework

To make explicit differences in observed data structures between the reviewed designs and methods, Tables 1 and 2 present a notational framework for combining group- and individual-level data, based on the North Carolina LBW example. For ease of exposition, a county-specific subscript is omitted but should be taken as implicit throughout.

T1-20
TABLE 1:
Notational Framework for Combining Group- and Individual-level Data From a Generic County, for the North Carolina Low-birth-weight Example
T2-20
TABLE 2:
Data Structures Under Various Designs That Consider Group- and/or Individual-level Data

Consider a generic county with a population size of N. Let N0xz and N1xz denote the number of LBW noncases and cases with race/sex pattern [X = x/Z = z], respectively (Table 1A). Summed across the levels of race and sex, the marginal LBW noncase and case totals are N0 and N1. Summing across the levels of LBW, Mxz denotes the number of individuals with race/sex pattern [X = x/Z = z]. Table 1A shows the Mxz as the marginal totals for Nyxz. Table 1B provides the Mxz as the joint race/sex distribution directly, together with notation for the marginal race and sex distributions: counts Mx+, x = 0/1, and M+z, z = 0/1, respectively.

Table 1A and 1B provide upper-case notation representing all individuals in the county; Table 1C provides analogous, lower-case notation for a subsample of size n. For example, n1xz denotes the number of LBW cases with race/sex pattern [X = x/Z = z] observed in the subsample. Following our review, individual-level data may be observed only on covariates (full-survey and survey subsample aggregate data designs) or jointly on outcomes and covariates (integrated aggregate data design, hierarchical related regression, two-phase, and hybrid designs).

Data Structures

Using the notation of Table 1, Table 2 summarizes observed data structures across various study designs. In an individual-level study, for example, one would observe either the Nyxz totals of Table 1A or the nyxz totals of Table 1C, depending on whether data were obtained on all individuals or a subsample. Taken across the levels of Y/X/Z, the totals are denoted Nyxz and nyxz, respectively. In contrast, an ecologic study design would observe only county-specific marginal LBW, race, and sex totals: {Ny, Mx+, M+z}, where Ny = {N0, N1}, Mx+ = {M0+, M1+}, and M+z = {M+0, M+1}. Table 1A and 1B make this explicit by presenting the Nyxz and Mxz counts within square brackets.

Under the full-survey aggregate data design, group-level outcome totals are supplemented with a survey collecting individual-level data on the covariate distribution. For the LBW example, these correspond to the marginal LBW and joint race/sex counts: {Ny, Mxz}, where Mxz = {M00, M01, M10, M11}. When a full survey is unavailable or unfeasible, the survey subsample aggregate data design supplements the outcome totals with race/sex information on a random subsample of n individuals: {Ny, mxz}, where mxz = {m00, m01, m10, m11}. If one can survey joint individual-level LBW/race/sex information further on a random subsample, the integrated aggregate data design combines these data with the group-level outcome totals: {Ny, nyxz}. The hierarchical related regression framework, which can be seen as a parametric analog of the integrated aggregate data design, uses these data structures and any additional covariate information; hence, the observed data may consist of {Ny, nyxz}, {Ny, Mx+, M+z, nyxz} or {Ny, Mxz, nyxz}. As noted earlier in the text, the flexibility of hierarchical related regression also permits contributions from counties where individual-level data are either unavailable (ie, {Ny, Mx+, M+z}) or case-only (ie, {Ny, Mx+, M+z, n1xz}).

The simplest two-phase study stratifies the entire population by outcome status and county membership. That is, the phase I strata are determined by the Ny across the K = 100 counties. Within each county, a subsample of n0 noncases and n1 LBW cases are sampled and their race/sex status retrospectively determined. Thus, the observed data structures are {Ny, n0xz, n1xz}. An alternative is to use group-level exposure information to stratify the population. For example, Figure B shows county-specific percent non-white rates using 5 strata; Table 3A provides the corresponding phase I stratification. From each of these 10 strata, one could retrospectively sample individuals and observe their race/sex status. Under this design, the observed data structures are {Ny, Mx+, n0xz, n1xz}.

T3-20
TABLE 3:
Phase I Stratification for the North Carolina Low-birth-weight Data, for 2 Potential Two-phase Designs, With S taken to be A, County-specific Percentage Nonwhite Births and B, County-specific Percentage Male Births

Finally, the hybrid design supplements an ecologic study with individual-level case-control data; hence, the available data structures are {Ny, Mx+, M+z, n0xz, n1xz}. As with HRR, the hybrid design permits contributions from some counties from which either no individual-level data or case-only data are observed: {Ny, Mx+, M+z} and {Ny, Mx+, M+z, n1xz}, respectively.

Simulation Study

To further illustrate methods for combining group- and individual-level data, we present a short simulation study based on the North Carolina LBW data. To estimate components of model (4), we considered combined 8 designs: (i) full-survey aggregate data design; (ii) survey subsample aggregate data design with n = 200 sampled from each county; (iii) integrated aggregate data design supplementing the survey subsample aggregate data design with n = 500 more random samples from each of the 4 largest counties, for which joint outcome/covariate data are surveyed; (iv) two-phase design with phase I stratification based on county membership and n = 2000; (v) two-phase design with phase I stratification based on county-specific non-white prevalence rates (Table 3A) and n = 2000; (vi) two-phase design with phase I stratification based on county-specific sex prevalence rates (Table 3B) and n = 2000; (vii) hybrid design with 250 cases and 250 controls from each of the 4 largest areas; and (viii) hybrid design with 250 cases from each of the 4 largest areas. For the two-phase designs, phase II sample sizes were balanced across the phase I strata and estimation based on maximum likelihood.35 For simplicity, we present only frequentist methods in our simulation study and, in particular, present no results for the hierarchical related regression approach. An online eAppendix (https://links.lww.com/EDE/A461) provides the data and code for the simulation study.

For each design, we simulated 10,000 combined group-/individual-level datasets. Throughout, the total number of births and within-county race/sex distributions were held at those in the observed data. Outcome data were generated based on model (4); a log link was used for each aggregate data design; a logit link was used for the two-phase and hybrid designs. Coefficient values for the “true” models were obtained from a fit of the complete individual-level data as follows: (−2.52, 0.59, −0.17) for the log-linear model; (−2.44, 0.66, −0.18) for the logistic model.

Table 4 presents small-sample percent bias, relative efficiency, and mean squared error. As analysis techniques for each design/method have been shown to be consistent (asymptotically unbiased), reported bias is due to small samples and, more specifically, not ecologic bias. Further, we note that relative efficiency is defined here as the ratio of the standard error under each design to the standard error for an analysis using individual-level outcome/exposure data. This ratio may be interpreted as how much tighter confidence intervals could be, on average, when one combines the two sources of information compared with using the individual-level data only.

T4-20
TABLE 4:
Operating Characteristics of Estimators for the Parameters From an Individual-level Model of Low Birth Weight, Under Various Designs

Across all designs, small-sample bias for the race effect is low (at most −2.8%). For the sex effect, bias under the integrated aggregate data design, two-phase, and hybrid designs is low. For the log-linear model, the full-survey and survey subsample aggregate data design estimators exhibit substantial small-sample biases of 16.7% and −91.0%, respectively. This contrasts with the two estimators that use individual-level outcome/exposure data (1.9% and −2.4%). The contrasting performance is due to the reliance of the full-survey and survey subsample aggregate data designs on between-county exposure variation as their source of information, together with the low variation in the percent male across the 100 counties (Fig. C). As the percent non-white exhibits substantial between-county variation, the full-survey and survey subsample aggregate data designs perform relatively well for the race parameter.

Overall, designs that use group-level data have improved efficiency for estimating the race effect compared with those that use individual-level data only. For the sex effect, the two aggregate data designs that do not access individual-level outcome data suffer from substantially reduced efficiency; in contrast, the integrated aggregate data design retains much of the benefit for the race effect (48.7% relative efficiency) with no tradeoff in the sex effect (95.4% relative efficiency). Each of the two-phase and hybrid designs outperforms or does no worse than a case-control design. Not surprisingly, the two-phase design that stratifies on group-level race measures has greater efficiency gains than the design that stratifies on group-level sex measures (48.8% reduction vs. 8.4%). In addition to substantial gains for the race effect (standard errors reduced by approximately 62%) and despite low between-county variation in the proportion male, the hybrid likelihood exploits this information to provide moderate efficiency gains of approximately 20% for the sex effect. Further, comparing the two hybrid designs indicates that, at least in this context, a case-only hybrid design may be a reasonable approach. The results for mean squared error reflect those of relative efficiency.

DISCUSSION

When scientific interest lies in individual-level associations, either alone or jointly with group-level associations, the only reliable solution to the ecologic inference problem is to collect individual-level data. Epidemiologists have at their disposal a range of designs that facilitate this; we have sought to provide a comprehensive overview of “combined” designs and associated analysis techniques. A short simulation study highlights potentially substantial efficiency gains associated with combining the two types of information in the analyses.

In practice, the specific choice of design will depend on the individual-level model of interest, the nature of available information, and assumptions regarding the data. Currently, the integrated aggregate data design, two-phase, and hierarchical related regression may provide the most convenient and powerful designs for researchers. In addition to the general benefits of the Bayesian framework, hierarchical related regression has the unique advantage of permitting arbitrary link functions (ie, both log() and logit()) to consider both nonrare and rare outcomes. However, hierarchical related regression requires additional input from researchers in distributional and modeling assumptions. Although sensitivity analyses are an option, the semi-parametric analyses of the integrated aggregate data design and two-phase design reduce the need for assumptions and, hence, may be appealing. Under the integrated aggregate data design, the individual-level data are obtained via simple random sampling so that the design will be most useful for nonrare outcomes. Further, Martinez et al26,27 developed their analytic framework assuming a log-linear model. The two-phase design, in contrast, is a stratified case-control study with analysis approaches having been developed assuming a logistic model for the outcome. Hence, it would likely be most appealing for rare outcomes.

Our simulation suggests that the hybrid design experiences the greatest efficiency gain from the inclusion of group-level data. This is likely due to the induced likelihood's direct use of group-level covariate data when characterizing possible configurations of the unobserved joint outcome/covariate data. The aggregate data design does not exploit such information; the two-phase design may use between-group covariate information but only indirectly as part of the phase I stratification. The hierarchical related regression approach of Jackson et al30 also uses group-level covariate data to help inform and estimate within-group covariate distributions. A drawback of the hybrid likelihood, however, is that it is computationally expensive and the statistical development has so far been limited to a few categorical covariates. None of the other reviewed designs are limited in this respect. Although the simulation study did not examine hierarchical related regression, we anticipate its performance being similar to the integrated aggregate data design and hybrid designs. A comprehensive statistical evaluation of each of the designs and methods is beyond the scope of this paper but could be useful for researchers considering these designs.

Beyond statistical considerations, when choosing between combined designs, researchers need to weigh numerous practical and epidemiologic issues. For example, logistical and financial constraints may preclude the collection of individual-level data from each group or area. In other settings, researchers may look to supplement readily available individual-level data with appropriate group-level data.38 From an epidemiologic perspective, model specification and interpretation can be challenging in multilevel settings. Specific issues include distinguishing between- from within-group effects; appropriately using between- and within-group exposure variation; characterizing and identifying between- and within-group confounding; identifying potential contextual effects; and ensuring compatibility of differing data sources. These issues are crucial to the design process in that they determine the data elements that require collection.23,24,31

We emphasize that no single design is ideal, and researchers have flexibility to tailor their choice to their specific setting. Indeed, the sequential nature of the designs (that is, collecting individual-level data given group-level data) lends itself to considering design issues that may improve efficiency, with group-level characteristics potentially being incorporated into decision-making. To date, little work has focused on study design in this context.32,39,40 Further work on these competing strategies of sampling and analyses would give researchers practical guidance.

REFERENCES

1. Morgenstern H. Ecologic studies. In: Rothman KJ, Greenland S, Lash T, eds. Modern Epidemiology. 3rd ed. Philadelphia: Lippincott Williams & Wilkins; 2008:511–531.
2. Best NG, Cockings S, Bennett J, Wakefield J, Elliott P. Ecological regression analysis of environmental benzene exposure and childhood leukaemia: sensitivity to data inaccuracies, geographical scale and ecological bias. J R Stat Soc Ser A Stat Soc. 2001;164:155–174.
3. Wilkinson P, Thakrar B, Walls P, et al. Lymphohaematopoietic malignancy around all industrial complexes that include major oil refineries in Great Britain. Occup Environ Med. 1999;56:577–580.
4. Whitley E, Darby S. Quantifying the risks from residential radon. In: Barnett V, Stein A, Turkman K, eds. Statistics for the Environment 4: Statistical Aspects of Health and the Environment. Chichester: John Wiley & Sons; 1999:71–89.
5. Maheswaran R, Morris S, Falconer S, et al. Magnesium in drinking water supplies and mortality from acute myocardial infarction in north west England. Heart. 1999;82:455–460.
6. Hu S, Ma F, Collado-Mesa F, Kirsner RS. Ultraviolet radiation and incidence of non-Hodgkin's lymphoma among Hispanics in the United States. Cancer Epidemiol Biomarkers Prev. 2004;13:59–64.
7. Reynolds P, Hurley SE, Gunier RB, Yerabati S, Quach T, Hertz A. Residential proximity to agricultural pesticide use and incidence of breast cancer in California, 1988–1997. Environ Health Perspect. 2005;113:993–1000.
8. Shaw PA, Etzioni R, Zeliadt SB, et al. An ecologic study of prostate-specific antigen screening and prostate cancer mortality in nine geographic areas of the United States. Am J Epidemiol. 2004;160:1059–1069.
9. Das B, Feuer EJ, Mariotto A. Geographic association between mammography use and mortality reduction in the US. Cancer Causes Control. 2005;16:691–699.
10. Grant W. Ecological study of dietary and smoking links to lymphoma. Altern Med Rev. 2000;5:563–572.
11. Pepin J. From the old world to the new world: an ecologic study of population susceptibility to HIV infection. Trop Med Int Health. 2005;10:627–639.
12. Simonsen L, Reichert TA, Viboud C, Blackwelder WC, Taylor RJ, Miller MA. Impact of influenza vaccination on seasonal mortality in the US elderly population. Arch Intern Med. 2005;165:265–272.
13. Goldhagen J, Remo R, Bryant T III, et al. The health status of southern children: a neglected regional disparity. Pediatrics. 2005;116:e746–e753.
14. Michael Y, Yen I. Built environment and obesity among older adults—can neighborhood-level policy interventions make a difference [commentary]? Am J Epidemiol. 2009;169:409–412.
15. Richardson S, Stucker I, Hemon D. Comparison of relative risks obtained in ecological and individual studies: some methodological considerations. Int J Epidemiol. 1987;16:111–120.
16. Piantadosi S, Byar D, Green S. The ecological fallacy. Am J Epidemiol. 1988;127:893–904.
17. Greenland S, Morgenstern H. Ecological bias, confounding, and effect modification. Int J Epidemiol. 1989;18:269–274.
18. Greenland S. Divergent biases in ecologic and individual-level studies. Stat Med. 1992;11:1209–1223.
19. Greenland S, Robins J. Ecologic studies—biases, misconceptions, and counterexamples [commentary]. Am J Epidemiol. 1994;139:747–760.
20. Richardson S, Monfort C. Ecological correlation studies. In: Elliott P, Wakefield JC, Best NG, et al, eds. Spatial Epidemiology: Methods and Applications. Oxford: Oxford University Press; 2000.
21. Wakefield J. Ecological inference for 2 × 2 tables. J R Stat Soc Ser A Stat Soc. 2004;167:385–445.
22. Wakefield J. Sensitivity analyses for ecological regression. Biometrics. 2003;59:9–17.
23. Diez-Roux AV. Bringing context back into epidemiology: variables and fallacies in multilevel analysis. Am J Public Health. 1998;88:216–222.
24. Diez-Roux AV. The study of group-level factors in epidemiology: rethinking variables, study designs, and analytical approaches. Epidemiol Rev. 2004;26:104–111.
25. Prentice RL, Sheppard L. Aggregate data studies of disease risk factors. Biometrika. 1995;82:113–125.
26. Martinez JM, Benach J, Ginebra J. An integrated analysis of individual and aggregated health data using estimating equations. Int J Biostat. 2007;3:10.
27. Martinez JM, Benach J, Benavides FG, et al. Improving multilevel analyses: the integrated epidemiologic design. Epidemiology. 2009;20:525–532.
28. Lasserre V, Guihenneuc-Jouyaux C, Richardson S. Biases in ecological studies: utility of including within-area distribution of confounders. Stat Med. 2000;19:45–59.
29. Jackson CH, Best NG, Richardson S. Improving ecological inference using individual-level data. Stat Med. 2006;25:2136–2159.
30. Jackson CH, Best NG, Richardson S. Hierarchical related regression for combining aggregate and individual data in studies of socio-economic disease risk factors. J R Stat Soc Ser A Stat Soc. 2008;171:159–178.
31. Jackson CH, Richardson S, Best NG. Studying place effects on health by synthesising individual and area-level outcomes. Soc Sci Med. 2008;67:1995–2006.
32. Wakefield J, Haneuse S. Overcoming ecologic bias using the two-phase study design. Am J Epidemiol. 2008;167:908–916.
33. White E. A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol. 1982;115:119–128.
34. Weinberg CR, Wacholder S. The design and analysis of case-control studies with biased sampling. Biometrics. 1990;46:963–975.
35. Breslow NE, Chatterjee N. Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. J R Stat Soc Ser C Appl Stat. 1999;48:457–468.
36. Haneuse S, Wakefield J. Geographic-based ecological correlation studies using supplemental case-control data. Stat Med. 2008;27:864–887.
37. Haneuse S, Wakefield J. Hierarchical models for combining ecological and case control data. Biometrics. 2007;63:128–136.
38. Stromberg U, Bjork J. Incorporating group-level exposure information in case-control studies with missing data on dichotomous exposures. Epidemiology. 2004;15:494–503.
39. Plummer M, Clayton D. Estimation of population exposure in ecological studies. J R Stat Soc Series B Stat Methodol. 1996;58:113–126.
40. Sheppard L, Prentice RL, Rossing MA. Design considerations for estimation of exposure effects on disease risk, using aggregate data studies. Stat Med. 1996;15:1849–1858.

Supplemental Digital Content

© 2011 Lippincott Williams & Wilkins, Inc.