In ecologic studies, the fundamental unit of investigation is a group of individuals, rather than individuals themselves.^{1} Ecologic studies are widely used because group-level (or aggregated) data are easy and inexpensive to obtain, particularly through data depositories such as disease registries and census data. Further, developments in computing (eg, geographical information systems) let researchers combine information at varying levels of aggregation.^{2} Taking advantage of these strengths, ecologic designs continue to be employed in many epidemiologic settings, including studies of environmental risk factors,^{3–7} cancer screening,^{8,9} and investigations of chronic^{10} and infectious diseases.^{11,12}

Notwithstanding their continued use, ecologic studies are controversial because they directly assess group-level associations, that is, relationships between group-level outcomes and group-level exposure measures. Such associations are sometimes of interest,^{13} particularly for policymaking.^{14} Typically, though, the scientific goal in epidemiology is to assess individual-level associations. With group-level data alone, one generally cannot estimate individual-level associations, although one may be tempted to interpret results from an ecologic study in terms of such associations. Doing so has many pitfalls.^{1,15–21} Of particular concern is that misuse of ecologic results may give rise to the ecologic fallacy, in which conclusions based on a group-level analysis differ from those that would have been drawn had an individual-level analysis been performed.

The fundamental difficulty is that ecologic studies cannot characterize within-group joint outcome/exposure/confounder distributions. This makes estimation of individual-level associations extremely difficult and is analogous to the challenge faced when an important confounder is missing. Unfortunately, the problem cannot be overcome solely via post hoc analytic methods, at least not without making untestable assumptions.^{22} The only reliable way to address the problem is to collect and incorporate appropriate individual-level data.

Combining group- and individual-level data has intuitive appeal. The individual-level data permit identifiability of individual-level associations via 3 mechanisms: (i) evaluation and control of bias; (ii) separation of contextual, within-, and between-group effects; and (iii) the ability to check models. Once identifiability is established, ecologic data may provide gains in power and efficiency, particularly if they represent large sample sizes and if the exposure of interest exhibits large between-group variation.

The past 20 years have seen numerous study designs and methods proposed to combine group- and individual-level data. Despite an extensive literature on developing models, particularly multilevel models,^{23,24} little work has been published to help researchers choose among alternative designs and, consequently, to help them plan additional data collection efforts. This paper reviews recently proposed “combined” epidemiologic study designs and describes the statistical frameworks they use to estimate individual-level associations. These ideas are illustrated with a simple, hypothetical study of birth weight, using data from North Carolina. Given individual-level data, the additional complexity of combining 2 sources of information at the analysis stage may make it appealing to ignore the group-level data. However, a simulation study illustrates the potentially substantial benefits of accommodating group-level data.

## MODEL SPECIFICATION

Fundamental to each design or method reviewed here is the premise that scientific interest lies with some underlying individual-level model. Suppose the population of interest can be stratified into K groups of sizes N_{1}, ..., N_{K}; in environmental epidemiology, such groups are often based on geographic location.^{20}

Let Y_{ki} be some binary outcome of interest for the *i*^{th} individual in the *k*^{th} group and π_{ki} = *P*(Y_{ki} = 1) the corresponding outcome probability. Consider the following general regression model:

where g() is a link function [eg, log() for a log-linear model and logit() for a logistic model] and X_{ki} denotes a vector of covariates. The latter may include exposures of interest, confounders, and potential effect modifiers and, further, may be defined at either the individual or group level. When both individual- and group-level covariates are included, such models are often called multilevel models.^{23,24}

Regardless of the level at which components of X_{ki} are defined, we call the β parameters in model (1) individual-level associations because they correspond to differences in risk between 2 individuals (because the outcome is defined at the individual level).

## DESIGN OPTIONS

The focus of this paper is on designs that supplement readily available group-level data with a sample of individuals for whom outcome and covariate information are observed. The following describes 4 general classes of such designs and their statistical methods.

### Aggregate Data Methods

Suppose one has access to the number of cases in each group, denoted N_{1k}. Given group-specific population totals, one can calculate the observed proportion of cases for the *k*^{th} group as _{k} = N_{1k}/N_{k}. Consider the induced model for the group-level outcome, π_{k} = E[_{k}], obtained by averaging the individual-level model (1) over the N_{k} individuals in the group:

The right-hand side of (2) shows that the induced model for _{k} is a function of the underlying individual-level β parameters. Further, it demonstrates that evaluating the model requires only individual-level information on the components of X within the *k*^{th} group {X_{ki}:i = 1, ..., N_{k}}. Note that these data constitute the (observed) group-specific (marginal) covariate distribution.

Exploiting these 2 features of expression (2), the aggregate data design supplements group-level outcome data with individual-level information on the covariate distribution.^{25} Such information could be obtained by surveying individuals for information on exposures, confounders, and effect modifiers. As there is no requirement to link this information to individual-level outcomes, it may be possible to take advantage of existing surveys to obtain these data. When the survey represents a complete enumeration of each group (ie, all N_{k} individuals), the combined design is called the full-survey aggregate data design. When a complete enumeration is not available or feasible, the survey subsample aggregate data design collects individual-level covariate information on a random subsample within each group. Assuming a log link for the individual-level model (1), estimates of β under both designs are obtained as the solution to an estimating equation.^{25}

Both the full-survey and survey subsample aggregate data design are useful when aggregated outcome counts are available, and administering a survey solely for covariates is most practical. In some settings, one may be able to administer the survey to collect individual-level information on outcomes and covariates jointly. Combining these data with group-level outcome information, Martinez et al^{26} proposed the integrated aggregate data design and developed an estimating-equations framework for estimation and inference for β. They showed that combining the two sources of data can correspond to improvements of analyses that use only survey-based individual-level outcome/covariate information.^{27}

### Hierarchical Related Regression

Each of the full-survey, survey subsample, and integrated aggregate data designs employ semiparametric estimating equations for their analyses. The estimating-equations framework is appealing because it does not rely on assumptions regarding the within-group covariate distributions. However, if one is willing to make distributional assumptions, one can take advantage of a fully parametric statistical framework.^{15,28} Modifying our notation slightly, let π_{ki}(x) = *P*(Y_{ki} = 1 X = x) denote the outcome probability for the *i*^{th} individual in the *k*^{th} group, given a covariate vector value of X = x. As with expression (1), π_{ki}(x) is taken to be specified in terms of individual-level associations of interest. Let f_{k}(x) denote the joint covariate distribution for the *k*^{th} group. The induced group-level model is obtained by integrating the individual-level model over f_{k}(x):

Specifying f_{k}(x) depends on the components of X. For example, Jackson et al^{29} considered two covariates, X_{1} and X_{2}. X_{1} is binary and follows a Bernoulli distribution; X_{2} is continuous and assumed to be normally distributed, conditional on X_{1}. While specification in general settings does require care, adopting a specific distributional form for f_{k}(x) can improve small-sample bias, efficiency, and power if the assumptions are correct.

A key advantage of the parametric approach is that one can imbed (3) into a fully Bayesian analysis.^{2,29} Recently, Jackson et al^{30} introduced hierarchical related regression as a flexible Bayesian framework for combining group- and individual-level data. Hierarchical related regression extends previous developments based on expression (3) in that it permits the use of within-group joint outcome/covariate information. Viewed as a parametric analog of the integrated aggregate data design, the Bayesian formulation of hierarchical related regression is appealing because it provides flexibility for incorporating prior information and accommodating challenging data features such as spatial structure, measurement error, and missingness. The framework also facilitates data synthesis across various data sources leading to improved power to distinguish individual-level and contextual effects.^{31}

### Two-phase Designs

Recently, the two-phase design was proposed as a convenient framework for overcoming ecologic bias.^{32} Briefly, two-phase studies were proposed as an extension of the case-control design for settings where the exposure of interest is rare.^{33,34} At phase I, the population is cross-classified according to the outcome and some stratification variable, S. The latter takes on a finite number of levels and is observed on all individuals of the population. The phase I stratification provides an efficient sampling frame from which additional individual-level information is collected on a subsample at phase II.^{35} In this respect, the design resembles a stratified case-control study with the added advantages of being able to (i) estimate coefficients corresponding to the stratification variable (ie, S) and (ii) obtain general efficiency gains by incorporating stratified outcome totals for the population.

In the ecologic context, group-level data can be used as the basis for the phase I stratification. A simple strategy is to cross-classify the population by case status and group membership. A drawback, however, is that if the number of groups is large, the phase I stratification will have many strata and potentially small cell sizes. This may lead to a breakdown in the analysis methodology. An alternative strategy is to base the phase I stratification on observed group-level covariate measures. We illustrate this approach in our simulation study below.

Estimation and inference of individual-level association parameters using data from a two-phase design follows using standard weighting or likelihood-based methods; Wakefield and Haneuse give a detailed summary.^{32}

### Hybrid Designs for Ecologic Inference

Proposed to address ecologic bias directly, the hybrid design for ecologic inference supplements an ecologic study with case-control data drawn from the same underlying population.^{36} Specifically, the design assumes that group-level outcome and covariate data are available and that individual-level covariate data (stratified by outcome status) are collected from each group.

Assuming an individual-level logistic model, estimation/inference proceeds via the induced hybrid likelihood, derived by averaging the individual-level likelihood over all the possible configurations of the unobserved complete individual-level data. This differs from the various aggregate data designs and the hierarchical related regression, which consider the induced group-level model derived by averaging the individual-level model over the unobserved individual-level data (see expressions (2) and (3)). Estimation and inference based on the hybrid likelihood can proceed via either maximum likelihood or within the Bayesian framework.^{37}

Like the two-phase design, the hybrid design may be viewed as a stratified case-control design. Indeed, the hybrid design that collects case-control samples from each group is equivalent to the two-phase design where the phase I stratification is based on group membership. A key distinction, however, is that under the hybrid design, one can choose not to collect individual-level data or to collect case-only data from certain areas. This provides flexibility at the design stage, where logistical or financial constraints may preclude or limit individual-level data collection for some groups. In contrast, current analysis techniques for the aggregate data design and two-phase design exclude groups for which no or case-only individual-level data are available. Depending on modeling and distributional assumptions, hierarchical related regression also has flexibility to incorporate information from groups with no or case-only individual-level data.

## EXAMPLE: LOW-BIRTH-WEIGHT DATA

To illustrate these approaches, we introduce a simple study of low birth weight (LBW; <2500 g) and consider the task of estimating the impact of infant race and sex using data compiled by the North Carolina State Center for Health Statistics (http://www.irss.unc.edu). Restricting to 2003 and 2004, North Carolina had 237,978 births, of which 21,493 were LBW. Across the K = 100 counties, the LBW rate varied from 6.0% to 15.9%; the percent nonwhite from 0.0% to 76.4%; and the percent male from 45.0% to 56.8% (Figure).

Let Y_{ki} be a binary indicator of LBW for the *i*^{th} infant born in the *k*^{th} county, and π_{ki} be the corresponding probability of LBW. Consider the following individual-level model:

where X_{ki} indicates race (0/1 = white/nonwhite), *Z*_{ki} indicates sex (0/1 = female/male), and g() is a link function. In model (4), β_{X} and β_{Z} are the individual-level associations of interest.

### Notational Framework

To make explicit differences in observed data structures between the reviewed designs and methods, Tables 1 and 2 present a notational framework for combining group- and individual-level data, based on the North Carolina LBW example. For ease of exposition, a county-specific subscript is omitted but should be taken as implicit throughout.

Consider a generic county with a population size of N. Let N_{0xz} and N_{1xz} denote the number of LBW noncases and cases with race/sex pattern [X = x/Z = z], respectively (Table 1A). Summed across the levels of race and sex, the marginal LBW noncase and case totals are N_{0} and N_{1}. Summing across the levels of LBW, M_{xz} denotes the number of individuals with race/sex pattern [X = x/Z = z]. Table 1A shows the M_{xz} as the marginal totals for N_{yxz}. Table 1B provides the M_{xz} as the joint race/sex distribution directly, together with notation for the marginal race and sex distributions: counts M_{x+}, x = 0/1, and M_{+z}, z = 0/1, respectively.

Table 1A and 1B provide upper-case notation representing all individuals in the county; Table 1C provides analogous, lower-case notation for a subsample of size n. For example, n_{1xz} denotes the number of LBW cases with race/sex pattern [*X* = *x*/*Z* = *z*] observed in the subsample. Following our review, individual-level data may be observed only on covariates (full-survey and survey subsample aggregate data designs) or jointly on outcomes and covariates (integrated aggregate data design, hierarchical related regression, two-phase, and hybrid designs).

### Data Structures

Using the notation of Table 1, Table 2 summarizes observed data structures across various study designs. In an individual-level study, for example, one would observe either the N_{yxz} totals of Table 1A or the n_{yxz} totals of Table 1C, depending on whether data were obtained on all individuals or a subsample. Taken across the levels of Y/X/Z, the totals are denoted N_{yxz} and n_{yxz}, respectively. In contrast, an ecologic study design would observe only county-specific marginal LBW, race, and sex totals: {**N _{y}**,

**M**,

_{x+}**M**}, where

_{+z}**N**= {N

_{y}_{0}, N

_{1}},

**M**= {M

_{x+}_{0+}, M

_{1+}}, and

**M**= {M

_{+z}_{+0}, M

_{+1}}. Table 1A and 1B make this explicit by presenting the N

_{yxz}and M

_{xz}counts within square brackets.

Under the full-survey aggregate data design, group-level outcome totals are supplemented with a survey collecting individual-level data on the covariate distribution. For the LBW example, these correspond to the marginal LBW and joint race/sex counts: {**N _{y}**, M

_{xz}}, where

**M**= {M

_{xz}_{00}, M

_{01}, M

_{10}, M

_{11}}. When a full survey is unavailable or unfeasible, the survey subsample aggregate data design supplements the outcome totals with race/sex information on a random subsample of n individuals: {

**N**, m

_{y}_{xz}}, where

**m**= {m

_{xz}_{00}, m

_{01}, m

_{10}, m

_{11}}. If one can survey joint individual-level LBW/race/sex information further on a random subsample, the integrated aggregate data design combines these data with the group-level outcome totals: {

**N**,

_{y}**n**}. The hierarchical related regression framework, which can be seen as a parametric analog of the integrated aggregate data design, uses these data structures and any additional covariate information; hence, the observed data may consist of {

_{yxz}**N**,

_{y}**n**}, {

_{yxz}**N**,

_{y}**M**,

_{x+}**M**,

_{+z}**n**} or {

_{yxz}**N**

_{y},

**M**

_{xz},

**n**

_{yxz}}. As noted earlier in the text, the flexibility of hierarchical related regression also permits contributions from counties where individual-level data are either unavailable (ie, {

**N**,

_{y}**M**,

_{x+}**M**}) or case-only (ie, {

_{+z}**N**,

_{y}**M**,

_{x+}**M**,

_{+z}**n**}).

_{1xz}The simplest two-phase study stratifies the entire population by outcome status and county membership. That is, the phase I strata are determined by the **N _{y}** across the K = 100 counties. Within each county, a subsample of n

_{0}noncases and n

_{1}LBW cases are sampled and their race/sex status retrospectively determined. Thus, the observed data structures are {

**N**,

_{y}**n**,

_{0xz}**n**}. An alternative is to use group-level exposure information to stratify the population. For example, Figure B shows county-specific percent non-white rates using 5 strata; Table 3A provides the corresponding phase I stratification. From each of these 10 strata, one could retrospectively sample individuals and observe their race/sex status. Under this design, the observed data structures are {

_{1xz}**N**,

_{y}**M**,

_{x+}**n**,

_{0xz}**n**}.

_{1xz}Finally, the hybrid design supplements an ecologic study with individual-level case-control data; hence, the available data structures are {**N _{y}**,

**M**,

_{x+}**M**,

_{+z}**n**,

_{0xz}**n**}. As with HRR, the hybrid design permits contributions from some counties from which either no individual-level data or case-only data are observed: {

_{1xz}**N**,

_{y}**M**,

_{x+}**M**} and {

_{+z}**N**,

_{y}**M**,

_{x+}**M**,

_{+z}**n**}, respectively.

_{1xz}### Simulation Study

To further illustrate methods for combining group- and individual-level data, we present a short simulation study based on the North Carolina LBW data. To estimate components of model (4), we considered combined 8 designs: (i) full-survey aggregate data design; (ii) survey subsample aggregate data design with n = 200 sampled from each county; (iii) integrated aggregate data design supplementing the survey subsample aggregate data design with n = 500 more random samples from each of the 4 largest counties, for which joint outcome/covariate data are surveyed; (iv) two-phase design with phase I stratification based on county membership and n = 2000; (v) two-phase design with phase I stratification based on county-specific non-white prevalence rates (Table 3A) and n = 2000; (vi) two-phase design with phase I stratification based on county-specific sex prevalence rates (Table 3B) and n = 2000; (vii) hybrid design with 250 cases and 250 controls from each of the 4 largest areas; and (viii) hybrid design with 250 cases from each of the 4 largest areas. For the two-phase designs, phase II sample sizes were balanced across the phase I strata and estimation based on maximum likelihood.^{35} For simplicity, we present only frequentist methods in our simulation study and, in particular, present no results for the hierarchical related regression approach. An online eAppendix (https://links.lww.com/EDE/A461) provides the data and code for the simulation study.

For each design, we simulated 10,000 combined group-/individual-level datasets. Throughout, the total number of births and within-county race/sex distributions were held at those in the observed data. Outcome data were generated based on model (4); a log link was used for each aggregate data design; a logit link was used for the two-phase and hybrid designs. Coefficient values for the “true” models were obtained from a fit of the complete individual-level data as follows: (−2.52, 0.59, −0.17) for the log-linear model; (−2.44, 0.66, −0.18) for the logistic model.

Table 4 presents small-sample percent bias, relative efficiency, and mean squared error. As analysis techniques for each design/method have been shown to be consistent (asymptotically unbiased), reported bias is due to small samples and, more specifically, not ecologic bias. Further, we note that relative efficiency is defined here as the ratio of the standard error under each design to the standard error for an analysis using individual-level outcome/exposure data. This ratio may be interpreted as how much tighter confidence intervals could be, on average, when one combines the two sources of information compared with using the individual-level data only.

Across all designs, small-sample bias for the race effect is low (at most −2.8%). For the sex effect, bias under the integrated aggregate data design, two-phase, and hybrid designs is low. For the log-linear model, the full-survey and survey subsample aggregate data design estimators exhibit substantial small-sample biases of 16.7% and −91.0%, respectively. This contrasts with the two estimators that use individual-level outcome/exposure data (1.9% and −2.4%). The contrasting performance is due to the reliance of the full-survey and survey subsample aggregate data designs on between-county exposure variation as their source of information, together with the low variation in the percent male across the 100 counties (Fig. C). As the percent non-white exhibits substantial between-county variation, the full-survey and survey subsample aggregate data designs perform relatively well for the race parameter.

Overall, designs that use group-level data have improved efficiency for estimating the race effect compared with those that use individual-level data only. For the sex effect, the two aggregate data designs that do not access individual-level outcome data suffer from substantially reduced efficiency; in contrast, the integrated aggregate data design retains much of the benefit for the race effect (48.7% relative efficiency) with no tradeoff in the sex effect (95.4% relative efficiency). Each of the two-phase and hybrid designs outperforms or does no worse than a case-control design. Not surprisingly, the two-phase design that stratifies on group-level race measures has greater efficiency gains than the design that stratifies on group-level sex measures (48.8% reduction vs. 8.4%). In addition to substantial gains for the race effect (standard errors reduced by approximately 62%) and despite low between-county variation in the proportion male, the hybrid likelihood exploits this information to provide moderate efficiency gains of approximately 20% for the sex effect. Further, comparing the two hybrid designs indicates that, at least in this context, a case-only hybrid design may be a reasonable approach. The results for mean squared error reflect those of relative efficiency.

## DISCUSSION

When scientific interest lies in individual-level associations, either alone or jointly with group-level associations, the only reliable solution to the ecologic inference problem is to collect individual-level data. Epidemiologists have at their disposal a range of designs that facilitate this; we have sought to provide a comprehensive overview of “combined” designs and associated analysis techniques. A short simulation study highlights potentially substantial efficiency gains associated with combining the two types of information in the analyses.

In practice, the specific choice of design will depend on the individual-level model of interest, the nature of available information, and assumptions regarding the data. Currently, the integrated aggregate data design, two-phase, and hierarchical related regression may provide the most convenient and powerful designs for researchers. In addition to the general benefits of the Bayesian framework, hierarchical related regression has the unique advantage of permitting arbitrary link functions (ie, both log() and logit()) to consider both nonrare and rare outcomes. However, hierarchical related regression requires additional input from researchers in distributional and modeling assumptions. Although sensitivity analyses are an option, the semi-parametric analyses of the integrated aggregate data design and two-phase design reduce the need for assumptions and, hence, may be appealing. Under the integrated aggregate data design, the individual-level data are obtained via simple random sampling so that the design will be most useful for nonrare outcomes. Further, Martinez et al^{26,27} developed their analytic framework assuming a log-linear model. The two-phase design, in contrast, is a stratified case-control study with analysis approaches having been developed assuming a logistic model for the outcome. Hence, it would likely be most appealing for rare outcomes.

Our simulation suggests that the hybrid design experiences the greatest efficiency gain from the inclusion of group-level data. This is likely due to the induced likelihood's direct use of group-level covariate data when characterizing possible configurations of the unobserved joint outcome/covariate data. The aggregate data design does not exploit such information; the two-phase design may use between-group covariate information but only indirectly as part of the phase I stratification. The hierarchical related regression approach of Jackson et al^{30} also uses group-level covariate data to help inform and estimate within-group covariate distributions. A drawback of the hybrid likelihood, however, is that it is computationally expensive and the statistical development has so far been limited to a few categorical covariates. None of the other reviewed designs are limited in this respect. Although the simulation study did not examine hierarchical related regression, we anticipate its performance being similar to the integrated aggregate data design and hybrid designs. A comprehensive statistical evaluation of each of the designs and methods is beyond the scope of this paper but could be useful for researchers considering these designs.

Beyond statistical considerations, when choosing between combined designs, researchers need to weigh numerous practical and epidemiologic issues. For example, logistical and financial constraints may preclude the collection of individual-level data from each group or area. In other settings, researchers may look to supplement readily available individual-level data with appropriate group-level data.^{38} From an epidemiologic perspective, model specification and interpretation can be challenging in multilevel settings. Specific issues include distinguishing between- from within-group effects; appropriately using between- and within-group exposure variation; characterizing and identifying between- and within-group confounding; identifying potential contextual effects; and ensuring compatibility of differing data sources. These issues are crucial to the design process in that they determine the data elements that require collection.^{23,24,31}

We emphasize that no single design is ideal, and researchers have flexibility to tailor their choice to their specific setting. Indeed, the sequential nature of the designs (that is, collecting individual-level data given group-level data) lends itself to considering design issues that may improve efficiency, with group-level characteristics potentially being incorporated into decision-making. To date, little work has focused on study design in this context.^{32,39,40} Further work on these competing strategies of sampling and analyses would give researchers practical guidance.

## REFERENCES

*Modern Epidemiology.*3rd ed. Philadelphia: Lippincott Williams & Wilkins; 2008:511–531.

*J R Stat Soc Ser A Stat Soc*. 2001;164:155–174.

*Occup Environ Med*. 1999;56:577–580.

*Statistics for the Environment 4: Statistical Aspects of Health and the Environment*. Chichester: John Wiley & Sons; 1999:71–89.

*Heart*. 1999;82:455–460.

*Cancer Epidemiol Biomarkers Prev*. 2004;13:59–64.

*Environ Health Perspect*. 2005;113:993–1000.

*Am J Epidemiol*. 2004;160:1059–1069.

*Cancer Causes Control*. 2005;16:691–699.

*Altern Med Rev*. 2000;5:563–572.

*Trop Med Int Health*. 2005;10:627–639.

*Arch Intern Med*. 2005;165:265–272.

*Pediatrics*. 2005;116:e746–e753.

*Am J Epidemiol*. 2009;169:409–412.

*Int J Epidemiol*. 1987;16:111–120.

*Am J Epidemiol*. 1988;127:893–904.

*Int J Epidemiol*. 1989;18:269–274.

*Stat Med*. 1992;11:1209–1223.

*Am J Epidemiol*. 1994;139:747–760.

*Spatial Epidemiology: Methods and Applications*. Oxford: Oxford University Press; 2000.

*J R Stat Soc Ser A Stat Soc*. 2004;167:385–445.

*Biometrics*. 2003;59:9–17.

*Am J Public Health*. 1998;88:216–222.

*Epidemiol Rev*. 2004;26:104–111.

*Biometrika*. 1995;82:113–125.

*Int J Biostat*. 2007;3:10.

*Epidemiology*. 2009;20:525–532.

*Stat Med*. 2000;19:45–59.

*Stat Med*. 2006;25:2136–2159.

*J R Stat Soc Ser A Stat Soc*. 2008;171:159–178.

*Soc Sci Med*. 2008;67:1995–2006.

*Am J Epidemiol*. 2008;167:908–916.

*Am J Epidemiol*. 1982;115:119–128.

*Biometrics*. 1990;46:963–975.

*J R Stat Soc Ser C Appl Stat*. 1999;48:457–468.

*Stat Med*. 2008;27:864–887.

*Biometrics*. 2007;63:128–136.

*Epidemiology*. 2004;15:494–503.

*J R Stat Soc Series B Stat Methodol*. 1996;58:113–126.

*Stat Med*. 1996;15:1849–1858.