Study designs using enhanced sampling to detect associations are ubiquitous in epidemiological research. The case–control study^{1–3} is a very commonly applied design, and from its principle of enriching the sample with increased response (and possibly exposure) variability, many other efficient designs have emerged, including the nested case–control design,^{4} the case–cohort design,^{5} the case–crossover design,^{6},^{7} and various two-phase designs.^{8–10}

We discuss a class of enhanced sampling study designs for longitudinal, binary response data wherein subjects are sampled prospectively with probability depending on an auxiliary variable measured at screening (or baseline) that is related to the longitudinal response. Sampled subjects are then followed over time. Even though the auxiliary sampling variable is not of interest for analyses, because of its relationship with the outcome, sampling based on it can improve efficiency and precision by increasing the event rate in the sample. A supporting aim is to describe and illustrate methods of data analysis tailored to these study designs.

To support this methodologic demonstration, we exploit an existing cohort, as subset of participant from the Lung Health Study, on the natural history of cardiovascular disease in smokers and consider the hypothetical scenario wherein complete data are available on only a subset of needed variables measured at baseline. Then, a targeted subsample of participants from the parent cohort is drawn and referred for assessment on the entire panel of variables needed for analysis.

As an illustrative example, suppose that the aim of our study is to estimate the effect of current (past year) smoking on a time-varying chronic obstructive pulmonary disease (COPD) diagnosis. We consider the absence of COPD according to a standard clinical definition and, for illustrative purposes, severe COPD, to be defined later. Because the Lung Health Study participants tended to have mild COPD, both binary outcomes under study (COPD absence and severe COPD) are somewhat rare in the Lung Health Study cohort. This is important because, analogous to case–control studies, our designs provide the greatest benefits in precision when outcomes are rare.

In a scientifically ideal setting, individuals would be assigned to smoking or nonsmoking status each year through a structured random process controlled by the investigator, as subjects were followed over time. Assuming participants adhered to their assigned smoking status, such a design would yield unambiguous estimates of the causal effect of smoking. Because this is of course not possible, we exploit available covariates to control confounding to the extent possible in an observational study. In particular, we adjust for 1-year lag of current smoking to account for time-dependent confounding between respiratory symptoms and current smoking, as well as other potential time-varying and time-invariant confounders described later in the article.

## METHODS

### The Lung Health Study: Participants and Measures

The Lung Health Study^{11},^{12} is a 10-center randomized clinical trial on smokers with mild COPD. Lung Health Study protocols were approved by the institutional review boards at each clinical center, and participants were enrolled after written informed consent was obtained. The trial was designed to test the efficacy of a smoking intervention program and the use of an inhaled bronchodilator to slow the rate of pulmonary decline over time. Data from the first five annual clinic visits in Lung Health Study are available at the National Center for Biotechnology Information database of genotypes and phenotypes^{13} (accession number phs000335.v2.p2; https://www.ncbi.nlm.nih.gov/gap). Links to instructions for downloading data are provided in the eAppendix; http://links.lww.com/EDE/B283.

Table 1 summarizes the Lung Health Study data after modest data cleaning. The original data set contained N = 4391 subjects, and after cleaning, N = 4213 (96%) remained. Among those excluded, 78 died during follow-up. The rest were removed because of lack of available spirometry data. Although we recognize that bias could arise with these exclusion criteria, we felt the impact would be minimal, particularly for purposes of the present methodological investigation.

We consider a hypothetical study that seeks to estimate the effect of past-year smoking on current COPD, a longitudinal binary response realized at each of the five annual clinic visits, defined for illustrative purposes in two different ways. In the Lung Health Study, spirometry was conducted during the screening visit and at all annual follow-up visits. A clinical diagnosis of COPD is realized if the ratio of post-bronchodilator forced expiratory volume in the first second of exhalation (FEV1) to forced vital capacity (FVC) is less than 0.7; as such, we define COPD absence as occurring when FEV1/FVC ≥ 0.7. COPD is absent in approximately 21% of subjects at the screening visit and 23% of subjects at the first follow-up visit (Table 1). This is a standard definition for absence of COPD (e.g., see http://www.goldcopd.org). A second outcome, generated for methodological demonstration purposes to be rare, was severe COPD, defined at each visit as having FEV1/FVC < 0.57. Severe COPD occurs in 11% of subjects at the screening visit and 12% of subjects at the first follow-up visit. Although this is not a standard definition of severe COPD, it provides a useful platform for illustrating the impact of prevalence on the efficiency of our design.

Risk factors for COPD available in the Lung Health Study database and used in analyses include (Table 1) gender, age, years of education, height, weight, and lifetime smoking status (in pack years) collected at baseline. We use years of education as a surrogate for socioeconomic status. Through a short annual survey, dust exposure at work, asthma symptoms during the previous year, and chronic bronchitis are assessed in each wave. In annual clinic visits, current smoking is assessed via a cotinine test. Even though asthma and chronic bronchitis were measured over time, for purposes of a stratified design discussed later, we assume they are time fixed and only use their values measured at the first follow-up visit.

### Subsampling from the Cohort

Although Lung Health Study data are available on N = 4213 subjects, for methodological demonstration, we consider a hypothetical but realistic scenario wherein resources limit sampling to only n = 800 participants from the baseline sample for longitudinal follow-up. Sampling may be simple random sampling of n = 800. However, with either of the two COPD end points, the response prevalence at any wave is fairly small, limiting information in the resulting data set.

To address this limitation, sampling may be enhanced by basing it on an auxiliary variable measured at or before baseline that is related to the COPD end points over time (auxiliary variable sampling). Here, we define auxiliary sampling variables to be the binary indicators of COPD absence (FEV1/FVC ≥ 0.7) or of severe COPD (FEV1/FVC < 0.57) at screening. In our particular implementation of auxiliary variable sampling, our goal was to sample approximately equal numbers of high- and low-risk subjects. For COPD absence, we sampled independently from 896 high-risk subjects (i.e., COPD absence at screening) with probability 400/896 = 0.45 and from 3317 low-risk subjects (with COPD at screening) with probability 400/3317 = 0.12. For the rarer, severe COPD outcome, we sampled high-risk participants with probability 400/445 = 0.90 and low-risk participants with probability 400/3768 = 0.11. Two key features of auxiliary variable sampling are (1) investigator-specified sampling probabilities and (2) oversampling of “high-risk” subjects for whom the auxiliary variable equals 1. This oversampling serves to increase the observed response prevalence across waves, thereby, enriching information in the sample for quantifying associations with risk factors of interest.

Sampling that also depends on a key covariate, usually rare, can additionally increase statistical efficiency for the coefficient of that covariate or its correlates. We conducted such an exposure and auxiliary variable sampling design using four strata defined by baseline COPD (either absence or severe) and baseline asthma. We sampled with probability 1.0 all subjects with asthma and allocated the remainder of sampled subjects equally between those with and without baseline COPD absence or severe COPD and without asthma at baseline (Table 2).

### Statistical Analysis

Regardless of sampling design, the n = 800 sampled participants will each yield up to five annual waves of data, with the dichotomous time-varying end point *Yij* being either COPD absence or severe COPD for participant *i* at visit *j*. Under a random sampling design, analysis proceeds using a population average logistic regression model across the five visits, estimated with generalized estimating equations (GEE).^{14} In this model, the set of predictors includes the time-varying current smoking indicator, the exposure of interest; time-varying adjustors included to control confounding, including 1-year lag of current smoking; and time-invariant baseline variables, also included to control confounding; complete model specification is given in Table 3. Note that because all participants are smokers at baseline, the 1-year lagged value of current smoking is 1 at wave 1 for all participants.

Importantly, all analyses used a GEE working independence covariance structure because of the likely violation of the full covariate conditional mean assumption,^{15},^{16} wherein the cross-sectional model of interest, namely COPD prevalence at time *j* as a function of baseline predictors and predictors also measured at visit *j* (plus lagged smoking), does not equal the full covariate conditional prevalence, namely COPD prevalence as a function of predictors measured at all visit times and taken as an ensemble. This is also referred to as a “no interference” assumption in the causal inference literature.

Under auxiliary variable sampling, some minimal notation is required to formalize the design and analysis. These study designs rely on an auxiliary variable—in this case, screening COPD absence or severe COPD—upon which sampling is based. Denote this variable *Z* _{i} Typically, *Z* _{i} = 1 will indicate a high risk (for *Yij* = 1) stratum and *Z* _{i} = 0 a low-risk stratum. While not of direct scientific interest, if well chosen, *Z* _{i} will be highly related to *Yij* for all *j*, thereby yielding enriched response prevalence (and variability) and more efficient study designs. Formally, we sample individuals with investigator-specified probability *π*(*Z* _{i}). When the outcome is rare, *π*(1) will be high compared to *π*(0).

In addition to measuring and sampling based on *Z* _{i}, in some cases, we also sample based an additional time-invariant exposure (predictor) variable *V* _{i} available at baseline. Such an exposure and auxiliary variable sampling design can further improve efficiency when *V* _{i} is rare by increasing exposure prevalence and, thereby, (co)variability between *Y* _{ij} and *V* _{i} Formally, we sample from four strata (0,0),(0,1),(1,0),(1,1) with probabilities

In most cases, *π*(0,0) is low, *π*(1,1) is high, and *π*(1,0) and *π*(0,1) depend on the relative prevalence of *Z* _{i} and *V* _{i}. In our methodological demonstration, we implemented such a strategy, letting *V* _{i} indicate baseline asthma and sampling to achieve equal numbers of subjects in the other strata defined jointly by asthma and either COPD absence or severe COPD (Table 2).

#### Analysis Approach 1: Sequential Offsetted Regressions

Sequential offsetted regressions^{17} is an analysis procedure for the designs proposed in this article. We refer interested readers to the eAppendix; http://links.lww.com/EDE/B283 and to Schildcrout and Rathouz for details.^{17} Briefly, with sequential offsetted regressions, we conduct two offsetted logistic regressions to estimate parameters of scientific interest. The first, auxiliary variable regression, captures the relationship of auxiliary variable *Z* _{i} to observed end point and predictor data (*Y* _{ij},*X* _{ij}). The second logistic regression is for the question of scientific interest. It is corrected for the biased auxiliary variable sampling or exposure and auxiliary variable sampling design.

To understand the first logistic regression, let *S* _{i} = 1 (or 0) indicate whether the *i*th participant in the parent cohort is (is not) sampled for longitudinal follow-up. Then, note from Bayes’ Theorem that

where the last factor is free of (*Y* _{ij},*X* _{ij}) because sampling only depends on *Z* _{i}. The left-hand side is the disease odds model for *Z* _{i} under the auxiliary variable sampling *sample*, while the first factor on the right-hand side is the corresponding disease odds model in the *population*, i.e., the true model for *Z* _{i} It is also true that the last factor is known *by design* to the investigator and is equal to *π*(1)/*π*(0) Because equivalence (1) is in terms of the odds, a population logistic regression (i.e., log odds) model for the association of *Z* _{i} to (*Y* _{ij},*X* _{ij}) will result in a sample logistic regression model of the same form, with the addition of log{π(1)/π(0)} as an offset term. Schildcrout and Rathouz^{17} provide the extension to the case where sampling is also based on a stratifying covariate *V* _{i} Sequential offsetted regressions is implemented in a Stata and R packages (available for download from http://biostat.mc.vanderbilt.edu/wiki/Main/ODSandLDA), wherein specifying log{π(1)/π(0)} = 0 yields standard GEE.

In our hypothetical auxiliary variable sampling study, we specified the auxiliary model to include the disease response *Y* _{ij} and all the same predictors *X* _{ij} as in the model of interest (Table 3). Additionally, because of the potential for a weakening relationship of *Y* _{ij} to screening *Z* _{i} with increasing *j*, we included the interaction between *Y* _{ij} and annual visit number (study year).

Once Pr(*Z* _{i} = 1|*Y* _{ij},*X* _{ij}) is known and available, it can be combined with Pr(*S* _{i} = 1|*Z* _{i}) (which equals Pr(*S* _{i} = 1|*Z* _{ij},*Y* _{ij},*X* _{ij})) to obtain Pr(*S* _{i} = 1|*Y* _{ij},*X* _{ij}) This in turn is used to compute

for every observation in the data set.

The second logistic regression follows the same pattern as the first one, but this one is aimed at the scientific question of interest in the population. Specifically,

Again, in equation (2), because both the sample and the population model are in terms of the disease odds, a logistic regression model for disease *Y* _{ij} in the sample can be specified in terms of the same logistic regression model in the population, with the addition of log{*B*(*X* _{ij})} as an offset term. In our work, we have estimated this model using GEE with independence correlation structure, although under certain conditions, an alternative correlation structure such as exchangeable or AR(1) is allowed and could further increase statistical efficiency.

#### Analysis Approach 2: Inverse Probability Weighting

Inverse probability weighting (IPW) is commonly applied when samples are intentionally (by design) or unintentionally (missing data) not representative of the source population.^{18},^{19} An alternative to sequential offsetted regressions, IPW is implemented by weighting each subject *i* (or, equivalently each observation) by the known value 1/π(*Z* _{i},*V* _{i}) to approximate the population represented by the original Lung Health Study cohort and estimating the population model using GEE, again with an independence correlation structure. Although IPW is easier to implement than sequential offsetted regressions, in these designs, as we shall see, it can result in marked loss of statistical efficiency.

### Comparative Analyses

Our primary analysis is of the Lung Health Study data generated under our hypothetical auxiliary variable sampling design and analyzed using sequential offsetted regressions. As our primary aim is to demonstrate the strength and features of this methodology “on average,” we replicate the process (using auxiliary variable sampling to sample from the full cohort and using sequential offsetted regressions for analysis of sampled data) 250 times and average the results. We compare the results to those obtained from the full cohort of N = 4213 and additionally from those obtained under a random sampling design and under the auxiliary variable sampling design analyzed with IPW. Specifically, for each replicate, each participant is either sampled with probability 800/4213 for the random sampling design or with auxiliary variable sampling probabilities given in the *Subsampling from the Cohort* subsection. On average, each sample includes n = 800 participants.

#### Design Features Within Auxiliary Variable Sampling

Although auxiliary variable sampling can be a powerful method for increasing statistical efficiency while controlling expenditure of research resources, there are several key features of any auxiliary variable sampling design, which can impact efficiency, some of which may be under control of the investigator and all of which should be given careful consideration. We quantify these effects, using the actual Lung Health Study data,^{20} to explore three data and two design and analysis features. Similar to sensitivity analysis, for each feature studied, we perturb the single feature of the original data/design/analysis, leaving the rest intact, to evaluate the impact on results.

The three data features we study are (1) overall prevalence of the response Pr(*Y* _{ij} = 1), which we alter by changing the cut point that defines COPD-free and severe COPD outcomes, (2) strength of the relationship between *Z* _{i} and *Y* _{i1}, i.e.,

, which we alter by perturbing the original *Z* _{i} values, and (3) correlation among repeated measurements within each subject *r*(*Y* _{ij},*Y* _{ik}), which we alter by regenerating

using a fully parametric, marginalized model^{21–23} estimated from the full cohort. Additionally, we examine the impact of using a more saturated auxiliary model for Pr(*Z* _{i} = 1|*Yij*,*X* _{ij}), which includes not only *Y* _{ij} and all predictors in *X* _{ij} but also all two-way interactions between *Y* _{ij} and each predictor. Finally, we consider an exposure (*V* _{i}) and auxiliary variable (*Z* _{i}) sampling plan (sampling with probability π(*Z* _{i},*V* _{i})), where *V* _{i} indicates asthma as a key predictor at baseline.

We use the inverse of sampling variance to quantify statistical efficiency. That is, the larger the sampling variance, the lower the efficiency. As such, we define relative variance (RV) as the ratio of average estimated variances:

Values greater than one are consistent with efficiency gains in auxiliary variable sampling with sequential offsetted regressions compared to random sampling. We also note that the estimators of the sampling variances are consistent for the true sampling variances,^{17} capturing components of variability due both to sampling the full cohort from the reference population and to subsampling from the full cohort. To put efficiency gains of the design into context, 800 of the 4213 subjects from the Lung Health Study are randomly sampled under the random sampling design. Thus, the efficiency of the full cohort analysis to the random sampling design analysis is 4123/800 = 5.26. As a final note, we conducted a distinct study wherein n = 500 (not 800) participants were sampled. The values of RV were qualitatively similar (see eAppendix; http://links.lww.com/EDE/B283).

## RESULTS

In Table 3, we display logistic regression parameter and standard error estimates from full cohort analyses and average logistic regression parameter estimates and average estimated standard errors across 250 replicates from the auxiliary variable sampling and random sampling design-based analyses. As noted earlier, these standard error estimates capture all components of sampling variability. Focusing first on full cohort analyses for both outcomes, we can see (Table 3) that nearly all independent variables were associated with outcomes, and as expected, estimates have the opposite sign for the two outcomes. Overall, COPD is more severe in current smokers than in nonsmokers, even accounting for past smoking, as measured jointly by all three of pack years at baseline, cigarette per day at baseline, and 1 year lag of current smoking. The odds ratio for COPD-free and for severe COPD for smokers versus nonsmokers is an estimated exp(−0.33) = 0.72 and exp(0.34 = 1.41), respectively. Additionally, even allowing for variations in smoking, COPD is increasing over time: the estimated per-year odds ratios of being COPD free and for severe COPD are exp(−0.14) = 0.87 and exp(0.19) = 1.21, respectively.

Examining results from the random sampling and auxiliary variable sampling designs, we observe the following: first, for the most part, point estimates appear to be similar across the various design and analysis procedures, implying that these procedures reproduced the full cohort results quite well, on average. As expected, random sampling estimates are very similar to those from the full cohort. Auxiliary variable sampling with sequential offsetted regressions and auxiliary variable sampling with IPW vary a bit more from the full cohort, but almost always by less than a half of a standard error. Second, random sampling and auxiliary variable sampling with IPW produce similar average standard error estimates across replicates. Third, auxiliary variable sampling with sequential offsetted regressions produces lower average standard error estimates than the random sampling design and auxiliary variable sampling with IPW. Finally, the increases in efficiency because of the auxiliary variable sampling with sequential offsetted regressions are more pronounced in the rarer, severe COPD outcome analysis, as compared with the COPD-free outcome analysis.

### Design Features Within Auxiliary Variable Sampling

First, we focus on the COPD-free results (Figure 1). In the original Lung Health Study data analysis, denoted with black diamonds in all panels, we used a cutoff of FEV1/FVC ≥ 0.70 to define COPD absence, resulting in Pr(*Y* _{ij} = 1) = 0.20, OR(*Y* _{i1}|*Z* _{i}) ~ 16.4, and *r*(*Y* _{ij},*Y* _{ik}) ~ 0.59. RV, reflecting relative efficiency, ranged from 1.09 (study year) to 1.46 (asthma) across all regression parameters, indicating 9% to 46% more subjects are required under a random sampling design to obtain the same precision we obtain with auxiliary variable sampling with sequential offsetted regressions. Response prevalence had a dramatic impact on RV (Figure 1A). With cutoffs for FEV1/FVC equal to 0.70 and 0.72, the overall prevalence of COPD absence was 0.20 and 0.10, respectively. With lower prevalence (in gray), the efficiency of the design increased dramatically. Though the RV for the study year coefficient was only 1.09, the RV exceeded 1.82 for all other estimates.

In Figure 1B and C, we reduced the strength of the *Z* _{i} ~ *Y* _{i1} relationship and the response correlation, respectively. In both cases, lower relative variances resulted, and in such circumstances, one would question the usefulness of the design. Although we observed similar patterns for the severe COPD model (Figure 2), RVs were higher because of lower response prevalence.

To fully understand circumstances under which the designs may be useful, we examined two additional features for their impact on RV: (1) the richness of the specification of the intermediate, auxiliary variable model, Pr(*Z* _{i} = 1|*Y* _{ij},*X* _{ij}) used in sequential offsetted regressions analyses and (2) the use of an exposure and auxiliary variable sampling design that creates sampling strata not only based on *Z* _{i} but also on a time-fixed, baseline exposure *V* _{i} (Figures 1D and E and 2D and E).

In Figures 1D and 2D, we show the impact that auxiliary variable model choice can have on RV. For the original analysis (labeled “not saturated”), our auxiliary model included response *Y* _{ij}, all predictors *X* _{ij} in the model of interest, and the multiplicative interaction *Y* _{ij} × study year_{ij}. For a more conservative approach, labeled “saturated”, we included the interaction between *Y* _{ij} and all terms in *X* _{ij} The saturated auxiliary model reduced efficiency gains for all parameters except those associated with study year. That is, by unnecessarily estimating many covariate interactions with *Y* _{ij} in the auxiliary model, RVs dropped substantially. Though the auxiliary model must be correct to ensure valid parameter inferences, it is reasonable to expect it to be relatively simple, because *Y* _{ij} should be most strongly related to *Z* _{i}.

In Figures 1E and 2E, we show the impact that further sampling stratification can have on efficiency. In the original analysis, two sampling strata were defined by *Z* _{i}. In Figures 1E and 2E, because asthma is very rare (Table 1), we consider a strategy to gain further efficiency on the parameter associated with presence (*V* _{ij} = 1) or absence (*V* _{ij} = 0) of asthma at baseline. By enriching the sample with those with asthma, we observed enormous efficiency gains even compared to the original auxiliary variable sampling design, e.g., for the asthma parameter in Figure 2, compare 1.63 under auxiliary variable sampling to 4.02 under exposure and auxiliary variable sampling. We also notice that because of its association with asthma, the exposure and auxiliary variable sampling design was far more efficient for the chronic bronchitis parameter, where the RV jumped from 1.82 under auxiliary variable sampling to 2.44 under exposure and auxiliary variable sampling. Predictors weakly associated with baseline asthma (*V* _{i}) were not strongly affected by stratifying on *V* _{i}.

## DISCUSSION

In this article, we have picked up on the classic epidemiologic notion of sampling based on disease events and shown one family of extensions to analysis of longitudinal or clustered data. Our main aims were, first, to focus on study design and, second, to treat data analysis methods as supporting methodologies to these new and more efficient study designs. Therefore, we have not gone into depth on the technical details around estimation and inference. Finally, we focused on longitudinal data but the ideas could be easily adapted to clustered data.

For this new family of study designs, we have demonstrated a method of analysis based on population average logistic regression models, which we call sequential offsetted regressions. In the context of a hypothetical study based on a real cohort, we showed that strategic sampling based on an auxiliary variable related to the binary response series, together with sequential offsetted regressions, yields increased statistical efficiency relative to random sampling and that sequential offsetted regressions can be more statistically efficient than standard IPW. Using the auxiliary variable sampling with sequential offsetted regressions, we demonstrated sensitivity of resulting statistical efficiency to specification parameters of the design and analysis. It is worth noting that in the process of conducting sequential offsetted regressions, we estimated Pr(*S* _{i} = 1|*Y* _{ij},*X* _{ij}). Weights based on this estimate, or associated stabilized weights,^{24–26} could also be used in IPW-based analyses. Though standard error calculations are difficult, bootstrap-based estimators can be used.

Our objective was to describe a class of designs that extend from the case–control sampling principle to the longitudinal response data setting. It is difficult to be comprehensive, and variations of these and other designs will arise according to study-specific constraints. Methodologically oriented biostatisticians will be interested in developing analysis methods for such designs.

## ACKNOWLEDGMENTS

The authors thank the supported effort of the faculty and staff members of the Johns Hopkins University Bayview Genetics Research Facility, NHLBI grant HL066583 (Garcia/Barnes, PI), and NHGRI grant HG004738 (Barnes/Hansel, PI). The Lung Health Study was supported by US Government contract No. N01-HR-46002 from the Division of Lung Diseases of the National Heart, Lung and Blood Institute. Data were downloaded from the NCBI database of genotypes and phenotypes (accession number phs000335.v2.p2) The authors would like to thank the reviewers for their thoughtful comments that led to substantial improvements to the paper.