We describe a novel class of study designs and associated analysis procedures for longitudinal binary response data.1–3 The designs can be used in settings where the response and basic covariate data are available, but an expensive exposure variable must be ascertained retrospectively in order to conduct a new study. To generalize case–control sampling, we propose a stratified sampling based on three strata (0, 1, and 2) empirically defined by summarizing each subject’s response over time into a simple total count. With Yij denoting the binary response value for subject i at observation number j and ni the number of times subject i was observed, we define stratum 0, 1, and 2 with
, respectively. Within this class of stratified sampling designs, we assign members of a stratum (r) a common sampling probability,
, and alternative designs are distinguished by the choices for the stratum-specific sampling probabilities:
An important feature of longitudinal data arises from the fact that both response and exposure variability may result from within-subject changes over time, or from subject-to-subject variability. In this article, we will show that who is informative (i.e., which strata should be sampled with high probability) depends upon the response-exposure associations that are of interest. We emphasize the importance of considering the empirical sources of exposure and response variability (between-subject and within-subject) when conducting the designs.
To demonstrate the utility of the designs, we use data collected at annual clinic visits from the Lung Health Study.4–6 The Lung Health Study protocols were approved by the institutional review boards at each participating clinical center, and participants were enrolled after written informed consent was obtained. Based on genetic associations observed with chronic bronchitis and chronic obstructive pulmonary disease in smokers,7 the exemplar analysis will examine the short term, causal effect of current smoking on chronic bronchitis risk in those with and without at least one T allele on single nucleotide polymorphism rs7671167, and the difference between the two effects. Importantly, it will also study the extent to which other, baseline characteristics are risk factors for chronic bronchitis.
The Lung Health Study collected genetic, covariate, and outcome data for all participants and made them available on the National Center for Biotechnology Information database of genotypes and phenotypes (dbGaP8; access number phs000335.v2.p2). Links to instruction pages for downloading dbGaP data are provided in the eAppendix; http://links.lww.com/EDE/B284. We use a subset of Lung Health Study data to conduct a plasmode simulation study9 that examines the utility of the proposed retrospective study designs under scenarios where the longitudinal outcome and covariate data are available on all subjects but genetic data must be ascertained retrospectively at a sample size limiting cost.
ESTIMANDS AND ESTIMATORS FOR THE LUNG HEALTH STUDY ANALYSIS
The population represented by Lung Health Study5 participants is middle-aged smokers with mild-to-moderate COPD. Although the Lung Health Study was a randomized clinical trial that examined smoking cessation interventions, treatment allocation was not available to us. For this analysis, we seek to emulate a hypothetical sequentially randomized study where participants are randomized to be a smoker or nonsmoker at different longitudinal follow-up visits. We, therefore, conduct a multiple logistic regression analysis that adjusts or controls for potential confounders of the smoking–chronic bronchitis relationship and other baseline characteristics that are also of scientific interest for their association with chronic bronchitis. Our primary goal with these data is to illustrate possible novel sampling designs that could be used to efficiently evaluate a new candidate biomarker that may predict chronic bronchitis or may modify the effect of smoking.
be the prevalence of chronic bronchitis in a population defined by subject i ‘s covariate values X ij at the jth visit, and let smkij be a 0/1 indicator of current smoking. In the analysis dataset, smkij varies both within and between individuals (intracluster correlation coefficient = 0.82), and for this reason, we adopt the “between–within” model described in Neuhaus and Kalbfleisch10 to partition exposure variability. The between–within model decomposes smkij into two components, one that varies exclusively within and the other that varies exclusively between individuals. Specifically, let
, then we can write
, and the two components can be modeled separately. For the Lung Health Study analysis, the analytical model is given by:
where snpi = 1(0) if the T allele is present (absent) at rs7671167, and cij is a vector of confounders and other risk factors that includes baseline covariates lifetime pack–years smoked, average number of cigarettes smoked per day before the Lung Health Study, gender, age, body mass index, and nine site-specific indicator variables; and the time-varying covariate, study year. Notice that
corresponds to a difference in the log odds ratio for smokers (versus nonsmokers) between those with and without the T allele. Because the estimate was very close to 0, we excluded the interaction between snpi and smkb,j in this between–within model.
Although the between–within model is not a standard data-generating, causal model,11,12 it is popular in longitudinal analyses and has been shown to be useful for estimating causal effects under specific assumptions. For example, consider a simplified version of (1), and assume the true model is given by
. Also assume that the subject level confounder ui is unmeasured and the model fit to the data is
. The estimate of
is consistent for
as long as ui has a linear relationship with smkb,i.13 Fitting the standard model
results in an inconsistent estimate of
. This estimate of
has exhibited robustness to the “linear with the average” criterion12 although it has shown sensitivity to high levels of heteroscedasticity14 and can potentially induce confounding due to conditioning on a collider.15,16 For the purpose of providing insights into the operating characteristics of the study designs, we assume the between–within model assumptions are not violated in a severe way. A key benefit of the model is the explicit decomposition of smkij into smkw,ij and smkb,i, which vary exclusively within-subjects (ICC = 0) and between-subjects (ICC = 1), respectively. Design considerations for between- and within-subject covariate effects will be shown to be quite different.
Lung Health Study Cohort
Table 1 shows demographic and other characteristics of 3,932 participants. Seventy-six percentage of participants possessed at least one T allele at rs7671167. The vast majority were observed at four follow-up visits, and at each visit, between 71 and 76 percentage were smokers. Chronic bronchitis occurred in 28–29% of visits. In all, 2,064 participants never experienced chronic bronchitis, 1,404 experienced it at some but not all visits, and 464 experienced chronic bronchitis at all visits.
To formalize the design options, assume a representative cohort of N participants with longitudinal binary outcomes (
) and observed covariate (
) data. In the Lung Health Study, Yij is the binary indicator for presence/absence of chronic bronchitis for subject i at visit j and ni is the number of times that subject i was observed. Assume additional measurement is needed to obtain a new exposure variable
(e.g., snpi) and ascertainment costs limit sample size. Thus, for subject i,
represents the complete design matrix, but Xei is not available initially. Finally, X ij is the covariate vector for subject i at visit j
We proposed1,2 stratified outcome-dependent sampling designs to identify subjects for retrospective Xei ascertainment. Strata are defined by the subject-specific response sum,
. Stratum 0, 1, and 2 is comprised of subjects who had events at none, some, and all follow-up visits, respectively. Within stratum R = r, there are Nr subjects and each is sampled with probability
, which is a design parameter under the control of the investigators and controls the magnitude of possible oversampling within select strata. The sampling probabilities result in an expected sample size from stratum r equal to
. We denote unique candidate designs with
, indicating the size of the subsample taken from each stratum.
For the estimation of time-varying covariate regression parameters, including interactions with time-varying covariates, we have previously shown that subjects with any outcome variability (i.e., those belonging to stratum R = 1) are more informative than those who have constant responses over time.1,2 The information difference between subjects with and without variation in their longitudinal outcomes is particularly large if the time-varying covariates vary exclusively within subjects (i.e., intraclass correlation coefficient equals 0) where subjects in strata 0 and 2 provide little to no information. In our example data, this is the case for the within-subject smoking variable where the intraclass correlation coefficient was, by construction, equal to 0. To the extent that there exists between subject variability in a covariate that is of interest, subjects in strata 0 and 2 may provide information regarding the exposure–outcome association.
Although the general design ideas apply to a broad class of models, we focus on “marginalized models”17–19 for binary response data. Marginalized models (like generalized linear mixed models20,21) can be estimated with likelihood-based approaches. However, in contrast to conditional generalized linear mixed models, marginalized models estimate marginal model parameters similar to generalized estimating equations.22,23 Given Y i and X i, the marginal mean model is
dimensional parameter vector. While this model specifies the (marginal) mean of the distribution
, to fully identify
, we construct a second regression model to characterize within-subject response association. We refer to the second regression component as the conditional mean or response dependence model. In this article, we use the marginalized transition and latent variable model24 where the response dependence model is
can be thought of as an intercept in a logistic regression model that includes a transition or Markov term Yij-1 and a random intercept bi. This response dependence model is appealing because with longitudinal data, it is common to observe both serial response dependence that decays with time separations, and long-range dependence that is captured with the random intercept. A more thorough description of marginalized models and specifically
can be found in the studies by Heagerty,18 Schildcrout and Heagerty,24 and Heagerty and Zeger.25
Within the proposed class of outcome-dependent sampling designs, we deliberately oversample informative subjects, resulting in a nonrepresentative study sample. For example, in many cases the sample will be overrepresented with members of stratum 1 in order to maximize efficiency. Therefore, to validly estimate regression parameters based on the targeted sample, we consider three estimation strategies: ascertainment-corrected maximum likelihood, weighted likelihood, and multiple imputation, which we briefly detail.
Ascertainment-Corrected Maximum Likelihood
, be the parameter vector from a marginalized model and let
be the likelihood contribution by subject i under random sampling. An ascertainment-corrected likelihood under the proposed design can be calculated via Bayes’ theorem with
where Li2 and Li0 are the likelihood contributions by subject i under random sampling had
is the probability of being sampled for all members of stratum R = r, Si = 1 if subject i is sampled, and Si = 0 if not. Ascertainment-corrected maximum likelihood maximizes (4), and we are able to estimate parameters in the original model, that is,
. Notice that this ascertainment-corrected likelihood is similar to the conditional logistic regression likelihood with the exception that we condition on being sampled (Si = 1) rather than on
. While conditional logistic regression is useful in many settings, it cannot estimate baseline covariate effects that are of secondary interest in the Lung Health Study analysis. Further, conditional logistic regression implicitly assumes a random intercept structure (approximately exchangeable correlation), which may be overly simplistic for longitudinal data that often possess serial dependence.
One of the most common approaches to addressing selection bias induced by design-based oversampling is inverse probability weighting.26,27 Inverse probability weighting can be used to correct the standard likelihood score equations (i.e., the derivative of the log likelihood) by weighting by the inverse of the subject-specific probability of having been sampled,
, which must equal one of
. Parameters are estimated by solving the weighted log-likelihood score equation,
, and robust standard errors22,26,27 are used to estimate uncertainty.
A weakness of ascertainment-corrected maximum likelihood and weighted likelihood is that even though (Y, X 0) are available on all subjects, the only information used in parameter estimation and inference is from subjects selected into the subsample for whom Xei is also available (i.e., those with Si = 1). One of the primary strengths of multiple imputation28,29 is its efficient use of all available data. This suggests that we may impute Xei for subjects not selected for detailed exposure evaluation (i.e., those for whom Si = 0). There are different approaches for building the required imputation model, and in the eAppendix, http://links.lww.com/EDE/B284, we detail the specific imputation strategy that we use for the present analyses.
RESULTS FROM ANALYSES OF THE LUNG HEALTH STUDY
We illustrate the potential advantages of an outcome-dependent sampling design over random sampling using the Lung Health Study cohort. We are interested in risk factors for chronic bronchitis, and in particular, the impact of current smoking for those with and without at least one copy of the T allele at rs7671167. Equation 1 shows the regression model. While the primary estimation targets are
, we are also interested in other covariate associations with chronic bronchitis risk (i.e., those described by the parameter
Chronic bronchitis was captured on the Lung Health Study annual questionnaire, and CBij = 1 if subject i experienced chronic bronchitis (at least three months of a phlegm-producing cough) in the year preceding visit j. The smoking variable, smkij was set to 1 if the participant tested positive on a cotinine test at the prior (j - 1) or present (j) visit. Otherwise, smkij = 0 We used this definition primarily to improve the likelihood that those with smkij = 0 were, in fact, nonsmokers. One disadvantage of this approach is that we removed the first annual follow-up visit from the analysis because all subjects smoked before the study.
Designs and Analysis Procedures
Even though the Lung Health Study genotyped all subjects, we are interested in circumstances where we may only genotype a subset. For illustration we assume that approximately 1,600 of the 3,932 subjects can be genotyped, and we investigate several targeted designs. With random sampling, we randomly selected 1,600 subjects to be included for genotyping. The other designs we considered were outcome-dependent sampling designs: D[100, 1400, 100], D[275, 1050, 275] and D[450, 700, 450]. We sampled individuals independently, and in light of the stratum sizes (Table 1), sampling probabilities for the three outcome-dependent sampling designs were [0.048, 0.997, 0.216], [0.133, 0.748, 0.593], and [0.218, 0.499, 0.970]. Importantly, D[100, 1400, 100] sampled effectively all participants in stratum 1, and so it includes all subjects who exhibited response variation over the course of the study. Results from earlier work suggest that this design can yield high precision for time-varying covariate effect estimates (e.g., for
). After conducting each of the three outcome-dependent sampling study designs, we analyzed the resulting data with ascertainment-corrected maximum likelihood, weighted likelihood, and multiple imputation. When conducting the random sampling design, maximum likelihood using subsampled individuals only and an multiple imputation procedure that imputed Xei in unsampled subjects were used for analyses. We replicated the process of sampling participants from the cohort and analyzing the observed data 500 times. We report averages across the replicates for parameter and uncertainty estimates.
Full Cohort Analysis
In Table 2, we show the results from the full cohort logistic regression analysis conducted with standard maximum likelihood and from the random sampling, D[100, 1400, 100] and D[275, 1050, 275] study designs. Results for the D[450, 700, 450] study design can be found in eTable 1, http://links.lww.com/EDE/B285. Under the full cohort analysis, current smoking was highly associated with chronic bronchitis risk, and presence of the T allele modified the effect of smoking. The odds ratio (95% confidence interval) for chronic bronchitis comparing smoking with nonsmoking visits was
in those without the T allele, and
in those with at least one T allele copy. The ratio of odds ratios for those with and without the T allele was estimated to be
consistent with the T allele ameliorating the impact of smoking on chronic bronchitis risk.
Across 500 replicates, on average, all designs and analysis procedure combinations approximately reproduced the point estimates from the FC analysis (Table 2), so we describe study design by analysis procedure combination performance by comparing the estimated precision. Results are displayed as standard errors for all parameters in Table 2. However, in the Figure, we focus on four key parameters in order to gain insights into the utility of the proposed designs. In the Figure, we present the ratio of the average estimated variance using a random sampling design with maximum likelihood analysis to each other design/analysis procedure combination. Thus, for each design/analysis procedure combination, we describe an estimate of “relative efficiency” compared with a standard random sampling design and analysis. It is worth noting that the maximum possible efficiency gain over random sampling is
, which is the comparison of the maximum likelihood analysis using the full cohort to maximum likelihood analysis using the random sampling design.
Figure panels A and B examine time-varying covariate parameters
, and we observe that with the D[100, 1400, 100] design, both ascertainment-corrected maximum likelihood and multiple imputation estimation procedures are very precise. The impressive precision using a targeted sample with less than half the total cohort is due to two important features of the design and model: (1) nearly all the 1,404 subjects with response variation (i.e., those in stratum 1) are sampled under the design and (2) all variability in smkw,ij and smkw,ij occured within subjects. These results are consistent with what we have observed in the past.1,2 Estimation precision drops as fewer subjects from stratum 1 are included in the outcome-dependent sampling, which is the case with the D[275, 1050, 275] and D[450, 700, 450] designs.
One interesting observation is that for estimates of
the relative variance pattern was not monotonic for weighted likelihood analysis under the three outcome-dependent sampling designs. Relative variance is highest for the D[275, 1050, 275] design and is lower for D[100, 1400, 100] and D[450, 700, 450]. There appears to be a tradeoff between the study design and weighted likelihood weighting scheme. An extreme design, for example, D[100, 1400, 100], samples the most informative subjects; however, with weighted likelihood, the design induces an extreme weighting scheme that leads to imprecise estimates. Further research is required to find a general solution for this study design-weighting scheme tradeoff.
Figure panels C and D show that the outcome-dependent sampling designs proposed here are not particularly useful if the target of inference is a subject-level or time-fixed covariate parameter. For example, D[100, 1400, 100] yields lower precision estimates of
than random sampling with maximum likelihood analyses, although precision improves as proportionately more subjects are sampled from strata 0 and 2 with D[275, 1050, 275] and D[450, 700, 450].
Even though multiple imputation was not clearly more precise than ascertainment-corrected maximum likelihood for
, the utility of multiple imputation is clearly displayed in baseline covariate effects such as
. Irrespective of the study design multiple imputation yielded
estimates that were as precise as the full cohort analysis since the covariate, gender, is available on all subjects in the dataset, including those who were not included in the outcome-dependent sample. By imputing Xei, we simply allow ourselves to exploit all the existing data in order to summarize the gender–chronic bronchitis association. Without imputing Xei, we effectively throw away available covariate information.
We discussed a class of epidemiologic study designs for longitudinal binary response data that can be viewed as an extension to the case–control design. Rather than using a single scalar response to create two sampling strata (cases and controls), we summarize the response vector to create three sampling strata (no events, some events, all events). By oversampling subjects with response variation, we can obtain substantial precision gains for time-varying covariate effects and for their interactions with baseline variables. Furthermore, supplementing the design with multiple imputation analyses yields near full efficiency for the effects of baseline covariates.
Our work provides two key results: (1) subjects with response variability are highly informative for estimating time-varying covariate effects, or equivalently subjects without response variability are relatively uninformative and (2) near full efficiency can be obtained for parameters associated with time-varying covariates that vary exclusively within individuals when all subjects with response variability are sampled. Insight into such results is provided by what is known about conditional logistic regression precision for longitudinal binary data. In contrast to random intercepts models, conditional logistic regression only uses within-cluster response and exposure variation to estimate parameters. Generalized linear mixed models “borrow strength” by exploiting between– and within-subject variability. As discussed in the studies by Neuhaus and Kalbfleisch10 and Neuhaus and Lesperance30, generalized linear mixed models are more efficient than conditional logistic regression for all parameters except those associated with time-varying covariates that vary exclusively within subjects (i.e., the intracluster correlation in the covariate is 0), where conditional logistic regression and generalized linear mixed models are equally efficient. That is, toward estimating such parameters, only subjects with response variability contribute information. In the Lung Health Study, the within-subject smoking, smkw,ij variable, had intracluster correlation equal to 0. Thus, the D[100, 1400, 100], which sampled nearly all subjects with response variability, yielded very precise estimates of
Even though we were interested in marginal model parameters, one might also be interested in conditional model estimation (e.g., from mixed effects models). If interest is exclusively in
from a conditional model, Sjölander et al31 showed that a highly informative study design and estimation procedure combination is one that samples only those with within-subject response and exposure variability (i.e., those who are doubly discordant) and that conducts analyses with conditional logistic regression. That is, we would sample only those who exhibited CBij and smkij variability over time. Such a design represents a highly resource-efficient approach and would result in a sample size that is smaller than what we have proposed. Further, it estimates parameters that are analogous to
. What is lost by conducting such a design and analysis is the ability to estimate other parameters efficiently (or at all), to do prediction, to “marginalize” over the exposure distribution to obtain marginal risk contrasts and also the possibility of invalid inferences when the response dependence model differs from a simple exchangeable structure.
The weighted likelihood analysis procedure is not as efficient as ascertainment-corrected maximum likelihood and multiple imputation procedures in settings we studied. However, it is well known that there is a bias-variance tradeoff between maximum likelihood procedures and weighted likelihood in the presence of model misspecification.32–34 Maximum likelihood is more efficient but less robust than weighted likelihood approaches. Such misspecification might arise, for example, if the mean model for
ignores an important interaction.
In this article, we have assumed that within-subject smoking is not an endogeneous exposure that is influenced by past outcomes. If the primary exposure of interest is driven by past outcomes, then tailored statistical methods would be needed that can properly address time-dependent confounding.35,36 Additional research is warranted that considers the potential for efficient outcome-dependent sampling designs in such a complex time-dependent context.
The authors thank the supported effort of the faculty and staff members of the Johns Hopkins University Bayview Genetics Research Facility, NHLBI grant HL066583 (Garcia/Barnes, PI) and NHGRI grant HG004738 (Barnes/Hansel, PI). The Lung Health Study was supported by U.S. Government contract No. N01-HR-46002 from the Division of Lung Diseases of the National Heart, Lung and Blood Institute. Data were downloaded from the NCBI database of genotypes and phenotypes (accession number phs000335.v2.p2). The authors would like to thank the reviewers for their thoughtful comments that led to substantial improvements to this paper.