Environmental, occupational, and nutritional epidemiologists frequently encounter the challenge of estimating health effects of long-term exposures that vary widely over time. Here we define long-term exposures as usual or average exposures over periods much longer than available direct measures, implying that they can be estimated only by using surrogates. The alternative strategies have been classified as either individual- or group-level measures.^{1} Individual-level measures are generally short-term. Examples are nutrition-related biomarkers as surrogates of usual nutrient intake, and personal air pollution samples as surrogates of long-term exposure.^{2,3} These measures (which almost always cover a relatively short time span) are imprecise estimates of longer-term average exposure, leading to underestimation of exposure-response relations.^{4} Group-level measures allow more precise estimation of exposure by averaging across individuals with similar characteristics. Some examples are grouping workers based on features of the work environment, and using household stove/fuel categories as surrogates of indoor air pollution.^{5,6} Although exposure-response analyses using these group-level measures are less susceptible to attenuation bias (because much of the exposure error is Berkson), the cost is decreased precision in estimating the exposure–response relation when there is large between-subject exposure variability unexplained by the grouping factors.^{1,5} With either strategy, exposure measurement error reduces the likelihood of identifying an underlying causal effect.

Methods to analyze and interpret exposure variation among people and over time, often using random effects and linear mixed-effects models, have been valuable for improving assessment of long-term exposures. For example, analysis of exposure variance components has allowed researchers to decide the numbers of repeated short-term measures per subject needed to estimate accurately exposure–response relations,^{7,8} compare the reliability of environmental measures and biomarkers of exposure,^{9} select optimal grouping schemes that maximize exposure contrast and precision,^{5} and correct for measurement error bias in health-effect estimates.^{10,11} Tielemans and colleagues^{1} used variance components to estimate attenuation and precision of estimated linear regression coefficients when exposure is measured with various sources and amounts of random error. They concluded that researchers have to decide between individual-level exposure measures generating precise, although biased, exposure–response estimates or group-level measures, generating essentially unbiased, although less precise, exposure–response estimates. These previous studies have focused on estimates of variance components by using random coefficients and effects of group-level predictors of exposure based on fixed coefficients, but they have not calculated individual-level predictions of exposure based on both random and fixed effects.

When both individual- and group-level exposure estimates are available, mixed models offer a third alternative that combines information from both, using variance component estimates to weight the contribution of each to the prediction. This prediction method, described more than 4 decades ago and sometimes referred to as the empirical Bayes estimator,^{13} may improve predictive validity of exposure measures and thereby enhance the sensitivity of epidemiologic studies. Using simulations, Stanek and colleagues^{14} found that mixed-model prediction performed better than direct measures of a subject's cholesterol, fat intake, and physical activity. Also using simulations, Seixas and Sheppard^{15} showed that use of the empirical Bayesian James-Stein shrinkage estimator for exposures in exposure–response analyses simultaneously reduced the bias inherent in individual estimates and the imprecision inherent in group estimates. Despite the apparent advantage of mixed models for improving the design and analysis of exposure assessment, prediction of subject-specific exposure based on mixed models has been underutilized, and validation has been limited. Furthermore, in studies including multiple actual exposure measures per subject, common strategies are simply to assign each subject the mean of their own observed data or the mean of a group to which they belong. The validity of mixed-model prediction has not been compared with these purely individual- or group-level strategies.

This paper compares the predictive validity of alternative measures of average woodsmoke exposure over about 1.5 years in children up to 18 months of age within a randomized control trial in Guatemala called RESPIRE (Randomized Exposure Study of Pollution Indoors and Respiratory Effects), designed to test the impact of reduced woodsmoke exposure on incidence of childhood acute lower respiratory infections.^{19} Carbon monoxide (CO) was used as a surrogate for woodsmoke,^{20} a complex mixture of gases and particles implicated in childhood respiratory infections.^{6} Previously reported findings from this study showed that replacing open-fire woodstoves with a chimney woodstove (the plancha) reduced 48-hour mean child CO exposure by 51% (95% confidence interval = 46%–55%) (KR Smith, et al, unpublished data, 2008). The 2 main analytic strategies for studying the impact of woodsmoke reduction on acute lower respiratory infection will be (1) intention-to-treat, with exposure based on randomly assigned stove type; and (2) exposure–response, with exposure assignment based on child-specific measures of exposure. We hypothesize that mixed-model prediction will have greater predictive validity for long-term CO exposure than either the mean of short-term exposures or the group-level estimates based on stove and other subject and residential characteristics.

#### METHODS

##### Data Collection

The study children were part of the RESPIRE clinical trial and living in homes with either open fires (247) or chimney stoves (262) in 23 villages spread over the highlands of San Marcos Department in Guatemala. Before randomization, baseline subject and residential characteristics were collected via questionnaire directed to the mother of the child and also by household inspection by field staff. Children entered the study from in utero to 4 months of age and were followed until dropout, death, the end of the 18th month of life, or the end of the study (15 December 2004).

Personal CO measures were taken with Gastec 1DL passive diffusion tubes (Gastec Corp., Kanagawa, Japan) worn by the children during 48-hour periods. Field procedures and measurement calibration are described in detail in a separate paper (KR Smith, et al, unpublished data, 2008). For the main study, the starting dates of personal CO measurements were staggered between January 2003 and May 2004 and repeated every 3–4 months among all children. For the validation study, an additional set of measures was taken among 70 children randomly selected from the main study population. These data were taken at the same frequency and time period, and on average 19 days before or after (standard deviation [SD] = 15, range 0–64 days) the corresponding measures in the same person and measurement cycle in the main study.

##### Statistical Methods

Due to the right-skewed distribution, the CO data were natural-log-transformed for analyses and presentation. This transformation resulted in model residuals that appeared to derive from an approximately normal distribution.

##### Baseline Model for Within-Child and Between-Child Variability

Estimates of variance components were used to evaluate information about long-term exposure provided by short-term measures. Using SAS (version 9.1; SAS Institute, Cary, NC), a linear model with a random intercept for child and no fixed effects was used to partition exposure variability into within- and between-child components. The log of the *k*^{th} CO measure on the *j*^{th} child (*Yjk*) was modeled as follows (model 1):

where β_{0} is the overall intercept, *bj* is the random effect for the *j*^{th} child, and *ϵjk* is the random within-child error. It is assumed that *bj* and *ϵjk* are independent and normally distributed with variances σ_{b}^{2} and σ_{w}^{2}, respectively. The between-child (μ̂_{b}^{2}) and within-child (μ̂_{w}^{2}) variance components were estimated using restricted maximum likelihood. A random intercept in the absence of any additional model assumption about the correlation among repeated measures induces a compound symmetry structure,^{21} and we compared alternative models for the covariance among repeated measures within-child (unstructured, exponential, autoregressive, and heterogeneous autoregressive) by using Akaike information criterion. This baseline random intercept model was run using all data and then run separately by stove type.

We estimate the intraclass correlation coefficient (r_{ic}, the proportion of total variability in the data attributable to between-subject differences) by:

The 95% CIs were computed using Smith's formula.^{22}

To evaluate geographic variability, we compared the model fit after including a random intercept for village. We also examined potential spatial correlation among the random child intercepts by plotting their squared differences in relation to distance between households, sometimes referred to as a variogram.

##### Explaining Exposure Variability

The proportion of each variance component that could be explained by subject and residential characteristics, including stove type, or by temporal factors (eg, season) was evaluated. Among the between-child characteristics, housing structures were classified by whether the kitchen and sleeping area were in the same rooms. Categories were created for floor, wall, and roof by combining similar materials. A household crowding index was calculated by dividing the number of people by the number of bedrooms. The main indicators of ventilation were kitchen volume (measured in cubic meters) and size of eaves space between the top of the walls and the roof. This space was subjectively classified by field workers as none, small, or large. The household altitude was measured using a geographic positioning device. Indicator variables were used for secondhand tobacco smoke exposure, household electricity, and use of a temascal, the wood-fired sauna prominent in Mam culture. Mother's age at baseline was categorized into 5-year intervals from 15–35 and over 35 years. An asset index (range 0–6) was calculated as the sum of the following household possessions: radio, television, refrigerator, bicycle, motorcycle, and automobile (yes or no for each).

Child age was categorized into 4-month intervals and seasons as cold and dry from November–February, warm and dry during March and April, and warm and wet from May–October based on local convention. Indicator variables were used for day of week.

We estimated the proportion of each variance component explained after sequentially adding each of the following fixed effects to the exposure model: stove type, other between-child characteristics, and time-varying characteristics. Backward elimination was used to remove covariates that had *P* values ≥0.1. The proportion of within-child variance explained (R_{within}^{2}) was calculated as in Xu,^{23} by subtracting from 1 the ratio of residual within-child variance under each alternative mixed model to that under the baseline random intercept model (model 1). Between-child variance explained (R_{between}^{2}) was calculated in an analogous way using the between-child variance components.

##### Measures of Long-Term Average CO Exposure

The individual-level measures of long-term exposure we evaluated were the child's first CO measure during the trial period (to test the validity of a single 48-hour measure as a surrogate for long-term exposure) and the child mean, based on each child's observed mean across log-transformed repeated 48-hour measures.

The group-level measures were based only on the fixed effects from mixed models. The first group-level approach included only stove type and the second accounted for additional between-subject characteristics.

In contrast to these purely individual- or group-level estimates, we compared mixed-model predictions of child-specific long-term exposure based on both fixed effects and random child intercepts. We considered predictions from the baseline random–effects model with no covariates, and then from mixed models after sequentially adding stove, other between-child covariates, and time-varying covariates. Although the time-varying covariates were allowed to vary when building the models, we set these covariates at the study population means to predict long-term exposures. We also evaluated predictions based on models allowing separate estimates of within- and between-child random variance by stove type.

To visualize the exposure data and long-term exposure estimates, we present smoothed probability distributions using the density function in R 2.4.0 (The R Foundation for Statistical Computing, www.r-project.org).

##### Validation Study to Compare Measures of Long-Term Exposure

Alternative measures of long-term exposure were compared on their ability to predict the child-specific mean from the validation dataset. Although this validation study measure is in itself an error-prone estimate, it was chosen because it would be expected to vary randomly around the underlying long-term exposure (defined as the child-specific expected mean of repeated CO tube measures during the entire first 1.5 years of life). Furthermore, we assumed that its errors are independent of the errors in the main study estimates. Given these 2 assumptions, any estimate closer on average to the validation study child mean will also be closer on average to the underlying long-term exposure. Predictive validity was assessed by the Pearson correlation and prediction squared error, defined as the average squared difference between each alternative measure and the child mean in the validation study. The prediction squared error is equivalent to the square of what Hornung^{24} referred to as “accuracy” and defined as the mean-squared combination of bias and precision.

Based on comparison with an imprecise gold standard, the Pearson correlation and prediction squared error underestimate predictive validity and cannot be used to assess the relative validity of alternative predictions. A more informative parameter would tell us how close the estimates are to the true underlying exposures. Fortunately, given independent errors between predictions and validation measures, the observed prediction squared error has the same expected value as the sum of the error variances of the main and validation study estimates. Because we assume the child means from the main and validation studies have equal error variance, this error variance is expected to equal half the observed prediction squared error for child mean. We exploited this relation to calculate a corrected prediction squared error by subtracting half the observed prediction squared error for the child mean from the observed prediction squared error for each alternative long-term exposure measure, resulting in an estimate of the error variance in relation to the underlying exposure. We refer to this as the corrected prediction squared error.

Although children were randomly selected for recruitment into the validation study, the adequacy of the validation dataset is based on untestable assumptions about the error structure of the gold standard. As a complementary way of comparing the predictive validity of alternative estimates of long-term-exposure, we performed a cross-validation using only the main exposure dataset, selecting the 471 children with at least 2 measures, and randomly dividing their measures into 2 datasets, each containing 50% of the data and 2 measures per child on average. One dataset was used to develop the models and produce predictions analogous to those in the first validation design described previously, and the other dataset was used to create validation measures based on the child-specific means of observed exposures.

#### RESULTS

The main study (subjects = 509, n = 1932) and the validation study (subjects = 70, n = 270) attained an average (SD) of 3.8 (±1.2) and 3.8 (±1.4) measures per child, respectively, both ranging from 1–6. The overall mean (SD) 48-hour CO exposure was 2.12 ppm (±2.32) in the main study and 1.93 ppm (±1.90) in the validation study. In the main study, the mean (SD) exposure was 2.77 ppm (±2.62) among children from open-fire homes and 1.50 (±1.90) among children from chimney-stove homes.

As shown in Table 1, the 70 randomly selected children in the validation study were very similar to the main study population according to measured baseline covariates. The timing of the child CO measures was similar on average in the main and validation studies, as shown in Table 2. The numbers of measures per 4-month age interval rise with the first few categories, due to the distribution of entry into the study between 0–4 months of age, and then decline after 12 months due to death, dropout, and the end of the study.

Table 1 Image Tools |
Table 2 Image Tools |

The main study means and variance components of log child CO overall and by stove type are presented in Table 3. Compound symmetry provided the best fit for the covariance among repeated measures within child. Although the total variance was similar within the 2 stove groups, the chimney-stove group had higher within-child variance and lower between-child variance than the open-fire group, resulting in a substantially lower intraclass correlation coefficient for the chimney-stove group. The mean log CO difference of 0.72 between the stove types is equivalent to a 51% reduction in exposure associated with the chimney stove compared with the open fire.

Figure 1 shows the distributions of log 48-hour CO measures (A) and log child means across repeated measures (B), the latter having 43% lower variance in the open-fire group and 58% lower variance in the chimney-stove group.

Mixed models for log 48-hour child CO are presented in Table 4 in order of increasing complexity, along with estimates of the residual within- and between-child variances and proportions of each variance component explained by the covariates. In addition to stove type, between-child covariates included in model 3 were a linear term for mother's age; binary indicators for asset index greater than 2, earth floor, tile kitchen roof, secondhand tobacco smoke, household electricity, wood-fired sauna, and altitude over 2800 m; and a categorical variable for kitchen wall materials (informal, adobe mud, and high quality). This model 3 explained 62% of the between-child variability. In model 4, we added categorical variables for day of the week and for child age in 3-month intervals for the first year or greater than 1 year. These time-varying covariates explained only 4% of the overall within-child variability.

The results of the validation study of alternative measures of long-term exposure are shown in Table 5. Compared with using the stove-specific group mean, using a single 48-hour personal CO measure per child had much lower predictive validity, as indicated by the lower Pearson correlation and substantially larger corrected prediction squared error. The predictive validity of the observed child-specific mean of repeated measures was only slightly better than the stove mean, whereas the group-level measure based on stove and other subject and residential characteristics performed better than the child mean. Although predictions based solely on the random intercept model 1 offered no improvement compared with the child mean or group-level measures, adding the fixed effect for stove type in model 2 substantially improved the predictions. The further gain in predictive validity by adding several other between-child covariates in model 3 was similar in magnitude to that achieved by allowing separate estimates of the variance components by stove type in model 2s, and these improvements are combined in model 3s. Adding the time-varying covariates, child age and day of the week, in models 4 and 4s did not improve predictive validity.

The measure from Table 5 with the best predictive validity was based on the model 3s:

The random components, b_{(i)}*j* and ϵ_{(}*i*_{)}*jk*, have a subscript *i* to denote that they are estimated separately by stove group. Figure 2 compares the distributions across all 509 children of group-level exposure estimates based on the fixed effects in the model (Fig. 2A) with predictions based on the fixed and random effects (Fig. 2B). The spread of the mixed-model predictions is wider than the group-level estimates but much narrower than the observed child means shown in Figure 1B.

Equation (Uncited) Image Tools |
Figure 2 Image Tools |

The results of the complementary validation study (shown in the Appendix) were similar regarding the superior performance of mixed-model predictions. The only substantial difference between the results of the 2 validation studies was that the observed child mean of approximately 2 measures had worse predictive validity than a group-level estimate based on stove, whereas the child mean of 4 measures performed better.

The variance of the random village intercept was not substantially different from zero (*P* = 0.207), and the Akaike information criterion model fit statistic did not improve when this random effect was added. Furthermore, the variogram did not show a trend across distance between households, providing no evidence of spatial dependence of the random child intercepts.

#### DISCUSSION

Direct long-term measures of exposures are often not obtainable without expensive and invasive measurement protocols. The alternative surrogates have been classified as either individual-level estimates, generally based on observed short-term exposures such as the 48-hour personal CO measures in RESPIRE, or group-level estimates, generally based on subject characteristics that predict exposure.^{1} Here we sought to compare the predictive validity of these 2 traditional approaches and mixed-model prediction, an alternative that combines individual- and group-level information. As summarized in Table 5 and the Appendix table, long-term exposure predictions based on mixed models had substantially better predictive validity than purely individual- or group-level estimates (as indicated by higher correlation coefficients and lower prediction squared errors).

The low overall intraclass correlation (r_{ic} = 0.33) indicates the low reliability of 48-hour child CO as a measure of long-term exposure. In the 1980s, authors of studies of kitchen pollution levels in Kenya and India reported similar intraclass correlation coefficients to those we observed for personal exposures.^{25,26} These authors concluded that the low between-household variability in air pollution levels made it essentially impossible to demonstrate health effects. However, the presence of large within-child variability does not mean there are no important exposure differences between children; it merely reduces the precision of long-term exposure estimates based on short-term measures. The similar intraclass correlation for particle samples in those other studies suggests our findings for CO may be generalizable to this other major constituent of woodsmoke. (Assessment of variance components across several pollutants in the same setting would be required to determine whether our findings can be extended to the full woodsmoke mixture.) Based on analysis of duplicate measures of kitchen CO, the estimated variance due to random instrument error (0.02) was equivalent to only about 3% of the within-child variance (0.55) observed for personal exposure (KR Smith, et al, unpublished data, 2008), suggesting that the imprecision of short-term measures was due to true temporal variability in exposure. In anticipation of this wide temporal variability, we increased the measurement duration from 24 to 48 hours and collected repeated measures. In spite of these improvements, only a moderate correlation (r = 0.70) was observed between child means in the main and validation studies. These individual-level estimates could be further improved by collecting more repeated measures per subject, although costs would be substantial and participant fatigue could occur.

Group-level exposure estimates are a potential solution when short-term measures vary widely over time. Their strength depends on the amount of between-subject variation explained by measured subject characteristics. In our study, type of stove explained approximately half (49%) of the between-child differences, illuminating why this is a fairly good surrogate of long-term exposure. Measured residential characteristics including stove type explained 62% of between-child variability. In fact, these characteristics predicted the validation study measure better than the mean of each child's own observed short-term exposures. Relying on subject characteristics alone, such as the “exposure prediction rule” used in a recent study of whole-body vibration among taxi drivers,^{27} is particularly useful for estimating exposure among subjects with no direct exposure measure. In studies directly measuring exposure among all subjects, the limitation of a group-level approach is that it ignores information regarding subject-specific long-term exposure contained in the short-term measures but not explained by subject characteristics.

When both individual- and group-level exposure estimates are available, mixed models offer a third alternative that combines information from both. Rappaport and colleagues^{12} combined fixed effects of jobs and process- and task-related covariates with variance components to estimate the probabilities that a random worker would exceed a particular occupational exposure limit on average or on a random day. The aim of that study, however, was to understand the population distribution of exposure rather than to assign an exposure value to specific workers. In a study of dust exposures in sawmills, Friesen and colleagues^{16} compared predictions based on fixed effects models to predictions based on mixed models and found little difference. This study had few measures per subject (mean = 1.2), and the assessment was based on associations between model predictions and the same observed exposures used to fit the models. Because there is a general tendency to overfit models to the particular dataset with which they are developed (which results in underestimation of the true prediction error) it is important to compare model predictions to observations from a separate validation dataset.^{17} Terrell and colleagues^{18} used this approach and found the predictive performance of mixed models significantly better than 2-stage ordinary least squares models for the decay of polybrominated biphenyl concentrations in serum. These serum measures were characterized by very small within-subject variability relative to variability between subjects, and so it is unclear whether these findings apply to the common exposure scenario of relatively large within-subject variability.

In a setting of large temporal variability in exposure, we used mixed models to produce predictions that are essentially a weighted average of an individual-level measure (based on observed short-term child CO) and a group-level measure (based on stove and other residential characteristics that predict child CO) borrowing information from other children to estimate the components of within- and between-child variance used for the weights. As within-child random variation increases relative to between-child, more weight is given to the group-level information and less to the individual-level information. As residual between-child variation increases, less weight is given to the group-level estimate and more to the individual-level estimate. Because children from open-fire households had higher between-child and lower within-child variation, their individual-level information would be weighted more heavily than among children from chimney-stove households. Predictions based on mixed models with separate variance component estimates by stove type had smaller errors, presumably a result of giving more weight where there is more information.

The reasons for heterogeneity in variance components by stove type are unknown. The occasional use of open fires in chimney-stove households may increase temporal variability. Other sources of CO exposure that are relatively constant between children may explain lower between-subject variability among the chimney-stove group because household woodsmoke would represent a smaller proportion of their total exposure. The finding that variance components—and therefore the reliability of short-term measures—differ by stove type raises the interesting question of whether short-term measures would have higher reliability in other populations.

Although child age and day of the week are important predictors of short-term exposure measures, adjusting for these factors did not improve the validity of predictions of underlying long-term exposure. What explains the remainder of the large within-child variability remains unknown. Identification of key time-varying predictors could further improve measurement of long-term exposure.

##### Potential Limitations

If the validation study measure and main study measures are not independent conditional on the underlying exposure, the prediction squared error does not estimate the sum of error variances. This concern is raised by the potential for positive correlation between missing data patterns in the main and validation studies, such as that induced by children who entered the study at an older age and therefore had no measure early in life when exposures tended to be high. Positively correlated errors would cause the observed prediction squared error to underestimate the error variance. However, the strength of mixed-model prediction is that, unlike simple means, it does not rely on the assumption that data are missing completely at random. Instead, it assumes that data are missing at random conditional on other data, including covariates in the model and previous exposure measures on the same child. For example, if the errors are only independent conditional on age, this condition may explain why the mixed-model predictions adjusted for age (model 4s) appear to perform slightly worse than those not adjusted for age (models 3s). Although missing-data bias remains a possibility, its effects are unlikely to explain our conclusion that mixed-model predictions perform better than purely individual- or group-level estimates. Furthermore, the complementary validation study presented in the eAppendix (available with the online version of this article) is not susceptible to the same sources of bias because each child's measures were randomly split to form prediction and validation datasets, and the results were similar.

The increased predictive validity of mixed-model prediction comes at a cost. Although each child's observed short-term exposure may be an unbiased estimate of their true underlying exposure, the mixed-model predictions are biased toward the estimated group-level mean for children with the same covariate values. In fact, observed subject-specific means are generally better predictors for subjects with underlying exposures in the tails of a distribution.^{14} Although it may be true that observed means are better predictors of underlying exposure for some children, the validation study showed that mixed-model predictions were better on average for all children. Furthermore, it is important to consider that not all types of measurement error induce the same bias in exposure–response estimates. Classic exposure measurement error attenuates regression coefficients in exposure–response analyses.^{28,29} The mixed-model predictions not only move closer to the underlying values but also have much lower variance than child means, indicative of less classic error.

#### ACKNOWLEDGMENTS

We thank Byron Arana, Anaité Díaz, Eduardo Canúz, and Rudinio Acevedo from the *Universidad del Valle*, Guatemala City for help in planning and coordinating the field research and data management; the Guatemalan Ministry of Health for their cooperation; and the RESPIRE project staff from San Lorenzo and Comitancillo.