Within the past several years, accelerometry has emerged as an important means of assessing the duration and intensity of physical activity and has served to define primary outcome measures in several observational (^{1,4,20}) and experimental studies (^{6,7,11,15,17}). It is currently being used in the large group-randomized Trial of Activity in Adolescent Girls (TAAG) to examine the effect of a school- and community-based intervention on physical activity in middle-school girls. The uniaxial accelerometer considered in this trial, the ActiGraph, formerly known as the Computer Science and Applications (CSA) and Manufacturing Technologies Inc. (MTI) (ActiGraph, LLC, Fort Walton Beach, FL) is a small (5 × 4 × 1.5 cm) and lightweight (45 g) device that captures vertical acceleration. Acceleration is sampled 10× s^{−1}, and the data are summed over a user-specified time interval (e.g., 30 s, 1 min) and the summed value or activity “count” is stored in memory.

Although the measurement protocols vary, most studies involve monitoring physical activity over several days to ensure reliable estimates of usual physical activity behavior and to account for potentially important differences in activity patterns on weekdays versus weekend days (^{19}). Participants are typically instructed to wear the accelerometer during waking hours, except when bathing and showering. Data from accelerometers are summarized in numerous ways, including the mean total count per day (^{11}) and the mean minutes per day spent in moderate or vigorous physical activity (using established count cut points to distinguish specific intensity levels). The analysis is complicated in that 1) activity levels vary among days of the week and times of day and 2) over multiple days of monitoring, missing data arising from removal of the monitor are a common occurrence. Although participant noncompliance accounts for a large fraction of the missing data, legitimate reasons for removing the monitor, such as complying with mandated sports league safety regulations or participation in water-related activity (for monitors that are not waterproof), also contribute to data loss. Thus, the timing and amount of data contributed by each individual vary. If summary statistics are computed using the observed data only, these statistics have the potential to be biased. For example, the total count for a given day clearly underestimates the true level of activity on days in which the monitor is worn only part of a day. Some researchers have tried to minimize this bias by computing summary statistics after excluding accelerometer data on days in which the monitor is worn only part of the day. These are called *incomplete* days of observation. This strategy, however, is not without issues. First, even after excluding days with, say, less than 8 h of wearing time, the number of hours the monitor is worn is still likely to vary. Moreover, if included days have intervals when the subject was awake but not wearing the monitor, then total activity will be underestimated. Second, this approach ignores possible differences in activity levels on complete days and incomplete days, making the estimated summary statistics from complete days subject to selection bias.

This paper proposes an analytical approach whereby the observed data are used to help predict activity levels for segments of the day in which the monitor is not worn. The resultant data set is *pseudo-complete* in the sense that each individual will have either observed or imputed data for all segments of each day in which the monitor was intended to be worn. Summary statistics are then estimated from this pseudo-complete data set. This imputation strategy is analogous to imputing missing item responses on multi-item questionnaires. The literature contains numerous examples in which this treatment of item nonresponse has been found to reduce bias effectively (^{5,21}).

In the following section, accelerometer data collected during the feasibility phase of the TAAG trial are used to demonstrate the potential for bias in estimating physical activity when all observed data are used as well as when a subset of data that excludes incomplete days of monitoring is used. We then describe procedures for filling in missing data using single imputation through expectation maximization (EM) and multiple imputation (MI) (^{9,14}). The remaining sections of the paper describe the design of a simulation study to assess the effectiveness of the imputation approaches, present its results, and discuss the effectiveness of imputation as a strategy for dealing with missing data in the context of accelerometry.

## THE PROBLEM OF MISSING ACCELEROMETER DATA IN TAAG

TAAG is a group-randomized, multicenter trial, sponsored by the National Heart, Lung, and Blood Institute, to assess the impact of a school- and community-based intervention on the physical activity of middle-school girls (^{16}). The six field centers involved in TAAG are the University of Maryland, Baltimore, MD; University of South Carolina, Columbia, SC; University of Minnesota, Minneapolis, MN; Tulane University, New Orleans, LA; University of Arizona Tucson, AZ; and San Diego State University, San Diego, CA. The University of North Carolina is the Coordinating Center.

The data reported here arise from a substudy conducted in the fall of 2002. This substudy was undertaken to inform decisions about measurement protocols and design for the main TAAG trial. In this substudy, each of the six sites recruited two schools and randomly selected 45 eighth grade girls from each school; 80.7% participated (*N* = 436). Girls were asked to wear the monitor for seven complete days, and the epoch of integration was set at 30 s. Because the monitor records the slightest motion as a nonzero count, a sustained (20 min) period of zero counts was judged as a time when the monitor was removed and counts in that period were set to missing. Figure 1 shows the cumulative proportion of girls in our sample classified as wearing the monitor (i.e., nonzero counts were being registered) at specific times of the day.

The primary outcome for assessing the effectiveness of the TAAG intervention was the mean intensity-weighted minutes of MVPA per day. To derive this variable, one must first convert accelerometer counts per 30 s to their MET (multiples of resting metabolic rate (RMR) in kilocalories per kilogram per hour (^{1}) using a calibration equation developed by Treuth et al. (^{18}). Then, the total intensity-weighted minutes (i.e., MET-minutes) of MVPA are computed by summing the MET values above a moderate-intensity threshold (1500 counts per 30 s), and dividing by 2 (to transform from the original 30-s scale to a 1-min scale).

Table 1 shows the average time (h) the monitor was worn by day of the week. As noted earlier, a characteristic of accelerometer data is that activity is not measured over a uniform period each day. The number of hours in which the monitor was worn was much higher on weekdays than on weekend days. Although the genesis of the missing data is not fully understood, many girls reported forgetting to put on the monitor; fewer reported not wearing the monitor because of illness or because of participation in a sporting event where its use was prohibited.

If the primary outcome is computed with an approach that uses observed data, which represents measurement of activity over an average of 12.1 h·d^{−1}, an estimated 136 MET·min of MVPA was accumulated per day. Alternatively, if an approach is used that excludes incomplete days, defined as having less than 8, 10, or 12 h of recorded activity, then the primary outcome was estimated to be 148, 147, or 159 MET·min of MVPA per day, respectively (Table 1). Because the first approach includes days in which only a few hours of activity are measured, the estimate is likely to represent an underestimate of the true mean level of activity in this sample and is not recommended. The approach of excluding incomplete days, on the other hand, involves an interesting trade-off between accurate representation of total activity during a typical day and possible selection bias resulting from differences between days that are dropped and days that are kept in the analysis.

When considering the approaches for dealing with missing data, it is important to understand the possible mechanism that may have given rise to the missing data. Following the terminology of Rubin (^{12}), the missing data are said to be missing completely at random (MCAR) if the “missingness” is independent of all other data, and therefore individuals with incomplete data have the same distribution for the primary outcome as those with completely observed activity data. When the missingness depends on the observed activity or covariates, but not on the missing (unobserved) pattern of activity, the missing data are said to be missing at random (MAR). When the missingness depends on the missing level of activity, the missing data are not missing at random (NMAR). If the distribution of activity for days that are excluded from analysis is the same as that for days that are included (i.e., the MCAR condition holds), the analysis approach in which incomplete days are excluded would result in a valid estimate of activity (^{9}). On the other hand, if people are less (or more) likely to wear the monitor when they are inactive (i.e., the NMAR condition holds), this approach would lead to a biased estimate of activity.

## IMPUTATION OF MISSING DATA

The fundamental idea of imputation is to use observed data values to assist in predicting missing values. How close the estimate of the missing value is to its true value depends on how many predictors are used and their correlation with the missing variable. In general, the greater the number of predictors and the higher the correlations among variables, the more precise the estimates of missing values will be. The quality of estimates is also affected by how much data are observed versus missing for an individual and by the pattern of missing values.

Table 2 shows the correlation matrix for measures of physical activity for each day of the week estimated from girls with seven completely observed days of data in the TAAG substudy (*N* = 181). Here, “complete” was defined as having nonmissing counts over at least 80% of a standard measurement day; with a standard measurement day defined as the length of time in which at least 70% of sample participants were wearing the monitor. For example, the minimum number of hours of nonmissing data for a weekday to be deemed complete would be 11.2 h (= 0.80*(70th percentile of *off-time – on-time*) = 0.80*(21:25h – 7:25h)) (Fig. 1). For weekend days, at least 7.2 h of nonmissing data would be required.

Before we describe the imputation procedures used in this paper, a little notation is needed. Let y_{i} = (y_{i1}, …, y_{ip})′ denote the set responses from subject i (i = 1,…, *N*) on *P* days of monitoring. The vector y_{i} can be partitioned into values that are observed and missing, y_{i} = (y_{i,obs}, y_{i,mis}), where the dimensions of each component depends on the number of observed and missing data values. Let z_{i} = (z_{i1},…, z_{ip})′ denote the ideal vector of responses for the ith subject in which all data have been observed, and Z be the matrix of observed responses, z_{i}, from all subjects. Assume z_{i} follows a multivariate normal distribution with parameter θ = (μ, Σ). One can generate plausible versions of Z in many ways. The most straightforward approach replaces an individual’s missing values with the mean of observed values. Although mean substitution is easy to use, it tends to decrease variance estimates as more means are added to the data. In turn, covariance estimates also are attenuated (^{8,10}). Nevertheless, mean substitution may be a reasonable approach when correlations between variables (y_{ij} and y_{ik}) are low, and missing data are less than 10% (^{3}). A more refined approach that uses probability density estimates of the missing values instead of point estimates, is known as the EM algorithm.

Starting with an initial parameter value θ^{(0)}, the EM algorithm (^{2}) repeats the following two steps. The jth iteration of the expectation or E-step consists of imputing missing values of Y by their conditional expected values given the observed data and the current parameter value θ^{(j − 1)}:

For a detailed description of the computations involved, see Schafer (^{14}). Any convenient starting value θ^{(0)} will do; the maximum likelihood estimate for θ computed using only the complete rows of Y was used here. The maximization or M-step uses both Y_{obs} and current imputed data Y_{mis}^{(j)} to reestimate θ, which is used in the next E-step to generate new imputations of Y_{mis}. This process is repeated until convergence, specified to occur when the successive log-likelihood values differ by less than 0.00001. The ideal data matrix Z is taken to be the observed and imputed data matrix produced from the final E-step.

It is important to note that single imputation methods treat imputed values as though they were the actual values. Therefore, the uncertainty about the correct value to impute is not taken into account and the variance of the summary statistic will be underestimated. To properly reflect this uncertainty, the MI (^{13}) procedure replaces each missing value with m > 1 plausible values. For the present work m = 5 imputations were performed. The final result is m versions of Z, where the missing data are replaced by independent random draws from *P*(Z | Y_{obs}), the predictive distribution of Z given the observed data. The resulting data sets Z^{(1)}, Z^{(2)}, …, Z^{(m)} are analyzed separately using standard complete-data methods, and the results combined in a manner that takes the imputation variability into account. In the current setting of multivariate normal data with arbitrary patterns of missing values, the form of *P*(Z | Y_{obs}) is complicated, making the distribution hard to sample directly. Thus, we employed a more convenient strategy whereby missing data are replaced by random draws from a simulated distribution of *P*(Z | Y_{obs}) obtained by the Markov chain Monte Carlo (MCMC) method. The MCMC method constructs a Markov chain that is long enough for the distribution to stabilize to a stationary distribution, which is the distribution of interest.

## DESIGN OF A SIMULATION STUDY TO EVALUATE THE EFFECTIVENESS OF IMPUTATION

We selected girls from the TAAG substudy who had seven complete days of monitoring (as defined in the previous section) to assess whether imputation could effectively estimating missing accelerometer data. This resulted in a sample of 181 girls. Although the primary outcome of interest was the average daily MET-minutes of MVPA, important secondary outcomes were physical activity during weekdays and weekends and during specific periods of the day. In particular, each weekday was partitioned into periods roughly corresponding to before school (6–9 a.m.), during school (9 a.m.−2 p.m.), after school (2−5 p.m.), early evening (5−8 p.m.), and evening (8 p.m.−12 a.m.). On weekends, the first two time periods were segmented at 11 a.m. rather than 9 a.m. to reflect differences in sleep patterns. The relative effectiveness of imputation methods was examined under four conditions resulting from two different missing data mechanisms (MCAR, NMAR), and level of data aggregation (e.g., entire day, or before/during/after school). Thus, data were missing either for the entire day or missing for specific segments of the day.

Missing data were created as follows. Let *P*_{y}(0.75) denote the 75th percentile for the mean level of activity across all days and *P*_{y} (0.75) the 75th percentile for the mean level of activity for the jth day. Missing data were generated from a logistic regression model with three covariates: body mass index (BMI) in tertiles (x_{i1} = 1, 2, or 3); mean level of activity across all days y_{i}; and activity level on the jth day (y_{ij}). The model had the form:

where r_{ij} is an indicator variable for missingness and I(.) denotes the indicator function. In all missing data models, the missing data rate was chosen to match that observed in the TAAG substudy (Table 1), and the following imposed restrictions:

(probability that y_{if} is missing is unrelated to observed or unobserved data)

At one extreme, the MCAR condition imposes a random pattern of missingness; at the other, the NMAR case assumes that the odds that the ith individual’s response on the jth day (or segment of the day) is missing is 1.6 times greater if the individual is in the highest versus the lowest tertile of the BMI tertile (e^{2*0.235} = 1.6), and 4 times greater if their observed mean activity level is below the 75th percentile (e^{1.386} = 4) or if their response on the jth day is below the 75th percentile for the mean responses on that day.

The bias and precision of the estimates of primary and secondary outcomes were both considered important criteria for assessing the performance of the imputation techniques. Bias represents systematic error (e.g., over- or underestimation of parameter estimates), and precision random error (e.g., the spread or variability in estimates about the true values). The mean and SD of the outcome variables were computed before creating the missing values and after imputing the values set to missing, with smaller differences between the two set of parameters suggestive of less bias. In addition, bias in estimation (“prediction bias”) was summarized with the mean signed difference (MSD):

where *N*_{mis} is the number missing data values. The precision in estimation of the missing data values was summarized with root mean square difference (RMSD):

With the MI procedure, MSD and RMSD were computed for each of the five imputations and the average was reported. Smaller values of these criterion measures indicate greater accuracy (MSD) and precision (RMSD).

## RESULTS OF IMPUTATION IN TAAG

The estimated means and SD for MET-minutes of MVPA based on the completely observed data are provided in column 3 of Table 3. The proportion of missing data created under a MCAR mechanism and the parameter estimates computed from the observed data, and both imputation approaches are given in the next four columns. When estimating the mean, we see that there was minimal bias regardless of level of aggregation and analytic approach (observed data only, EM imputation, MI). When estimating the variance, the observed data analysis showed a positive bias while both imputation procedures were essentially unbiased for the weekday, weekend, and daily activity averages. For example, the SD of the weekend average MET-minutes of MVPA based on the completely observed data was 105, 133 based on the observed data, and 103 and 100 based on the pseudo-complete data sets created from EM imputation and MI, respectively. The same conclusions can be drawn from the MSD in Table 4, which show that the average prediction bias when imputing for activity across the entire day was minimal (−5.9 vs −3.7 MET·min of MVPA on weekdays; −2.3 vs −1.1 MET·min of MVPA on weekend days using EM imputation and MI, respectively). The prediction bias on Mondays and Fridays was considerably higher, with the missing activity values underestimated by more than 25 MET·min on Mondays and overestimated by more than 20 MET·min on Fridays. The average prediction bias when imputing for activity in segments of the day was low, with an average discrepancy between imputed and true activity of less than 2.1 MET·min of MVPA.

The last four columns of Table 3 show that when the missing data are NMAR, an analysis of observed data only resulted in mean daily activity values (e.g., Monday–Sunday) that were significantly larger than those from the completely observed data. This was to be expected, as the missing data were generated assuming a greater probability of missingness associated with lower activity levels. The imputation approaches also yielded positively biased estimates, but the magnitude of error was lower. Interestingly, the bias in weekday, weekend, and daily average activity estimates were similar for the observed data analysis and imputation procedures. This follows from the fact that the least active individuals are selectively excluded from the estimation of activity on a given day of the week, while they are not excluded from estimation of weekday, weekend, and daily averages as long as they have at least 1 d observed.

In general, EM imputation and MI performed similarly. Paired *t*-tests did not reveal significant relative biases between estimates of means and SD derived from the two imputation procedures. However, EM imputation showed a slight advantage over MI with regard to the precision with which the imputed values approximated the true values and higher correlations between true and imputed values. For example, under the NMAR condition, the correlations between imputed and true activity during the prespecified hour-blocks were 0.37, 0.38, 0.34, 0.00, and 0.29 using EM imputation versus 0.16, 0.28, 0.13, 0.19, and −0.04 using MI (Table 4).

## DISCUSSION

The analyses and simulations described in this paper focus on the problem of obtaining unbiased estimates of physical activity with accelerometer data collected over multiple days. Although we used data collected during a TAAG substudy, the potential issues of bias and the performance of imputation techniques are expected to generalize to other studies using accelerometry.

Analysis of accelerometer data with intermittent missing data can be handled in a number of ways. One common approach is to restrict the analysis to days in which a sufficient amount of data were recorded. The problem arises in choosing an appropriate cut point for what is to be considered sufficient. In the TAAG data, we showed that the estimate of physical activity was remarkably different depending on the choice of cut points. One possible solution to this problem might to be to propose a standard cut point for researchers working with accelerometer data. This approach has two problems. First, the same cut point may not be appropriate for all populations (e.g., young and old), or for all days of the week (e.g., weekday vs weekend). The risk in leaving the choice of cut point to data analysts is that it is very difficult to strike the right balance between setting the cut point high enough to eliminate days when the monitor was clearly not worn long enough to accurately represent that day’s activity versus making it so high that a significant number of individuals are excluded from analysis (thereby introducing the potential for selection bias). This suggests a need for guidelines to determine cut points that are appropriate for each individual analysis. The approach described in this paper, and that being used in the TAAG trial, involves defining a standard measurement day as the period over which at least 70% of the study population have recorded accelerometer data and making the cut point 80% of that observation period.

An alternative approach to dropping days with insufficient accelerometer data is to use imputation techniques to predict activity on those days. The advantages and limitations of imputation have been widely considered (e.g,, see Schafer (^{14})). The effectiveness of an imputation technique depends not only on its ability to obtain unbiased estimates of missing values, but also on the ability to reduce bias in summary statistics.

The results of the simulation study clearly show that performance of either imputation technique was affected by the proportion of missing data, the correlation of activity across days of the week, and the missing data mechanism. Imputation performed better on weekdays than weekends, due to the lower percent of missing records and the higher correlation between activity levels on those days.

Both imputation methods yielded unbiased estimates of the mean daily physical activity under conditions in which the data were MCAR. The relative differences between the imputation methods’ ability to reduce bias in summary statistics were small and nonsignificant. The variance of the summary statistic from MI is derived from two parts: the variance of the summary statistic computed from each multiple imputed data sets (within-imputation variance) and the variance in the summary statistic across the multiple imputed data sets (between-imputation variance). The EM approach will overstate the precision of the summary statistic because it is unable to estimate the second component of variation. In this particular application, the precision was only slightly overstated. The *ad hoc* analysis approach of dropping incomplete days also resulted in an unbiased estimate of the mean daily activity, but the SD was larger than for the original (complete) data and either imputation approach. Thus, a key limitation of the *ad hoc* approach is a loss of efficiency. In turn, the resulting reduction in precision leads to a loss of power for testing hypotheses.

When the missingness mechanism is NMAR (not missing at random), the imputation techniques failed to eliminate the bias in estimating the overall summary measure of activity, but both performed better than the *ad hoc* procedure when estimating activity on individual days. The mean MET-minutes of MVPA estimated from the complete data was about five units (less than a tenth of a SD) lower than that for the estimates obtained using either the *ad hoc* or imputation approach. However, when examining the estimated activity on each day of the week, the magnitude of this bias was markedly higher using the *ad hoc* method but relatively unchanged with either imputation approach.

Although imputation has emerged as a popular strategy to deal with missing data, an important limitation must be noted. Both the EM algorithm and MI assume the missing data are either MCAR or MAR. Thus, inherent in the problem of missing value imputation is the fact we have no objective way of knowing whether the missing data are MCAR, MAR, or NMAR. However, some analyses can be performed that explore whether known factors are related to the probability of missingness. For example, one could assess whether respondents with the lowest observed activity levels were most likely to have incomplete data. In the present TAAG substudy, the probability of having one or more incomplete days of monitoring was not found to be associated with race, age, or average activity level on completely observed days.

With the growing use of accelerometers to measure physical activity, some appreciation for the potential bias that might result from using various methods for computing summary statistics is needed. Because missing value imputation is never worse and often superior to the *ad hoc* procedure of deleting incomplete days, those analyzing accelerometer data should consider using the available software to undertake imputations. The EM algorithm was found to have slightly greater precision in predicting missing values than MI for the measure of activity used in this analysis (MET-minutes of MVPA) and was the procedure chosen for imputation in TAAG. This may have been due to the slight skewness in the distribution of the outcome variables. Although both the EM algorithm and MI assume that the data come from a multivariate normal distribution, the EM algorithm has been shown to provide consistent estimates under weaker assumptions about the underlying distribution (^{9}). We do not believe, however, that this conclusion is necessarily true for other physical activity outcome variables, such as the average total count per day or even for a transformation of the MET-minutes of MVPA, which may better satisfy the multivariate normality assumption.