Principles of Design and Analyses for the Calibration of Accelerometry-Based Activity Monitors : Medicine & Science in Sports & Exercise

Journal Logo

Objective Monitoring of Physical Activity: Closing the Gaps in the Science of Accelerometry

Principles of Design and Analyses for the Calibration of Accelerometry-Based Activity Monitors


Author Information
Medicine & Science in Sports & Exercise 37(11):p S501-S511, November 2005. | DOI: 10.1249/
  • Free


Accelerometers have gained acceptance as perhaps the most effective way to obtain objective information about levels of physical activity in the population (45). The Caltrac Personal Activity Monitor (Muscle Dynamics, Torrence, CA) was the first monitor to be fairly widely used in research, but today a number of different devices are commercially available for physical activity researchers (e.g., Tritrac-R3D, ActiGraph, BioTrainer, ActiTrac, Actical, Tracmor, Actiwatch, SenseWear PRO2). The various accelerometry-based devices are based on the same essential principles but different processes are used to filter, process, and store the raw accelerometry signals. When exported, the resulting outcome variable from accelerometers is a dimensionless unit typically referred to as “accelerometer counts.” The accelerometer counts provide an indicator of overall movement but a fundamental research challenge has been to determine how counts equate to more meaningful indicators, such as energy expenditure or time spent in moderate to vigorous activity. The process used to convert raw accelerometer counts into more meaningful and interpretable units is generally referred to as “calibration.”

This paper reviews measurement issues associated with the calibration of accelerometers. A review of past calibration research is first provided followed by guidelines and suggestions for future research. The last section includes a case study demonstrating the utility of a new approach to calibration based on receiver operator characteristic (ROC) curves.


The key goal in calibration research is to determine the relationship between the raw accelerometer output and actual levels of physical activity. Most devices are sold with built-in calibration equations or algorithms, but independent research has been necessary to evaluate how accelerometer output varies across a wide range of intensities and for different types of activities. This type of research is referred to as value calibration (validity) research. Although not as frequently described, interunit calibration of activity monitors is also needed to ensure that the different units of a given monitor provide similar information. Units typically undergo factory calibration, but quantifying the reliability has become increasingly important as accelerometers begin to be used for surveillance and large-scale clinical trials. This type of research is referred to as unit calibration (reliability) research. Both types will be described below.

Value Calibration (Validity) Research

Value calibration allows raw accelerometer counts to be converted into more meaningful and standardized units. The unique characteristics of the internal accelerometers in the different devices necessitate separate calibration studies for each monitor. Early research with accelerometers emphasized the use of controlled laboratory studies to evaluate the agreement between accelerometer data and measured metabolic data from metabolic carts. In later research, the use of free-living activities has become more common. Both types of settings have been helpful in advancing knowledge about the calibration of accelerometers. Freedson et al.(16) have provided a comprehensive review of calibration studies on children and Matthews (27) has provided a similar review for adults. Therefore, the emphasis here is on principles for the design and evaluation of calibration studies.

Laboratory calibration studies.

Laboratory calibration studies have typically used a series of progressively increasing speeds on a treadmill as the test protocol. Early research with the Caltrac accelerometer reported that the output tended to underestimate energy expenditure (EE) at lower intensities but overestimate EE at higher intensities (1), but subsequent studies noted a tendency for overestimations across a full range of speeds (14,19,31). The equivocal findings with the Caltrac are likely due to the fact that the Caltrac was factory calibrated based on walking.

Similar calibration research has been conducted on the series of second-generation devices that were released after the Caltrac. Freedson et al. (17) conducted a laboratory study on the ActiGraph, formerly known as the Computer Science and Applications (CSA) and Manufacturing Technology Inc. (MTI) (ActiGraph, LLC, Fort Walton Beach, FL), using three different locomotor speeds. Similar paces and protocols were employed by Nichols et al. (30) in a study on the Tritrac-R3D monitor, now sold as the RT3 Triaxial Research Tracker (StayHealthy, Inc. Monrovia, CA) and by Welk et al. (46) in research with the ActiTrac (IM Systems, Baltimore, MD) and the BioTrainer (IM Systems) monitors. The results of these laboratory calibration studies supported the validity of these different accelerometers. The monitors generally exhibited strong associations between activity counts and measured EE (V̇O2) across a range of intensities. The variance accounted for from the regression analyses (R2) ranged from 0.82 to 0.93 for these four monitors while the SEE values ranged 1.0 to 1.4 kcal·min−1. A number of other studies have been conducted with other monitors and populations using a variety of criterion measures; the three studies highlighted here demonstrate the typical measurement characteristics that result from laboratory calibration equations.

In any calibration study, the specific characteristics of the sample population or unique aspects of the design may lead to biased R2 values so it is essential to conduct cross-validation research with independent samples. Cross-validation results with the Freedson et al. and Nichols et al. equations were not reported in the original studies but both equations were tested in a cross-validation study comparing several monitors (47).

EE estimates from the ActiGraph were not significantly different from the measured values, but the Tritrac-R3D was found to overestimate at the higher intensities. However, Bland–Altman graphs revealed that the individual error of estimation was actually lower for the Tritrac-R3D than for the ActiGraph (The Tritrac-R3D also had a higher R2 and a lower SEE). Bland–Altman plots such as these have been shown to be quite helpful in this type of research because they allow the errors to be graphically displayed across the full range of values. The observation of large variability in individual estimations has been reported in a number of other studies, and this pattern has led to the general conclusion that accelerometry-based activity monitors can provide accurate data for group comparisons but may not yield precise indicators for individual determinations (45).

Most calibration studies have assumed that the output from accelerometers is linearly related to EE. Several recent studies have tested this assumption by examining the responsiveness of various accelerometers with higher intensities or paces. A study by King et al. (24) tested the accuracy of EE calibration equations across three walking speeds and four running speeds They observed that the BioTrainer and ActiGraph were not linearly related to movement at the higher running speeds (above 161 m·min−1 or 9.66 km·h−1). The Tritrac-R3D and SenseWear PRO2 (Body Media, Inc., Pittsburgh, PA) exhibited better linearity through the full range of motion. The Tritrac-R3D was found to yield the most accurate estimates at the running speeds while the SenseWear PRO2 had the best overall accuracy. Brage et al. (7) also reported the leveling off of the ActiGraph monitor during higher intensity movements. The ActiGraph output increased linearly until 9 km·h−1 and then leveled off causing the ActiGraph to underestimate V̇O2 by 11% at 10 km·h−1 and by 48% at 16 km·h−1. The increasing error has been attributed to wider interindividual variability in gait mechanics and economy of motion at higher intensities of activity as well as to more pronounced effects from anthropometric variables, such as leg length. The use of curvilinear functions may provide a better fit for this type of data and provide more accurate estimations across the full range of exercise intensities. Collectively, these results highlight some of the factors that contribute error to individual- and group-level calibration studies.

Field-based research.

Because laboratory studies cannot provide a true evaluation of how accelerometers perform under real-world conditions, many studies also have examined the validity of accelerometers in the field. The lack of a perfect criterion measure has made it difficult to draw definitive conclusions from these studies. Most studies have relied on self-report measures as a comparison (26,28,40), but the challenge here is the lack of precision in the self-report measure. Investigators also have made comparisons using double-labeled water (DLW) (4,11,18,22) but the inability to parse the DLW data by time has limited the utility of these comparisons.

A number of studies have been conducted using portable metabolic units to evaluate the validity of accelerometry for assessing common free-living activities (2,8,10,20,47). Participants in these studies have typically performed a variety of lifestyle tasks (e.g., shoveling, raking, sweeping) and/or recreational activities such as golf. The accelerometers tested in these studies have consistently underestimated the energy cost of these lifestyle tasks. The under predictions of EE in these studies is generally expected because of the documented inability of waist-worn accelerometry-based devices to monitor upper body movements. These results, however, helped to document the amount of error that occurs when assessing various free-living activities. Although the point estimates for specific activities were fairly inaccurate, total error across a longer time period would become significant only to the extent to which these lifestyle activities contribute to a person's total daily expenditure.

Some investigators have used lifestyle activities in calibration studies to try to capture the variable responsiveness of accelerometers to these movements (20,42). This concept has intuitive appeal but inherent limitations as well. The equations may accurately predict group level EE, but the R2 values tend to be lower than with locomotor-based equations and there is considerable individual error about the mean. The equations may be useful if individuals are specifically performing the activities used in the calibration study, but use in free-living participants would tend to result in systematic overestimation of activity levels. This is because the activities used in the calibration of the device are not fully captured with the accelerometer (i.e., the accelerometer records fewer counts than would be recorded for other locomotor activities with the same energy cost). The extreme overestimation of activity levels from the use of these lifestyle-based calibration studies has been noted in several studies that have compared various cut points with the ActiGraph (39,41).

Several recent calibration studies have used a combination of locomotor activities and other free-play activities to better simulate the diverse types of activities that youths participate in (35,43). Incorporating other free-living activities in this type of design better reflected the diverse activity patterns of youths, but the relationship between movement counts and EE for these activities is so different from that of the locomotor activities that a single equation may still not accurately characterize the data. In a study by Puyau et al. (35), the Actiwatch (Mini Mitter Co., Bend, OR) and the Actical (Mini Mitter Co.) accounted for over 76% of the variance in measured EE, but the authors suggested that, because of the relatively wide confidence intervals, the equations should only be used for group comparisons and not for individual determinations. In a study by Treuth et al. (43), the investigators determined that an unacceptable number of misclassifications occurred when the equation included data collected while participants were doing aerobics and bicycling. The final equation was developed based on the transition between slow and brisk walking because these activities provided clear distinctions between levels of physical activity. This approach was found to yield fewer misclassifications than approaches that incorporated the diverse activities.

Summary of value calibration research.

Accelerometers are clearly responsive to different intensities of physical activity, but fundamental challenges remain with respect to converting counts into meaningful outcome data. The essential problem is that there are unique relationships between movement and EE for different activities. Equations based on locomotor patterns will tend to underestimate EE (and physical activity levels) from lifestyle activities, while equations based on lifestyle physical activities will tend to overestimate typical activity levels. It is unlikely that a single equation or algorithm can be developed to accurately capture all activities that an individual can perform so alternative strategies may be needed to accurately assess the diverse range of activities that people perform in their daily lives. The use of full room calorimeters, as demonstrated by Puyau et al. (35), may provide the most effective way to capture the appropriate balance between locomotor activities and free-living activities in a noninvasive manner, but additional work is needed to improve the precision of the equations.

Unit Calibration (Reliability) Research

Unit calibration refers to the internal reliability of the accelerometer sensors across multiple units. It is a well-established tenet of science that reliability is an essential prerequisite for validity, but much less research has been conducted on the reliability of accelerometers. Early research with accelerometers tended to assume that units of a given monitor type provide similar information. Although not specified in many cases, researchers interested in validity issues may have chosen to use a single unit for all measurements in order to reduce extraneous error that may have affected the results. This design may have helped to improve the internal validity of the particular study but this design would clearly threaten the external validity of the results. Poor generalizability of equations or cut points developed in these studies may explain some of the equivocal findings that have been reported with accelerometers in past research. The following section briefly summarizes research on the reliability of accelerometers and discusses other factors that must be considered in calibrating accelerometers.

Reliability across units.

Most companies perform some form of calibration check as part of a quality assurance check before shipping. The goal is to ensure that the devices are each measuring and summarizing raw accelerometer data in a similar way. Because this type of calibration is rarely described by the manufacturers, research teams began to incorporate calibration steps into their research or to conduct separate calibration studies to check reliability. The most common design has been to use a mechanical device that provides a standardized amount of acceleration. Studies with the Tritrac-R3D have generally demonstrated good interinstrument reliability (intraclass correlation coefficients (ICC) of r = 0.97) (25) and acceptably small coefficients of variation of 1.8% (30). A study by Metcalf et al. (29) evaluated the reliability of the original CSA ActiGraph monitor using a turntable device. They reported low intrainstrument coefficients of variation (1.83%) and slightly higher interinstrument coefficients of variation of 5%. Research by Brage et al. (5) advanced reliability research by evaluating responses across a wider range of simulated vertical accelerations in a mechanical device. The results confirmed that variability between ActiGraph units is greater than variability within units. They reported that each unit was significantly different from the mean value from all the units and also noted that there was considerable heteroscedasticity across units in the response to different accelerations.

An alternative design for reliability research has been to compare outputs from two units on opposite sides of the body. A study on the Tritrac-R3D (30) reported ICC values ranging from R2 = 0.73 to R2 = 0.87 for two different Tritrac-R3D monitors worn during free-living activity. A laboratory study (46) reported average ICC values of R2 = 0.66 and R2 = 0.84 for the ActiTrac (IM Systems) and the BioTrainer monitors, respectively, across a range of paces. A youth calibration study with the ActiGraph yielded average ICC values of 0.87 (44). Overall, the reliability values for different monitor types appear to be fairly consistent in these types of comparisons. An interesting observation by Jakicic et al. (21) is that reliability appears to depend on the type of activity being assessed. They reported higher ICC for walking and running (range: R2 = 0.76–0.92) than for the stepping, sliding, or cycling tasks (range: R2 = 0.54–0.88).

A limitation in past reliability research is that it has not been possible to determine the source of error influencing these devices. A recent study (49) employed generalizability theory to allow the partitioning of variance due to different sources of error. This study examined the intraunit variability (i.e., differences between units of a given monitor) and intraindividual variability (i.e., differences between individual responses to a given monitor) of four commercially available activity monitors (Tritrac-R3D, ActiGraph, BioTrainer, and the Actical) for a structured bout of physical activity. The results provided detailed information on the sources of error that may influence the accuracy of accelerometer data. The ActiGraph monitor had the lowest amount of variance across monitors, the lowest between trial variance and the highest overall G value (an index of overall reliability). A limitation in this study is that it was not possible to evaluate higher intensity activities or free-living activities. As mentioned previously, several studies (6,24) have demonstrated the nonlinearity of accelerometer output at higher intensities of activities. Differences in accelerometer filtering at higher intensities also may influence accelerometer output. Reliability needs to be examined across a wider range of intensities and activities to better characterize these responses.

Positional influences and mechanical properties.

The output from an accelerometer is dependent on the positioning on the body as well as the inherent mechanical properties of the sensor. Studies have demonstrated that accelerometers can be calibrated for different positions on the body (3). For example, monitor output at the hip, wrist, leg, and ankle has been compared in some studies. Little evidence suggests that one position is better than another, so pragmatic guidelines, such as comfort and ease of use, have taken precedence. The most common position for accelerometers is the hip, but accelerometer output may vary even with position about the hip. In a small pilot study (23), monitors were worn at three different positions about the hip to determine whether multiple monitors could be worn on the same hip without systematically biasing the data. The results revealed that position significantly influenced the results with the ActiGraph but not the results with the BioTrainer or the Tritrac-R3D. A possible explanation for this difference is that each device uses different numbers and orientations of the accelerometers within the devices. The ActiGraph is a vertical accelerometer; the BioTrainer is considered uniaxial but also bidirectional because the accelerometer is positioned 45 degrees to vertical. The Tritrac-R3D is a three-dimensional monitor that incorporates signals from three planes of movement. The results of this pilot study suggest that position may be more of a factor for the vertically oriented accelerometer used in the ActiGraph than for the bidirectional (uniaxial) or three-dimensional monitors. Variability due to positioning can be reduced with careful training and supervision, but this is more difficult when data collection is conducted over multiple days.

A topic that deserves further research is the potential advantage of three-dimensional monitors for interinstrument reliability. Researchers have demonstrated potential advantages of three-dimensional devices for capturing different types of activity (9), but these devices also may prove to have advantages for reliability. The Tritrac-R3D, for example, uses a composite vector magnitude value that reflects the combined value from three orthogonal accelerometers (vector magnitude = √(x2 + y2 + z2)). Different positions of the accelerometer would result in different contributions from the x, y, and z terms, but the overall magnitude is theoretically independent of accelerometer positioning. In contrast, uniaxial monitors are highly dependent on position. Different forces and movements during locomotor movements would result in different acceleration values picked up by the accelerometer. In the previously described generalizability study (49), we expected the Tritrac-R3D to exhibit a lower monitor × trial interaction term than the other devices, and this was found to be the case. The ActiGraph still yielded better overall results, but it is possible that other factors could have accounted for the results. For example, the ActiGraph monitors in the study were new, whereas the Tritrac-R3D monitors were previously used in field research. Documentation from the manufacturer of the current Tritrac-R3D monitor (RT3 Triaxial Research Tracker, StayHealthy, Inc., Monrovia, CA) point out that accelerometers in the original Tritrac-R3D were positioned by hand so it is possible that the results with the newer RT3 monitor or the three-dimensional TracMor (Triaxial Accelerometer for Movement Registration, Phillips, Eindhoven, The Netherlands) device may outperform other uniaxial devices. The Actical monitor features an “omnidirectional” accelerometer that was also expected to result in improved reliability in the generalizability study. The results revealed lower than expected reliability for the Actical but this may have been due to the use of experimental Actical units. Additional work is needed to understand factors contributing to variability in accelerometer counts.

Summary of unit calibration research.

Overall, the results presented here demonstrate that much is still unknown about positional and mechanical characteristics of accelerometers. These factors influence the output from accelerometers and, therefore, the reliability and validity of data collected with accelerometers. Additional work is needed to better understand these factors.


To obtain accurate information from accelerometers, it is important to control for as much error as possible. Careful planning of calibration studies and appropriate use of these equations in field-based research can help to improve the utility of accelerometers for assessing physical activity in different populations. This section describes recommended design characteristics for calibration studies and discusses different analytical techniques used to calibrate activity monitors.

Design Issues

Representative sample population.

Calibration equations developed for accelerometers will be accurate only to the extent that the sample population is representative of the overall population. The sample population must be similar in age, size, and behavioral patterns, but large and diverse enough to capture the natural variability that exists in the population. Specific calibration studies have been done on different youth populations including preschool children (13,37), elementary school youths (10,35,36,44,49), and adolescents (35,43), but at present it is not clear what factors should distinguish transitions from early childhood, from childhood into adolescence, and from adolescence into early adulthood. Changes in height, weight, body fatness, economy, and movement patterns occur during maturation, and all these have potential to influence accelerometer output. Recent studies have also demonstrated that MET values associated with various physical activities in youths are not independent of body weight or body fatness in youths (12). If measured MET values are used in the creation of these calibration equations, it may lead to systematic bias in equations for overweight and nonoverweight youths. Equations that include age and body size may be necessary to capture the age-related variability in movement and response characteristics of accelerometers. It is important for future calibration research to directly examine these unique influences in youths because MET values, as defined in the adult literature, are clearly not applicable to youths. If generalized equations are used in youths for group-level estimations, it is important to ensure that the population used in the calibration study is representative of the target population.

The use of accelerometers has not been as common in research with geriatric populations, but it is equally important to determine whether separate calibration equations are needed to characterize the activity patterns and movement characteristics of elderly populations. Demonstrated differences in muscle mass and gait suggest that separate equations may be needed, but it is not clear whether age or function should be used as the basis for calibration in this population. Age may not be an appropriate way to stratify the elderly population due to the large variability that exists in body composition and function for a given age. To develop stable calibration relationships, it is important to have large samples that are representative of the population being to be studied. The following guidelines are suggested for future research:

  • Designs should include participants that are as representative as possible.
  • Analyses should examine moderating influences of age, body size, and body fatness on the results from the study.
  • The use of individual calibration equations (based on an individual's own data) will improve the accuracy of prediction equations but may not be feasible for large-scale studies.

Representative monitors.

Calibration equations developed for accelerometers will be accurate only to the extent to which the monitors are representative of the overall population of monitors used in subsequent research. It is essential to employ multiple monitors when developing calibration equations. As previously described, output from monitors can vary by 10–20% for a standardized bout of activity. A single monitor may not be representative of all monitors, so group-level data are more likely to be accurate if the sample of monitors used in a study is representative of the population of monitors. The use of multiple monitors essentially allows variability in individual monitors to somewhat cancel out. To avoid any systematic bias in results, it is important to counterbalance or randomize the monitors across participants.

Accelerometers are increasingly being used for surveillance research (15,32) as well as for outcome measures in intervention research. Because these designs involve the use of multiple monitors on large samples of people and/or data collection across multiple time points, it is important to control for as much error as possible. The following guidelines are suggested for future research:

  • Surveillance or descriptive studies employing large numbers of monitors should report the coefficient of variation (CV = SD/mean) for the monitors for a standardized movement to demonstrate the overall variability of the monitors.
  • Researchers using accelerometers over time (e.g., in intervention studies) should include a monitor calibration check before and after using the devices in the field. Guidelines should be established to detect monitors that are not yielding counts within the expected level of error.
  • Statistical corrections should be employed when possible to remove error due to monitor variability. A recent report from the European Youth Heart Study used the accelerometer unit number as a covariate to control for differences between units (38). This statistical adjustment removes unwanted variability and increases the power to detect group differences or to test links between physical activity and other outcomes.

Representative activities and sampling.

Calibration equations developed for accelerometers will be accurate only to the extent to which the activities performed are representative of the type and intensities of those performed by the target population. Because activity patterns vary considerably in the population, it has proven difficult to determine appropriate sets of activities that capture a variety of activity levels while still emphasizing the most common movements. Early research using treadmill-based protocols was questioned because they included a limited range of activities and did not capture lifestyle activities. Field-based research techniques made it possible to include more diverse types of activities in calibration studies, but these approaches have not demonstrated as much promise as expected. As previously described, equations based on field-based activities would be accurate only for periods in which those activities were performed. Because the predominant activity in a person's day is locomotor activity (i.e., walking), calibration equations should emphasize locomotor movements or free-play activities that involve locomotor movements in the study protocol. Activities involving large upper body movements are not captured by accelerometers worn on the hip, so it is not practical to develop calibration equations based on these movements. Approaches based on pattern recognition may be able to detect different types of activities and employ activity-specific equations. However, if the goal is to develop a singe prediction equation for estimating activity or EE the following guidelines are suggested:

  • Accelerometers should be calibrated using activities that can be accurately captured with accelerometry (i.e., primarily locomotor-based movements)
  • Free-living activities are preferable to lab-based protocols on treadmills when possible, but constraints with equipment and space often dictate range of movement.
  • Intermittent activities are preferable to steady-state activities because activities in life are rarely steady state. For example, participants could be observed during basketball play or while working in an office area.

Analytic Issues

In recent years, a number of important methodological advances have occurred in the analytical techniques used to process and interpret accelerometer data. One of the major advances is with the use of mixed modeling approaches for calibrating accelerometers. Another new innovation involves the use of ROC curves. Both statistical approaches are briefly described.

Mixed model approaches for calibration.

In most calibration studies, participants perform multiple bouts of activity and the accelerometer values are related to some criterion measure to determine the calibration equation or cut point. Traditional design approaches that use multiple data points for each individual technically violate the independence assumption of multiple regression. Mixed modeling allows the repeated nature of data (i.e., multiple data points for each individual) to be modeled in the analyses. The study by Treuth et al. (43), described earlier, is one of the first studies to employ a mixed model (random coefficient) design to determine cut points for moderate to vigorous physical activity. In these analyses, individual slopes and intercepts are computed for each participant completing the designated tasks, but the average values are used for the group-level equation. Quadratic and cubic trends were examined in higher level models. Possible moderating influences from age, body mass index (BMI), and race were also tested to determine whether they contribute to the fit of the model. The flexibility and interpretability afforded by mixed model designs are a significant advance in research design for calibration studies.

ROC curves.

ROC curves provide an alternative approach for establishing accelerometer cut points. A ROC curve is essentially a graphical representation of the relative trade-offs between sensitivity and specificity for different cut points. Indices computed from these graphs provide an empirical basis for determining the most appropriate cut point to minimize misclassification.

ROC curves are frequently used in clinical research to evaluate the effectiveness of diagnostic tests for different diseases. The goal in clinical research is to establish a threshold value that can accurately identify individuals at risk (true positives) without unnecessarily flagging individuals not at risk (false positives). A number of studies have employed ROC curves to evaluate the utility of BMI cut points for detecting adiposity in different populations, but the use of ROC curves in accelerometry-based research has been limited.

Conceptually, the ROC approach offers the same advantages as for other clinical research because it provides a more empirical basis for selecting effective cut points. The essential challenge is to determine a threshold that accurately captures “physical activity” (sensitivity) without capturing “inactivity” (false positive). Sensitivity (1—the false-negative rate) is typically plotted on the y axis, whereas the rate of false-positive tests (1—specificity) is typically plotted on the x axis. A good diagnostic test is one that has low false-positive and false-negative rates across a reasonable range of cutoff values (Fig. 1). A bad diagnostic test is one where the only cutoffs that make the false-positive rate low also create a high false-negative rate (Fig. 1). The distribution of points on the ROC curve provides a graphical description of these characteristics. A diagnostic test demonstrating good discrimination properties would be one that rises steeply up the y axis. This means that sensitivity (1—the false-negative rate) is high and the false-positive rate is low. A diagnostic test with poor discrimination properties would be one that follows a diagonal path from the lower left corner to the upper right corner. This indicates that improvements in false-positive rates are matched by corresponding declines in the false-negative rate. The case study presented below will illustrate the utility of ROC curves for research with accelerometers.

FIGURE 1—Hypothetical ROC curves. The:
dark line shows the best case scenario (perfect sensitivity and specificity) and the dotted line shows the worst case scenario (any increase in sensitivity is matched by a reduction in specificity).

Calibration Case Study

A case study is used to compare results obtained with different calibration approaches. The data for this case study were originally analyzed using multiple regression techniques (48). However, as described above, the use of repeated measures in the design technically violates the independence assumption in multiple regression. The analyses here will compare a mixed modeling approach with analyses done using ROC curves. To date, the use of ROC curves in accelerometer research has only been reported in a few studies (35,37). In both cases, it was used primarily to check the sensitivity and specificity but not to generate the most appropriate cut points. The value in an ROC analysis found to maximize classification agreement can also be used to justify the most appropriate cut point, and this is the approach demonstrated here. The primary goals of this case study example are to highlight some of the design features described earlier and to demonstrate how results of calibration studies may vary depending on the type of analyses employed.


The participants in the study were 30 children ages 8–12 yr old who were enrolled in a summer camp. Participants were fitted with an ActiGraph accelerometer (programmed at 5-s epochs) and completed two different data collection sessions to allow for both calibration and cross-validation in the same study. The first session (used for calibration) included seven semistructured bouts of activity performed individually for 2 min each (sitting, standing and dribbling, walking, walking and dribbling, walking and jogging, jogging and dribbling, and jogging). The second session (used for cross-validation) was an unstructured bout of activity performed in a group setting for approximately 10 min. All sessions were videotaped and coded in the laboratory using an established direct observation technique as the criterion measure.

The design of the study was conducted in general accordance with the principles described above. Participants from a range of ages were used to allow the cut points to be usable for a wide range of elementary school children (representative sample). Four different accelerometer units were randomly employed in the study to increase the generalizability of the results (representative monitors). The activities included in the study included movements that would be common during free-play situations in youths (representative activities). Most were performed intermittently, but some continuous activities (e.g., walking and jogging) were also used because these locomotor movements are the movements that accelerometers are best suited to capture. Limitations of the design included the small number of participants (N = 30) and the small number of monitors (N = 4). The size of the sample was somewhat limited by the time-intensive nature of the direct observation; however, the number of monitors used was probably appropriate for the number of participants involved.

Data processing and analyses.

The direct observation was performed using a computerized behavioral observation tool (Behavioral Evaluation System and Taxonomy ( that allows users to program the keyboard to code data in “real time.” For this study, the keyboard was programmed to allow coding with a modified version of the Children's Activity Record System (CARS) developed by Puhl et al. (34). The original CARS uses a 5-point classification system that makes distinctions between hard and very hard intensities. This proved too difficult for real-time coding so the two highest intensity categories were collapsed into a single code. The coding levels were operationalized as 1 = sit (rest), 2 = stand (light), 3 = walk (moderate), and 4 = run (vigorous).

During the coding session, the observer would code the specific type of activity that was taking place by pressing the corresponding key on the keyboard (all observations were done by a single observer). At the end of the observation, the observation data were exported into Microsoft Excel. Customized Visual Basic routines were used to compute the average activity level over the same time intervals used in the activity monitors. For these calculations, each observation value (1, 2, 3, 4) was multiplied by the time spent in the given intensity level, and the sum was divided by the total time in the interval to compute the average intensity level. The resulting observation values ranged from 1 to 4, with higher values reflecting periods with higher amounts of physical activity. Data were processed at multiple time intervals (e.g., 5, 15, 30, and 60 s) to compare the effect of epoch length, but these results are not reported here.

Analyses of the data were conducted using standard multiple regression, mixed model regression, and ROC approaches to examine how results might vary with different calibration procedures. A key determination in establishing a cut point is selecting an appropriate operational definition. The present study was designed to capture children's intermittent activity patterns. A direct observation code of 3.0 would designate steady-state walking, so for this study, an active minute was operationalized as one with an observation value of 2.5. This level of activity would necessitate that a child was active for at least half of the time period.


Standard multiple regression analyses yielded an R2 of 0.59 and an SEE of 0.44 observation units (Fig. 2). The overall equation was y = 1.94 + (0.00194 * ActiGraph 5-s count). Solving this equation for observation values of 2.5 yields accelerometer cut points of 288 for 5 s or 3456 for 60 s. Mixed model regression approaches can account for the fact that there are multiple data points for each individual. An unconditional linear growth model was used for this analysis with slopes and intercepts of individual calibration curves considered as random effects and average intercepts and slopes considered to be fixed effects. The equation was similar to the one for the standard regression analyses: Y = 1.68 + (0.002815). Solving this equation for observation values of 2.5 yields accelerometer counts of 291 for 5 s or 3492 for 60 s.

FIGURE 2—Graph from standard multiple regression analyses using temporally matched data (5-s intervals) from the ActiGraph and the direct observation values.

As described, the ROC approach uses a very different strategy for determining cut points. Essentially, the ROC process determines the cut point that yields the fewest misclassifications. The optimal point is one that maximizes the area under the ROC curve. The ROC curve for the present data yielded an optimal cut point at 181 counts for 5-s data (2172 for 60-s data). The reported sensitivity and specificity were 95.9 and 87.6, respectively (Fig. 3). The area under the curve was also computed to be 0.957. A value of 1.0 would represent a perfect cut point because it would achieve 100% sensitivity and 100% specificity. For illustration purposes, a curve yielding an area value of 0.5 indicates that the cut point has 50% sensitivity and 50% specificity. Operationally, this would indicate that the use of the cut point is no better than flipping a coin. The large value for area under the curve for the present graph indicates that the cut points based on the sample data would likely be effective in characterizing activity levels.

FIGURE 3—Plots from the ROC curve analyses used for detecting physical activity for the ActiGraph (labeled as CSA in plot). A. The trade-off between sensitivity (y axis) and 1 – specificity (x axis). B. The discrimination of data points above and below the threshold of 2.5 for the selected cut point value of 181.

To evaluate the utility of the three sets of cut points, the values were applied to the free-play activity in session 2. The classification agreement, sensitivity, and specificity are shown in Figure 4. The results with the regression approaches yielded higher specificity values but relatively low sensitivity values. The high specificity value indicates that the cut points are not likely to misclassify activity as inactivity but the low sensitivity values suggest it is not good for classifying inactivity. The ROC analyses yielded a lower cut point value of 2172 counts per minute as a threshold. The point that leads to the minimal amount of misclassification determines the selection of the cut point. With a lower cut point value, more minutes would count as “activity,” but this would reduce the misclassification of inactivity. The resulting cross-validation with the data from the free-play activity supports this expectation. The specificity values were lower (77%), but the sensitivity of the cut point was higher (64%).

FIGURE 4—Cross-validation analyses reporting the classification agreement and kappa values for detecting activity and inactivity using the unstructured free-living data. A. Results from standard linear regression model. B. Results with mixed model regression. C. Results for ROC curves.


This case study illustrates several basic points. The similar results from the two regression approaches yielded almost identical cut points. This suggests that little bias is introduced by the use of multiple data points in standard regression procedures. Potential bias due to violation of the independence assumption may be minimized because each participant contributes an equal number of data points. The flexibility of the mixed model approach still has considerable merits relative to multiple regression techniques. The other observation was that the ROC curve yielded a more liberal cut point for the same threshold definition of physical activity (an observation value of 2.5 in this case). The resulting cross-validation demonstrates that gains in sensitivity are compensated by losses in specificity. The decision regarding what type of cut point to use may depend on determining the most acceptable type of error for a particular research application.

Recommendations for additional research.

Research on calibration of activity monitors has progressed greatly in the past 10 yr, but additional work is clearly needed. New developments and technologies will likely help to resolve some of the current challenges described above. One of the most significant advances is the development of tools that are designed to detect patterns of movement rather than amount of movement. These approaches may prove advantageous because they somewhat circumvent the inherent challenge in calibration research (i.e., determining what counts as physical activity). Through pattern recognition software, movement patterns that involve physical activity can be detected and tracked rather than estimated based on acceleration values. Pattern recognition software is incorporated in the Intelligent Device for Energy Expenditure Assessment (IDEEA) (Minisun, LLC, Fresno, CA) monitor as well as in the SenseWear PRO2 armband monitor. The IDEAA monitor uses multiple electrodes and a complex neural network to determine the predominant activity type that the person is performing at a given time and then employs specific algorithms to estimate EE. The SenseWear PRO2 uses pattern recognition algorithms based on acceleration values and other sensors to detect movement type. Similar approaches are being developed for other accelerometers including the ActiGraph (33).

Recent advances in technology have been spurred, in part, by greater collaboration between manufacturers and research teams. Continued collaboration and more “open-source” development may help standardize accelerometry approaches. For example, reporting of actual acceleration values in units of gravitational force (g) by all manufacturers would allow for direct comparisons across monitors and reduce the confusion created by the plethora of competing devices. Several additional suggestions for future research are provided below.

Cross-validation with appropriate criterion measures.

The availability of multiple cut points or equations has led to much confusion in the accelerometer literature. Determining the most accurate equation or cut point for a particular situation or population is an important priority for future research. Equations must be cross validated with independent samples, ideally from other settings and environments. They must also be evaluated under free-living conditions to determine their true utility for field-based research. The lack of a criterion measure of physical activity for field-based research has made it difficult to conduct this type of work, but devices such as the IDEEA may be particularly useful as criterion measures. One study (51) reported that the IDEEA can accurately detect the type, onset, duration, and intensity of most fundamental movements with 98% accuracy. Another study (50) demonstrated that the EE estimations from the IDEAA monitor were 99% accurate compared to a mask calorimeter and 95% accurate compared with estimates from a metabolic chamber. These results strongly support the utility of this device as a field-based criterion measure of physical activity and EE.

Balancing accuracy with feasibility.

Advances in design and analytical techniques of calibration studies will continue to improve the accuracy of accelerometry-based activity monitors. A consideration that may guide the progression of this research is to maintain an appropriate balance between accuracy and feasibility. Self-report instruments have typically been the only choice for large-scale research projects where feasibility is a greater concern. This is because they can be administered to large samples of people at low cost. As the price of accelerometers drops and their ease of use increases, it will become more feasible to use accelerometers for large-scale research applications.

Efforts to continually increase the sophistication of accelerometers are certainly needed (and welcomed), but some advances may detract from the usability of the device or keep the price too high for large-scale use. For example, the integration of HR and accelerometry offer clear advantages for improving the accuracy of measurement; however, this level of accuracy may not be necessary in some applications. The need for more invasive HR monitoring may make these devices too cumbersome to wear over time. The ability to have data collected over multiple days with a simple noninvasive device may be a more important consideration than obtaining precise estimates of activity for a given day. The documented utility of pedometers for various field-based applications demonstrates the value of simple measurements for some applications. Individual calibration equations have also been proposed by some investigators, but these procedures add additional costs and may impose logistical constraints that limit more widespread use. Practical, low-cost monitoring technologies are still needed.

The nature of the research question should be the determining factor in the choice of monitoring technology. As illustrated in the vast literature on the health benefits of physical activity, powerful effects can be detected with relatively crude measures of activity. Increased precision is clearly needed in some lines of research, but if the goal is to characterize activity patterns or to determine differences in activity levels between groups, then high levels of precision are probably not necessary. A balance between accuracy and feasibility is clearly needed.

Advancing the concept of “standard of care.”

Research with accelerometry would progress more systematically if investigators were expected to demonstrate possible advantages of alternative equations (for a given monitor) instead of demonstrating that their approach or equation also works. In medical research, the underlying paradigm for evaluating new treatments is direct comparisons with “standard of care” procedures. New modalities and treatments must be shown to result in improved clinical outcomes (better outcomes or fewer side effects) than the existing standard procedures before they are widely accepted and used. If similar expectations were used in accelerometry research, much confusion would be avoided. In this case, the burden of proof should be on the investigative team to demonstrate or explain why an equation or approach is better than current equations available for a given monitor. If multiple equations exist, researchers should be expected to demonstrate that their equation is the most appropriate one for their situation. If a new equation does as good or better in predicting physical activity, the conclusion to recommend a new equation may be appropriate. If the previous equation proves to be adequate, the researchers should comment that their results provide continued support for the previous equation. This model would allow research to progress in a systematic way without introducing unnecessary confusion into the literature. The concept of standard of care could be upheld through the normal peer-review process. Reviewers should hold authors to a high standard and expect that they are appropriately referencing past work in the area and demonstrating value of their approach over accepted approaches.


1.Balogun, J.A., Martin, and D.A. Clendenin. M.A. Calorimetric validationof the caltrac accelerometer during level walking. Phys. Ther. 69:501–509, 1989.
2.Bassett, Jr., D. R. Ainsworth, B. E. Swartz, A. M. Strath, S. J. O'Brien, and K. O. King. G. A. Validity of four motion sensors in measuring moderate intensity physical activity. Med. Sci. Sports Exerc. 32:S471–S480, 2000.
3.Bouten, C. V. C., Sauren, A. A. H. J. Verduin, and M. Janssen. J. D. Effects of placement and orientation of body-fixed accelerometers on the assessment of energy expenditure during walking. Med. Biol. Eng. Comp. 35:50–56, 1997.
4.Bouten, C. V. C., Verboeket-Van De Veene, W. P. H. G. Westerterp, K. R. Verduin, and M. Janssen. J. D. Daily physical activity assessment: comparison between movement registration and doubly labeled water. J. Appl. Phys. 81:1019–1026, 1996.
5.Brage, S., Brage, N. Wedderkopp, and N. Froberg. K. Reliability and validity of the computer science and applications accelerometer in a mechanical setting. Meas. Phys. Educ. Exerc. Sci. 7:101–119, 2003.
6.Brage, S. Wedderkopp, N. Andersen, and L. B. Froberg. K. Influence of step frequency on movement intensity predictions with the CSA accelerometer: a field validation study in children. Pediatr. Exerc. Sci. 15:277–287, 2003.
7.Brage, S. Wedderkopp, N. Franks, P. W. Andersen, and L. B. Froberg. K. Reexamination of validity and reliability of the CSA monitor in walking and running. Med. Sci. Sports Exerc. 35:1447–1454, 2003.
8.Campbell, K. L., Crocker, and P. R. E. McKenzie. D. C. Field evaluation of energy expenditure in women using Tritrac accelerometers. Med. Sci. Sports Exerc. 34:1667–1772, 2002.
9.Chen, K. Y., and Sun. M. Improving energy expenditure estimation by using a triaxial accelerometer. J. Appl. Phys. 83:2112–2122, 1997.
10.Eisenmann, J. C., Strath, S. J. Shadrick, D. Rigsby, P. Hirsch, and N. Jacobson. L. Validity of uniaxial accelerometry during activities of daily living in children. Eur. J. Appl. Physiol. 91:259–263, 2004.
11.Ekelund, U., Sjostrom, M. Yngve, A. et al. Physical activity assessed by activity monitors and doubly labeled water in children. Med. Sci. Sports Exerc. 33:275–281, 2001.
12.Ekelund, U., Yngve, A. Brage, S. Westerterp, and K. Sjostrom. M. Body movement and physical activity energy expenditure in children and adolescents: how to adjust for differences in body size and age. Am. J. Clin. Nutr. 79:851–856, 2004.
13.Fairweather, S. C., Reilly, J. J. Grant, S. Whittaker, and A. Payton. J. Y. Using the Computer Science and Applications (CSA) activity monitor in preschool children. Pediatr. Exerc. Sci. 11:413–420, 1999.
14.Fehling, P. C., Smith, D. L. Warner, and S. E. Dalsky. G. P. Comparison of accelerometers with oxygen consumption in older adults during exercise. Med. Sci. Sports Exerc. 31:171–175, 1999.
15.Felton, G. M., Dowda, M. Ward, D. S. et al. Differences in physical activity between black and white girls living in rural and urban areas. J. School Health 72:250–255, 2002.
16.Freedson, P. S., Pober, and D. Janz. K. F. Calibration of accelerometers for children. Med. Sci. Sports Exerc. 37:S523–S530, 2005.
17.Freedson, P. S., Melanson, and E. Sirard. J. Calibration of the computer science and applications, inc. accelerometer. Med. Sci. Sports Exerc. 30:777–781, 1998.
18.Gretebeck, R., Montoye, and H. J. Porter. W. Comparison of the doubly labelled water method for measuring energy expenditure with Caltrac accelerometer readings. Med. Sci. Sports Exerc. 29:S60, 1997.
19.Haymes, E. M., and Byrnes. W. C. Walking and running energy expenditure estimated by Caltrac and indirect calorimetry. Med. Sci. Sports Exerc. 25:1365–9, 1993.
20.Hendelman, D., Miller, K. Bagget, C. Debold, and E. Freedson. P. S. Validity of accelerometry for the assessment of moderate intensity physical activity in the field. Med. Sci. Sports Exerc. 32:S442–S449, 2000.
21.Jakicic, J. M., Winters, C. Lagally, K. Ho, J. Robertson, and R. J. Wing. R. R. The accuracy of the Tritrac-R3D accelerometer to estimate energy expenditure. Med. Sci. Sports Exerc. 31:747–754, 1999.
22.Johnson, R. K., Russ, and J. Goran. M. I. Physical activity related energy expenditure in children by doubly labeled water as compared with the Caltrac accelerometer. Int. J. Obes. 22:1046–1052, 1998.
23.Jones, S. L., Wood, K. Thompson, and R. Welk. G. J. Effect of monitor placement on output from three different accelerometers. Med. Sci. Sports Exerc. 31:S142 (abstract) 1999.
24.King, G. A., Torres, N. Potter, C. Brooks, and T. J. Coleman. K. J. Comparison of activity monitors to estimate energy cost of treadmill exercise. Med. Sci. Sports Exerc. 36:1244–1251, 2004.
25.Kochersberger, G., McConnell, E. Kuchibhatla, and M. N. Pieper. C. The reliability, validity, and stability of a measure of physical activity in the elderly. Arch. Phys. Med. Rehabil. 77:793–795, 1996.
26.Leenders, N. Y. J. M., Sherman, and W. M. Nagaraja. H. N. Comparison of four methods of estimating physical activity in adult women. Med. Sci. Sports Exerc. 32:1320–1326, 2000.
27.Matthews, C. E. Calibration of accelerometer output for adults. Med. Sci. Sports Exerc. 37:S512–S522, 2005.
28.Matthews, C. E., and Freedson. P. S. Field trial of a three-dimensional activity monitor: comparison with self report. Med. Sci. Sports Exerc. 27:1071–1078, 1995.
29.Metcalf, B. S., Curnow, J. S. Evans, C. Voss, and L. D. Wilkin. T. J. Technical reliability of the CSA activity monitor. The Early Bird Study. Med. Sci. Sports Exerc. 34:1533–1537, 2002.
30.Nichols, J. F., Morgan, C. G. Sarkin, J. A. Sallis, and J. F. Calfas. K. J. Validity, reliability, and calibration of the Tritrac accelerometer as a measure of physical activity. Med. Sci. Sports Exerc. 31:908–912, 1999.
31.Pambianco, G., Wing, and R. R. Robertson. R. Accuracy and reliability of the Caltrac accelerometer for estimating energy expenditure. Med. Sci. Sports Exerc. 22:858–862, 1990.
32.Pate, R. R., Freedson, P. S. Sallis, J. F. et al. Compliance with physical activity guidelines; prevalence in a population of children and youth. Ann. Epidemiol. 12:303–308, 2002.
33.Pober, D. M., Raphael, and C. Freedson. P. S. Novel technique for assessing physical activity using accelerometer data. Med. Sci. Sports Exerc. 36:S198, 2005.
34.Puhl, J., Greaves, K. Hoyt, and M. Baranowski. T. Children's Activity Rating Scale (CARS): description and calibration. Res. Q. Exerc. Sport. 61:26–36, 1990.
35.Puyau, M. R., Adolph, A. L. Vohra, F. A. Zakeri, and I. Butte. N. F. Prediction of activity energy expenditure using accelerometers in children. Med. Sci. Sports Exerc. 36:1625–1631, 2004.
36.Puyau, M. R., Adolph, F. A. Vohra, and F. A. Butte. N. F. Validation and calibration of physical activity monitors in children. Obes. Res. 10:150–157, 2002.
37.Reilly, J. J., Coyle, J. Kelly, L. Burke, G. Grant, and S. Paton. J. Y. An objective method for measurement of sedentary behavior in 3- to 4-year olds. Obes. Res. 11:1155–1158, 2003.
38.Riddoch, C. J., Andersen, L. B. Wedderkopp, N. et al. Physical Activity Levels and Patterns of 9-and 15-Yr-Old European Children. Med. Sci. Sports Exerc. 36:86–92, 2004.
39.Schmidt, M. D., Freedson, and P. S. Chasan-Taber. L. Estimating physical activity using CSA accelerometer and a physical activity log. Med. Sci. Sports Exerc. 35:1605–1611, 2003.
40.Sirard, J., Melanson, and E. Freedson. P. Field evaluation of the Computer Science and Applications, Inc. physical activity monitor. Med. Sci. Sports Exerc. 32:695–700, 2000.
41.Strath, S. J., Bassett, and D. R. Swartz. A. M. Comparison of MTI accelerometer cut-points for predicting time spent in physical activity. Int. J. Sports Med. 24:298–303, 2003.
42.Swartz, A. M., Strath, S. J. Bassett, Jr., D. R. O'Brien W. L. King, and G. A. Ainsworth. B. E. Estimation of energy expenditure using CSA accelerometers at hip and wrist sites. Med. Sci. Sports Exerc. 32:S450–S456, 2000.
43.Treuth, M. S., Schmitz, K. Catellier, D. J. et al. Defining accelerometer thresholds for activity intensities in adolescent girls. Med. Sci. Sports Exerc. 36:1259–1266, 2004.
44.Trost, S. G., Ward, and D. S. Burke. J. R. Validity of the Computer Science and Application (CSA) activity monitor in children. Med. Sci. Sports Exerc. 30:629–633, 1998.
45.Welk, G. J. Use of accelerometry-based activity monitors to assess physical activity. In: Physical Activity Assessments for Health Related Research. G. J. Welk (Ed.). Champaign, IL: Human Kinetics 2002.
46.Welk, G. J., Almeida, and J. Morss. G. Laboratory calibration and validation of the Biotrainer and Actitrac activity monitors. Med. Sci. Sports Exerc. 35:1057–1064, 2003.
47.Welk, G. J., Blair, S. N. Wood, K. Jones, and S. Thompson. K. W. A comparative evaluation of three accelerometry-based physical activity monitors. Med. Sci. Sports Exerc. 332:S489–S497, 2000.
48.Welk, G. J., Dale, and D. Schaben. J. A. Application of direct observation techniques for the calibration of activity monitors with children. Res. Q. Exerc. Sport. 373:A16, 2002.
49.Welk, G. J., Schaben, and J. A. Morrow Jr. J. R. Reliability of accelerometry-based activity monitors: a generalizability study. Med. Sci. Sports Exerc. 36:1637–1645, 2004.
50.Zhang, K., Pi-Sunyer, and F. X. Boozer. C. N. Improving energy expenditure estimation for physical activity. Med. Sci. Sports Exerc. 36:883–889, 2004.
51.Zhang, K., Werner, P. Sun, M. Pi-Sunyer, and F. X. Boozer. C. N. Measurement of human daily physical activity. Obes. Res. 11:33–40, 2003.


© 2005 American College of Sports Medicine