Secondary Logo

Journal Logo

Perspectives for Progress

A Framework to Evaluate Devices That Assess Physical Behavior

Keadle, Sarah Kozey1; Lyden, Kate A.2,3; Strath, Scott J.4; Staudenmayer, John W.5; Freedson, Patty S.2

Author Information
Exercise and Sport Sciences Reviews: October 2019 - Volume 47 - Issue 4 - p 206-214
doi: 10.1249/JES.0000000000000206
  • Free
  • Visual Abstract

Key Points

  • Body-worn devices have emerged as the measurement tool of choice in many surveillance, experimental, and observational studies.
  • There has been a rapid and relentless proliferation of new devices and data processing methods without an established framework to evaluate their validity. This has resulted in much confusion, driven by nonrobust device development and evaluation methods that do not reflect how the devices will be used in practice.
  • Divergence in summary estimates within and between devices obstructs efforts to pool data and preclude between-study comparisons, which are necessary to develop coherent public health recommendations.
  • We propose the adoption of a phase-based framework to enable the field to move forward in a more systematic manner.
  • This article seeks to stimulate discussion on such a framework by providing goals and recommendations for study methods at each phase and outlining key challenges facing the field.

INTRODUCTION

In the 30 yr since body-worn devices (referred to as devices) were first used in physical activity research, there has been great progress in advancing the hardware, software, and processing methods used to translate device signals into estimates of physical behavior (e.g., posture and activity intensity) (1–4). Because these devices provide an unprecedented level of detail regarding the type, duration, and intensity of physical behaviors needed to improve health and longevity, they are presently the measurement tool of choice in many surveillance studies, prospective cohorts, and randomized clinical trials (5,6). Data collection is ongoing in hundreds of thousands of participants, and we anticipate that these rich data sets will address key research gaps identified in the 2018 Physical Activity Guidelines Advisory Committee Report (7). However, a major challenge facing the field is the rapid and relentless proliferation of new technologies and data processing methods without an established framework to evaluate their validity.

The field of measurement science is focused on developing, validating, and disseminating statistical algorithms to translate device signals into metrics of physical behavior (8). However, there are no defined benchmarks for progress. As such, there is no consensus on how to appropriately take a new method from inception to adoption. For example, some algorithms are developed and tested in participants completing a single walking bout (9), and other algorithms are evaluated in naturalistic conditions compared to a criterion (10); both are described as “validated.” The majority of the method development and validation studies have been conducted in laboratory settings where participant behavior is constrained to a few common tasks (e.g., walking) performed for a fixed duration (11,12). These laboratory studies have been instrumental in advancing the field. However, they offer little insight into how methods will perform in real-world conditions where activity and postural transitions are frequent and occur at irregular intervals. Rigorous validation studies conducted in free-living settings are rare but necessary to predict the performance of a device and algorithm in real-world conditions (10,13,14). Further amplifying the confusion is that both data collection (device and body placement) and data processing (algorithm applied) decisions affect estimates of time spent in different behaviors, and as a result, several recent studies have reached widely disparate interpretations of the amount of moderate-vigorous physical activity (MVPA) needed for longevity (15,16). These differences obstruct efforts to pool data and preclude between-study comparisons, which are necessary to develop clear public health recommendations.

To address some of this confusion, we propose that the measurement field adopt a phase-based framework for developing and evaluating device-based methods for physical behavior assessment. The framework is characterized by flexible and progressive processes, prespecified milestones for advancement, and it allows for a return to earlier stages for refinement and optimization when necessary. The notion that a new discovery (e.g., method, treatment, device) be evaluated under progressive conditions before it is adopted is not new (17–19). To advance a drug to market, there is a phase-based framework with clear milestones for progression that is guided by regulatory requirements (18,19). The proposed framework is similar to the drug development framework in that it borrows concepts and terminology to facilitate interdisciplinary communication. However, the target audience and end goals are different. One important distinction is that the current framework is not meant to be implemented by a regulatory body. Rather, we intend this framework to be used by researchers who develop methods or use devices to assess physical behavior.

SCOPE AND OVERVIEW OF THE FRAMEWORK

The framework applies to devices that are used to predict aspects of physical behavior including activity intensity (e.g., MVPA) and activity types such as walking and driving (3). These devices typically include an accelerometer that measures acceleration signals; however, the framework may be applied to other device signals, for example, heart rate, gyroscope, or electromyogram (EMG), that can be used in conjunction with or independent from acceleration signals. The framework is intended to facilitate the development and validation of processing methods to predict physical behavior from devices, which requires access to the “raw” output, rather than already processed summary estimates. This type of data is typically only available from “research-based” devices rather than “consumer-based” devices; thus, the framework is more relevant to research devices. However, encouraging standard validation within a naturalistic setting also is applicable to consumer devices (8).

The framework is illustrated in Figure 1. The phases of the framework progress from mechanical testing of the sensor signal (phase 0) in a controlled and artificial environment to a comprehensive naturalistic validation in real-world conditions where individual variation is high (phase III). The intermediate phases (I and II) focus on the development of a new method under controlled laboratory or semistructured conditions. Thus, the phases become more reflective of real-world conditions as studies progress along the framework, characterized by increasing individual variability (i.e., varying time spent in different behaviors/postures and frequent transitions). Although the early phases of the framework do not reflect real-world human behavior, they are necessary and useful steps in the development process, particularly as a device or type of signal from the device is evaluated for the first time. As the environment and protocol become increasingly variable, we expect performance to decline, which may necessitate return to an earlier phase for further refinement and optimization. The final phase of the framework (phase IV) involves the application of a method to health studies. By this phase, a method has been determined to be valid and reliable, but factors such as participant and researcher burden are important contributing factors that determine whether a method is adopted into practice.

Figure 1
Figure 1:
Overview of a phase-based framework to evaluate body-worn devices to assess physical behavior.

It is important to note that the goal of a framework is not to determine whether there is a single, best device, wear location, or algorithm that should be used by all researchers. There are many considerations on what is the best choice for a given study, including cost, subject, and researcher burden. For measurement scientists, the goals of the framework are to 1) rapidly identify efficacious devices and prediction methods, 2) enable direct comparison of performance, 3) standardize terminology to facilitate interdisciplinary collaboration, and 4) minimize differences in summary estimates between different devices and data processing methods. For health researchers, a framework provides clarity on how extensively algorithms have been validated, enabling informed decisions on what monitor to use in a particular study. In the following sections, we will identify the goals for each phase along with methodological considerations and suggestions, which are summarized in Table 1. We hope the proposed framework will stimulate discussions and, ultimately, consensus in the field.

TABLE 1
TABLE 1:
Summary of goals and study designs for each phase along the framework

Phase 0: Mechanical Signal Testing

Goals

The goal of this phase is to test the reliability and validity of the underlying signals via controlled protocols using electronic machines. Controlled testing is used for testing within-device reliability (i.e., determining the extent to which monitor produces the same signal when tested multiple times under identical conditions), between-device reliability (i.e., testing different devices that are the same brand), and comparability of the output across different monitor brands (20–22). Phase 0 testing provides insight into the amount of variability in the underlying device technology without the influence of human variation and provides little insight into how a device will perform in applied studies. As devices progress from the manufacturing stage into the research arena, it is important to determine whether the signals are responding as expected in response to a known stimulus.

Study designs and methods

Methods for mechanical testing use electronic devices, such as wheels, tables, and orbital shakers to examine the acceleration signals under fixed frequencies (21,23). Raw, unprocessed device output (e.g., g’s) should be tested under multiple frequencies that reflect human movement. Frequencies tested should increase in a stepwise fashion in small increments (e.g., 0.1 Hz) that span the range relevant to human movement from sedentary (0–0.4 Hz) to vigorous (>1.1 Hz) (24). To test within-device reliability, devices must undergo multiple testing trials under identical conditions, but given that mechanical stimuli are constant and controlled, each trial needs only to be performed for a short duration (e.g., 2–3 min). Of note, the vast majority of the research in this area has been using acceleration signals (in either arbitrary units of “counts” or, more recently, acceleration units (g)) (25,26). Some devices can now collect and store other signals (e.g., heart rate, EMG, gyroscope, magnometer), and we recommend similar signal testing for between-, within-, and across-device reliability, but the paucity of research in this area precludes more specific recommendations.

Phase I: Laboratory Development

Goals

The goal of the phase I studies, which are sometimes referred to as calibration studies, is to develop an algorithm to estimate activity energy cost or activity type from the device signals under controlled laboratory conditions. Participants complete the same activity for a fixed duration without transitions between activities or postures (e.g., (27)). These controlled conditions are optimal for identifying features of the signal that may be valuable for distinguishing between different activity intensities or types. This phase introduces human variation, but laboratory protocols without transitions are still not reflective of real-world conditions. Thus, these results may not reflect how a device/method will perform in free-living applications.

Study designs and methods

Device data are collected at the same time as criterion measures (e.g., indirect calorimetry or direct observation (DO)) while participants complete a selection of activities within a laboratory environment (25). Activities are performed for a fixed duration (e.g., 7 min of walking, 7 min of sweeping), and the start/stop times of the activities are recorded. Device data are labeled with criterion values and then used to develop an algorithm to predict posture, activity type, or intensity. Using fixed durations and controlling the start/stop of each activity ensure that the data set used to develop the prediction algorithm contains “clean” and continuous sensor signals that are associated with only one distinct criterion value (e.g., walking, moderate intensity). Algorithms are then cross-validated on the same sample using a separate smaller hold-out group (e.g., develop on 70% of the sample and cross-validate on the remaining 30% of the sample) or a hold-one-out method (28). A researcher has two main decisions to make in designing a phase I study: 1) the activities that will be selected in the protocol and 2) the statistical algorithm (e.g., machine learning, linear regression) that will be applied to the data.

The type of activities included in the development phase will affect the estimates of time spent in specific intensities or types (29). We recommend that representative activities by domain and intensity should be included in all calibration studies (Fig. 2). In Figure 2, we illustrate the amount of time spent in different domains by age and common activities by intensity for the general American adult population as a starting point for which activities researchers should consider including (31,33). The specific selection of activities should be modified to ensure they are relevant to particular age groups, given the intensity and prevalence of a particular activity differ for children, adolescents, younger adults, and older adults. Achieving consensus on a standard minimum set of common activities for studies would make calibration data sets more generalizable, enable results to be directly compared among studies, and still enable flexibility for investigators to add specific types of activities that are relevant for their population of interest.

Figure 2
Figure 2:
Illustration of time use domains and sample activities that should be included in phases I–III of the framework. A. Data are taken from American Time Use Survey Tables and consolidated into five primary domains by age categories (30). Values are hours per day (percentage of waking day). Personal care includes grooming, eating, and drinking; household activities includes housework and caring for others; work and education include work for pay and education-related studying and attending class; leisure includes sports, exercise, and recreation, relaxing, and socializing; and other includes purchasing goods and services, organizational, civic, and religious activities, other activities. B. Example activities are taken from the top 10 most prevalent activities by intensity, where possible (31). Other activities that fit within the intensity and domain also could be used. The proportion of time in different intensity categories within the domains was approximated based on time use data for a general adult sample (aged 20+ yr) (31,32) and is intended to illustrate that a range of intensities are necessary in each domain. Notably, leisure time is predominately spent in sedentary behaviors, but we recommend oversampling light and moderate-vigorous physical activity (MVPA) because they are the predominate target of health interventions.

With regard to the choice of algorithm, the first studies used linear regression models to relate counts (X) to activity intensity (metabolic equivalents (METs), the Y value). The resulting equation was then inverted to estimate the count “cut point,” above which the intensity exceeded 3 METs (34). Several studies have demonstrated that the signal-intensity relation is not linear, particularly during activities of daily living (35,36) or when the monitor is worn on the wrist (37). This realization, along with advances in device technology that allowed data to be collected and stored at higher frequencies, prompted researchers to apply more sophisticated statistical models. These include nonlinear regression (38) and machine learning approaches (27,39–46). Machine learning models are similar to regression models in that they use inputs (the sensor signal) to predict a response (e.g., activity type or intensity) (47). However, their primary advantage is that they do not assume a rigid parametric relation between the sensor signal and the response. Instead, they use the data to flexibly learn the relation under minimal assumptions. We recommend the field continue to develop and apply machine learning algorithms. However, as we discuss in phase IV (adoption), usability is a barrier to the use of these models by applied researchers, and measurement scientists should consider even at early phases how and what software tools will be required to ensure an algorithm can be used in applied studies.

Phase II: Semistructured Evaluation

Goals

The primary goal of a phase II study is to refine algorithms developed in a phase I study to accommodate transitions between distinct activities. As an algorithm progresses in the development framework from phase I to phase II, a goal also is to independently validate the algorithm in a sample different from the one in which it was first developed. This is necessary to determine the generalizability and robustness of the model(s). Compared with phase I, these studies are more reflective of “real-world” conditions because they develop and refine algorithms to accommodate transitions between distinct activities and effectively predict a wider range of activities in which humans may engage from one day to the next.

Study design and methods

Phase II studies use a semistructured routine that consists of activities of varying durations and includes transitions between activities (37,48,49). Like phase I studies, phase II studies take place in a laboratory setting, and thus, researchers are still able to exert a high degree of control over the activities performed. As such, phase II studies still do not reflect true real-world conditions. Similar to the recommendations for phase I, developing a core set of routines facilitates between-study comparisons.

Phase III: Naturalistic Validation

Goals

The goal of phase III studies is to conduct a rigorous, independent validation of an algorithm in real-world conditions compared with gold standard measures. These studies establish that a new method works well enough to be used in applied research and should be completed before a new method is used in an intervention or etiologic study. Similar to other phase-based frameworks, a new method should be evaluated compared with the “standard of care” (i.e., an accepted existing method) to examine if it is better than what is already available. It is important to note that “better” is typically based on which method is more accurate or precise, but researchers may also consider other factors including compliance to the data collection protocol, cost of the monitor, and the time and cost of data processing (50).

Study design and methods

These studies must take place in real-world conditions where participants are able to complete their typical daily behaviors. In phase III studies, the algorithm should be compared with a gold standard criterion measure (Table 2). Studies should aim to capture a full 24-h day and include a representative of a range of behaviors. The core principles of a naturalistic validation study are as follows:

TABLE 2
TABLE 2:
Criterion measures in physical behavior assessment
  1. Collect Data in True Naturalistic Conditions. To inform how a given method will perform in experimental and observational studies, the data must be collected within these naturalistic conditions, such as the participants' own home, workplace, and within their community without instruction on when to start or stop a particular activity.
  2. Use Gold Standard Criterion Measures.Table 2 provides an overview of available criterion measures for physical behavior validation, which are broadly categorized as physiologic criteria for metrics of energy expenditure (EE) or behavioral criteria for metrics like activity type, posture, and steps.
    1. Physiologic Criterion Measures. These provide an estimate of EE; indirect calorimetry and doubly labeled water (DLW) are the most appropriate for criteria for device studies (51). Portable indirect calorimeters measure oxygen consumption, and participants wear a lightweight (<2 lb) backpack and a facemask. These systems have been worn for up to 6 h by study participants and provide valid estimates of time spent in different activity intensities (52). However, participants may be uncomfortable wearing them in certain public situations because of the mask and backpack. Whole-room indirect calorimeters also provide estimates of total and intensity-specific EE. Participants can perform some behaviors in a naturalistic manner (varying durations, include transitions), but the breadth of behaviors is restricted because all activities are performed indoors within a confined space. DLW is a criterion of total EE and activity-related EE (when subtracting resting EE). DLW has been widely used to validate monitors, including several large studies (53,54). This method is unobtrusive and unlikely to affect the participant’s behavior. The primary disadvantages of DLW are that it is costly and does not provide an estimate of activity intensity, type, or temporal patterns.
    2. Behavioral Criterion Measures. For behavioral criteria, we recommend DO using still camera images or video recordings. Still camera images have been used as a criterion for activity type and location. These cameras collect and store images every 20 s for up to 16 h on a single battery and can store data for multiple days (55–57). The primary limitation is that this method cannot assess transitions between activities or intensity. Video-recorded DO is a valid criterion for number of steps (13), postures, and transitions between postures (58) and activity type (59), all of which are important physical behavior metrics that influence health (60). Although video-recorded DO does not directly assess EE, it is useful for categorizing activity intensities (61,62). A limitation of using videos is that evaluation requires a lot of labor and time. Both still images and videos can be annotated by different researchers/across research groups to assess inter- and intrarater reliability. The research community should develop standard definitions of metrics that can be annotated similarly across research groups (63).
  3. Use an Empirical Protocol Based on Time Use to Ensure the Data Are Representative of Habitual Behavior. Ideally, validation studies would take place for a full 24 h over multiple days, which is possible with DLW as a criterion for total EE. However, for some criterion measures that are obtrusive and labor intensive (i.e., indirect calorimetry and DO), collecting data over a full 24-h period is not feasible. In these situations, time use data should be used to inform a protocol to ensure that a representative range of activities are captured across the study. Participants could be provided basic instructions for the session in which they are being observed (e.g., complete some housework and leisure activities). Their instructions should be general (“complete an errand in the community”) rather than specific directions (“go to your car, drive to this store and shop for 10 min”) to ensure the participants complete tasks in as naturalistic a manner as possible. Although it may not be feasible to obtain 24-h worth of data on a single person, this approach would ensure the validation data set contains a representative set of activities typical of a 24-h period across the study sample.
  4. Including a Diverse Sample of Participants. We recommend adopting standard measures and reporting of age, sex, race/ethnicity, physical activity status, body mass index, socioeconomic status, and job classification to facilitate comparisons across studies. To ensure that algorithms generalize to a range of activities, diverse samples completing diverse activities are needed. For example, the types of activities completed by blue collar workers at work are different from those of agricultural workers. Similarly, the movement pattern and common activities of those with functional limitations are different from those of the general population, necessitating targeted recruitment and potentially algorithm refinement, too (64).
  5. Develop and Validate Processing Methods on a Variety of Data Collection Procedures (Sensors/Attachment Sites). When feasible (due to cost and participant burden constraints), researchers should collect data using multiple devices to enable direct comparison among devices, enabling end users to make an informed decision on which monitor to use and where to place it on the body. Of note, researchers must balance the use of multiple devices with ensuring correct placement (i.e., avoid attaching four devices to a single wrist).
  6. Appropriate Statistical Evaluation. The purpose of the statistical evaluation is to determine the accuracy and precision of the processing methods by assessing if we can be statistically confident that the equivalence between the method and a criterion is within a given tolerance (e.g., ±5 or 10%). This approach, called equivalency testing, is different than traditional hypothesis testing, which assesses if the processing method’s estimates are different from the criterion measure and limits the probability of falsely claiming a difference to be less than 5% (typically). The equivalence testing paradigm has been reviewed in general (18,65) and applied to physical activity assessment (66). Previous research also provides detailed explanations on how this approach establishes evidence that a new method is better than a standard method (18,50).

Phase IV: Adoption

Goals

A method is successfully adopted into health research when it is used in an applied study, such as surveillance, experimental, or clinical trials, and observational studies linking physical activity with health-related outcomes. For the measurement scientist, the goal of the adoption phase is to have others implement new methods into applied studies. For the more advanced data processing techniques, this will likely require user-friendly software that is (ideally) freely available to other researchers.

Study design and methods

Although this phase appears outside the traditional validation process, measurement scientists with a primary interest in method development must remember that the end goal is for an end user, with less computational expertise, to use their processing method. To date, few machine learning methods have been used in applied studies to quantify the effectiveness of an intervention or relate physical behaviors to health. Instead, investigators are defaulting to simpler methods (i.e., cut points) and have reported cleaning data by hand, clearly decreasing reproducibility and increasing researcher burden (67). Applied researchers face real barriers to implementing complicated new methods, which are often a “black box” that requires statistical expertise both to understand and to use. For a 7-d assessment period, there has been a remarkable 3000-fold increase in the average file size when collecting data in a single-axis with 60-s epochs compared to raw acceleration in 3 axes at 100 Hz. This significantly, and often prohibitively, increases storage and computational burden. To facilitate effective adoption, there must be clear 1) communication from measurement scientists to end users about the reliability and validity of different devices and processing methods, and 2) open-source, user-friendly software available to process the data. Measurement scientists should design with dissemination in mind, meaning that the end goal should be considered from the early phases (68,69).

PERSPECTIVES FOR PROGRESS

Before providing specific recommendations on the opportunities and challenges of adopting this framework, it is prudent to consider several key observations about the current state of the field. First, there has been substantial progress in device manufacturing that has been driven by technological innovations (i.e., miniaturization of batteries and storage) and physical activity researchers demanding monitor signals in “raw” output (21,23). To continue to progress, the measurement field must be able to efficiently evaluate the impact of new technologies.

Second, despite consensus that laboratory-based methods do not translate well to a naturalistic environment, the field is inundated with phase I studies that are not progressing to rigorous validation studies (phase III) (10,70,71). A 2018 systematic review reported that there are 20 available methods to process acceleration signals (mg), and none have been validated in a natural environment or an independent sample (11). Even the most promising algorithm developed in a phase I study will decline in performance when evaluated in a natural environment.

Third, the activities included in the method development process will affect how well that method is able to estimate physical behavior (27,29). Several decades of research have shown that protocols including only locomotion activities will underestimate time spent in MVPA (12,38). Despite this limitation, new algorithms have been developed for vector magnitude counts and raw acceleration metrics using only locomotion activities (e.g., (9,72)). Regardless of the quality of the features extracted from the monitor or the complexity of the algorithm, if the activities included in the method development do not reflect what people do daily, new methods will not perform well in real-world conditions.

Fourth, when they have been conducted, rigorous validation studies in free-living conditions (with criterion measures) have provided clarity for the field (10,13,55,58). As an example of why these phase III studies are so important, in 2009, we published a machine learning algorithm that was trained and developed on two laboratory-based data sets featuring over 300 participants and 41 activity types (27,73). We then discovered that the model performed poorly on an independent sample, overestimating energy cost by 33% (10). We hypothesized that this overestimation was due to two factors: 1) the method was developed on data that did not include transitions and thereby assumed activities start and stop in minute intervals, and 2) the method was developed using a protocol that did not include sedentary behaviors. Consequently, we used acceleration data collected in the free-living conditions with DO as the criterion to refine our algorithm. Our focus was to identify transitions between postures and activities and to separate periods of activity from stretches of sedentary time (10). This scenario is an example of how even a rigorous laboratory study does not translate to naturalistic conditions, but by combining the knowledge gleaned from the laboratory setting with observations from the free-living setting, we were able to greatly improve the performance of our method. This type of continuous, bidirectional refinement is necessary for method optimization (Fig. 1).

Fifth, there has been an exponential increase in the amount of global data using activity monitors, and over the next 5–10 yr, there will be follow-up on over 500,000 participants for disease outcomes and mortality (6,74,75). These data have the potential to address key research gaps regarding the optimal frequency, type, intensity, and timing of activity for improving health. It is therefore a critical time to begin to establish which devices, attachment sites, and processing methods are providing comparable estimates of physical behavior to ensure coherent public health translation.

What Will It Take to Achieve Consensus?

If this framework were adopted broadly, it would require a change in perspective on when a method is considered validated. We advocate that a device or processing method should not be considered validated for use in a study until it has been compared with a criterion measure in a natural environment (phase III). We recognize that these types of rigorous evaluations are costly and time intensive. Although not exhaustive, we highlight three specific challenges for which future discussion and consensus are needed to move the field forward.

First, a key goal that should drive the future science is to harmonize the core activities to be included at each phase. In Figure 2 we provide common activities across domains and intensities to guide this conversation. Included activities should represent a range of time use domains, include a broad spectrum of intensities, and be age appropriate. In addition to locomotion activities, household, leisure, occupational, and recreational activities involving both lower and upper body movement should comprise part of the menu of behaviors that are studied. Sedentary behaviors need to be included too. Data from the Time Use Surveys and previous day recalls show the types and intensities of behaviors typically performed within a day, and they can be used to inform a selection of activities and across the framework (30,76).

Second, moving the field forward will involve investment in data sharing resources (5). Because the criterion measures and naturalistic conditions are time and cost intensive, data sharing is pragmatic. It is unlikely that a single research group/study will collect data that are generalizable across a range of participants, from different backgrounds, who represent a true range of behaviors. Moreover, computational scientists may lack expertise necessary to collect and annotate free-living data with human subjects and criterions, but it is advantageous for the field if they are able to test and refine algorithms on these robust data sets. In some fields, data sharing has been mandated for federally funded projects. A National Cancer Institute funded project (iData) released a publicly available data set that includes a large sample of older adults with device data and DLW, an important step in this direction (https://biometry.nci.nih.gov/cdas/idata/). To encourage individual investigators to release data may require incentives in the form of either carrots or sticks. To offer a carrot, researchers who contribute data to a repository in a standard format could gain access to others’ data as well, and there are currently federally funded data repositories that could be used for device-based measures (77). In contrast, a stick could come in the form of requirements that data are released to maintain funding or publishing opportunities. At minimum, researchers should share a code to implement their data analyses for the purpose of replication.

Third, successful data pooling and sharing require standard annotations and formatting for data sets, but the field presently lacks harmonious operational definitions for many criterion measures. This is particularly evident for behavioral criteria including metrics such as steps, specific postures, and activity types. Different studies are choosing to predict different types of activities (e.g., driving, walking/running, housework, stationary vs sitting vs standing) (40,55,57). Importantly, both the type and number of behaviors that the researcher is trying to categorize will affect accuracy and precision. There have been efforts to standardize definitions for key postures (78), but nuanced conceptual distinctions (e.g., sitting vs reclining) may be extremely difficult to operationalize as a criterion using DO. Researchers publishing their manuals for annotating data and operational definitions for these metrics may help the field converge on key metrics (55). For physiologic criteria (DLW and indirect calorimetry), EE is the key metric, which can be presented as activity-related EE when subtracting out (or dividing by) resting energy cost. For indirect calorimetry, estimates of time spent in different absolute intensity categories (e.g., light, moderate, vigorous) are well established, although controversy about the use of METs and absolute versus relative intensity persist (8,79). Although it will be a challenge, reaching consensus on these operational definitions will facilitate collaboration and data sharing of annotated criterion files and sensor data. This will enable more effective collaborations and may set the standard for best practices in this field.

CONCLUSIONS

In highlighting the challenges, we want to be mindful not to minimize the great progress that has been made. The use of devices to estimate activity and sedentary behavior has the potential to advance our understanding of how these behaviors are linked to health and optimal intervention strategies to increase physical activity and decrease sedentary time. However, differences in estimates from different monitors and processing methods impair efforts to pool data, complicate between-study comparisons, and prevent cohesive public health recommendations. We believe that adopting the principles laid out in this framework for method development and evaluation will lead to valid estimates of multiple aspects of physical behavior, shape best practices, and enable more rapid identification, refinement, and use of promising new methods.

References

1. Montoye HJ, Washburn R, Servais S, Ertl A, Webster JG, Nagle FJ. Estimation of energy expenditure by a portable accelerometer. Med. Sci. Sports Exerc. 1983; 15:403–7.
2. Wong TC, Webster JG, Montoye HJ, Washburn R. Portable accelerometer device for measuring human energy expenditure. I.E.E.E. Trans. Biomed. Eng. 1981; 28:467–71.
3. Intille SS, Lester J, Sallis JF, Duncan G. New horizons in sensor development. Med. Sci. Sports Exerc. 2012; 44(Suppl. 1):S24–31.
4. Troiano RP, McClain JJ, Brychta RJ, Chen KY. Evolution of accelerometer methods for physical activity research. Br. J. Sports Med. 2014; 48:1019–23.
5. Wijndaele K, Westgate K, Stephens SK, et al. Utilization and harmonization of adult accelerometry data: review and expert consensus. Med. Sci. Sports Exerc. 2015; 47:2129–39.
6. German National Cohort (GNC) Consortium. The German National Cohort: aims, study design and organization. Eur. J. Epidemiol. 2014; 29:371–82.
7. Physical Activity Gidelines Scientific Report. Physical Activity Guidelines Committee Report. Washington, DC: Department of Health and Human Services; 2018. Available from: https://health.gov/paguidelines/second-edition/report/. Accessed May 2019.
8. Welk GJ, Bai Y, Lee JM, Godino J, Saint-Maurice PF, Carr L. Standardizing analytic methods and reporting in activity monitor validation studies. Med. Sci. Sports Exerc. 2019; 51:1767–1780.
9. Vähä-Ypyä H, Vasankari T, Husu P, et al. Validation of cut-points for evaluating the intensity of physical activity with accelerometry-based mean amplitude deviation (MAD). PLoS One. 2015; 10:e0134813.
10. Lyden K, Keadle SK, Staudenmayer J, Freedson PS. A method to estimate free-living active and sedentary behavior from an accelerometer. Med. Sci. Sports Exerc. 2014; 46:386–97.
11. de Almeida Mendes M, da Silva ICM, Ramires VV, Reichert FF, Martins RC, Tomasi E. Calibration of raw accelerometer data to measure physical activity: a systematic review. Gait Posture. 2018; 61:98–110.
12. Lyden K, Kozey SL, Staudenmeyer JW, Freedson PS. A comprehensive evaluation of commonly used accelerometer energy expenditure and MET prediction equations. Eur. J. Appl. Physiol. 2011; 111:187–201.
13. Toth LP, Park S, Springer CM, Feyerabend MD, Steeves JA, Bassett DR. Video-recorded validation of wearable step counters under free-living conditions. Med. Sci. Sports Exerc. 2018; 50:1315–22.
14. Jeran S, Steinbrecher A, Pischon T. Prediction of activity-related energy expenditure using accelerometer-derived physical activity under free-living conditions: a systematic review. Int. J. Obes. (Lond). 2016; 40:1187–97.
15. Evenson KR, Wen F, Herring AH. Associations of accelerometry-assessed and self-reported physical activity and sedentary behavior with all-cause and cardiovascular mortality among US adults. Am. J. Epidemiol. 2016; 184:621–32.
16. Lee IM, Shiroma EJ, Evenson KR, Kamada M, LaCroix AZ, Buring JE. Accelerometer-measured physical activity and sedentary behavior in relation to all-cause mortality: the Women's Health Study. Circulation. 2018; 137:203–5.
17. Czajkowski SM, Powell LH, Adler N, et al. From ideas to efficacy: the ORBIT model for developing behavioral treatments for chronic diseases. Health Psychol. 2015; 34:971–82.
18. Jones B, Jarvis P, Lewis JA, Ebbutt AF. Trials to assess equivalence: the importance of rigorous methods. BMJ. 1996; 313:36–9.
19. Lipsky MS, Sharp LK. From idea to market: the drug approval process. J. Am. Board Fam. Pract. 2001; 14:362–7.
20. John D, Morton A, Arguello D, Lyden K, Bassett D. “What is a step?” Differences in how a step is detected among three popular activity monitors that have impacted physical activity research. Sensors (Basel). 2018; 18.
21. John D, Sasaki J, Staudenmayer J, Mavilia M, Freedson PS. Comparison of raw acceleration from the GENEA and ActiGraph GT3X+ activity monitors. Sensors (Basel). 2013; 13:14754–63.
22. Rothney MP, Neumann M, Beziat A, Chen KY. An artificial neural network model of energy expenditure using nonintegrated acceleration signals. J. Appl. Physiol. 2007; 103:1419–27.
23. Rothney MP, Apker GA, Song Y, Chen KY. Comparing the performance of three generations of ActiGraph accelerometers. J. Appl. Physiol. (1985). 2008; 105:1091–7.
24. John D, Miller R, Kozey-Keadle S, Caldwell G, Freedson P. Biomechanical examination of the ‘plateau phenomenon’ in ActiGraph vertical activity counts. Physiol. Meas. 2012; 33:219–30.
25. Freedson P, Bowles HR, Troiano R, Haskell W. Assessment of physical activity using wearable monitors: recommendations for monitor calibration and use in the field. Med. Sci. Sports Exerc. 2012; 44(Suppl. 1):S1–4.
26. van Hees VT, Thaler-Kall K, Wolf KH, et al. Challenges and opportunities for harmonizing research methodology: raw accelerometry. Methods Inf. Med. 2016; 55:525–32.
27. Freedson PS, Lyden K, Kozey-Keadle S, Staudenmayer J. Evaluation of artificial neural network algorithms for predicting METs and activity type from accelerometer data: validation on an independent sample. J. Appl. Physiol. (1985). 2011; 111:1804–12.
28. Staudenmayer J, Zhu W, Catellier DJ. Statistical considerations in the analysis of accelerometry-based activity monitor data. Med. Sci. Sports Exerc. 2012; 44(Suppl. 1):S61–7.
29. Matthews CE, Keadle SK, Berrigan D, et al. Influence of accelerometer calibration approach on moderate-vigorous physical activity estimates for adults. Med. Sci. Sports Exerc. 2018; 50:2285–91.
30. Bureau of Labor and Statistics. American Time Use Survey. Available from: http://www.bls.gov/tus/tables.htm.
31. Tudor-Locke C, Johnson WD, Katzmarzyk PT. Frequently reported activities by intensity for U.S. adults: the American Time Use Survey. Am. J. Prev. Med. 2010; 39:e13–20.
32. Church TS, Thomas DM, Tudor-Locke C, et al. Trends over 5 decades in U.S. occupation-related physical activity and their associations with obesity. PLoS One. 2011; 6:e19657.
    33. Ainsworth BE, Haskell WL, Herrmann SD, et al. 2011 Compendium of Physical Activities: a second update of codes and MET values. Med. Sci. Sports Exerc. 2011; 43:1575–81.
    34. Freedson PS, Melanson E, Sirard J. Calibration of the Computer Science and Applications, Inc. accelerometer. Med. Sci. Sports Exerc. 1998; 30:777–81.
    35. Kozey SL, Lyden K, Howe CA, Staudenmayer JW, Freedson PS. Accelerometer output and MET values of common physical activities. Med. Sci. Sports Exerc. 2010; 42:1776–84.
    36. Pober DM, Staudenmayer J, Raphael C, Freedson PS. Development of novel techniques to classify physical activity mode using accelerometers. Med. Sci. Sports Exerc. 2006; 38:1626–34.
    37. Montoye AHK, Begum M, Henning Z, Pfeiffer KA. Comparison of linear and non-linear models for predicting energy expenditure from raw accelerometer data. Physiol. Meas. 2017; 38:343–57.
    38. Crouter SE, Clowers KG, Bassett DR Jr. A novel method for using accelerometer data to predict energy expenditure. J. Appl. Physiol. 2006; 100:1324–31.
    39. Rosenberg D, Godbole S, Ellis K, et al. Classifiers for accelerometer-measured behaviors in older women. Med. Sci. Sports Exerc. 2017; 49:610–6.
    40. Ellis K, Kerr J, Godbole S, Staudenmayer J, Lanckriet G. Hip and wrist accelerometer algorithms for free-living behavior classification. Med. Sci. Sports Exerc. 2016; 48:933–40.
    41. Chowdhury AK, Tjondronegoro D, Chandran V, Trost SG. Ensemble methods for classification of physical activities from wrist accelerometry. Med. Sci. Sports Exerc. 2017; 49:1965–73.
    42. Zhang S, Rowlands AV, Murray P, Hurst TL. Physical activity classification using the GENEA wrist-worn accelerometer. Med. Sci. Sports Exerc. 2012; 44:742–8.
    43. Bonomi AG, Goris AH, Yin B, Westerterp KR. Detection of type, duration, and intensity of physical activity using an accelerometer. Med. Sci. Sports Exerc. 2009; 41:1770–7.
    44. Staudenmayer J, He S, Hickey A, Sasaki J, Freedson P. Methods to estimate aspects of physical activity and sedentary behavior from high-frequency wrist accelerometer measurements. J. Appl. Physiol. 2015; 119:396–403.
    45. Sasaki JE, Hickey AM, Staudenmayer JW, John D, Kent JA, Freedson PS. Performance of activity classification algorithms in free-living older adults. Med. Sci. Sports Exerc. 2016; 48:941–50.
    46. Kate RJ, Swartz AM, Welch WA, Strath SJ. Comparative evaluation of features and techniques for identifying activity type and estimating energy cost from accelerometer data. Physiol. Meas. 2016; 37:360–79.
    47. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning; Data Mining, Inference, and Prediction. 2nd ed. Stanford: Springer; 2008, 764 pp.
    48. Ellingson LD, Hibbing PR, Kim Y, Frey-Law LA, Saint-Maurice PF, Welk GJ. Lab-based validation of different data processing methods for wrist-worn ActiGraph accelerometers in young adults. Physiol. Meas. 2017; 38:1045–60.
    49. Ellingson LD, Schwabacher IJ, Kim Y, Welk GJ, Cook DB. Validity of an integrative method for processing physical activity data. Med. Sci. Sports Exerc. 2016; 48:1629–38.
    50. Welk GJ, McClain J, Ainsworth BE. Protocols for evaluating equivalency of accelerometry-based activity monitors. Med. Sci. Sports Exerc. 2012; 44(Suppl. 1):S39–49.
    51. Hills AP, Mokhtar N, Byrne NM. Assessment of physical activity and energy expenditure: an overview of objective measures. Front. Nutr. 2014; 1:5.
    52. Crouter SE, DellaValle DM, Haas JD, Frongillo EA, Bassett DR. Validity of ActiGraph 2-regression model, Matthews cut-points, and NHANES cut-points for assessing free-living physical activity. J. Phys. Act. Health. 2013; 10:504–14.
    53. Chomistek AK, Yuan C, Matthews CE, et al. Physical activity assessment with the ActiGraph GT3X and doubly labeled water. Med. Sci. Sports Exerc. 2017; 49:1935–44.
    54. Matthews CE, Kozey Keadle S, Moore SC, et al. Measurement of active and sedentary behavior in context of large epidemiologic studies. Med. Sci. Sports Exerc. 2018; 50:266–76.
    55. Willetts M, Hollowell S, Aslett L, Holmes C, Doherty A. Statistical machine learning of sleep and physical activity phenotypes from sensor data in 96,220 UK Biobank participants. Sci. Rep. 2018; 8:7961.
    56. Ellis K, Kerr J, Godbole S, Lanckriet G, Wing D, Marshall S. A random forest classifier for the prediction of energy expenditure and type of physical activity from wrist and hip accelerometers. Physiol. Meas. 2014; 35:2191–203.
    57. Kerr J, Patterson RE, Ellis K, et al. Objective assessment of physical activity: classifiers for public health. Med. Sci. Sports Exerc. 2016; 48:951–7.
    58. Kozey-Keadle S, Libertine A, Lyden K, Staudenmayer J, Freedson PS. Validation of wearable monitors for assessing sedentary behavior. Med. Sci. Sports Exerc. 2011; 43:1561–7.
    59. Kozey Keadle S, Lyden K, Hickey A, et al. Validation of a previous day recall for measuring the location and purpose of active and sedentary behaviors compared to direct observation. Int. J. Behav. Nutr. Phys. Act. 2014; 11:12.
    60. Young DR, Hivert MF, Alhassan S, et al. Sedentary behavior and cardiovascular morbidity and mortality a science advisory from the American Heart Association. Circulation. 2016; 134:E262–79.
    61. Lyden K, Petruski N, Staudenmayer J, Freedson P. Direct observation is a valid criterion for estimating physical activity and sedentary behavior. J. Phys. Act. Health. 2014; 11:860–3.
    62. Welch WA, Swartz AM, Cho CC, Strath SJ. Accuracy of direct observation to assess physical activity in older adults. J. Aging Phys. Act. 2016; 24:583–90.
    63. Bourke AK, Ihlen EAF, Helbostad JL. Development of a gold-standard method for the identification of sedentary, light and moderate physical activities in older adults: definitions for video annotation. J. Sci. Med. Sport. 2019; 22:557–61.
    64. Strath SJ, Pfeiffer KA, Whitt-Glover MC. Accelerometer use with children, older adults, and adults with functional limitations. Med. Sci. Sports Exerc. 2012; 44(Suppl. 1):S77–85.
    65. Lenth RV. Some practical guidelines for effective sample size determination. Am. Stat. 2001; 55:187–93.
    66. Dixon PM, Saint-Maurice PF, Kim Y, Hibbing P, Bai Y, Welk GJ. A primer on the use of equivalence testing for evaluating measurement agreement. Med. Sci. Sports Exerc. 2018; 50:837–45.
    67. Albinali F, Beaudin J. SPADES: survey of needs. Accessed May 2019. Available from: http://spades-documentation.s3-website-us-east-1.amazonaws.com/survey-of-needs.html.
    68. Glasgow RE, Vinson C, Chambers D, Khoury MJ, Kaplan RM, Hunter C. National Institutes of Health approaches to dissemination and implementation science: current and future directions. Am. J. Public Health. 2012; 102:1274–81.
    69. Klesges LM, Estabrooks PA, Dzewaltowski DA, Bull SS, Glasgow RE. Beginning with the application in mind: designing and planning health behavior change interventions to enhance dissemination. Ann. Behav. Med. 2005; 29:66–75.
    70. Pavey TG, Gilson ND, Gomersall SR, Clark B, Trost SG. Field evaluation of a random forest activity classifier for wrist-worn accelerometer data. J. Sci. Med. Sport. 2017; 20:75–80.
    71. Gyllensten IC, Bonomi AG. Identifying types of physical activity with a single accelerometer: evaluating laboratory-trained algorithms in daily life. I.E.E.E. Trans. Biomed. Eng. 2011; 58:2656–63.
    72. Sasaki JE, John D, Freedson PS. Validation and comparison of ActiGraph activity monitors. J. Sci. Med. Sport. 2011; 14:411–6.
    73. Staudenmayer J, Pober D, Crouter S, Bassett D, Freedson P. An artificial neural network to estimate physical activity energy expenditure and identify physical activity type from an accelerometer. J. Appl. Physiol. 2009; 107:1300–7.
    74. Lee IM, Shiroma EJ. Using accelerometers to measure physical activity in large-scale epidemiological studies: issues and challenges. Br. J. Sports Med. 2014; 48:197–201.
    75. Doherty A, Jackson D, Hammerla N, et al. Large scale population assessment of physical activity using wrist worn accelerometers: the UK Biobank Study. PLoS One. 2017; 12:e0169649.
    76. Eurostat. Harmonised European time use surveys. 2008. Available from: https://ec.europa.eu/eurostat/ramon/statmanuals/files/KS-RA-08-014-EN.pdf.
    77. Goldberger AL, Amaral LA, Glass L, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000; 101:E215–20.
    78. Tremblay MS, Aubert S, Barnes JD, et al. Sedentary Behavior Research Network (SBRN)—Terminology Consensus Project process and outcome. Int. J. Behav. Nutr. Phy. 2017; 14:75.
    79. Kozey S, Lyden K, Staudenmayer J, Freedson P. Errors in MET estimates of physical activities using 3.5 ml x kg(−1) x min(−1) as the baseline oxygen consumption. J. Phys. Act. Health. 2010; 7:508–16.
    Keywords:

    wearable sensor; physical activity; sedentary behavior; measurement; exercise; accelerometer; activity monitor

    Copyright © 2019 by the American College of Sports Medicine