Pediatric concussion is a widespread public health concern that has received significant media attention in recent years. Between 1997 and 2007, emergency department visits for concussion have increased by 100% in children aged 8 to 13 and by 200% in those aged 14 to 19.1 Clinical research supports the idea that children and adolescents take longer to recover from a concussion.2 A recent Institute of Medicine committee statement suggested that more research should be performed to “establish objective, sensitive, and specific metrics and markers of concussion diagnosis, prognosis, and recovery in youth”.3 Evaluation of concussion includes balance assessment,4–6 often involving the Balance Error Scoring System (BESS).4,7–9
Administering the BESS involves testing postural control in three different positions (double leg stance, tandem stance, and single leg stance) on both a firm and foam surface. The reliability of the BESS has been well-studied in the adult population. The intrarater reliability in adult and high school populations has been demonstrated to have intraclass correlation coefficients (ICC) that range between 0.63 and 0.92.10–14 Interrater reliability has been shown to be between 0.44 and 0.96 for each portion of the test,11,14,15 with an overall test ICC of 0.57 in an adult population.11 The minimum detectable change (MDC) has also been studied in an adult population with the mean intrarater and interrater MDCs across 3 testers were 7.3 and 9.4 points, respectively.11
Data for BESS testing in the pediatric population is much more scarce, however.16 Two small studies performed by Valovich-McLeod et al have looked at the BESS in youth aged 9 to 14. These studies used a modest sample size of 50 children and displayed high intrarater ICCs ranging from 0.87 to 0.95 for the various subcomponents of the test, with an ICC of 0.98 for the total score,17 and a test–retest reliability of 0.70.18
To our knowledge, the reliability of the BESS has not been studied in a large pediatric population. Our aim was to examine the inter and intrarater reliability, the test–retest reliability, the MDC, and evaluate for the presence of a learning effect for the BESS in a large pediatric population between the ages of 5 and 14 years.
This prospective observational study was approved by the University of Utah Institutional Review Board. Written parental permission and verbal child permission were required from all subjects.
Children aged 5 to 14 were recruited from local school districts (elementary and junior high schools), after approval from their school board, by approaching principals and physical education teachers. Children without a history of any known visual, vestibular, or balance disorder were included in the study. Potential subjects with a lower extremity injury or pain within the last 6 months were excluded, as were children with active concussion symptoms. Parents of subjects were asked about previous head injuries and further information about the number, perceived recovery, and timing of the concussions were recorded. Parental opinion of the child's balance ability was also recorded.
After obtaining parental and subject consent, children were screened for any active sources of lower extremity pain, and height and weight were measured. The BESS was performed, as below, at the child's school as part of their physical education class. Testing took place in a quiet location separate from the gymnasium, free of distractions. Up to 4 testing stations were used at any one time, with portable dividers used to minimize distractions from nearby subjects. The testing surface was generally equivalent to a concrete or linoleum floor, and a nonslip grip mat was used under the foam pad for those respective conditions to prevent the foam pad from slipping under the chidren's feet. Children, who were dressed in regular school attire, were asked to remove their shoes and identify which foot they preferentially used to kick a ball. This was used as the dominant foot in the BESS testing protocol. If they could not reliably identify which foot they preferred to kick a ball with, they were asked which hand they write with; in these instances the ipsilateral foot was used as the dominant foot. Children were videotaped and retested 2 to 3 weeks later in a similar fashion for the purposes of reliability determinations. This time frame was chosen not only for logistic and feasibility purposes but also because we feel it mirrors a reasonable time frame within which a health care provider may reassess a young concussed athlete. Videotaping was performed to facilitate the completion of interrater assessments.
The BESS testing, as described by Riemann et al,15 was performed—a summary is provided in Table 1. Briefly, subjects performed three 20-second standing balance conditions with hands on the iliac crests and eyes closed: 1 with the feet together, 1 with the dominant leg lifted, and 1 in a tandem position with a leading dominant leg. These conditions were repeated with the participants standing on a piece of medium-density foam (Airex Balance Pad; Alcan, Inc, Montreal, Canada). Single points were given for errors in balance—lifting the hands off the iliac crests, opening the eyes, stepping, stumbling, falling, remaining out of position for more than 5 seconds, moving the hip into more than 30° of flexion or abduction, or lifting the forefoot or heel. A maximum of 10 points per condition is allowed,15 for a total score ranging from 0 to 60. If the subject was unable to maintain stability for 5 seconds at any point during the trial, a maximum score of 10 was given for that condition. A total score was determined from the sum of the error points given for each of the six conditions.
Test–retest reliability was obtained by comparing BESS scores as recorded live at the two separate time points by the same rater. Intrarater reliability was performed by having assessors review videos of their live assessments. Only assessors who tested a minimum of 10 subjects were asked to review their own videos. Interrater reliability was assessed by having 4 assessors review videos of 50 random subjects distributed evenly by age and sex across the entire cohort. All assessors, for both intrarater and interrater reliability determinations had experience administering the BESS in clinical encounters and received additional training in performing the BESS. This training took place in the form of live clinical assessments with the lead author who routinely evaluates patients in the age demographic of this study, and by reviewing an online course produced by the lead author in which BESS video demonstrations of actual patients in this age demographic were viewed, scored, and discussed. All subjects were part of a project aimed at deriving normative data for the BESS and modified BESS in children; results on the normative data are discussed elsewhere.19
We described participant demographic characteristics using mean, standard deviation, median, and proportion as appropriate. We also generated matrix scatter plots displaying the BESS scores recorded by the 4 reviewers.
To examine intrarater, interrater, and test–retest reliabilities of the BESS, ICCs and their associated 95% Confidence Intervals (CIs) were estimated. The interrater observations were based on the BESS total scores across the 4 reviewers overall; the test–retest observations were based on the first live and the second live tests of all the reviewers; the intrarater observations were based on the first live test and first video within individual reviewer. Furthermore, the MDC for the intrarater, interrater, and test–retest reliabilities were computed for the total BESS scores; individual stance data were not analyzed.
For reliability testing, a power analysis determined that a sample size of 46 kids with 4 raters rating each subject on video review achieves 99% power to detect an ICC of 0.3 using an F-test with an alpha of 0.05.
A P-value of less than 0.05 was considered statistically significant. All analyses were performed using SPSS 22 (IBM Corp Released 2013: IBM SPSS Statistics for Windows, Version 22.0; IBM Corp, Armonk, NY) and SAS 9.2 (SAS Institute Inc, Cary, NC).
Study data were collected and managed using Research Electronic Data Capture (REDCap) tools hosted at the University of Utah.20 Research Electronic Data Capture is a secure, web-based application designed to support data capture for research studies, providing (1) an intuitive interface for validated data entry; (2) audit trails for tracking data manipulation and export procedures; (3) automated export procedures for seamless data downloads to common statistical packages; and (4) procedures for importing data from external sources.
A total of 388 potential subjects were screened for the study, of which 373 participated. Six potential subjects were excluded for previous diagnosis of a visual, vestibular, or balance disorder; eight were excluded for having a lower extremity injury within the last 6 months; and 1 child was excluded because of a recent, symptomatic concussion. The median age of participants was 9 years of age (range, 5-14), with 58.7% male sex, and 83.7% participating in team or individual sports. A total of 10.5% of the participants had sustained a concussion in the past, of which 84.6% of those participants had sustained only 1 concussion, 66.7% had occurred greater than 1 year ago, 94.9% reported complete recovery, and 5.13% reported that they were mostly recovered. These subjects were included in the study. A total of 16 total reviewers participated in test administration. A thorough description of subject demographics and BESS test results is available separately.19
Of 373 subjects, 331 (89%) completed both the first and second assessments. The test–retest reliability was determined to be 0.90 (95% CI, 0.88-0.92). Regarding the possibility of a learning effect, the mean difference in BESS score between the two live test administrations was 0.036 (95% CI, −0.53 to 0.60), and the paired t test result (P = 0.900) revealed no significant learning effect as a group. However, in some individuals there were rather large differences between the first and second test. Specifically, in 9% of subjects the BESS score did not change, in 44% of subjects the BESS score worsened (mean change 4.43, st dev 3.70, range, 1-27), whereas in 47% of subjects the BESS score improved (mean change 4.22, st dev 3.00, range, 1-15).
Overall intrarater ICC was 0.96 (95% CI, 0.95-0.97) with individual intrarater ICCs ranging between 0.69 and 0.99. Eight of 16 reviewers had completed enough live assessments and subsequent video reviews to be eligible for analysis of intrarater reliability. See Table 2 for a summary of intrarater ICCs.
Overall interrater ICC was 0.93 (95% CI, 0.79-0.97). Figure 1 graphically demonstrates the BESS correlations between raters, for which correlation coefficients ranged from 0.85 to 0.95.
Minimum detectable change was calculated according to various scoring comparisons. Minimum detectable change based on the first and second live test administration was 6.2 at the 90% CI, 7.3 at the 95% CI, and 9.6 at the 99% CI. Minimum detectable change based on the intrarater scoring variation was 3.9 at the 90% CI, 4.6 at the 95% CI, and 6.0 at the 99% CI. Finally, MDC based on the interrater variation was 8.1 at the 90% CI, 9.6 at the 95% CI, and 12.6 at the 99% CI. These are illustrated in Table 3 for reference.
A good test has 2 important features: validity and reliability. To achieve validity, a good test needs to measure what it claims to measure. There are several types of validities—criterion validity, content validity, and construct validity. In a pediatric concussion population, the criterion validity has ranged from poor to adequate in comparisons to the Immediate Post-Concussion Assessment and Cognitive Testing and Post-Concussion Symptom Scale.21 Excellent content validity has been demonstrated in adults.22 And the BESS has shown good construct validity in identifying balance deficits in different conditions.22 A good test also needs to be reliable; that is, how consistently a test produces information about the subjects. Test scores should not be affected by measurement errors. Validity does not necessarily imply reliability and vice versa. Yet, for a test to be considered as good, both validity and reliability evidence are required.
These results give valuable information on the reliability of the BESS in a pediatric population. Although well-studied in an adult population,10–15 literature regarding the reliability of this test in the pediatric population is scarce.16 These data expands on previous work by Valovich McLeod et al,17,18 using a larger population of healthy children between the ages of 5 and 14.
The test–retest reliability of the BESS has previously been demonstrated in a population of 50 pediatric athletes over 60 days to have an intraclass correlation coefficient of 0.70,18 which is much lower than our value of 0.90. The difference may arise from the fact that our children were retested much sooner (2-3 weeks after initial testing). Valovich McLeod et al17 suggested that there may be a learning effect for the BESS in a pediatric population when retested within days; this has also been demonstrated in young adult populations.23–25 We found no evidence for this over this longer period, as scores for initial and repeat testing, performed over 2 to 3 weeks, showed no significant difference. We used a similar population, but used a much longer break between test administrations, which could explain the loss of a potential learning effect that may have been present at an earlier time point. Interestingly, Valovich McLeod et al18 tested 50 youth sport participants over 60 days and demonstrated a significant difference in scores between tests, which does not appear in our similar study. One difference may be that the previous study specifically recruited youth athletes who may see the testing as a challenge and something to improve on, or may have a better ability to train their bodies based off a single test.
Intrarater reliability of the BESS has been commonly studied in the young adult and adult populations, and ranges from 0.63 to 0.92.10–14 Our reviewers had similar reliability values, ranging from 0.69 to 0.99. Most ICC values were above 0.9 and the overall ICC was found to be 0.96, which is similar to previous work in the pediatric population, which demonstrated an ICC of 0.98.17 Valovich et al also studied a relatively small group of 50 athletes between the ages of 9 and 14. In our study, 250 subjects of our total sample between the ages of 5 and 14 were used to derive this intrarater data.
Interrater reliability was high in our population of 50 healthy children, more so than those recorded in the young adult populations.11,14 This is an interesting phenomenon, as young adult populations (Iverson 2008) tend to have lower amounts of errors in the BESS compared with pediatric populations.17–19 It is possible that the types of errors seen in pediatric populations are more exaggerated in appearance and easier to identify than those in adults. We also used different populations of raters (physicians, physical therapists, and certified athletic trainers), which suggests that different types of clinicians can administer the test reliably.
We looked at MDC scores according to repeated assessments over time to identify what constitutes a significant change in BESS score and the MDC if different assessors are conducting the examination. For serial BESS test administrations performed by the same reviewer, our data (95% CI for MDC of 7.3) suggest that a change of 8 points in the BESS for the same reviewer has a greater than 95% probability that it is due to a change in postural stability rather than intrarater variability. Valovich McLeod et al18 used a reliable change index method rather than MDC in their study of 50 student athletes between the ages of 9 and 14 that were administered the BESS approximately 60 days apart; their data were reported in 70%, 80%, and 90% CIs. They reported a significant worsening (or increase) in BESS score at a 90% CI was 5.3 units, or 6 points on the BESS. They reported a significant improvement (or decrease) in BESS score at a 90% CI was 9.4 units, or 10 points.
For intrarater variability scoring the same test administration, our MDC of 4.58 (95% CI) is slightly better than that reported by Finnoff et al11 who reported an intrarater MDC of 7.3 points in an adult population. Their study did not report, however, the CI at which they established their intrarater MDC.
Regarding MDC for assessments performed by different raters, our MDC of 9.6 points at a 95% CI suggests that a change of 10 points in the BESS when viewed by different reviewers is due to a change in postural stability with a >95% certainty. This is almost identical to the findings of Finnoff et al11 who demonstrated an interrater MDC of 9.4 points when measured in an adult population. Both of our findings suggest that most of the variability in testing comes from differences between raters.
Clearly, future studies need to investigate the performance of children with concussion on BESS testing to establish discriminant validity (the ability of the BESS to accurately distinguish concussed from nonconcussed children at the time of initial evaluation) and convergent validity (whether the BESS is related to similar instruments). It will also be important to determine the timing of normalization of BESS performance in children with concussion, and how these measures compare with other measures, such as neuropsychological tests or symptom scores. Our study was limited by the fact that our data were limited to a sample from local schools. Hence, the results may not be generalizable to other populations. Because testing was performed on a group of healthy children, for whom one could expect relatively little variability, this reliability may be lower in an affected population. Finally, it should be noted that the comparatively broad range of ages in our subjects could contribute to a relatively lower reported reliability, particularly as the younger children (ages, 5-10), which were included in our cohort, tend to have greater variability of their BESS score and make more errors.19
In conclusion, this is the first large study examining the reliability of the BESS in a pediatric population. The BESS was demonstrated to have a test–retest reliability of 0.90 (MDC 7.3 points), an intrarater ICC of 0.96 (MDC 4.6 points), and an interrater ICC of 0.93 (MDC 9.6 points). This suggests that the BESS is a reliable measure of balance assessment in the pediatric population, but may lack responsiveness as relatively large MDCs are required to assess change. Our study should increase the confidence for health care providers to use the BESS as a standard part of the physical examination in children with suspected concussion, even as young as age 5.
The authors thank Shirley D. Hon from the University of Utah College of Engineering, Christine Cheng from Roseman University College of Pharmacy, and Jeremy D. Franklin from the University of Utah College of Education for assistance with data analysis.
1. Bakhos LL, Lockhart GR, Myers R, et al. Emergency department visits for concussion in young child
2. Field M, Collins MW, Lovell MR, et al. Does age play a role in recovery from sports
-related concussion? A comparison of high school and collegiate athletes. J Pediatr. 2003;142:546–553.
3. Graham R, Rivara FP, Ford MA, Spicer CM. National Research Council Committee on Sports
-Related Concussions in Youth, Institute of Medicine. Sports
-related concussions in youth: Improving the science, changing the culture. Washington, DC: National Academies Press; 2014.
4. Scat3. Br J Sports
5. Giza CC, Kutcher JS, Ashwal S, et al. Summary of evidence-based guideline update: evaluation and management of concussion in sports
: report of the Guideline Development Subcommittee of the American Academy of Neurology. Neurology. 2013;80:2250–2257.
6. McCrory P, Meeuwisse W, Johnston K, et al. Consensus statement on concussion in sport—the 3rd international conference on concussion in sport held in Zurich, November 2008. PM R. 2009;1:406–420.
7. McCrea M, Guskiewicz KM, Marshall SW, et al. Acute effects and recovery time following concussion in collegiate football players: the NCAA Concussion Study. JAMA. 2003;290:2556–2563.
8. Murray N, Salvatore A, Powell D, et al. Reliability and validity evidence of multiple balance assessments in athletes with a concussion. J Athl Train. 2014;49:540–549.
9. Riemann BL, Guskiewicz KM. Effects of mild head injury on postural stability as measured through clinical balance testing. J Athl Train. 2000;35:19–25.
10. Erkmen N, Taşkın H, Kaplan T, et al. The effect of fatiguing exercise on balance performance as measured by the balance error scoring system. Isokinetics Exerc Sci. 2009;17:121–127.
11. Finnoff JT, Peterson VJ, Hollman JH, et al. Intrarater and interrater reliability of the Balance Error Scoring System (BESS). PM R. 2009;1:50–54.
12. Hunt TN, Ferrara MS, Bornstein RA, et al. The reliability of the modified balance error scoring system. Clin J Sport Med. 2009;19:471–475.
13. Susco TM, Valovich McLeod TC, Gansneder BM, et al. Balance recovers within 20 minutes after exertion as measured by the balance error scoring system. J Athl Train. 2004;39:241–246.
14. Valovich McLeod TC, Armstrong T, Miller M, et al. Balance improvements in female high school basketball players after a 6-week neuromuscular-training program. J Sport Rehabil. 2009;18:465–481.
15. Riemann BL, Guskiewicz KM, Shields EW. Relationship between clinical and forceplate measures of postural stability. J Sport Rehabil. 1999;8:71.
16. Davis GA, Purcell LK. The evaluation and management of acute concussion differs in young children. Br J Sports
17. Valovich McLeod TC, Perrin DH, Guskiewicz KM, et al. Serial administration of clinical concussion assessments and learning effects in healthy young athletes. Clin J Sport Med. 2004;14:287–295.
18. Valovich McLeod TC, Barr WB, McCrea M, et al. Psychometric and measurement properties of concussion assessment tools in youth sports
. J Athl Train. 2006;41:399–408.
19. Hansen C, Cushman D, Anderson N, et al. A normative data set of the balance error scoring system in children between the ages of 5 and 14. Clin J Sport Med. 2016;26:497–501.
20. Harris PA, Taylor R, Thielke R, et al. Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomedical Informatics. 2009;42:377–381.
21. Barlow M, Schlabach D, Peiffer J, et al. Differences in change scores and the predictive validity of three commonly used measures following concussion in the middle school and high school aged population. Int J Sports
Phys Ther. 2011;6:150–157.
22. Bell DR, Guskiewicz KM, Clark MA, et al. Systematic review of the balance error scoring system. Sports
23. Mancuso J. An investigation of the learning effect for the Balance Error Scoring System and its clinical implications [Abstract]. J Athl Train. 2002;37:S10.
24. Mulligan IJ, Boland MA, McIlhenny CV. The balance error scoring system learned response among young adults. Sports
25. Valovich TC, Perrin DH, Gansneder BM. Repeat administration elicits a practice effect with the balance error scoring system but not with the standardized assessment of concussion in high school athletes. J Athl Train. 2003;38:51–56.