The serious problem of falling in old age and the general accompaniment of balance control difficulty in those who fall has lead to development of standarized measures of balance. Two assessment tools commonly used in physical therapy, and specifically in the evaluation of elder individuals who have fallen or are at risk of falling, are the Berg Balance Scale (BBS)1 and the Dynamic Gait Index (DGI).2,3 These evaluation instruments not only serve as the baseline measures of the balance problem but also assist the therapist in developing the treatment program for the client's specific problems and monitor improvement.
Low scores in the BBS and DGI have been associated with an increased risk of falling in the elder population (ie, a BBS of <45 and DGI of <19).3–6 Therapists use these cutoff points as a reference to guide treatment and monitor progress. Cutoff scores are often taken as absolute values to determine the success of a particular intervention and are commonly used to report patient progress. For example, a patient who improves from an initial BBS of 40 to a final score of 46 can be reported as having a much lower risk of falling.7 This anchor-based method should be used with caution because assessment instruments are not always responsive enough to detect small changes in function. Responsiveness or sensitivity to change refers to the ability of an instrument to detect change over time when a patient has actually experienced an increase/decrease in function. Clinicians need to consider that meaningful changes in function can be masked by the multiple sources of error associated with an assessment instrument.
Unfortunately, when results differ from one evaluation to the next, it cannot be assumed that “true” change has occurred; some or all of the change could be attributed to measurement error. Error can be inherent to the test, a result of evaluators' administration of the test, or simply a representation of the naturally existing fluctuation in patients' performance. The amount of error across measurements of the same test is related to the reliability of the test. There are several statistical methods that have been widely used to estimate reliability. These include methods of relative reliability such as the Pearson correlation coefficient or the intraclass correlation coefficient (ICC), and absolute reliability methods such as the coefficient of variation and the standard error of the measurement (SEM).8–10 Although relative reliability methods focus on the strength of the correlation between repeated measures taking into account the total group variability (between subject/measurement) and the individual measurements variability (within subject/measurement), absolute reliability methods focus on the variability of the scores from measurement to measurement (within subject/measurement). This last approach offers the advantage of providing estimates that are not affected by the range of measured scores and can, therefore, be used to interpret the consistency of individual scores.9,11
A number of investigators have used the SEM to estimate the absolute reliability of clinical performance measures.10–15 The SEM expresses measurement error in the same unit of the original instrument and is not influenced by variability among subjects.10 Therefore, the SEM is an excellent indicator of the influence of measurement error on an individual's score. The SEM is closely related to the concept of minimal detectable change (MDC) expressed by Stratford et al.16 The MDC is the amount of change in a given measure that must be obtained to determine whether true change has occurred between 2 testing occasions. The MDC is expressed as a confidence interval around the SEM, indicating the values that are within the range of error attributable to the measuring instrument. The MDC can provide clinicians useful and easy-to-understand criteria to assess change in patients' performance.
Although correlation methods used to calculate relative reliability are excellent sources of information to compare groups of patients, the SEM is more appropriate for clinical practice, that is, when making decisions about individual patients.17 However, few investigators have used a measurement of absolute reliability to investigate the psychometric properties of the BBS, and no published study has addressed this issue with the DGI.12–14,15 Stevenson15 used the SEM to investigate error associated with the use of the BBS in patients with stroke. He found a SEM of 2.49 (in BBS units) in patients with stroke receiving inpatient rehabilitation. In addition, he calculated a confidence interval around the SEM and found that a change of 6 BBS points was needed to be 90% confident that true change has occurred. Similar results were obtained by others. Steffen and Seney14 investigated individuals with Parkinson disease and found an MDC95% value of 5 BBS points. Conradsson and colleagues12 explored this issue in elderly individuals living in assisted living facilities and dependent in activities of daily living. The results of their investigation suggest that a change of 8 BBS points is necessary to detect a change with 95% confidence. These results suggest that clinical decisions based on these assessment tools must take into account the amount of error associated with multiple administration of the assessment.
Further investigation of MDC in different populations and under different test-retest conditions are needed. In addition, understanding how MDC values change at different score levels of an assessment instrument (ie, the BBS and DGI cutoff points of <45 and <19, respectively) may be valuable for clinicians who are faced with making treatment decisions on the basis of these values. The purpose of this study was to use the SEM to estimate the MDC associated with the BBS and DGI. This study aimed to improve the clinical use of these standardized instruments by providing clinicians with estimates of measurement error that are easy to interpret and can be used to assist in clinical decision making.
The sample consisted of 42 subjects (26 men and 16 women, aged 65 years and older) participating in a larger research study investigating the link between smoking and recovery from frailty in older Floridians. This study was approved by the institutional review board for the University of Florida and the Research and Development Committee at the North Florida/South Georgia Veteran's Affairs Medical Center. Inclusion criteria included community dwellers with a history of falls or near falls in the past 12 months, the ability to walk 20 feet (with or without an assistive device), and a score of 24 or more on the Mini-Mental State Examination.18 All participants presented multiple comorbid conditions, including, but not limited to, diabetes, hypertension, neuropathies, orthopedic conditions, Parkinson disease, previous stroke, and general deconditioning. Participants were recruited from a gait and balance disorders clinic at the North Florida/South Georgia Veteran's Affairs Medical Center, from local offices of geriatric physicians, and from the general community.
The Berg Balance Scale
The BBS is a frequently used performance-based ordinal scale that assesses postural balance.1 The test consists of 14 motor tasks, which simulate those tasks that older adults encounter during daily activity: sitting to standing, standing unsupported, sitting unsupported, standing to sitting, transfers, standing with eyes closed, standing with feet together, reaching forward with outstretched arm, retrieving an object from floor, turning to look behind, turning 360°, placing alternate foot on stool, standing with 1 foot in front, and standing on 1 foot. Each item is rated on a 5-point ordinal scale of 0 (indicating the lowest level of function) to 4 (indicating the highest level of function), with the total score ranging from 0 to 56. The BBS can be administered in 15 to 20 minutes and requires minimal equipment.
A number of investigations suggest that the BBS is a valid measure of balance. Initially, Berg et al4 correlated BBSs with a general rating of balance made by therapist (Pearson r = 0.81). Other studies by the same authors have also demonstrated high correlation values between the BBS and other measures of balance. For instance, the Pearson r correlations between the BBS and the balance subscale of the Tinetti Performance-Oriented Mobility Assessment and the Barthel Index mobility subscale were 0.91 and 0.67, respectively.1 Other researchers have also found high correlations between BBSs and other motor and functional measurements: Fugl-Meyer test motor and balance subscales (Pearson r = 0.62-0.94), timed up and go test scores (Pearson r = −0.76), Emory Functional Ambulation Profile (Pearson r = −0.60), and gait speed (Pearson r = 0.81).19,20 The BBSs also correlated moderately with the DGI (Spearman coefficient = 0.67) and center-of-pressure measures (−0.40 to −0.67 [Kendall coefficient of variance]).20
Several studies have also reported high intra- and interrater reliability for the BBS. Berg et al21 used videotaped evaluations of the BBS to obtain interrater reliability (ICC = 0.98 for total BBSs). The same researchers replicated these results in a test-retest format, producing a within rater ICC = 0.97 and between rater ICC = 0.98. The large majority of studies investigating reliability of the BBS have used some form of correlation coefficient such as the Pearson product moment correlation coefficient (r) and the ICC, with the latest becoming more popular in recent times. A fundamental problem of ratio indexes such as the ICC is that the error of measurement and true variability are expressed in relative terms. An ICC score is a ratio of within subject and between subject variability. Thus, the range of genuine differences in any attribute is sample dependent. Therefore, reported high ICC values for the BBS must be considered with caution.
The Dynamic Gait Index
The DGI was developed by Shumway-Cook and Woollacott2 to assess balance in older adults, who are at risk for falling. This functional gait scale consists of 8 common gait tasks: walking at different speeds on a level surface, walking with horizontal and vertical head turns, ambulating over and around obstacles, ascending and descending stairs, and making quick turns. Each item is scored on a 4-level ordinal scale, where 3 = “normal,” 2 = minimal impairment, 1 = moderate impairment, and 0 = severe impairment. The maximum possible score is 24 points. The DGI can be administered in 10 minutes and requires minimal equipment.
The psychometric properties of the DGI have not been extensively investigated. Validity of the scale has been supported by moderate correlation with the BBS (Spearman rank order correlation, ρ = 0.71).22 Sensitivity and specificity to identify individuals with a history of falls have been established at 59% and 64%, respectively.3 The test developers investigated the interrater and test-retest reliability of the scale using a small sample of 5 older adults and 5 raters. They found ICC values of 0.96 (interrater) and 0.98 when subjects were retested a week later by 2 therapists.23 Intrarater reliability has only been reported in individuals with multiple sclerosis (0.76-0.99, Pearson bivariate analysis, P < .05).24 Despite being widely used in the clinic, the psychometric properties of the DGI have not been investigated sufficiently.
The BBS and DGI were administered as part of a larger battery of physical performance and self-reported behavioral measures. For the purpose of this investigation, only initial evaluations (initial live evaluation and videotaped rescore of this evaluation) were considered. Evaluations were conducted by 2 experienced physical therapists, who specialized in gait and balance disorders in the older adult population (>7 years of experience in geriatric physical therapy). At each evaluation, the BBS and DGI assessments were videotaped by a research assistant with a Sony DCR-VX2100 digital camcorder (The Sony Corporation of America, New York, New York). Recorded sessions were rescored at a later time (time between initial and rescores >2 weeks) by the same therapists. Therapists used a television screen or computer monitor to view the recorded evaluations, and they were blinded to previous score (live score) and whether the recordings were from an initial or subsequent evaluation. All participants were assessed (live) and reassessed (recorded) by the same therapist.
All statistical analyses and graphical representations were performed with SPSS 13.0 software for Windows (SPSS Inc, Chicago, Illinois) and Microsoft Office Excel software for Windows (Microsoft Corporation, Redmond, Washington, District of Columbia).
Box plots were used to investigate the presence of outliers in the data. The distribution of the absolute differences between tests (initial BBS and DGI and rescored BBS and DGI) was plotted. Cases with values between 1.5 and 3 box lengths (interquartile range) from the upper or lower edge of the box were considered mild outliers. Cases with values more than 3 box lengths from the upper or lower edge of the box were considered extreme outliers. The SEM (also referred to as the absolute reliability) was calculated using the following formula: SEM = SD ÷ √(1 − ICC). Since raters did not assess all subjects, a completely random 1-way design was applied to calculate the ICC. This corresponds with the ICC (1,1) classification of Shrout and Fleiss.25
The SEM was used to calculate MDC. Minimal detectable change is the product of the SEM, the tabled z score for a desired confidence interval, and the √2. The √2 term acknowledges that 2 measurements are being compared. For a 95% confidence interval, the MDC = SEM ÷ 1.96 ÷ √2 (1.96 = z value associated with a 2-sided 95% confidence interval). Minimal detectable change was also calculated with 90% and 80% confidence intervals (MDC [90%] = SEM ÷ 1.645 ÷ √2, and MDC [80%] = SEM ÷ 1.28 ÷ √2).
Normality was visually explored with normal Q-Q plots and tested with the Kolmogorov-Smirnov normality test. The use of the SEM, because it assumes a normal distribution of error, requires that the measurement error be not related to the magnitude of the measured variable. This is referred to as heteroscedasticity.9 Heteroscedastic data shows that individuals who score the highest in a particular test also show the greatest amount of measurement error. Heteroscedasticity was formally examined by plotting the absolute differences between initial (live) value and rescored (videotaped) value, against the mean score. In addition, Spearman correlation (ρ) was used to rule out a relationship between each individual's absolute score difference and his or her mean.
The SEM and MDC procedures described previously were also used to investigate the amount of error associated with individuals at different levels of the BBS and DGI rating scales. Because the true score in these 2 assessments is unknown, the mean value between the initial and rescored values of the BBS and DGI was used to dichotomize the participants in 2 groups. Commonly used cutoff scores (<45 for the BBS and <19 for the DGI) were used to form the groups. Dichotomizing the DGI scores resulted in a sample size not suitable for this analysis. Therefore, the SEM and MDC were only calculated for the high BBS and low BBS performance groups.
A total of 42 participants were assessed with the BBS and the DGI. The average age was 75.6 years (range 59–88 years). There were 26 men (62%) and 16 women (38%) in the sample. The participant's mean initial BBS was 39 points (SD = 8.9, range 17–53). The rescored mean value was 40 points (SD = 8.8, range 19–55). For the DGI, the mean initial value was 12.9 (SD = 4.5, range 3–21), and the rescored mean was 12.7 (SD = 4.6, range 4–22).
The distribution of absolute values of the difference between initial BBS and rescored BBS was investigated with a box plot. For the BBS, three participants' scores were identified as outliers. Their absolute values were 8, 11, and 6, respectively. These scores were considered mild outliers because their values laid between 1.5 times and 3.0 times the interquartile range below the first quartile or above the third quartile. Therefore, these 3 scores were included in all subsequent analyses. The distribution of absolute values of the difference between initial DGI and rescored scores for the DGI was also explored with a box plot. No outliers were identified in this case.
A frequency distribution of the absolute difference between initial and rescored values is found in Table 1. The BBS presented a mean absolute difference of 2.6 points (SD = 2.4, range 0–11). Fifty-seven percent of the participants had a BBS absolute difference of 2 BBS points or less, whereas 5% presented an absolute difference of more than 6 BBS points. For the DGI, the mean absolute difference was 1.29 (SD = 0.99, range 0–3). Seventy-four percent of participants had an absolute difference in score of 1 or less DGI points, while 14% of the participants had an absolute difference of 3 DGI points.
Figure 1 presents a plot of BBS combined mean (initial and rescored) against the difference between initial and rescored values. Figure 2 replicates the same plot with DGI values. These plots offer a visual representation of the spread of the scores from a perfect reliability (ie, initial minus rescored equal 0). In addition, the plots indicated that the data were reasonably homoscedastic for both BBS and DGI. That is, the spread of the scores does not increase or decrease with higher or lower combined mean values. Spearman's ρ correlation coefficients (BBS, ρs = 0.17, P > .05; and DGI, ρs = 0.3, P > .05) confirmed the lack of relationship between the mean score and the absolute difference in scores (initial minus rescored values).
To estimate MDC values for the BBS and DGI, the SEM was calculated. SEM values for the BBS and DGI were 2.35 and 1.04, respectively. Next, MDC was obtained using 95%, 90%, and 80% confidence levels (MDC95%, MDC90%, and MDC80%). A summary of MDC values is given in Table 2. The results revealed MDC95% values of 6.5 BBS and 2.9 DGI points, respectively. Therefore, a change greater than 6.5 BBS points and 2.9 DGI points is necessary to reveal a change that exceeds the measurement error associated with these instruments and show “genuine” change with 95% confidence.
Subjects were divided into functional groups based on commonly used cutoff scores for the BBS (<45) and DGI (<19). Because of sample size limitations, MDC was calculated only for the BBS functional groups. The BBS functional grouping resulted in 30 cases classified as low function (<45 points) and 12 cases classified as high function (>45 points). The MDC95% was 7.3 BBS points for the low-function group, and 6.3 BBS points for the high-function group. Other MDC values are summarized in Table 2.
Consistent with previously reported findings,12–14,15 the results of this investigation suggest that a considerable amount of change is necessary to be confident that genuine change has occurred among 2 BBS testing occasions. Similar results were obtained for the DGI suggesting that clinicians using these instruments must be cautious when interpreting results obtained from repeated measures.
The BBS and DGI are widely used in clinical practice to measure individual's balance ability and gait and monitor improvement in these areas. High reliability values have been reported for both instruments.12,21,26–29 However, previous investigations have used a form of correlation, such as the Pearson product moment correlation or the ICC, to investigate the reliability of these instruments. Although correlational investigations are suitable for investigating the degree of agreement between groups of subjects in repeated measures, they offer little information about the amount of change that an individual needs to achieve “genuine” change—that is, the amount of change beyond the error associated with the instrument.
Absolute reliability is a more appropriate way of investigating the reliability of an instrument intended for use in a clinical setting, where clinicians are more concerned about individual change. In this investigation, the results of the absolute reliability of the BBS and the DGI indicate that 6.5 and 2.9 points, respectively, are required to be 95% confident that genuine change has occurred between 2 testing occasions for older adults who have reported falling. This information is valuable for clinicians working in geriatric settings where these results can be used to make individual decisions and assess improvement in function over time.
Although the 95% confident level is widely accepted in the research community, one could argue that in clinical practice, a lower confidence level could be of practical use to make appropriate clinical decisions. In this investigation, confidence levels of 90% and 80% were also calculated. The BBS showed MDC90% of 5.5 and MDC80% of 4.3 points. The DGI presented MDC90% of 2.4 and MDC80% of 1.9 points. Even at the lowest confidence level, both instruments demonstrated an estimated amount of error that should be considered when making clinical judgments. In this investigation, the DGI demonstrated half the amount of estimated error as compared with the BBS. This comparison does not, however, take into account the range of values of both instruments. That is, the MDC of 6.5 points in the BBS is equal to 11.6% of the total possible score of the BBS (56 points), whereas the MDC of 2.9 DGI points is equal to 12.1% of the total possible score for the DGI (24 points). Therefore, based on these results, both instruments presented similar MDCs.
Distribution-based approaches such as the SEM used in this investigation assume that the measurement error is constant across the range of possible scores. In this investigation, individuals were dichotomized into 2 functional groups to investigate the possible fluctuation of the SEM at 2 different levels of the scale. For the BBS, individuals with lower performance (BBS < 45) demonstrated higher SEM values. Their MDC95% was 7.3 BBS points. In contrast, participants with higher performance scores (BBS > 45) showed lower SEM values. Their MDC95% was 6.3 BBS points. These results suggest that measurement error is not constant across different levels of function as assessed by this instrument. Clinicians need to consider this limitation and apply the appropriate criteria when assessing patients of different functional ability. However, in this investigation, dichotomizing the initial pool of participants reduced the sample size of the groups. Therefore, our results should be interpreted with caution. Further investigation is needed to substantiate the possibility of different MDC levels based on performance. In addition, because of sample size limitations, we were unable to investigate MDC values at different DGI levels of performance. Future research needs to address this issue.
A methodological issue worth considering is the use of videotaped evaluations to establish the reliability of an instrument. This method has been widely used and published. In fact, the initial reliability study conducted by Berg et al30 used videotaped assessments to investigate the intrarater reliability of the BBS. A clear disadvantage of this design is that it does not take into account the natural fluctuation of the participant's performance when tested in two separate occasions. Clinical decisions based on this method must, therefore, be made with caution because not all sources of error are considered.
In a recent publication, Stevenson15 found an MDC95% value of 6.9 BBS points when assessing stroke patients in a test retest design. Stevenson's results are comparable to what was found in the present experiment, perhaps suggesting the possibility that the variation seen in both experiments is mostly due to the instrument's reliability and not within patient reliability. An interesting point to note is that test-retest experiment of Stevenson15 used the best performance of 3 trials as the value for each item. With this approach, the true score is likely more easily captured, and the within subject variability is decreased. In addition, Stevenson15 used the data reported by Berg and colleagues21 in their reliability study to calculate the MDC95%. He found an MDC95% of 6.2. Again, the investigation by Berg et al21 employed a test-retest design with stroke participants and produced similar results to the present study.
A significant advantage of using videotaped sessions to assess absolute reliability is the ability of this procedure to isolate measurement error inherent to the use of a particular assessment instrument. Fluctuations in individual's performances must be considered for clinical decisions. However, to investigate and improve the measurement properties of an assessment instrument, we must be able to isolate the intrinsic psychometric properties of the instrument and the ability of the raters to produce consistent ratings. In addition, an advantage of using videotaped sessions is that it eliminates the possibility of a learning effect when assessing participants at 2 separate occasions. When testing and retesting subjects within a short period of time, subjects could perform better after being familiar with the test and the testing environment.
Although researchers often focus on significant group mean changes in the variable of interest to draw conclusions about the effectiveness of a particular intervention, clinicians face the need to assess individual patients to judge a particular condition or monitor improvement. In this study, the BBS and DGI demonstrated mean values between test occasions of less than 1 point, suggesting that as a group, both testing occasions provided almost indistinguishable results. At the individual level, however, these 2 instruments demonstrated a considerable amount of variability. Clinicians must, therefore, be aware of this issue and consider the MDC values when making decisions on the basis of these instruments.
The methodology employed in this investigation adds an additional level of complexity to traditional reliability studies. This level of complexity results, however, in more user-friendly reliability estimates. This statistical approach allows clinicians to simply determine when a genuine change occurs among 2 testing occasions.
This experiment is the first attempt to investigate the MDC of the BBS and the DGI in elder community dwellers participating in a rehabilitation program. The results from this investigation demonstrate that a change of 6.5 points in the BBS and 2.9 points in the DGI is necessary to be 95% confident that genuine change in function has occurred between 2 assessments. This information is important for assessing individuals' performance, monitoring progress, and guiding treatment in clinical practice. Future investigations are needed to explore MDC at different functional levels.