Secondary Logo

Journal Logo

Original Articles

Test-Retest Reliability of Traumatic Brain Injury Outcome Measures

A Traumatic Brain Injury Model Systems Study

Bogner, Jennifer A. PhD; Whiteneck, Gale G. PhD; MacDonald, Jessica PsyD; Juengst, Shannon B. PhD; Brown, Allen W. MD; Philippus, Angela M. BA; Marwitz, Jennifer H. MA; Lengenfelder, Jeannie PhD; Mellick, Dave MA; Arenth, Patricia PhD; Corrigan, John D. PhD

Author Information
The Journal of Head Trauma Rehabilitation: September/October 2017 - Volume 32 - Issue 5 - p E1-E16
doi: 10.1097/HTR.0000000000000291
  • Free


LONGITUDINAL and intervention studies often utilize participant self-report instruments to measure outcomes. The establishment of test-retest reliability is critical to the determination of whether self-reported outcomes have changed because of time, intervention effects, or other factors. Unfortunately, many measures used to evaluate change in individuals with moderate-severe traumatic brain injury (TBI) do not have data on reliability with this diagnostic group. For example, the majority of measures used in the National Institute on Disability, Independent Living, and Rehabilitation Research TBI Model Systems (TBIMS) National Dataset, the largest longitudinal study of persons with TBI in the world, do not have established reliability. While some demographic items, health status questions, and substance use questions derived from national surveys have reliability data for the general population, the measures have not been specifically evaluated with persons with TBI. Even the most conventional measures, such as the Disability Rating Scale (DRS1,2) and FIM,3 lack information on reliability when used with persons with moderate-severe TBI and/or when administered by telephone, as is done with the TBIMS National Dataset and other studies.

The use of a self-report instrument without established reliability is of concern due to multiple potential sources of error that can introduce noise into the evaluation of change. Self-report can be especially vulnerable to error because in the course of answering questions, a number of cognitive and emotional processes are employed that can vary between persons and their current state. The Cognitive Aspects of Survey Methodology paradigm specifies 4 cognitive processes used to generate a response to a survey question: (1) understanding and interpreting the question; (2) retrieving relevant information; (3) forming a judgment based on integration of retrieved information, and (4) “mapping” the judgment to the response options.4 Even respondents without cognitive impairments do not always proceed through this sequence in an optimal manner; those with cognitive and executive functioning deficits can be especially challenged.

Emotional state, level of effort, and level of fatigue can also vary across time points and interact with cognitive processes. Some respondents may engage in “satisficing,” that is, employing minimal effort when formulating responses.5 Respondents provide an answer that appears satisfactory but may not represent the full consideration of the question that is expected by the interviewer. Respondents are more likely to satisfice with increased task difficulty, when they are cognitively impaired, or when not sufficiently motivated. Respondents who are asked questions that exceed their cognitive abilities are more likely to satisfice. Even when the cognitive demands are consistent with respondent abilities, the resources needed to provide the optimal response can be reduced by fatigue. Instrument design can mitigate some of these fluctuations, as can careful training of interviewers on how to recognize and alleviate difficulties in responding.

Given that persons with moderate-severe TBI experience cognitive difficulties, fatigue, and fluctuations in emotional state and effort that can increase vulnerability to inconsistent responding, the reliability of self-report instruments used with this population cannot be assumed. The purpose of the current study was to evaluate the test-retest reliability of each of the measures that comprise the follow-up interview of the TBIMS. Despite the multiple threats to reliable responding described for the general population, we hypothesized that the majority of measures would have an acceptable level of reliability due to the piloting and vetting procedures used for evaluating measures before they are added to the dataset, the training provided to the interviewers, and the ongoing data quality improvement efforts. A second aim of this study was to determine whether less reliable respondents could be identified at the time of enrollment in the study. In addition to demographics, predictor variables were chosen based either on their relation to cognitive abilities or to sensitive questions. We also evaluated whether there was a relation between consistent responding and contemporaneous cognitive functioning as well as follow-up period.



Participants were drawn from the National Institute on Disability, Independent Living, and Rehabilitation Research–funded TBIMS National Dataset. All participants sustained a moderate-severe TBI as indicated by loss of consciousness (LOC) exceeding 30 minutes, duration of posttraumatic amnesia greater than 24 hours, emergency department Glasgow Coma Scale Score of less than 13, or neuroimaging abnormalities as a result of trauma to the head. Additional inclusion criteria included 16 years of age or older; presented to a TBIMS-affiliated acute care hospital within 72 hours of the injury; and received comprehensive rehabilitation at a TBIMS-designated site. Currently, there are 16 centers enrolling participants and contributing acute, rehabilitation, and follow-up data to the longitudinal database. Follow-up interviews are typically conducted with the participant or proxy by phone, in person, or by a mail-out questionnaire at 1, 2, and 5 years postinjury and every 5 years thereafter. Each follow-up period has a designated data collection window: 4-month window for the 1-year follow-up, 6-month window for the 2-year follow-up, and a 1-year window for all remaining follow-ups.

Six centers provided test-retest data for the current study, with 1 to 2 interviewers at each site. A minimum total sample size of 200 participants (with a priori target of at least 30 participants per site) was chosen because it would be sufficiently large to ensure that coefficients have relatively narrow confidence intervals; this is particularly important when interpreting statistics that may fall on the lower end of the acceptable range.6–8 Participants whose follow-up windows opened during the designated 6-month data collection period were eligible for the study. Only persons with brain injury, not proxies, were asked to participate, because the focus of this study was on the reliability of the measures when administered to persons with TBI. In addition, participants who completed their initial follow-up interview by mail, who did not speak English, or who did not complete the initial follow-up interview during the data collection period were excluded. A consecutive sample (based on the window-opening date by site) of participants who met the inclusion criteria was approached for the study. If the participant's window opened but the participant could not be contacted by the end of the designated data collection period, attempts continued until 6 months after the window-opening date. Two sites required an extended data collection window.


The institutional review boards at each participating site approved the procedures used in this study. After each participant completed the standard interview for follow-up years 1, 2, 5, 10, 15, or 20, he or she was asked whether he or she was willing to complete the interview a second time within the next 14 to 28 days. The same interviewer completed both interviews for each participant but did so without accessing the first interview. While it is possible that some interviewers may have remembered answers to the first interview, interviewers were cautioned to be alert to inadvertently cuing participants regarding previous answers. The conduct of the second interview followed the same standard procedures used for collecting the data for the initial interviews, presenting the same questions in the same order.

The minimum interval between interviews (14 days) was chosen based in part on responses from interviewers regarding the rate of memory decay they experience for interviews they previously administered, with the assumption that persons with TBI being interviewed would experience memory decay at a similar or greater rate. The maximum length of the interval (28 days) was chosen to minimize the likelihood that “true” change would occur.

Data collectors kept a record of second interviews that were not completed because of use of a proxy for the first interview, completion by mail rather than phone, refusal, failure to successfully complete the second interview with persons who initially agreed to participate (with reason), or any other reason. Standard strategies for maximizing follow-up that are mandated by the TBIMS National Data Center were employed for both interviews. Participants who did not complete the second interview were replaced with another participant.


Reliability was examined for all 66 variables included in the follow-up interview for the TBIMS National Dataset as of October, 2013, including (a) single-item measures of residence; marital status; educational level; employment; economic status; general health as well as specific health conditions; rehospitalization; height and weight; tobacco, alcohol, and other drug use; transportation; mental health and (b) multi-item instruments: FIM3,9,10, Participation Assessment with Recombined Tools—Objective (PART—O),11 Disability Rating Scale—Interview Version (DRS),1,2 Glasgow Outcome Scale—Extended (GOSE),12 Supervision Rating Scale (SRS),13 Satisfaction with Life Scale (SWLS),14 TBI Quality of Life Anxiety and Depression items (TBI-QOL),15 and The Ohio State University TBI Identification Method (OSU TBI-ID).16 Details about each of the measures can be accessed at

Data analysis

The reliability of each measure was evaluated using the scores or categories commonly used in the field. For example, the FIM instrument was transformed to Rasch scores,9,10 the categories of the Supervision Rating Scale were collapsed using the schema most often employed by TBIMS researchers (see, and summary indices were used to describe alcohol consumption (eg, drinks per week).

The reliability of continuous measures was estimated using intraclass correlation coefficients (ICC) based on a 2-way random effects model evaluating for consistency (models evaluating absolute agreement were also run, but coefficients were nearly identical and so are not reported). Test-retest reliability designs are ideally suited for 2-way models since trials and subjects are fully crossed. Trials and subjects were considered random with the intent of generalizing to other studies. The ICC was also used to calculate the standard error of measurement and the denominator for the reliable change index, measures that clinicians can use to identify meaningful change in patients.17,18

The reliability of categorical variables was evaluated using Cohen κ or weighted kappa with ordinal variables (quadratic weighting). Ordinal variables that are often used as continuous variables were also assessed using ICC. Ninety-five percent confidence intervals are provided for each coefficient, with the lower confidence interval providing the reader with an estimate of the lowest estimate that might be found with similar samples.

Given that the size of a reliability coefficient is affected by various factors including the distribution of responses and prevalence, acceptability was evaluated relative to those distributions. For dichotomous variables, the following indices were also provided to aid in evaluation: (a) prevalence index (a high index indicates the prevalence is very high or very low, and that chance agreement is high, which can decrease kappa value, especially large kappa values)19; (b) bias index (a high index indicates the degree to which the proportion of positive cases differed between time 1 and time 2, with a large bias inflating the kappa value, especially small kappa values)19; and (c) percent of negative and positive agreements (used to determine whether there is a difference between the reliability of positive vs negative responses).20 For continuous variables, the coefficient of variation is provided (SD/mean × 100, time 1 and time 2 averaged) (higher values indicate more dispersion, when high ICCs may be artificially inflated; less dispersion may suppress ICC).21

When available, reliability estimates from other studies using the same or similar questions have also been provided for comparison purposes (when the items were not exactly the same, the item name is provided in italics in the tables). The study that yielded the greatest number of comparisons was the Substance Abuse and Mental Health Services Administration (SAMHSA) 2010 reliability study on its National Survey on Drug Use and Health.22 More than 2700 participants completed the survey twice, with a test-retest interval of 5 to 15 days. A second study that also examined a large number of variables common to the TBIMS data set was conducted with a South Australian general population sample (n = 154 in reliability study), using computer-assisted telephone interviews administered 13 to 35 days apart.23 Numerous other studies examined smaller sets of variables (data collection details are provided in the notes under the tables).

In addition to estimating the reliability of the measures, the ability to predict which participants would be less reliable was also assessed. The percentage of responses that were not equivalent between time 1 and time 2 was calculated for each participant as a measure of the participant's reliability. Multiple regression was run using predictor variables representing characteristics that would be known at the time of enrollment into the study (by discharge from rehabilitation) and were thought to be associated with reliable responding: age, race/ethnicity, gender, level of education, special education, payor source of Medicaid or Charity, FIM Cognitive at discharge (Rasch-adjusted), incarceration prior to injury, and drug use prior to injury. A second regression model added FIM Cognitive at follow-up (Rasch-adjusted, first administration) and the follow-up period to determine the influence of contemporaneous factors. Assumptions for the regression model were verified.


A total of 231 eligible participants completed both interviews; however, 7 participants completed the second interview outside of the predetermined window of 14 to 28 days (ranging from 11 to 34 days). These latter participants were excluded from the analysis, yielding a final sample size of 224 participants. An additional 166 TBIMS participants with open follow-up windows were excluded from the study for the following reasons: (a) 60 participants did not complete first interview because of death, incarceration, refusal, loss to follow-up, or withdrawal; (b) 11 participants refused the second interview; (c) 15 participants completed the data collection by mail; (d) 51 participants required a proxy to complete the interview; (e) 9 participants did not speak English; and (f) 20 participants were lost to follow-up after completing the first interview.

The number of participants from each site ranged from 19 to 68, with 5 of the 6 sites meeting the original sampling goal of at least 30 participants per site (the remaining site had a limited participant pool from which to draw due to a funding hiatus). Demographic and injury characteristics of the primary sample are provided in Table 1.

Demographic and injury characteristics of the sample

Tables 2 to 5 provide the ICC, kappa, and weighted kappa values (as appropriate to the scale of the variable), reliability coefficients obtained for the same or similar measures in other studies, prevalence index, bias index, percentage of negative and positive agreement, the coefficient of variation, standard error of measurement, and the denominator for the reliable change index. The ICC values ranged from 0.65 (binge drinking past month) to 0.99 (number of days from injury to employment), indicating that all values were good to excellent using traditional benchmarks.40 Weighted kappa ranged from 0.54 (rating of general emotional health) to 0.99 (highest level of education achieved), and kappa ranged from 0.43 (psychiatric hospitalization) to 1.00 (residence). The following kappa/weighted kappa estimates fell below 0.60 (common benchmark for “good' or “substantial” reliability40,41): arrested in past year, psychiatric hospitalization in past year, number of days not in good physical health, and rating of general emotional health. The weighted kappa for binge drinking categorized on an ordinal scale was 0.48, but the kappa for “any binge drinking” was 0.68. For arrests and psychiatric hospitalization, participants were more consistent from time 1 to time 2 when they denied either incident as compared with when they endorsed an arrest or hospitalization.

Demographics: Test-retest reliability coefficients, confidence intervals, interpretive indices and comparison studiesa
Instruments: Test-retest reliability coefficients, confidence intervals, interpretive indices and comparison studiesa
Mental health items: Test-retest reliability coefficients, confidence intervals, interpretive indices, and comparison studiesa
Physical health items: Test-retest reliability coefficients, confidence intervals, interpretive indices, and comparison studiesa

When considering factors that can have an impact on the reliability coefficient, relatively high coefficients of variation (>100%) were found for number of days living at current address, number of days from injury to employment (however, sample size was very small), binge drinking, average number of drinks per week, number of days not in good physical health, number of TBIs with any LOC, number of TBIs with LOC greater than 30 minutes, and the DRS. High coefficients of variation suggest that the ICC value may be inflated because of the large dispersion; therefore, weighted kappa was also calculated to provide a measure of reliability not based on variance.42 Low coefficients of variation, which may suppress ICC values, were found for each FIM score and height; however, these ICC values were all high, so the low dispersion is not of concern.

Relatively high prevalence indices (>0.80), which can suppress kappa, were found for residence, arrested, psychiatric hospitalization, suicide attempt, congestive heart failure, myocardial infarction, stroke, cancer, liver disease, and any TBI with LOC before the age of 15 years. The bias indices were all very small (highest was 0.08), so it is unlikely that systematic differences in responding at time 1 and time 2 inflated kappa values.

Participant inconsistency from time 1 to time 2 ranged from 2% to 36% nonequivalent responses on the 55 interview variables (multiple scores from the same measure were not included to avoid overweighting some measures over others, eg PART Total score was not included, but the 3 domain scores were included). The mean percent inconsistency was 20% (SD = 6%). The regression model that included characteristics known by rehabilitation discharge yielded an adjusted R2 of 0.13. As shown in Table 6, the only significant predictors were black race, Hispanic ethnicity, and being incarcerated prior to injury, all of which were associated with less consistent responding. When follow-up period and FIM Cognitive at follow-up were added to the model, the adjusted R2 increased to 0.23. Black race and Hispanic ethnicity remained significant, but FIM Cognitive at follow-up was the strongest significant predictor, with higher cognitive FIM scores associated with greater consistency of responses.

Prediction of percent of nonmatching responsesa


The findings from this study support the conclusion that reliable self-report can be obtained from persons with moderate-severe TBI using the measures and follow-up methodology employed by the TBIMS. With few exceptions, these measures used to evaluate outcomes at follow-up meet conventional benchmarks for “good” or “substantial” reliability. Furthermore, the reliability figures obtained with this sample compare relatively well with those of other studies. The coefficients obtained from a larger general population study conducted by SAMHSA22 were nearly identical (within the confidence intervals) or lower than that obtained in the current sample for all but 3 of 20 measures. Similar favorable comparisons were founded with other studies of a more select set of variables. The exception was a lower coefficient for the DRS; however, the sample size for the prior study was extremely small (n = 40) and the method of administration differed.29

Variables with low reliability relative to other variables included being arrested, psychiatric hospitalization, and rating of general emotional health. These measures request information that participants may consider sensitive, which could affect consistency in responses, as suggested by the lower agreement obtained from time 1 to time 2 when either arrests or psychiatric hospitalization was endorsed. Relatively low coefficients were also obtained by the SAMSHA study for arrests and psychiatric hospitalization (though the latter was only obtained for youth). Other measures of mental health included in the TBIMS interview also had lower coefficients than those for non–mental health measures; however, there they were still sufficiently strong to give the user viable options for measuring these constructs.

The item measuring number of days not in good physical health had an ICC value that may have been inflated by high dispersion. The weighted kappa value was low compared with that obtained with a general population sample,39 though still within a range that would be considered acceptable for many studies. This item is commonly used in health surveillance, though in light of the current results, it may be prudent to encourage interviewers to assist respondents who are having apparent difficulty providing a numerical answer, as recommended by the Centers for Disease Control and Prevention.43

There were very minimal systematic differences between participants who responded more or less reliably, based on the low effect sizes obtained when attempting to predict inconsistent responding. Therefore, we conclude that strong reliability reported in the current study can generalize to the full participant-reported TBIMS longitudinal database and possibly to other samples of persons with moderate-severe TBI who are able to complete the interviews themselves (the findings cannot be generalized to proxy-reported responses). However, the weak but significant relations observed with black race and Hispanic ethnicity suggest that minority status could increase the likelihood that an individual may feel more vulnerable when responding, decreasing the consistency of responses across time. The relation between inconsistent responding and incarceration prior to the injury is not surprising, given that arrests in the year prior to the follow-up interview was one of the least reliable measures. Nor is it surprising that better cognitive functioning at the time of the follow-up interview was associated with reliable responding, though the strength of the relation was only moderate (increasing the R2 by only 10%).

Clinical implications

The results of this study imply that the majority of these self-report measures can be used to measure change over time or following interventions. The reliability coefficients generated from the current study were used to calculate the denominator for the reliable change index 17,18,44 and the standard error of measurement,17,45 both of which can be used to determine whether a patient's status has changed beyond what might be expected because of measurement error.


Approximately 40% of the source sample was not eligible for participation in the reliability study. The primary reason was that the participant did not complete the initial follow-up interview. Loss to follow-up is a common problem in longitudinal studies, though the TBIMS strategies for maximizing follow-up have reduced rates of loss overall. The second most common reason a participant was deemed ineligible was proxy-completed interview. The results of the current study cannot be generalized to participants who rely on a proxy to respond, who complete the interview by mail, or who do not speak English.

Recruitment rates varied across sites as did reasons for ineligibility. Two sites experienced a funding hiatus, which presented challenges to following up with participants with whom they had not been able to maintain regular contact. In addition, some sites had more proxy respondents than other sites.

While interviewers did not review their first interview with participants before administering the second interview, it is still possible that they recalled previous answers and inadvertently cued responses. Regular monthly meetings with interviewers were used to assess and reinforce fidelity to the protocol, so it is believed that the likelihood of cued responses is very minimal. In addition, it is not possible to know the extent to which true change may have influenced the findings, and for this reason, measures of reliability are always considered to be estimates (but any true change would have contributed to underestimated reliability).


In conclusion, the standardized measures chosen by the TBIMS Center investigators and the telephone-based responses recorded by protocol-trained interviewers involved in regular continuous improvement activities yielded stable responses overall on repeated administration. This supports the effectiveness of the TBIMS National Data and Statistical Center's data collector training and data quality improvement programs. In addition, these findings indicate that research publications reporting analyses using the TBIMS data set—in the past and future—use data that are reliable and reproducible. Finally, these data indicate that individuals with moderate to severe TBI are capable of providing reliable self-reports in the years following injury.


1. Rappaport M, Hall KM, Hopkins K, Belleza T, Cope DN. Disability rating scale for severe head trauma: coma to community. Arch Phys Med Rehabil. 1982;63:118–123.
2. Malec JF, Hammond FM, Giacino JT, Whyte J, Wright J. Structured interview to improve the reliability and psychometric integrity of the Disability Rating Scale. Arch Phys Med Rehabil. 2012;93:1603–1608.
3. Granger CV, Hamilton BB, Keith RA, Zielezny M, Sherwin FS. Advances in functional assessment for medical rehabilitation. Top Geriatr Rehabil. 1986;1:59–74.
4. Tourangeau R, Rips LJ, Rasinski K. The Psychology of the Survey Response. Cambridge, England: Cambridge University Press; 2000.
5. Krosnick JA. Response strategies for coping with the cognitive demands of attitude measures in surveys. Appl Cogn Psychol. 1991;5:213–236.
6. Zou GY. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med. 2012;31:3972–3981.
7. Hadzi-Pavlovic D. Sample size for kappa. Acta Neuropsychiatr. 2010;22:199–201.
8. Cantor AB. Sample-size calculations for Cohen's kappa. Psychol Methods. 1996;1:150–153.
9. Heinemann AW, Linacre JM, Wright BD, Hamilton BB, Granger C. Relationships between impairment and physical disability as measured by the Functional Independence Measure. Arch Phys Med Rehabil. 1993;74:566–573.
10. Linacre JM, Heinemann AW, Wright BD, Granger CV, Hamilton BB. The structure and stability of the Functional Independence Measure. Arch Phys Rehabil. 1994;75:127–132.
11. Whiteneck GG, Dijkers MP, Heinemann AW, et al Development of the Participation Assessment with Recombined Tools-Objective for use after traumatic brain injury. Arch Phys Med Rehabil. 2011;92:542–551.
12. Wilson JT, Pettigrew LE, Teasdale GM. Structured interviews for the Glasgow Outcome Scale and the extended Glasgow Outcome Scale: guidelines for their use. J Neurotrauma. 1998;15:573–585.
13. Boake C. Supervision Rating Scale: a measure of functional outcome from brain injury. Arch Phys Rehabil. 1996;77:765–772.
14. Diener E, Emmons RA, Larsen RJ, Griffin S. The Satisfaction With Life Scale. J Pers Assess. 1985;49:71–75.
15. Tulsky DS, Kisala PA, Victorson D, et al TBI-QOL: development and calibration of item banks to measure patient reported outcomes following traumatic brain injury. J Head Trauma Rehabil. 2016;31:40–51.
16. Corrigan JD, Bogner J. Initial reliability and validity of the Ohio State University TBI Identification Method. J Head Trauma Rehabil. 2007;22:318–329.
17. Wright A, Hannon J, Hegedus EJ, Kavchak AE. Clinimetrics corner: a closer look at the minimal clinically important difference. J Man Manip Ther. 2012;20(3):160–166.
18. Perdices M, Tate R. Single subject designs as a tool for evidence-based clinical practice: are they unrecognized and undervalued? Neuropsychol Rehab. 2009;19(6):905–927.
19. Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther. 2005;85:257–268.
20. Cicchetti DV, Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990;43:551–558.
21. Hopkins WG. Measures of reliability in sports medicine and science. Sports Med. 2000;30:1–15.
22. Substance Abuse and Mental Health Services Administration. Reliability of Key Measures in the National Survey on Drug Use and Health (Office of Applied Studies, Methodology Series M-8, HHS Publication No. SMA 09-4425). Rockville, MD; 2010.
23. Dal Grande E, Fullerton S, Taylor AW. Reliability of self-reported health risk factors and chronic conditions questions collected using the telephone in South Australia, Australia. BMC Med Res Methodol. 2012;12:108.
24. Cham PM, Chen SC, Grill JP, Jonk YC, Warshaw EM. Reliability of self-reported willingness-to-pay and annual income in patients treated for toenail onychomycosis. Br J Dermatol. 2007;156:922–928.
25. Green J, Forster A, Young J. A test-retest reliability study of the Barthel Index, the Rivermead Mobility Index, the Nottingham Extended Activities of Daily Living Scale and the Frenchay Activities Index in stroke patients. Disabil Rehabil. 2001;23:670–676.
26. Petrunoff NA, Xu H, Rissel C, Wen LM, van der Ploeg HP. Measuring workplace travel behaviour: validity and reliability of survey questions. J Environ Public Health. 2013;2013:423035.
    27. Masedo AI, Hanley M, Jensen MP, Ehde D, Cardenas DD. Reliability and validity of a self-report FIM (FIM-SR) in persons with amputation or spinal cord injury and chronic pain. Am J Phys Med Rehabil. 2005;84:167–176.
    28. Ottenbacher KJ, Hsu Y, Granger CV, Fiedler RC. The reliability of the Functional Independence Measure: a quantitative review. Arch Phys Med Rehabil. 1996;77:1226–1232.
    29. Gouvier WD, Blanton PD, LaPorte KK, Nepomuceno C. Reliability and validity of the Disability Rating Scale and the Levels of Cognitive Functioning Scale in monitoring recovery from severe head injury. Arch Phys Med Rehabil. 1987;68:94–97.
    30. Pettigrew LE, Wilson JT, Teasdale GM. Reliability of ratings on the Glasgow Outcome Scales from in-person and telephone structured interviews. J Head Trauma Rehabil. 2003;18:252–258.
    31. Wilson JT, Edwards P, Fiddes H, Stewart E, Teasdale GM. Reliability of postal questionnaires for the Glasgow Outcome Scale. J Neurotrauma. 2002;19:999–1005.
    32. Pavot W, Diener E, Colvin CR, Sandvik E. Further validation of the Satisfaction with Life Scale: evidence for the cross-method convergence of well-being measures. J Pers Assess. 1991;57:149–161.
    33. Rosengren L, Jonasson SB, Brogardh C, Lexell J. Psychometric properties of the Satisfaction with Life Scale in Parkinson's disease. Acta Neurol Scand. 2015;132:164–170.
    34. Bogner J, Corrigan JD. Reliability and predictive validity of the Ohio State University TBI identification method with prisoners. J Head Trauma Rehabil. 2009;24(4):279–291.
    35. Cuthbert JP, Whiteneck GG, Corrigan JD, Bogner J. The reliability of a computer-assisted telephone interview version of the Ohio State University TBI identification method. J Head Trauma Rehabil. 2016;31:E36–E42.
    36. Stein AD, Lederman RI, Shea S. The Behavioral Risk Factor Surveillance System questionnaire: its reliability in a statewide sample. Am J Public Health. 1993;83:1768–1772.
    37. Bonevski B, Campbell E, Sanson-Fisher RW. The validity and reliability of an interactive computer tobacco and alcohol use survey in general practice. Addict Behav. 2010;35:492–498.
    38. Brownson RC, Jackson-Thompson J, Wilkerson JC, Kiani F. Reliability of information on chronic disease risk factors collected in the Missouri Behavioral Risk Factor Surveillance System. Epidemiology. 1994;5:545–549.
    39. Andresen EM, Catlin TK, Wyrwich KW, Jackson-Thompson J. Retest reliability of surveillance questions on health related quality of life. J Epidemiol Community Health. 2003;57:339–343.
    40. Cicchetti DV. The precision of reliability and validity estimates revisited: distinguishing between clinical and statistical significance of sample size requirements. J Clin Exp Neuropsychol. 2001;23:695–700.
    41. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174.
    42. Fleiss JL, Cohen J. The equivalence of weighted kappa and intraclass correlation coefficient as measures of reliability. Educ Psychol Meas. 1973;33:613–619.
    43. Centers for Disease Control and Prevention. Measuring Healthy Days. Atlanta, GA: Centers for Disease Control and Prevention; 2000.
    44. Jacobson NS, Follette WC, Revenstorf D. Psychotherapy outcome research: methods for reporting variability and evaluating clinical significance. Behav Ther. 1984;15:336–352.
    45. Weir JP. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. J Strength Cond Res. 2005;19:231–240.

    brain injuries; psychometrics; test-retest reliability

    Copyright © 2017 Wolters Kluwer Health, Inc. All rights reserved.