Share this article on:

Measurement of Health Outcomes: Reliability, Validity and Responsiveness

Roach, Kathryn E. PhD, PT

JPO: Journal of Prosthetics and Orthotics: January 2006 - Volume 18 - Issue 6 - p P8-P12

We measure health outcomes to help us make decisions about managing our patients. Outcome measures help us predict which patients will benefit most from a particular intervention and to document whether the patient improves after the intervention is provided. There is a wide range of important health outcomes, including disability and quality of life. Reliability is a necessary but not sufficient characteristic of an outcome measure. It also is necessary to determine whether the measure actually captures the aspect of the phenomenon of interest. Validity is not a characteristic of an instrument. It can be determined only in relation to a particular question as it pertains to a defined population. Finally, outcome measures used to evaluate changes in patients over time must be responsive in their ability to detect real change. It is important to understand the types and psychometric properties of outcome measures to select the measure best suited to a particular purpose.

KATHRYN E. ROACH, PhD, PT, University of Miami, Miller School of Medicine, Department of Physical Therapy, Miami, Florida.

Correspondence: Kathryn E. Roach, PhD, PT, 5915 Ponce de Leon Blvd., Plumer Building, 5th Floor, Coral Gables, FL 33146; e-mail:

Back to Top | Article Outline


Measurement is an abstraction. It involves developing a set of rules to assign numbers to represent a concept. The rules must provide a precise system for assigning numerical values to differentiate and order levels of the concept.1 To develop that set of rules, you must simplify the concept by focusing on those aspects of the concept that are most important to your purposes.2 This requires a clear understanding of your purpose for using the measure. The need to understand the purpose for making the measurement is true even for simple concepts, such as body mass. Body mass could be defined as an amount of weight measured by a scale (units would need to be specified as metric or English). Body mass could also be defined as a volume by measuring the amount of water displaced (again units would need to be defined). If you were trying to determine how many passengers a plane could carry based on the aerodynamics of flight, you would want to measure the body mass of the passengers in terms of their scale weight. If you wanted to determine how many passengers could comfortably fit into the cabin of the airplane, you might find the amount of water displaced as a better measure of body mass.

Back to Top | Article Outline


To understand something, you must be able to measure it. Without measurement you could say that someone was fat or thin. However, without some way of assigning numbers to represent body mass, it would be very difficult to accurately determine whether someone was fatter than someone else or whether someone's body mass had increased or decreased. It would certainly be impossible to determine how body mass relates to other things, such as fitness, health, or risk of disease. We measure health outcomes to help us make decisions about providing care for our patients.2,3 Outcome measures help us predict which patients will benefit most from a particular intervention and to document whether the patient improves after the intervention is provided.

Back to Top | Article Outline


There is a wide range of important health outcomes. Historically, death and disease were the outcomes of interest to physicians.4 However, disability, discomfort, and dissatisfaction now are also recognized as critically important outcomes. Disability, in particular, has received a great deal of attention as an important health outcome. The World Health Organization International Classification of Function5 describes three separate levels of disablement—impairment, activity limitation, and participation restriction,6 all of which present separate measurement issues. Activity limitation is the focus of a wide range of outcome measures that differ greatly in terms of the specific activities included. These measures also employ a variety of measurement dimensions, including amount of assistance, degree of difficulty, frequency, time, and quality.

In addition to disability measures, the broad concept of health-related quality of life (HRQOL)7 has become an outcome of interest. Health-related quality of life reflects many aspects of life, ranging from disease and impairment to social support and environmental obstacles, making HRQOL an extremely complex concept. Obviously, no single instrument or battery of instruments can measure all the possible outcomes of interest in all possible groups of patients.

Back to Top | Article Outline


Just as there is a wide range of health outcomes, there is an even wider range of possible outcome measures from which to select. Deciding which outcome measure to use can be a daunting task. Therefore, it is important to understand the types and characteristics of outcome measures to select the measure best suited to your purpose.

Back to Top | Article Outline


There are two broad types of outcome measures.8 Categorical (also referred to as diagnostic) measures are used to classify patients to make clinical decisions. Each category has a set of criteria. Patients who meet the criteria for a particular category are placed in that category. The Medicare Functional Classification Levels were used in this way to determine prosthetic prescription.9 Descriptive measures describe the patient in terms of the extent of a phenomenon.8 The phenomenon described may be as simple as body mass or as complex as HRQOL. Descriptive measures can be used prognostically to predict some future event. For example, The Amputee Mobility Predictor Score is designed to predict the ability of an amputee to walk with a prosthetic device.10 Descriptive measures can also be used to predict who will most benefit from a particular intervention or to determine the optimal intensity and duration of an intervention.8

Descriptive measures can be used to examine the relationship between two or more factors, such as the relationship between leg length and walking speed.2 Perhaps the most important use of descriptive measures is to detect change resulting from an intervention. For example, a 6-minute walk test11 could be used to demonstrate change in walking ability resulting from the addition of more sophisticated prosthetic components.

Back to Top | Article Outline


Outcome measures are used to answer questions about patients. The answers to these questions help guide clinical decisions and plans of care. The quality of the information provided by outcome measures depends, in part, on the psychometric properties of those measures. The psychometric properties of outcome measures include such things as level of measurement, reliability, validity, and responsiveness.1,2,12

Back to Top | Article Outline


There are four measurement scales or levels of measurement. Each scale has rules for interpreting and manipulating the data. Nominal level data are produced when a measure assigns people to categories according to a set of criteria.1 These categories are mutually exclusive and exhaustive. Although numbers may be assigned to these categories, these numbers do not represent levels of a concept. For example, a left-sided transtibial amputee could be coded 1 and a right-sided transtibial amputee could be coded 2, but these numbers would not imply any hierarchical relationship between individuals assigned to the two categories. The only mathematical operation that is permitted with this type of data is counting the number of individuals in each category.

Ordinal level data also involve mutually exclusive and exhaustive categories, but for ordinal data the categories do have order.1 Individuals could be classified as “unable to ambulate,” “able to ambulate with a walker,” “able to ambulate with a cane,” or “able to ambulate with a device.” These categories could be assigned numbers 1 through 4, respectively. In this case, the numbers would imply a hierarchical relationship relative to ability to ambulate. However, it would be difficult to demonstrate that the difference in the level of this construct ability to ambulate was the same between the categories “unable” and “with a walker” as it was between the categories “with a cane” and “without a device.” From a purely mathematical perspective, because the intervals between categories are unequal, ordinal data should not be used to perform mathematical operations such as adding, subtracting, multiplying, and dividing. However, many outcome measures produce these types of data, and mathematical operations are often used to generate total scores by adding or averaging item scores. There is widespread disagreement about the severity of the problems presented by such practices.1

Both interval and ratio level data involve order and equal intervals between units of measurement. Examples of these types of data include height measured in centimeters and temperature measured in degrees. The difference between interval and ratio data is that interval data lack a true zero.1 Temperature is a good example of interval data. The real world temperature associated with the zero value on a thermometer varies depending on the scale being used (e.g., Centigrade versus Fahrenheit). The zero on neither scale represents a true zero or absence of temperature. Height, on the other hand, produces true ratio data. No matter which measurement scale is used, zero height represents a complete absence of height. Ratio level data can be used to perform all possible mathematical operations, including multiplication and division.

Back to Top | Article Outline


Many outcome measures take the form of clinical indices that calculate a total score by combining item level scores. For many indices, the level of measurement for item scores is nominal or ordinal. Although from a mathematical perspective these types of data should not be added, total scores typically are calculated by adding item scores. Explicit weighting may be used to calculate total scores by attaching greater mathematical value to certain items. Some degree of implicit weighting occurs in every index. Implicit weighting derives from the relative number of items representing particular aspects of the construct being measured.

Back to Top | Article Outline


If you want to make decisions based on an outcome measure, you must be confident that, if no real change has occurred, your outcome measure will produce the same number each time you use it. A measure of body mass would not be very helpful if the numbers it produced varied substantially from time to time. This concept of measurement is called reliability.1,2,12 It is important to remember that a measure is never universally reliable. A measure is only reliable for use with a particular population. For example, a bathroom scale that was consistent within a half pound would be a sufficiently reliable measure of weight for adults but would not be sufficiently reliable for use with neonates. Sometimes a particular characteristic of a test will make it difficult to achieve consistent results with certain types of subjects. For example, a test that required subjects to follow complex instructions might not be reliable for use with small children or individuals with cognitive problems.

There are several types of reliability that should be examined. Self-report measures that require individuals to respond to a series of written questions should be examined for test–retest reliability.1 Test–retest reliability is examined by having individuals complete the measure on more than one occasion with the assumption that no real change will have occurred between sessions. Problems with the test–retest reliability of self-report measures are most often due to problems with the wording of items. In particular, wording problems can occur when instruments developed for use in one country are used in another.13 This can be a significant problem for translated instruments. It can also be a problem for instruments used in countries where the same language is spoken with a different use of idiom. For example, words may be used very differently in Cuban Spanish and Mexican Spanish or American English and British English. Instruments with good reliability in one country may not be as reliable in a second country.

Outcome measures that are designed to test only one concept should also be examined for a type of reliability called internal consistency. Internal consistency is a measure of the extent to which all of the items in the outcome measure address the same underlying concept.1,2 Internal consistency is a desirable characteristic for measures such as the depression scales, for which all the items should deal with depression. Internal consistency is not necessary or desirable for measures such as the SF-3614 that intentionally address more than one concept.2

Performance-based outcome measures require the participation of a rater. Raters must be trained to follow a standard set of rules to administer and score the measure. If raters do not adhere to rules, measurement errors may occur that adversely affect the reliability of the measure. Two types of rater reliability can be examined. Intra-rater reliability indicates how consistently a rater administers and scores an outcome measure. Inter-rater reliability indicates how well two raters agree in the way they administer and score an outcome measure.1,2,12 Intra- and inter-rater reliability are examined by administering a test to a group of subjects on multiple occasions. Again, the assumption exists that no real change in the subjects has occurred between sessions. Sometimes this assumption is difficult to meet. It is possible that the subject's performance as well as the rater's performance may vary from session to session because of things such as learning effects and fatigue. Some types of patients may have more problems with consistent performance than others. This is one of the reasons reliability is not a stable characteristic of an outcome measure. An outcome measure can be considered reliable only for a particular purpose with a particular type of subject.

Coefficients of reliability represent the true score variance divided by the true score variance plus the error variance. A reliability coefficient of 1.0 represents perfect reliability, indicating that all of the differences between scores represent real differences between individuals.1 A reliability coefficient of 0.43 indicates that 43% of the variance is due to true score and 57% of the variance is due to measurement error. In general, reliability coefficients below 0.50 are considered poor, between 0.51 and 0.75 they are considered moderate, and above 0.75 they are considered good.

Back to Top | Article Outline


Reliability is a necessary but not sufficient characteristic of an outcome measure. If a measure does not produce consistent findings, it cannot be depended upon to measure what we want to measure. However, an outcome measure can produce consistent findings and still not provide the information that we need. Validity is defined as the degree to which an instrument measures what we intend to measure.1,2,12 Because all measurement involves assigning numbers to represent some limited aspect of a phenomenon, it is critical to determine whether the measure you are using actually captures the aspect of the phenomenon of interest. Validity is not a characteristic of an instrument. Validity can be determined only in relation to a particular question as it pertains to a defined population. To again use the example of body mass, if you measure body mass to determine how many people you could place on an elevator without exceeding the elevator's load capacity, you would probably want to measure the weight of your subjects using a scale. If you wanted to know how many people you could fit on an elevator, it would probably be more useful to know their volume as measured by water displacement. While the two ways of measuring body mass are clearly related, they provide slightly different kinds of information. To complicate things further, if you wanted to examine the relationship between body mass and risk of adverse health outcomes, neither of these measures would be particularly useful. Both weight and volume measures give you bigger numbers for taller people. If you are interested in body mass from the standpoint of obesity (our original qualitative measure of who is fatter and who is thinner), neither of the measures we have been talking about really measures this or is valid for this purpose. The body mass index is a calculation that is used to adjust weight for height. This is a better measure of body mass if you are interested in the question of obesity because taller people are not penalized. It could be argued that the body mass index is a more valid measure of body mass for the purpose of predicting health outcomes.

Although reliability can be examined experimentally in a fairly straightforward manner, the question of validity is far more complex. There are many types of validity. Some types of validity, such as face validity and content validity, can be evaluated only subjectively. Face validity is the extent to which a test appears to measure what it is intended to measure.1,2 This type of validity would be important if you are advocating for the universal use of a particular test by a clinical community. Such a test would need to appear to measure what the clinicians in the community intended to measure. A related form of validity is content validity. Content validity is the degree to which a test includes all the items necessary to represent the concept being measured.1,2 The content validity of a test may vary widely depending on the question the test is being used to ask and the population involved. If you are attempting to measure activity limitation in older individuals in an assisted living facility (ALF), you would want an outcome measure that included items dealing with toilet transfers and ambulation with an assistive device. If you are attempting to measure activity limitations in younger athletic individuals, you would want an outcome measure that included items dealing with running, jumping, and climbing. An outcome measure that had good content validity for the young athletic population would have very poor content validity for the older population in the ALF.

Neither face nor content validity can be examined experimentally, and both are considered lower levels of validity. The higher forms of validity, criterion and construct, can both be objectively examined. Criterion validity is the most straightforward type of validity.1,2 The validity of an outcome measure is tested by comparing the results of the outcome measure or target test to a gold standard or criterion test. If the target test measures what it is intended to measure, then its results should agree with the results of the gold standard criterion test. This type of validity can be examined by giving both tests at the same time (concurrent validity) or by giving the target test first to determine whether it predicts the findings of the gold standard test administered at a latter time (predictive validity). The primary problem with criterion validity is that it requires an established gold standard test. There are very few situations in rehabilitation where such a gold standard test exists.

Construct validity reflects the ability of a test to measure the underlying concept of interest to the clinician or researcher.1,2 There is no simple way to definitively establish the construct validity of an outcome measure. However, the question of whether a test actually measures what it is intended to measure for a particular purpose in a particular population is critical to selecting an appropriate outcome measure.

There are a number of strategies available to examine the construct validity of an outcome measure. The known groups method can be used to support the construct validity of a test. This approach is based on the assumption that if you give the test to two groups of subjects that you know differ on the construct of interest, the test scores of the groups should differ if the test actually measures what it is supposed to measure.1,2 For example, to examine the construct validity of the 6-minute walk test as a measure of functional mobility in the elderly, you could compare the test results of a group of frail elderly individuals who reside in an assisted living facility to those of a group of healthy elderly who volunteer at the local community hospital. These two groups would be expected to differ in terms of functional mobility. If the 6-minute walk test is a valid measure of functional mobility in this group of subjects, there should be a difference in 6-minute walk scores between the groups.

Convergent and discriminant validity can be used to support the construct validity of a test. Convergent validity is demonstrated when scores on the test being examined are highly correlated to scores on a test thought to measure similar or related concepts.1,2 For example, scores on a gait index should be correlated to scores from an activity limitation measure because the concepts of gait and activity limitation are related. It is possible that the scores would be correlated when the measures were administered to a group of elderly individuals but not correlated when administered to a group of young athletic individuals. The lack of correlation in the second situation could arise from a ceiling effect on one or both measures. If the activity level measure dealt with basic activities of daily living but did not deal with more vigorous activities, all the subjects in the younger group would receive the highest possible score. The measure would provide no information on the activity limitations experienced by the younger subjects and would not correlate with the scores on the gait index. Discriminant validity is demonstrated when scores on the test being examined are not correlated to scores on a test meant to measure a very different construct.1,2 To continue the example above, assuming none of the subjects had dementia, the scores from an intelligence test should not correlate with the scores from a measure of activity limitation. If the activity limitation tests were strongly correlated to the intelligence test, it could be argued that the activity limitation test was not measuring the intended construct.

Back to Top | Article Outline


If an outcome measure is used to evaluate changes in patients over time, the measure must be able to detect this change. This concept has been described in a number of ways. Longitudinal validity has been defined as a measure's ability to detect change whether or not the change detected is clinically meaningful.2 Responsiveness has been defined as the ability of an instrument to accurately detect change when it has occurred.15 Two types of responsiveness have been identified.16 Internal responsiveness is defined as the ability of a measure to change during a prespecified time frame. Internal responsiveness is often examined by administering a measure before and after a treatment of known efficacy. External responsiveness reflects the extent to which changes in a measure relate to changes in other measures of health status. Like other measurement characteristics, responsiveness is not a constant characteristic of a measure. It can be evaluated only when a measure is used for a particular purpose with a particular group of subjects. Reliability is a critical component of responsiveness. Measures with poor reliability will have difficulty detecting real change because the noise introduced by measurement error will obscure any real change that has occurred.

There are a number of other characteristics that influence the responsiveness of a measure. Some measures are less responsive because of their construction. Measures designed to place patients into a limited number of categories tend not to be responsive because large changes in status usually are required to change categories. Measures that have ceiling effects, in that almost all respondents initially achieve the highest possible scores, will not be responsive because there is no room for improvement. Multi-item measures that include a few items that should change in response to an intervention and many others that are very unlikely to change will tend to be unresponsive.

For a measure to be responsive, it must be reliable and include multiple items dealing with aspects of the construct that are likely to change, and the scoring of the items must allow for improvement. For example, a measure of activity limitation that included items dealing with bed mobility and toilet transfers and scored the items based on the level of assistance required to perform the activity might be a very responsive measure when used with elderly patients who had recently undergone surgery to repair a hip fracture. However, this measure would be very unresponsive if used with a group of young amputees being trained for high-level sports activities.

Back to Top | Article Outline


The effective use of outcome measures is an important aspect of clinical care. Deciding which outcomes are relevant to a particular type of client and then selecting appropriate measures of those outcomes requires an understanding of the clinical situation, as well as an understanding of the measurement properties of the outcome measures. When selecting an outcome measure, you should ask the following questions.

Why are you measuring? What type of question are you trying to answer? Do you want to make a diagnostic decision, determine response to an intervention or predict a future outcome?

What are you measuring? Are you interested in some aspect of the disability model or are you interested in some aspect of quality of life? How would you operationally define the construct for your specific purpose?

Who are you measuring? What are the clinical and demographic characteristics of the population you are trying to measure?

Outcome measures can be important tools for guiding clinical decision making. However, to function well these tools must be used with skill and understanding.

Back to Top | Article Outline


1. Portney LG, Watkins MP. Foundations of Clinical Research: Applications to Practice. 2nd ed. Upper Saddle River, NJ: Prentice Hall, Inc.; 2000.
2. Fitch E, Brooks D, Stratford PW, et al. Physical Rehabilitation Outcome Measures: A Guide to Enhanced Clinical Decision Making. 2nd ed. Hamilton, Ontario: Canadian Physiotherapy Association; 2002.
3. Granger CV. The emerging science of functional assessment: our tool for outcomes analysis. Arch Phys Med Rehabil 1998;79:235–240.
4. Fletcher RH, Fletcher SW, Wagner EH. Clinical Epidemiology. 2nd ed. Baltimore: Williams & Wilkins; 1988.
5. World Health Organization. International Classification of Functioning Disability and Health (ICF). Geneva: World Health Organization; 2001.
6. Grimby G. Editorial. Outcome measures in rehabilitation. J Rehabil Med 2002;34:249–250.
7. Ware JE. Conceptualization and measurement of health-related quality of life: comments on an evolving field. Arch Phys Med Rehabil 2003;84 (S2):S43–S51.
8. Wade DT. Editorial. Assessment, measurement and data collection tools. Clin Rehabil 2004;18:233–237.
9. Medicare. Medicare Region C Durable Medical Equipment Regional Carrier, Supplier Update Workshop, 1995.
10. Gailey RG, Roach KE, Applegate EB, et al. The Amputee Mobility Predictor (AMP): a functional assessment instrument for the evaluation of the lower limb amputee's ability to ambulate. Arch Phys Med Rehabil 2002;83:613–627.
11. Cooper K. A means of assessing maximal oxygen uptake. JAMA 1968;203:201–204.
12. Streiner DL, Norman GR. Health Measurement Scales: A Practical Guide to Their Development and Use. 2nd ed. New York: Oxford University Press Inc.;1995.
13. Beaton DE, Bombardier C, Guillemin F, et al. Guidelines for the process of cross-cultural adaptation of self-report measures. Spine 2000;24:3186–3191.
14. McHorney C, Ware J, Raczek A. The MOS 36-Item Short-Form Health Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care 1993;31:247–263.
15. Beaton DE, Bombardier C, Katz JN, et al. A taxonomy for responsiveness. J Clin Epidemiol 2001;54:1204–1217.
16. Husted JA, Cook RJ, Farewell VT, et al. Methods for assessing responsiveness: a critical review and recommendations. J Clin Epidemiol 2000;53:459–468.

outcomes; reliability; responsiveness; validity

© 2006 American Academy of Orthotists & Prosthetists