MASSOF, ROBERT W. PhD, FAAO
False scales are the Lord's abomination; Correct weights are dear to his heart. - Proverbs 11:1
A major social issue confronting American ophthalmology at the beginning of the twentieth century was the lack of a uniform disability scale for visual impairments (partial sight). Workmen's compensation laws in different states made reference to loss of sight, loss of vision, or reduction of vision as eligibility criteria for benefits. But in most states, the laws made no reference to specific measurable visual impairment or functional vision criteria for compensation. 1 Because of this void, the courts were filled with compensation cases requiring expert testimony from ophthalmologists whose opinions frequently differed from one another. The frustration from the lack of disability guidelines is captured in the comments of Dr. John A. Donovan from Butte, Montana. These comments were published in a 1925 issue of the Journal of the American Medical Association:
Compensation has become a very serious subject. Juries are giving terrific verdicts, and beware-each one of us is constantly liable to become the prey of these vultures who are forever looking up and advising lawsuits-the grasping corporations, the labor unions whose sole teachings are less work, more pay-never once do they advise as we do, efficiency, but always what? … In every village is the man of note, reading the law at a distance of 3 inches with his myopic eye, and with his hypertropic (sic) eye supervising the neighbors…. Men come to me because the examiner in another division rejected them for train service. I request their reappointment, and vice versa. Who was right and at which time? We who have to deal daily with claim agents and claim attorneys of our employers and yet must give satisfaction and advise conscientiously our patients, anticipate the difficulty of making any iron bound standards, yet know it must be done.2
By 1925, visual acuity <20/200 was considered to be equivalent to total vision loss. Snellen inadvertently had much to do with the establishment of this benchmark when, in 1862, he made the 20/200 letter the limit of vision measurements on his chart (i.e., if you could not read any letters on the eye chart, you were blind). 3 Most countries in the world adopted this scientific standard as the legal definition of blindness for purposes of compensation and entitlements. Later, with the passage of the Social Security Act in 1935, the U.S. raised the standard to ≤20/200. 4 Those who met the visual acuity criteria for blindness were eligible for workmen's compensation in nearly all the states. The issue was where to draw the line for those who were not legally blind, but had abnormal vision in the better eye that could not be corrected (usually from work-related injuries). There was a need to establish a scale of partial disability that was attached to visual impairment measures.
VISION ECONOMICS
In the late 19th and early 20th century, vision economists believed that they could develop mathematical formulae for estimating vision disability from measures of visual impairments. Their basic theory was that a person's earning potential was proportional to his functional ability, technical ability, and the market value of his skills, knowledge, personal appearance, and other competitive factors that would influence the employer's judgment of him. 5 An injury could reduce earning potential by reducing functional ability and/or by reducing competitiveness (e.g., because of disfigurement or prejudice). Technical ability was believed to be something acquired with training and education and therefore would not be lost as a consequence of disease or injury.
Zehender, in 1889, was the first to offer a formula that provided estimates of visual ability, in terms of earning potential, as a function of visual acuity. 6 Zehender's scale, which made earning potential 50% at 20/30 visual acuity and 9% at 20/200, is illustrated with the filled circles in Fig. 1. His seminal article launched the field of vision economics, but attracted a storm of criticism that effectively silenced him on the subject for the rest of his career. 7,8
Most of the early work in vision economics was done in Germany, culminating in an opus by H. Magnus in 1894. 9 Magnus developed a complicated formula to estimate earning potential that was the product of functional ability, technical ability, and competitive ability. The competitive ability factor was a root for which the exponent varied with the severity of functional damage. The open circles in Fig. 1 illustrate Magnus' earning ability estimates as a function of visual acuity. The curve is similar in shape to Zehender's scale; however, earning ability drops more rapidly for Magnus, reaching zero at 20/200. Magnus' theory was introduced to America by Würdemann's rewrite of his book in English. 10
With the exception of Würdemann's promotion of Magnus' work, 5,10 Howard F. Hansell was the only American to tackle the problem of vision economics at the beginning of the 20th century. 11 He polled insurance companies and learned that they had no compensation guidelines for visual impairments and blindness. Rather, insurance companies negotiated with beneficiaries, writing policies to allow them to designate the value of their own eyesight, and premiums were adjusted accordingly. In the absence of standards, Hansell offered his own earning potential scale based on his clinical experience. Hansell's scale, which sets earning potential to 80% at 20/40 and 20% at 20/200, is illustrated with X's in Fig. 1.
The early vision economists were harshly critical of each other's scales and accused one another of mathematical incompetence. Although it seems that earning potential would be a straightforward variable to measure as a function of visual acuity, no one collected any data that could be used to confirm or refute the various scales offered over the years.
In the absence of data relating earning potential to visual impairments, states were free to use any compensation criteria they chose. Of those states that adopted visual acuity criteria for compensation, there was little agreement with other states, and there was no compelling rationale for the acuity values chosen to represent different levels of disability. In fact, despite the voluminous work of the visual economists, many of the adopted criteria were grounded in a misunderstanding of the Snellen notation for visual acuity. 12 Several influential ophthalmologists, as well as lobbyists and lawmakers, shared this misunderstanding. The Snellen fraction was taken literally. An individual with 20/40 visual acuity was thought to have lost half of his sight, and someone with 20/200 visual acuity had only 10% of his vision left. A remarkable number of early scholars felt compelled to write articles that presented arguments against the belief that a decimal visual acuity represented the fraction of remaining vision. 13-18 Nevertheless this belief prevailed.
AMA VISUAL EFFICIENCY SCALE
In 1925, the American Medical Association (AMA) published guidelines on compensation for eye injuries. 19 These guidelines consisted of tables of visual efficiency (0% to 100%) as a function of visual acuity, extent of the visual field, and oculomotor function. Vision disability was simply 1 - visual efficiency. The dependence of visual efficiency on visual acuity was said to be grounded in experience, experiment, and authoritative opinion. However, in reality, the AMA visual efficiency vs. acuity table was based entirely on the speculations of Dr. Albert C. Snell, an ophthalmologist from Rochester, NY, and his mathematically-skilled colleague, Mr. Scott Sterling.
Contrary to claims of the AMA Committee on Compensation for Eye Injuries, no experiments were performed, unless reference was being made to the incomplete attempts by Dr. Snell and Mr. Sterling to measure visual efficiency with simulated visual impairments using etched glass filters (they never published anything more than a description of the etched glass and some anecdotal reports). 20 Published experience consisted of a case study by Dr. Snell that literally described the visual acuities, ocular disorders, and work-related problems of 100 selected patients from his practice. 21 Using today's standards, it is unlikely that Dr. Snell's disability tables would be adopted so uncritically. However, to Dr. Snell's credit, he thought he was basing his vision disability scale on a reasonable extrapolation of what was then well-accepted psychophysical theory, Weber's and Fechner's laws. But he and Mr. Sterling got it wrong.
Fechner's law, which is built on Weber's law, is a mathematical description of the magnitude of sensation as a function of the magnitude of the physical stimulus. Weber observed that the smallest change in stimulus magnitude required to produce a minimally perceptible change in sensation is a constant proportion of the stimulus magnitude. 22 If S refers to the intensity (or magnitude) of sensation and I refers to the intensity (or magnitude) of the physical stimulus, then Weber's empirical psychophysical law, as generalized by Fechner, is at the threshold of detection for constant ΔS. Fechner idealized the finite Δ's with differentials and integrated Equation 1. 23 The result, which Fechner and his followers believed solved the mind-body problem (and is celebrated on Fechner Day, the anniversary of Fechner's epiphany while lying in bed on the morning of October 22, 1850 24), is what is now called Fechner's lawMATH
Snell and Sterling defined visual efficiency, E, as the ratio of the patient's resolving capability, R, to the resolving capability of the normal person, RN, orMATH
They argued that equal increases in the visual angle [of visual acuity] correspond to equal proportionate reductions in the visual efficiency. Mathematically this statement iswhere a is the visual angle and c is a constant. This equation is Weber's law but backwards. The visual angle is the stimulus, and resolvability is the property of visual sensation. So the equation should have been written asMATH instead. Undaunted, Snell and Sterling proceeded to compute visual efficiency aswhere e is the base of the natural logarithm and c and k are constants. Like Fechner's derivation of the logarithmic law, Equation 6 is the integral of Equation 4 if the Δ's are idealized as differentials (for obvious reasons to the optical community, we will not call this novel derivation Snell's law).
Snell and Sterling justified their scale with some curious reasoning about how their etched glass filters reduced visual acuity. They reasoned that because one filter reduced visual acuity to 20/40, and six such filters in series were required to reduce vision to light perception (no form vision), then 20/40 must represent 1/6th of total vision loss (i.e., a 16% loss of visual efficiency, which was a number that agreed with their calculation of visual efficiency for 20/40 acuity). They also took comfort in the similarity of their visual efficiency scale to the one published earlier by Hansell. They failed to note, however, that Hansell's scale was completely arbitrary and represented his personal opinion. 11
Fig. 2 illustrates the 2000 AMA visual efficiency values vs. visual acuity (filled circles), along with the result of Snell's and Sterling's exponential disability model (solid curve), upon which the AMA values are based. The constants in the equation were chosen to make visual efficiency 100% for 20/20 visual acuity and 20% (80% disability) for 20/200. The X's represent Hansell's scale. The similarity to Snell and Sterling's model is remarkable, and one wonders how much influence Hansell had on their choice of values for the constants. Also, for comparison, Fig. 2 shows visual efficiency based on a literal interpretation of the Snellen fraction (open circles). The AMA disability scales have been updated several times since their adoption in 1925. 25-27 However, until 2001, the AMA visual efficiency vs. visual acuity scale was the one developed by Snell. Therefore, for 75 years, compensation cases in the United States were decided on the basis of an AMA vision disability scale that was conceived in error.
Interest in functional vision scales resurfaced at the close of the 20th century. Today, however, the demand in the United States for measures of functional vision is driven by the needs of managed care and the Centers for Medicare and Medicaid Services (formerly known as the Health Care Financing Administration, which administers Medicare and Medicaid), rather than by worker's compensation. With the publication of the World Health Organization's (WHO) classifications of impairments, disabilities, and handicaps, 28 the definition of disability has broadened beyond the employment and economic consequences of chronic impairments to include every aspect of daily life. Third-party payers and cost managers of health care services are interested in the value and outcome of services in terms of health-related benefits or health-related quality of life. 29 They argue that treatment authorization and reimbursement decisions should be made, at least in part, on the basis of the patient's perception of the value of health care services and the likelihood that rendered services will improve the patient's well-being and daily functioning.
The purpose of this article is to critically examine the problem of measuring vision disability. The first part of the article reviews modern attempts to define and measure vision disability, comparing them to the historical development of other types of measurements. The second part of the article presents in a tutorial fashion the historical evolution of psychometric theory and shows how it applies to the development and validation of vision disability measurement scales.
CURRENT STATUS OF VISION DISABILITY SCALES
Before we can measure vision disability, we must define what it is. At the turn of the last century, vision disability was defined economically. 5,10,19 Vision disability, expressed as a percentage, was interpreted as the probability of being unemployed or underemployed because of chronic vision impairments. The AMA vision disability scale is often thought of as an estimate of relative earning potential as a function of visual acuity or visual field dimensions.
More recently, Colenbrander took issue with the AMA vision disability scale and its narrow focus on employment potential. 30 He argued that disability must be measured in terms of the person's ability to function in everyday life, not just in a job. People with visual impairments have varying amounts of ability, and disability is defined as the visually impaired person's ability to function relative to a norm. Colenbrander used the WHO functional classifications of visual impairments 28 to guide his development of a revised vision disability scale. Like the AMA's visual efficiency scale, Colenbrander's visual ability scale is linked to visual acuity (the Functional Acuity Score, or FAS) and to visual field diameter (the Functional Field Score, or FFS). 31
The AMA visual efficiency scale is a ratio scale. It ranges from 0 to 100%, and zero means no visual efficiency. By definition, visual efficiency cannot be <0, i.e., there are no negative values. The upper bound of 100% is artificial; it represents the lower limit of normal-in principle the scale could continue on to infinity. As illustrated in Fig. 3, ratio scales have an absolute origin; zero means zero. In the case of visual efficiency, zero must mean no useful vision. However, in the more general case of ability, absolute zero must mean that the person is unable to do anything. Short of using death as a criterion definition of zero ability, the true origin is undefined. The consequence of a more general definition of ability is that equating a person with 50% vision disability to a person with 50% physical disability (on a ratio scale), which might be reasonable in terms of employment potential, implies that the two people have the same ability to function in everyday life. Furthermore, on a ratio scale, 50% disability means that both people are halfway between normal ability (100%) and the origin (i.e., death), or the absurd conclusions that both people can do half as much as a person with normal ability and that the two of them could team up and do as much as a normal person.
Colenbrander's visual ability scale also covers the range of 0 to 100. However, as illustrated in Fig. 4, the FAS is proportional to visual acuity on a logarithmic axis (i.e., log minimum angle of resolution, or logMAR). Unlike the AMA visual efficiency scale, Colenbrander's visual ability scale is consistent with Fechner's law. Where the AMA visual efficiency scale is asymptotic on a logarithmic acuity axis, i.e., visual efficiency continuously approaches zero as logMAR worsens, Colenbrander's FAS continues past zero to take on negative values for visual acuities worse than 20/2000. Although this has no practical significance because one could equate all visual acuities worse than 20/2000 with no useful vision, the FAS defines an interval scale because the origin is arbitrary. Unlike ratio scales, as illustrated in Fig. 3, the zero point on interval scales cannot be interpreted as zero quantity of the variable. Frequently, interval scales extend indefinitely in both directions (ratio scales can also have negative values, e.g., debits and credits relative to no money). Temperature is usually measured on an interval scale, i.e., whether measured in degrees Farenheit or Celsius, zero does not mean the absence of heat, and the temperature can fall below zero. In that the FAS is an interval scale, one cannot interpret the FAS value as a percent of normal. Although an FAS value of 50 is halfway between 100 and 0 on the scale, the value cannot be interpreted as 50% of normal. This limitation on the meaning of Colenbrander's scale occurs because the zero point is arbitrary (just as 50°C is halfway between the freezing and boiling points of water, but does not represent 50% of the heat that is required to boil water).
Ian Bailey was the first to propose that visual ability is proportional to logMAR, predating Colenbrander by about 4 years. He developed the Visual Acuity Rating scale which, like the FAS, ranged from 100 at 20/20 to zero at 20/2000 and had a value of 50 at 20/200. He also compared his Visual Acuity Rating scale to the Snell-Sterling visual efficiency scale. 32 The Visual Acuity Rating scale survives today in the right-hand column of some Bailey-Lovie visual acuity charts. Even though it is identical to the FAS, it was not promoted as a vision disability scale.
The AMA visual efficiency scale was also defined in terms of visual field radius. In this case, as illustrated with the open circles in Fig. 5, visual efficiency is proportional to visual field radius (1.6% per degree of radius). Visual efficiency is zero when the field radius is zero, and it is 100% when the visual field radius is 63°. As with visual acuity, visual efficiency as a function of visual field radius is a ratio scale with an arbitrary upper boundary. However, unlike the attempt made with visual acuity, there is no theoretical or experimental rationale for the AMA visual efficiency scale as linked to visual field radius.
Esterman modified the AMA visual efficiency scale as a function of visual field radius by giving different parts of the visual field different functional weights. 33 Central field areas are weighted more heavily than are areas in the periphery. The inferior field has greater weight than does the superior field. Esterman's weighting scheme, which was his invention and inspired by intuition-not by data-have their largest modifying effects on the AMA visual efficiency scale at visual field radii <20°.
Like the FAS, which is 50 at 20/200, Colenbrander also defined his FFS to be 50 at the threshold of legal blindness, i.e., 10° radius. Similarly, he defined the FFS to be 100 at the threshold of normal, i.e., a visual field radius of 60°. In keeping with Fechner's tradition, the other values of the FFS were chosen to be about proportional to the logarithm of visual field radius (filled circles and solid curve in Fig. 5).
Colenbrander has succeeded in persuading the AMA and Centers for Medicare and Medicaid Services to adopt WHO's broader and more inclusive definition of disability. He also has persuaded the AMA to officially endorse his disability scales, the FAS and FFS, replacing Snell's visual efficiency scale. 31 The FAS and FFS are consistent with Fechner's law and therefore rest on a stronger theoretical foundation. However, in terms of empirical support, the FAS and FFS scales are just as arbitrary as all of the other vision disability scales that went before them.
MEASUREMENT OF VISION DISABILITY
The more than 100-year history of vision disability scales has been one of well-positioned and persistent people, through sheer force of personality, persuading their colleagues and law makers to accept and codify their ideas. For the past century, vision disability theorists and their proponents were not restrained in the least by the shackles of data. Disability theorists have enjoyed unfettered freedom because the experimental side of disability science has been slow to develop. Very few attempts at measuring vision disability were made before 1980. However, over the past 20 years, since the introduction of Colenbrander's and WHO's broader concept of disability, several investigators have tried to measure the impact of visual impairments on daily life.
Three basic strategies for measuring the functional consequences of visual impairments have been used. One approach that is favored by psychophysicists is to measure the behavior of visually impaired people in highly controlled conditions. Examples include laboratory measurements of reading speed or reading comprehension, 34,35 measurements of performance variables in driving or mobility courses, 36,37 measurements of face discrimination or recognition using standard psychophysical paradigms, 38 and measurements of visual search in real-life scenes projected on a screen. 39 The advantage of this strategy is that the measurements per se are not controversial. The investigators are measuring rate, accuracy, thresholds, or duration. Measurements can be accurate and precise, and the interpretations of the measurements are straightforward. The criticism of this psychophysical approach is that it does not directly measure the subject of interest, viz, the impact of visual impairments on daily life. Rather, it is a measurement of surrogate behavior with the hope that the results can be generalized to daily living.
A second strategy for measuring the functional consequences of visual impairments is to ascertain the opinions of expert judges. This approach is commonly used in physical medicine and rehabilitation, but has been used only occasionally to measure vision disability. For example, the Functional Independence Measure obtains ratings from occupational therapists, physical therapists, or other care providers on the patient's ability to perform specific activities independently. 40 The Functional Independence Measure is too coarse a scale for measuring vision disability, but similar rating methods have been developed for use with orientation and mobility instructors and vision rehabilitation therapists. 41,42 The advantage of this approach is that measurement scales based on expert judgments, ranging from ice skating competition to reviews of grant applications, are well accepted in our society. There are well-validated statistical tests for measuring concordance among judges and for creating normed measurement scales that transcend the individual judge or group of judges. The disadvantage of using expert judgments as measurements is that the resulting scale is ordinal. Expert ratings represent rankings of individual observations relative to past observations or relative to ideals. For example, one cannot argue that the difference between a score of 5.0 and 5.5 in a gymnastics competition is the same as the difference between 9.5 and 10, or that the person awarded a 10 has twice the ability of the person awarded a 5. Also, judges only rate what they observe. One might not be able to generalize observations made in the clinic to the patient's everyday life.
The third strategy for assessing vision disability is to ask the patients questions about how well they are functioning. Most of the 2attempts to directly assess the functional consequences of visual impairments have used this method. Beginning with the work of Bernth-Petersen in 1981, 43 more than a dozen visual function questionnaires have been developed. 44 Most recently, the National Eye Institute commissioned the development of the NEI Visual Functioning Questionnaire (NEI-VFQ) to be used as a functional outcome measure in clinical research. 45 All visual function questionnaires ask the patient to rate the difficulty of performing specific daily activities. The ratings are scored by rank and averaged to produce an overall test score or a set of subscale scores. The advantage of this method is that it directly ascertains the information of interest. The criticism of this approach is that patient ratings are ordinal for each question and the test scores represent averages of ranks, which are of questionable validity. 46
Of the three strategies for measuring vision disabilities, psychophysical measurements are the most scientific and patient questionnaires are the least. However, like the apocryphal drunk searching for his keys in a well-lit area because it is too dark in the alley where he lost them, psychophysical measures enable us to identify with certainty what we find, but do not offer hope of finding what we seek. To measure vision disabilities, we must venture into the unilluminated scientific back alley of visual function questionnaires.
Visual Function Questionnaires
Visual function questionnaires seem to have been developed and validated within the classical paradigm of psychological and educational testing. Each questionnaire, called an instrument, has a number of individual statements or questions, called items, that require the patient to give categorical responses. Different instruments use different response categories. For example, the 14-item visual functioning index (VF-14) requires the patient to rate the difficulty of each item on a scale of 0 (no difficulty) to 4 (unable to do). 47 The Visual Activities Questionnaire requires the patient to rate on a scale of 1 (never) to 5 (always) the frequency of performing the activity described by each item. 48 And the NEI-VFQ requires ratings of difficulty for about half of the items, ratings of frequency of applicability of a description for about one-fourth of the items, and ratings of agreement with statements for the remaining items, all on a five-point scale. 49 Fig. 6 provides examples of three items from the NEI-VFQ.
All visual function questionnaires are scored by averaging patient ratings across items. In the simplest case, the VF-14 score is computed by averaging the patient's ratings for all 14 items and multiplying by 25 to produce an instrument scale that ranges from 0 to 100. In a more complicated example, the 33 items in the Visual Activities Questionnaire are divided into eight domains (a.k.a. subscales): peripheral vision, acuity, visual search, depth, color, adaptation, glare, and processing speed. A score is computed for each domain by averaging patient ratings for the items that are included in that domain. Similarly, each of the 51 items in the NEI-VFQ are assigned to one of 13 domains: general health, general vision, ocular pain, vision expectations, near vision, distance vision, social problems, mental health, role problems, dependency, driving, peripheral vision, and color vision (actually there are 52 items, but item 9 was overlooked when domain assignments were made). As with the other instruments, a score is computed for each domain by averaging patient responses to the items included in that domain.
The instrument or domain scores computed from averages of patient ratings are accepted at face value, as if they represent a quantity of the variable of interest. This practice of averaging response categories across items to obtain a composite score is an imitation of the traditional scoring methods used in educational testing. In aptitude or achievement testing, one first defines the domain to be tested, for example, verbal skills or mathematics. By consulting content experts in the domain, one then creates a set of items that is a representative sample of the skills or knowledge that define the domain. The test-taker's response to each item is scored as correct or incorrect, and the percentage of correct answers represents the fraction of the domain mastered by the test-taker. Individual test scores can be compared with the distributions of test scores for subpopulations (e.g., all third graders or all college applicants), and transformed to percentiles. 50
The same logic applied to visual function questionnaires would argue that the items are representative samples of each domain. The patient's responses are then averaged to produce a score that represents a fraction of the entire domain. The problem with this reasoning is that unlike aptitude and achievement tests, there are no correct answers to the items in a visual function questionnaire. Each patient sets his or her own personal criteria for rating the level of difficulty of an item. Unless we know the response criteria used by each patient for the different rating categories, we cannot interpret test scores, and we cannot compare patients to each other in a meaningful way. Despite the liberal use of educational testing jargon by instrument developers (e.g., domains, content validity, construct validity, and criterion validity), visual function instruments, as they are routinely scored, are not comparable to aptitude and achievement tests. Rather, they produce Likert scales.
Likert Scales
Rensis Likert, a famous and politically powerful sociologist from New York University, published a landmark article in 1932 that described his method for measuring attitudes and opinions using rating scales. 51 Likert was not the first to offer a method of measuring attitudes with rating scales, but because of the simplicity of his response scoring and the long since forgotten supporting arguments that he presented, every imaginable type of subjective survey developed over the past 70 years has incorporated Likert scales. Within our own field, Likert scales are so taken for granted that every visual function questionnaire incorporates them uncritically with no reference made by the developers to their origin or rationale (indeed many developers may not even be aware that they are using Likert scales).
Likert argued that the level of agreement with specific statements could be used to measure an individual's attitude (or opinion) about the subject of the statements. Each subject is an item, and Likert considered each item to be independent of all other items. For example, Likert was interested in racial attitudes among college students, and one of his questions was, In a community in which the negroes [sic] outnumber the whites, under what circumstances is the lynching of a negro [sic] justifiable? The possible responses for this question were (a) Never. (b) In very exceptional cases where a specially brutal crime against a white person calls for swift punishment. (c) As punishment for any brutal crime against a white person. (d) As punishment for any gross offense [felony or extreme insolence] committed against a white person. (e) As punishment for any act of insolence against a white person. Presumably, the more racist a person is, the lower their threshold will be for justifying a lynching. By asking several such questions covering different subjects related to racial attitude and scoring the responses, Likert believed that he could quantitatively measure a person's attitude toward other races.
Likert assumed that the rating response to each item represented a separate measure of attitude. For a given population, Likert further assumed that an attitude, such as opinions about other races, would be normally distributed. For each item, the mean of the attitude distribution for the population could be shifted from the overall mean, depending on sampling biases and confounding variables specific to the item. However, it was expected that the average distribution across all items would reflect the population's distribution of attitude. Thus, within Likert's framework, all items are expected to be equal a priori, and each item provides a sample measurement with some amount of error (this is the rationale for using split-half reliability testing, today replaced with Cronbach's alpha, which is equivalent to the average of all possible split-half reliability estimates 52).
Likert recognized that to achieve a meaningful measurement of attitude, the subject response scores for each item must be positioned on an interval scale. To accomplish this positioning, Likert transformed response categories to normal standard deviation units (i.e., sigma units, today called z-scores). Fig. 7 illustrates Likert's reasoning and how he generated his scale.
First, Likert assumed that the sample of subjects responding to his questionnaire was drawn from a population for which the attitude of interest is normally distributed (bell-shaped curve in Fig. 7). The response categories are ordered for each item (ranging from the most negative to the most positive attitude), and successive response categories are represented by successive intervals on the attitude scale (intervals labeled a through e in Fig. 7). With five response categories, there are four category boundaries on the attitude scale (long vertical lines in Fig. 7). Theoretically, the number of people who respond with category a is estimated as the area under the normal distribution, measured from negative infinity to the boundary of category a (shaded area in Fig. 7). Therefore, by counting the subjects who responded with category a, one can estimate the location of the boundary for category a by looking up the z-score that corresponds to the percentage of subjects who gave response a (Likert mistakenly reported that he used Table 22 in Thorndike's book 53 to do this, he actually used Table 23). Applying the same reasoning to the other response categories, Likert estimated all of the category boundaries on the attitude scale. The boundaries illustrated in Fig. 7 are estimated from the sample data that Likert published in his Table 2. 51
For reasons that will become clear in a moment, Likert constructed his sigma scale by averaging the interval boundaries (short vertical lines in Fig. 7 labeled with the letters). However, the first and last intervals (a and e) have only one boundary each (the other boundary is at minus and plus infinity, respectively). To obtain sigma scale values for the first and last categories, Likert invented extra category boundaries (dashed vertical lines in Fig. 7) corresponding to 1% and 99% areas under the normal distribution (rather than the infinite limits for 0% and 100%, where they really are).
The sigma scale for attitude is an interval scale. Each response category for each item has a corresponding sigma value that is estimated from the response data for all subjects. Based on his belief that each item was an independent sample of attitude, Likert estimated an attitude measure for each subject by averaging across all items the sigma values that correspond to the subject's responses. The resulting attitude measure is equivalent to a percentile for the population represented by the subject sample.
Likert's next step, which forever endeared him to survey developers, was to argue that the sigma values, which are on an interval scale, are equivalent to the ordinal ranks of the response categories. Fig. 8 illustrates response category ranks (1 to 5 for categories a through e) as a function of the sigma value for that category. The filled circles in Fig. 8 are from Likert's Table 2. 51 Without presenting any data to support his claim (other than the values in Table 2), Likert reported that the response category ranks were proportional to the sigma values. Therefore, Likert concluded, averaging response ranks is equivalent to averaging sigma values, and for all practical purposes, the average of response ranks across items produces a score that is equivalent to a measure on an interval scale. Likert supported his conclusion with consistently high correlations between average response ranks and average sigma values. However, correlations only establish that the two estimates are monotonically related, not that they are proportional (i.e., correlations assume a linear relationship, they do not prove it).
The conventional method of scoring visual function questionnaires follows Likert's prescription, with the presumption that the response ranks to each item are proportional to the interval-scaled sigma values. Janet Szlyk and colleagues 54 published the results of a visual function questionnaire that they administered to visually impaired patients with retinitis pigmentosa. Table 2 in their article lists the number of patients who responded with each of the categories (difficulty ratings from 1 to 5) for each of the 33 items in their functional assessment questionnaire. Using Likert's methods, I estimated sigma values for the three middle rating categories (i.e., categories 2, 3, and 4) for each item. As shown in Fig. 9, with the exception of items 10, 16, and 23, there is roughly the same linear relation between response category rank and sigma value for all items (same slope, different intercepts). This is seen more clearly in Fig. 10, which illustrates the response rank as a function of the average sigma value across items for the Szlyk et al. 54 data. This almost linear relation would seem to confirm Likert's scale (i.e., average response rank) for this visual function questionnaire. However, note that unlike Likert's analysis, I omitted sigma values and response ranks for the most extreme response categories (1 and 5).
The apparent validity of Likert's scale rests on a mathematical sleight of hand. The extreme response categories are unbounded, so there are no sigma values corresponding to those categories. To use averages of response ranks as a substitute for estimating interval scales, Likert needed sigma values for his extreme categories. He invented boundaries for these categories that put the response ranks vs. the average sigma values on the regression line, but made no mention of this critical step anywhere in his article (these numbers simply appear in Table 2 without explanation). However, the choice of criteria for the extreme category boundaries is completely arbitrary. If Likert had chosen extreme category boundaries that correspond to 0.1% and 99.9% areas under the normal distribution, instead of 1% and 99%, then the average sigma values would have plotted where the open circles are in Fig. 8. The open circles would move farther from the regression line as the criteria for the extreme category boundaries moved closer to 0% and 100%. In short, the sigma values for the extreme categories are illegitimate, and the rationale for Likert scales depends on a flawed argument. Because Likert scales are averages that include the rankings for the extreme response categories, they are uninterpretable, irrespective of the quasi-linearity with sigma values for the interior response categories. Therefore, visual function questionnaires that generate Likert scales, as they all do, produce numbers that cannot be interpreted as measurements.
Developers and supporters of visual function questionnaires would take issue with the conclusion that the numbers produced by their instruments are not valid measurements. Through the use of inferential validation arguments, they try to build a circumstantial case that their instrument scores represent measurements. They hold focus groups of visually impaired patients to identify problems areas that should be assessed, 49 survey experts to achieve a consensus on the content of items, 47 and demonstrate (usually with Cronbach's alpha 52) high interitem consistency 55 (Cronbach's alpha actually is a reliability coefficient-it is sensitive to internal consistency, but internal consistency should be independent of the number of items, and Cronbach's alpha is not). They demonstrate that instrument scores correlate in the expected direction with other measurements that are more believable, such as visual acuity 56 (external validity). They use factor analysis to demonstrate that items can be grouped into statistically consistent domains 48 (factorial validity). And they show that instrument scores can be used to discriminate groups of patients that are expected a priori to have different characteristics or levels of problems 55 (criterion validity). Such arguments, however, can only lead us to hypothesize that the numbers chosen to represent responses to the items in the questionnaire must in some way be related to the variable of interest, they do not establish that those numbers are measurements of the variable. To be considered measurements, strict criteria must be met by the operations that generate the numbers and by the interpretation of the quantities recorded. 57 Above all, there must be an explicit theory of how the variable is related to the number that is recorded as a measurement. 58
MEASUREMENT PRINCIPLES
Measurements begin with observations. In the case of weight, by lifting objects, a person can observe that object A is heavier than object B. In the case of temperature, a person can observe from his level of comfort that yesterday was warmer than it is today. In the case of vision disability, the patient can observe that activity A, such as sewing a hem, is more difficult than activity B, such as taking a bath. All observations are ordinal. In other words, all observations entail judgments of greater than, less than, or about equal to. 46
Multiple observations can be ordered by making successive judgments of inequality. For example, a set of objects can be ordered from lightest to heaviest, a selection of days can be ordered from coldest to warmest, and a list of daily activities can be ordered from least difficult to most difficult. In those instances where we can establish for a population that a consensus exists on how observations should be ordered, we have the basis for developing a measurement.
Measurements consist of counting the number of units that are required to achieve equivalence with some trait of the object of the measurement. The role of observation in measurement is to judge equivalence. For example, the ancient Arabs used the qirat, the seed of a coral tree, to define a unit of weight. The object to be weighed was placed on one pan of a balance scale, and qirats were added to the other pan until the scales were judged to be balanced. The weight of the object was simply the number of qirats required to achieve balance. 59
To measure vision disability, we must define a unit of measurement that we can count and use to judge equivalence. The Likert unit that is used by most visual function questionnaires is a rank assigned to the patient's categorization of the perceived difficulty of the activity described by the item. To construct an analogy, if we were going to weigh an object in Likert units, each person would pick up the object to be weighed and rate on a scale of 1 to 5 how heavy it is. For example, the object would be given a Likert score of 1 if it is judged to be very light, a score of 2 if it is judged to be moderately light, a score of 3 if it is judged to be somewhat heavy, a score of 4 if it is judged to be very heavy, and a score of 5 if it is judged to be impossible to lift. By averaging Likert scores across people, we would obtain a value in Likert units that represents that object's weight. It is likely that objects weighed in this way would be ordered correctly by weight. But calling this a measurement is absurd because we know that the Likert units in this example, which are defined to have equal intervals, are very unlikely to produce equal intervals of weight. And we have no a priori expectation that two objects with the same average Likert score necessarily would balance.
Allowing people to invent their own definitions of criteria against which observations are judged (e.g., personal definitions of degrees of light and heavy) might seem to present insurmountable problems for developing measures of vision disability. After all, there most likely would be wide variations between people in their definitions of response category criteria, so there would be no consistent scale for the measurement. This problem, however, is not new to measurement, and our forebears learned how to deal with it.
Dating to a period before recorded history, human dimensions were used as units for measuring distance. These units included the finger (width of one finger), inch (end of the thumbnail to the first knuckle), palm (four fingers), foot (heel to toe, usually of the shoe), cubit (elbow to the tip of the middle finger), yard (nose to the end of the middle finger on a laterally stretched arm), fathom (distance between fingertips of laterally stretched arms), etc. It was obvious to everyone that there were variations between people in the sizes of these personal units. But these variations created an issue of measurement precision, not one of measurement accuracy. Precision refers to the consistency of a measurement, irrespective of its accuracy. Accuracy refers to systematic bias in the measurement, irrespective of its precision. In cases of commerce, where consistency is important, measurement precision was improved by always using the same personal unit. For example, cloth was sold by the yard and, for a given shop, it was always measured by the same person who, by definition, had a standard yard arm. The comparative shopper would not only have to compare prices, but also the biases in the quantity of cloth represented by the yard. 60,61
The statistical nature of measurements was made explicit in the legal definition of the rod. The rod, a measure used by surveyors, was defined in the 16th century as 16 feet (today it is 16.5 feet). In surveying, accuracy was the more important issue. To calibrate a surveying rod, the surveyor would stand outside a church and recruit 16 average men from the congregation after the service. The 16 men would stand in line with their left feet touching heel to toe. The surveyor and a government functionary would mark off the distance spanned by the feet of the 16 men to define a right and lawful rod. 60,62
In the case of visual function questionnaires, individual differences in response category criteria can be managed statistically. Like the definition of the rod in terms of the average foot, we can talk about average response category criteria for the population. We still must acknowledge the individual differences, but like the individual yard arm, these differences will have the greatest impact on the accuracy of our measurements of vision disability for individual people. In the case of measurements of vision disability for groups of people, like the rod, the individual differences in response category criteria will impact precision.
Just as the right and lawful rod can be operationally defined by law, one might argue that the units and rules of vision disability measurements could simply be defined by edict and then we could get on with the business of making measurements. In effect, the National Eye Institute has done just that with the creation of the NEI-VFQ and their encouragement of investigators seeking NEI funding to use it as a measure in their studies. 63 Also, some Medicare carriers now require the use of the NEI-VFQ or similar instruments to make measurements of medical necessity to support claims for reimbursement of low-vision rehabilitation services. 64 But arbitrarily declaring the NEI-VFQ to be a measure is equivalent to saying that measurement is nothing more than the assignment of numbers to observations according to rules. This view of measurement was formalized by S. S. Stevens in the 1950s, 65 but a wide variety of measurement theorists and philosophers have thoroughly and convincingly discredited such liberal definitions of measurement as pseudoscience. 58
A big difference between measuring weight, capacity, or distance and measuring vision disability is that the units of weight, capacity, and distance can be seen and counted publicly (called manifest or extensive variables). In the case of these measures, there is a direct and easily appreciated proportionality between the measurement and the thing being measured (e.g., doubling the quantity doubles the weight). Also, for weight, capacity, and distance there is an intuitive understanding of the variable being measured as something that can be divided and counted.
In the case of vision disability measurements, the observations and the criteria against which the observations are judged are personal and private (called latent or intensive variables). It is difficult to visualize how vision disability could be divided for the purpose of counting. And reaching a consensus on when the sum of units that are counted is equivalent to a quantity of vision disability seems irresolvable. These problems, however, are not much different from those faced by people early in our history who were trying to make measurements of temperature.
Estimates of warmth are highly individual. Like the feelings of difficulty of performing different activities, the feelings of warmth and cold can be talked about and ordered, but cannot be compared in any publicly observable way. Before the 16th century, there was no method of reconciling differences between people in their judgments of temperature. Although inequalities in temperature could be judged and agreed upon, there was no theory of what constituted the temperature variable, and there was nothing tangible that could be divided and counted. It was clear that feelings of temperature must be linked to something physical, but nothing could be seen (except the consequences of extreme heat or cold).
In the late 16th century, Galileo performed an experiment that demonstrated a visible correlate with temperature. His apparatus consisted of a glass tube that was open at one end and enlarged into a hollow bulb at the other end. The open end of the tube was immersed in a bowl of spirits (probably wine) and the fluid would rise part way up the tube. When the glass bulb was warmed, e.g., by holding it in his hand, the fluid in the tube would descend. When the glass bulb was chilled, the fluid in the tube would rise. We now know that warming and cooling the bulb changed the temperature, and therefore changed the pressure of the air trapped in the bulb. When the bulb was warmed, the pressure of the trapped air went up and forced the fluid out of the tube. When the bulb was cooled, the pressure of the air decreased, and the higher pressure of the air outside the bulb forced fluid up the tube. Galileo called his invention a thermoscope because it allowed one to visualize changes in temperature. 60,66 The thermoscope ultimately led to the development of the thermometer. 67
Like other areas of measurement, the history of thermometers is one of calibration and standardization of scales and the development of more reliable and accurate measuring instruments. However, unlike distance, capacity, and weight, temperature measurements are indirect. To measure temperature, we measure surrogate variables that we can see, such as of the height of a column of mercury or alcohol. The temperature variable is a theoretical construction that today we understand to be equivalent to the average disposable thermokinetic energy of a substance. When thermometers were first invented, people thought they were measuring the effects of an invisible massless substance, called caloric, that flowed between bodies. Temperature measures were interpreted to be indirect measures of the quantity of caloric in the substance. However, the measurement art was ahead of the science, and that is why today we commonly measure temperature with interval scales (Fahrenheit, Celsius, and Reaumur) rather than using the ratio scale that is tied to thermodynamics (Kelvin). 60,68
As with the measurement of temperature, to measure vision disability, we must begin with a theoretical construction of the variable to be measured. Like the measurement of temperature, the measurement of vision disability must be indirect because we cannot see and do not have direct access to the phenomenon that we want to measure. Like temperature, the creation of a vision disability scale must be accomplished through theory and scientific observations. And like early measures of temperature, we know that there must be an absolute zero for vision ability, but we do not know how to position what we measure in relation to the true origin. Therefore, given the state of our knowledge, we must use interval scales, rather than ratio scales, to measure vision disability. However, unlike measurements of temperature, there are no obvious physical phenomena that change in a measurable way with vision disability and could be used as a manifest surrogate variable. For this reason, the development of disability measurements has followed the path created by the development of psychological and educational measurements.
PSYCHOMETRICS OF VISUAL FUNCTION QUESTIONNAIRES
As Colenbrander pointed out, when we talk about disability, we actually are discussing ability.30,31 We say that a person is disabled when, because of disease, disorder, or injury, he has abnormal difficulty with (or cannot perform) routine activities that he otherwise would be able to do. The variable of interest is ability. The level of disability is simply the difference between normal ability and the ability of the patient.
Ability is a trait of the person. Different people have different levels of ability, and that is what we want to measure. If differences between people in ability can be attributed mainly to differences in visual impairments, then we are talking about differences in vision ability. Vision ability is just that component of ability that can be modulated by changes in visual function. Similarly, cognitive ability is that component of ability that can be modulated by changes in cognitive function, physical ability is that component of ability that can be modulated by changes in physical function, etc. For a given population, e.g., people with low vision, ability will have some distribution. This assumption is the same as the starting assumption of Likert in the development of his attitude scale. It also was the starting assumption of Likert's contemporary, Louis L. Thurstone of the University of Chicago, who was the first to suggest that latent psychological traits, such as attitudes, could be measured on an interval scale (Thurstone pioneered the field of attitude testing; Likert was a critic and offered his scale in response to Thurstone's work).
Thurstone's Psychometric Theory
Activities can be itemized. Likert assumed that all items measured the same thing, and that differences between item measures represented measurement errors. Contrary to that idea, Thurstone believed that different items represented different values on the attitude scale. Thurstone's method of measuring latent psychological variables is based on that premise.
Thurstone generalized the sensory psychophysical laws of Weber and Fechner to other psychological variables that introspectively seem to vary in magnitude, such as qualitative judgments of attitudes or difficulty. Thurstone argued that because of instability and imperfections in the nervous system, a stimulus with intensity I in Fechner's equation gives rise to a distribution of sensory magnitudes, S. For stimulus i, the mean of the sensation distribution is Si and the standard deviation is ςi, and for stimulus j, the mean is Sj and the standard deviation is ςj. For the person to be able to say reliably that the two sensations are just noticeably different, the difference between the mean sensations would have to exceed the variability of the sensations by some criterion factor, i.e.,MATH where ςi-j is the standard deviation of the difference between the two sensory distributions (called discriminal dispersion by Thurstone). The variance of the difference between two distributions is the sum of the variances minus twice the covariance, orMATH where rij is the correlation between the sensory distributions for the two stimuli. These equations describe Thurstone's law of comparative judgments and basically say that the standard deviation is the unit of discrimination between quantities of psychological variables. Within Thurstone's framework, the constant ΔS of Weber's law, as written by Fechner, is a constant ratio of the difference between the means to the standard deviation 69 (i.e., constant signal-to-noise ratio in the jargon of modern psychophysics 70).
In the case of measuring attitudes, Thurstone thought of each item in the attitude survey as a stimulus that generated a response that corresponded to a specific level on the attitude scale (with some uncertainty represented by the standard deviation). For items to be judged as representing different attitudes, the difference between the mean scale values would have to exceed the standard deviation by a criterion amount (law of comparative judgments). Thus, given a large number of items, judges should be able to sort the items into piles that represent just noticeable differences (JND's) in attitude from the neighboring piles. If the standard deviations were about equal for all items (a condition that Thurstone called case V), then the piles of items that correspond to equivalent attitudes would represent equal intervals on the attitude scale. The items, sorted into equal JND intervals, could then serve as a ruler for measuring the attitude of each individual by eliciting the level of agreement with each item.
Thurstone implemented his theory with a 130-item survey on opinions about the church that he developed with E. J. Chave, a professor of religious education at the University of Chicago. 71 Examples of items in the survey are 1. I have seen no value in the church, 90. I like the ceremonies of my church but do not miss them much when I stay away, and 4. I believe that membership in a good church increases one's self-respect and usefulness. Each of the 130 items was printed on a small slip of paper. Three hundred subjects were asked to sort the items into 11 piles labeled A through K that represented an evenly graduated series of opinions ranging from the most antichurch to the most prochurch. The 300 judges were instructed to ignore their own opinions about the church and simply judge the meaning of each statement in terms of the strength of the pro or con opinion about the church. Thurstone then counted the frequency that each item was sorted into each of the 11 categories and constructed cumulative frequency functions on the 11-point equal-appearing interval attitude scale.
Fig. 11 illustrates the cumulative frequency functions from Thurstone and Chave for items 1 (filled circles) and 4 (open circles). These frequencies are plotted on the attitude scale defined by response categories, which are assumed to be positioned at equally spaced intervals (the last category for item 1 is omitted because it contains all of the remaining items that were too extreme to sort into the other 10 categories). The solid curves through the data are best-fit cumulative normal distribution functions. The value on the attitude scale that corresponds to a cumulative frequency of 0.5 is the mean for that item, and the slope of the cumulative frequency function is inversely proportional to the item's standard deviation.
Thurstone theorized that the standard deviation of the item distribution on the attitude scale is a combination of the uncertainty of each judge in the comparative judgment and differences between judges in their interpretations of the item or variations in their reactions to the item. Large standard deviations signaled poor item consistency, which would lower measurement precision. The means of the best-fit cumulative normal distributions were taken as the attitude, on the interval attitude scale, that corresponds to that statement. The standard deviations were used to decide which items should be excluded from the final attitude measurement instrument.
The attitudes of individuals were measured by asking each subject to check the items that expressed their sentiment toward the church. An individual would be expected to agree only with a narrow range of items on the attitude scale that are in the neighborhood of his or her attitude. Thus, Thurstone used the average scale values of the items checked to compute the subject's score. The standard deviation of the scale values for the checked items provided an estimate of the precision of the attitude measure for each subject.
The Sickness Impact Profile, 72 which has been used to assess vision disability 73 and to validate some visual function questionnaires, 47 was developed using Thurstone's equal-appearing intervals method. The Sickness Impact Profile consists of 312 unique items that are simple statements describing behavioral dysfunction. An example of an item is, I stay home most of the time. Twenty-five judges (consisting of graduate nursing students, medical students, health service administration students, and physicians) rated the severity of dysfunction described by each of the items on an 11-point scale that ranged from minimally dysfunctional to severely dysfunctional. The rating categories are equivalent to the piles into which Thurstone's judges sorted the items that were printed on slips of paper. Item values were estimated from cumulative functions on an interval scale constructed from the rating categories. To use the Sickness Impact Profile, subjects identify those items that accurately describe themselves. The average item value of the items checked defines the score for that individual.
Thurstone's method of equal-appearing intervals has two theoretical limitations. First, it is an article of faith that the rating categories, or piles, are spaced at equal intervals. Even within Thurstone's framework of the law of comparative judgments, there is no guarantee that equal JND steps will represent equal intervals on the latent variable scale (i.e., the JND interval is proportional to the standard deviation). Uncertainty in judgment could be different for different parts of the scale, especially if items that cluster together on the scale have similar properties that influence judgment. Unequal intervals will show up as noise in the fit of the cumulative function to cumulative frequencies. Second, the judges who create the interval scale might not be representative of the population that will be tested with the instrument. Thus, even if the scale were valid for the judges, it might not be valid for the subjects of the measurement. This second criticism was one of Likert's principal objections to Thurstone's method and one that he hoped to correct by constructing a scale from the responses of the test subjects.
In reality, there is little difference between Likert's assumption that the ranks of response categories used by subjects form equal intervals and Thurstone's assumption that the rating categories used by judges form equal intervals. Thurstone used a larger number of categories, therefore the variations in category intervals were probably small compared with the total range of the scale. In comparison, Likert scales typically use a small number of categories and, therefore, there could be relatively large inequalities in category intervals. The main difference between Thurstone and Likert is that Thurstone assumed that different items could occupy different positions on the attitude scale, whereas Likert assumed that all items had the same average position on the attitude scale, but were sampled with some error. This is a very important difference between theories that has significant consequences for the interpretation of instrument scores. To better understand the difference between Likert's and Thurstone's theories and to show how the current use of Likert scales by visual function questionnaires violate Likert's most basic assumptions, we will examine Likert's methods within Thurstone's framework.
Like Thurstone, to measure attitudes, Likert invented a number of statements that represented a range of opinions about a particular topic. Thurstone simply collected as many statements as possible that were related to the topic and then had his judges rank order those statements on an attitude scale. Each of those statements was considered an item in Thurstone's instruments. In contrast, Likert identified a large number of specific subtopics that were related to the topic of interest. It was his belief that the average of the person's attitudes toward the individual subtopics (e.g., opinion about lynching) would reflect the person's overall attitude toward the topic (e.g., opinion about other races).
Each of the subtopics was considered an item in Likert's instruments. For each subtopic, Likert invented a statement that would represent the most negative opinion, a statement that would represent the most positive opinion, and a statement that represented a neutral position. He then selected a statement that represented an opinion midway between the most negative and neutral attitudes and another statement that represented an opinion halfway between the neutral and most positive attitudes. These statements were assigned ranks that were used for scoring. In effect, Likert created Thurstone's equal-appearing interval scale for each subtopic using the instrument developers as the judges. If the ordered statements for each subtopic were treated as an item, all items across all subtopics most likely would plot on Thurstone's scale as five distributions at five different positions. Because the subject is instructed to choose one of the five statements for each subtopic, the subject most likely will choose the statement that is closest on the attitude scale to his or her opinion. If the subject's attitude falls between two of the response alternative distributions, he or she might choose half of the statements from one distribution and half from the other distribution. Thus, the average rank score for that person would fall halfway between the two distributions. Within Thurstone's framework, Likert's and Thurstone's scales would be expected to agree (within the limits of the inequalities of the intervals between distributions).
In the case of visual function questionnaires, individual items are descriptions of specific activities. For example, in the VF-14, one item is reading small print such as labels on medicine bottles, a telephone book, or food packages and another item is reading a large print book or large print newspaper. The ordered response alternatives for each item range from no difficulty to unable to do on a five-point scale. Although ratings across items are averaged to produce a score, which is Likert's method, the design of the VF-14 violates Likert's premises. That is, the items themselves are not equivalent; reading small print requires far more visual ability than does reading large print. Thus, the location of the average score for a person on the Likert scale, and the confidence that can be placed in that score as an estimate of person ability, depends entirely on the choice of items that make up the instrument and on which items the subject chooses to answer (one possible response is not applicable, which is treated as missing data). In practice, like the VF-14, most visual function questionnaires are a mixture of Thurstone and Likert methods that produce scores that cannot be interpreted within either framework.
Classical Test Theory
If we were to apply classical psychometric methods to visual function questionnaires, we would analyze the characteristics of responses to the individual items by people with different scores. In educational testing, classical test theory is based on the premise that different items in the test represent different levels of difficulty. This assumption allows one to interpret test scores as representing different levels of mastery. A person with a low level of mastery might answer the easy items correctly, but miss the items that have moderate or high levels of difficulty. A person with a higher level of mastery will also answer the easy items correctly, but should be able to answer some of the more difficult questions too. The person with the highest level of mastery should not only be able to correctly answer all of the easy and moderately difficult questions, but also correctly answer the most difficult questions. If the items can, in fact, be ordered by difficulty, then it is easy to believe that the number of correct answers on a test should order the test takers by mastery of the subject of the test.
Although there are no right or wrong answers to items on visual function questionnaires, we can set up a theoretical structure that parallels that of classical test theory. To use the methods of classical test theory, we need only to develop a concept of true score for visual function questionnaires. Classical test theory does not impose any other conditions. Frederic M. Lord pointed out that classical test theory has no theory, rather, as he says, it is actuarial science.74 Classical test scales are analogous to temperature scales in the absence of a theory of heat (even an erroneous one). Anything that changes monotonically with temperature can be used as a measurement of temperature. Within the classical framework, it is assumed that the ability of each test taker is monotonically related to a true score on the test. Monotonicity simply means that the two variables would rank the test takers in the same order. The actual test score is regarded to be an estimate of the true score with some amount of error.
Classical test theory is tautological. The true score is just the expected value of the observed score (i.e., the average score if the test could be administered to the same person multiple times). The relation of the test score to the ability of the test taker is unique for each test. For example, based simply on the numerical value of the test score, people will seem to have more ability when taking easy tests than when taking hard tests. The difficulty of an item in the test is defined only by the proportion of the test takers who answer the question correctly. Different versions of a test produce different scales, so they cannot be compared unless both tests are calibrated with the same group of people. And missing responses to items (e.g., responding not applicable on visual function questionnaires) change the test scale in unknown ways. 74
In the case of most educational and achievement tests (ignoring tests that use partial credit scoring), the score for each item is dichotomous, i.e., 1 (correct) or 0 (incorrect). In contrast, most visual function questionnaires use polytomous response scales, e.g., an ordinal difficulty rating for each item in the VF-14 that ranges from 0 to 4. Although not explicitly stated or tested by the instrument developers, our expectation is that each item in the VF-14, or any other visual function questionnaire, describes an activity that requires a specific level of visual ability in order for the patient to report that the activity is easy (response rating = 0). Presumably, it requires somewhat less ability for the patient to report that the activity is somewhat difficult (response rating = 1), even less to report that the activity is moderately difficult (response rating = 2), and less still to report that the activity is very difficult (response rating = 3). To anchor the scale, for each item there will be some threshold level of ability that must be exceeded in order for the patient to report that they can perform the activity at all (otherwise the rating response = 4). If the Likert score for the VF-14 is an estimate of a true score, it must be monotonic with the patients' ability, consequently, the frequency that patients' response ratings for a particular item will exceed any given response category must increase monotonically with the instrument score.
Steinberg et al. 47 administered the VF-14 to 550 preoperative cataract patients. Using raw patient response data from that study (which were generously shared with the author by the investigators), I tested the hypothesis that VF-14 Likert scores were estimates of true scores that are monotonic with ability. Fig. 12, top, illustrates item characteristic curves for the four reading items in the VF-14 (i.e., read small print, read ordinary newsprint, read large print, and read signs) using a response criterion of greater than not difficult (i.e., >0, which dichotomizes the response to not difficult vs. at least some difficulty). Average response ratings (VF-14 score divided by 25) were computed for each subject across all items to which the subject responded. For this group of preoperative cataract patients, the average ratings ranged from 0 to 3. Based on their average ratings, subjects were grouped into eleven bins ranging in average rating value from 0.125 to 2.75. The number of patients who responded with a rating greater than zero for the item was counted for each bin. The curves in Fig. 12, top, illustrate the fraction of patients who responded with a rating greater than zero, as a function of the average rating, for each of the four reading items in the VF-14. Fig. 12, bottom, illustrates item characteristic curves for the same four reading items, but using a response criterion of a rating greater than moderately difficult (i.e., >2).
Overall in Fig. 12, the trend is for a monotonic increase in the proportion of subjects whose responses exceed the rating scale criterion, as a function of the average rating, for all four items. One could almost imagine that the different items have the same item characteristic curves, but they occupy different positions on the average rating axis. In classical test theory, the difficulty of the item is estimated from the average score for the item across all test takers. If, in the case of the VF-14, the average rating could be treated as an interval scale, then the item characteristic curves could be translated along the x axis until they superimposed. Fig. 13 illustrates the results of translating the item characteristic curves in Fig. 12, top, by the amount of the difference between the average rating for each item and the average rating for reading small print (filled circles). The curves were translated by -0.2 average rating units for reading ordinary newsprint (X's), by -1.23 units for reading signs (open circles), and by -1.89 units for reading large print (open triangles). If the average rating across items represents an interval measure of ability, then the amount of translation of the different item characteristic curves to superimpose the curves would represent differences in the required ability to perform the different activities with no difficulty. This interpretation of the positions of the item characteristic curves would also apply to the rating scale data of Szlyk et al., 54 plotted in another way in Fig. 9. The horizontal intercepts of the lines fit to the difficulty ratings vs. sigma values in that presentation of the data would correspond to different average difficulties for the different items in their instrument.
Fig. 14 illustrates for two items from the VF-14, reading small print and reading large print, the item characteristic curves for different rating category criteria. For both items, the item characteristic curves for more difficult rating categories are progressively shifted to the right on the average rating axis. Fig. 15 illustrates the item characteristic curves for reading small print shifted on the average rating axis by the difference between the average ratings for each item characteristic curve and the average rating for the rating >0 item characteristic curve. When shifted to equate mean scores, the item characteristic curves nearly superimpose.
The trend for all of the VF-14 item characteristic curves is a monotonic increase in the proportion of patients responding with a particular rating category or higher as the average difficulty rating increases. These results imply that the average difficulty rating across items in the VF-14 produces a score that is likely to be monotonic with person ability. The similarity of the shapes of different item characteristic curves (e.g., Figs. 13 and 15) while occupying different positions on the average difficulty rating axis implies that different items require different levels of ability to elicit a particular difficulty rating. These results imply that patients with different levels of ability, i.e., different average difficulty scores, are ordering instrument items by difficulty in the same way.
Although the results of this item analysis are encouraging, they do not validate average difficulty ratings as a measure of person ability. The interpretation of the average difficulty rating is unique to the full VF-14. Missing data, in effect, change the instrument and produce a new instrument scale. For example, if patients respond not applicable to all of the easy items, they will seem to have low ability. The same patients will seem to have higher ability if they respond not applicable to all of the difficult items. Dropping or adding items to the instrument will change the scale in a way that it can no longer be interpreted. There is no means of comparing scores of different visual function questionnaires, except to administer each questionnaire to the same set of patients. These limitations are not a property of visual function questionnaires, they are a consequence of the application of classical test theory.
Item Response Theory
Because of severe limitations on the interpretation and generalization of instrument scales, educational and psychological testing research has largely abandoned classical test theory in favor of more modern item response theory (IRT). 75 Where classical test theory begins with a test score, IRT begins with an explicit mathematical model that describes the relationship between responses to the instrument's items and the trait of the person to be measured. In effect, IRT returns to basic Thurstonian principles of psychological measurement, which were lost in the tautologies of classical test theory.
In the case of vision disability, the trait of visually impaired people that we wish to measure is their ability to perform everyday activities. Ability cannot be observed directly; it is a latent variable that must be inferred from patients' behavior or reports about problems they experience with daily activities. In IRT, the person trait of interest, such as mastery of a subject, is often represented with the symbol θ (theta). We will use the symbol α (alpha) instead as a mnemonic to explicitly represent the latent ability variable that we wish to measure.
Like Thurstone's theory, IRT assumes that each item represents some level of the person trait. In the case of achievement testing, the items vary in difficulty, which means that the items vary in the level of mastery required to answer the question correctly. Usually, the symbol bi is used in IRT to represent the level of the person trait required by item i. In the case of vision disability assessment instruments, the items describe specific daily activities. Like the conclusions from the analyses in Figs. 12 and 13, we would argue that each activity requires the person to have some criterion level of ability in order for the person to report that the activity can be performed with ease. We will use the symbol ρi (rho) as a mnemonic to represent the required ability (i.e., the required value of α) to perform the activity described in item i with some criterion level of ease.
The practical goal of IRT is to model item characteristic curves, such as those shown in Fig. 12, top. The underlying assumption of the model is that the true score is monotonic with the person variable α, and that the item characteristic curve will be transformed to an item response probability function, usually an ogive, when the test score is transformed to α. 74 Different IRT models use different equations for the item response function ogives. 76 Most models use a variation of the logistic. Birnbaum's logistic has four parameters: (1) item difficulty, ρi (controls the position of the item characteristic curve on the abscissa), (2) item discriminability, di (controls the slope of the item characteristic curve), (3) chance performance level for the item, ci (controls the level of the lower asymptote of the item characteristic curve), and 4) the careless error rate for the item, ui (controls the level of the upper asymptote of the item characteristic curve). The equation for Birnbaum's logistic is which represents the probability that a person with ability αn answers item i correctly. 77 Fig. 16 illustrates ogives generated by Equation 1 for different values of ρi (Fig. 16 a), di (Fig. 16 b), ci (Fig. 16 c), and ui (Fig. 16 d). In the case of vision disability assessments, there are no correct or incorrect answers, therefore ci = 0, ui = 1, and Equation 9 simplifies to which has two parameters, di and ρi.
The Two-Parameter Item-Response Model Applied to Polytomous Rating Scales
Item response models require item scores to be dichotomous, i.e., only two response alternatives, such as correct and incorrect. Most visual function questionnaires employ polytomous response categories, e.g., difficulty ratings on a scale of 0 to 4. Therefore, to apply IRT to vision disability, we must modify Equation 10 to incorporate response criteria for the different rating categories. We will use the symbol τx (tau) as a mnemonic for the threshold value of α - ρ for response category x. Thus, Equation 10 would be rewritten as the probability of responding with category x, or greater, given person ability αn and required ability ρi for item i where τx, like ρi, simply manipulates the position of the ogive on the abscissa (see Fig. 16 a and analogous to the interpretation of the item characteristic curves in Figs. 14, top, and 15).
At this stage of development, the logistic in Equation 11 has no theoretical significance; it is simply an equation that generates an ogive with a slope that can be manipulated with changes in di and a position that can be manipulated with changes in ρi and/or τx. We could easily replace Equation 11 with a cumulative normal function 78which is more difficult than the logistic to manipulate algebraically, but gives us greater insight into the IRT parameters. In terms of density functions, as shown in Fig. 17, there is no practical difference between the derivative of the logistic in Equation 11 and the normal distribution that is used to produce Equation 12. The item discrimination parameter, di, in Equation 11 is inversely proportional to the item-dependent standard deviation, ςi in Equation 12.
Similar to Likert's assumption about the distribution of attitudes in the population, we will assume that ability, α, is distributed in the population of visually impaired patients. For convenience, we will assume that the distribution of α is normal (however, as will become clear later, the actual shape of the α distribution is irrelevant). Following the assumptions of IRT, we assume that each item describes an activity that requires a criterion value of α to be performed with ease, i.e., ρi. Each person can make a judgment of how difficult it is to perform a specific activity. Theoretically, the person is judging the difference between his or her personal ability and the ability that is required to perform the activity, i.e., αn - ρi. If the person's ability far exceeds the ability required by the activity, that person most likely will report that the activity is easy to perform. On the other hand, if the person's ability is less than that required to perform the activity, then the person most likely will report that the activity is impossible to do. The difference between the person's ability and the ability required to perform the activity is called functional reserve.79 The more functional reserve a person has for a given activity, the easier the activity will be for that person. As illustrated by the extreme examples, each person can assign a level of difficulty to each activity based on his or her functional reserve for that activity. 80
The criterion level of ability that is required for each activity, ρi, most likely will depend on how the activity is performed (e.g., working distance) and on the circumstances surrounding the performance of the activity (e.g., illumination). Also, depending on wording, different items might be subject to different interpretations. Consequently, as Thurstone argued, we expect that irrespective of the person's ability, the inherent difficulty of each activity will vary across people. In other words, we must assume that there is a distribution of ρ for each activity. As illustrated in Fig. 18, each item would be represented by a distribution of ρ on the α axis with a mean ρi and a standard deviation ςi.
The distribution of α in the population of visually impaired patients will have a mean μα and a variance ςα2. The distribution of functional reserve for this population, i.e., α - ρ, for item i will have a mean, μΔ(i), which is the difference between the two means, μα - ρi. The variance of the functional reserve distribution, ςΔ2(i), is the same as Thurstone's equation, i.e., the sum of the component variances minus twice the covariance, orwhere ri is the correlation between α and ρ for item i.
Similar to Likert's reasoning, which is illustrated in Fig. 7, each person will have a threshold value of functional reserve, tx, that must be exceeded to respond with difficulty rating x. Because of individual differences, we expect tx to be distributed across people and, probably because of inconsistencies in judgment, distributed within each person individually. However, on average, everyone should order thresholds for the different response categories in the same way. The mean value of tx for the population is τx, and the variance of tx is ςx. Both within-person and between-person variability in tx will contribute to errors in the estimate of each person's ability. However, as we will see, variability in tx has little effect on the estimate of the required ability for each item. The variability in tx appears like within person variability in α. 81 Therefore, we will assume that ςx is included in ςα in our analysis. (If tx for any given person-item encounter is τx + εnx, where εnx incorporates both within- and between-person variations in tx and, similarly, tx+1 is τx+1 + εnx+1, then the decision rule tx ≤ αn - ρi <tx+1 for person n is equivalent to τx ≤ αn + δn - ρi < τx+1, where δn = -(εnx + εnx+1), which assigns all variability to αn.)
Returning to Fig. 7 and following Likert's steps, we can proceed to estimate z-scores (i.e., sigma values) that correspond to each response category for each item individually. However, in this case, the variable on the abscissa is functional reserve, i.e., α - ρ, and the functional reserve distribution has a mean μΔ(i) and a standard deviation ςΔ(i). The z-score corresponding to response category x for item i iswhere the average response threshold, τx, is independent of the item parameters. There would be an equation like Equation 14 for each response category (x) and each item (i). As Likert did, the values of z (xi) can be estimated from the proportion of people who responded with category x to item i.
If we consider two items, i and j, we can estimate z-scores for each response category for each item. Equation 14 can be rewritten asandwhich, when equated, leads to the linear expression
Thus, when comparing estimated z-scores between items for different response categories, the slope of the best-fit line corresponds to the ratio of the standard deviations, and the intercept is equivalent to the difference between the means weighted by the standard deviation of one of the items.
Equation 16 can be rewritten with Equation 13 substituted for ςΔ(i) and ςΔ(j), μα - ρi substituted for μΔ(i), and μα - ρj substituted for μΔ(j). The resulting expression isif ri and rj in Equation 13 are zero (setting the correlations to zero is for presentation convenience, nonzero correlations will not alter the conclusions or change the method of calculating model parameters). Equation 17 shows that the slope of the best-fit line will be unity unless ςj2 ≠ ςi2 (and/or ri ≠rj) and the difference between them is large relative to ςα2. Thus, differences in slopes for different item pairs are due entirely to differences in variability (and/or covariability) in item difficulty distributions. The intercept is proportional to the difference between the mean item difficulties.
Equation 17 also illustrates that the estimation of item difficulties (from the intercept) is independent of the distribution of α. The variability of α in the population, ςα2, will determine the scale of the item parameters (i.e., all parameters are normalized to ςα), but not the relative difficulty intervals. The weighting term on the difference between item parameters can be factored out, which also contributes to the scale of item difficulties, but does not affect the relative intervals.
By comparing every item to every other item and performing a linear regression on the estimated z-scores, one obtains a matrix of slopes and a matrix of intercepts. If the row index is i and the column index is j, then the cells in the two matrices above the diagonal contain slopes and intercepts for z (x|i) regressed on z (x|j), and the cells below the diagonal contain slopes and intercepts for z (x|j) regressed on z (x|i). By averaging the slope with the reciprocal of the slope and averaging the intercept with the negative of the intercept divided by its slope in the corresponding cell of the slope matrix, the two matrices are reduced to triangular matrices with slopes and intercepts for the average component line through the z-scores. For example, returning to the tabularized data of Szlyk and colleagues, 54 we can estimate a z-score for every response category for every item. Fig. 19 illustrates representative z-scores for four estimates of τx for two different item pairs that have different slopes. Also, plotted in Fig. 19 along with the z-score estimates is the best-fit component line for each item pair.
The units for item difficulties can be set arbitrarily to the standard deviation for one of the items, e.g., ςΔ(1) for the first item, Q1. In this example, the average slopes in column 1 of the matrix would be the estimates of standard deviations for the items in the rows of the matrix. By dividing the average slope in each cell by the slope in the first column of that row, we obtain an estimate of the standard deviation for each column, in Q1 standard deviation units, for each item listed in the row. Theoretically, the item standard deviation estimates are described by Equation 13.
The final step is to multiply the average intercept estimates in the matrix by the corresponding average item standard deviations. This step removes the standard deviation term from the intercept, leaving the differences between the estimates of item difficulty (i.e., ρi - ρj). By arbitrarily setting one of the item difficulties to zero (e.g., for Q1), we can estimate all of the other item difficulties for each cell in the matrix. Then, as we did for the item standard deviation estimates, we can average across columns to obtain an estimate of difficulty for each item.
This analysis can be applied to the Steinberg et al. 47 VF-14 data used earlier (e.g., Figs. 12 through 15). Also, using Equation 15 with the average item difficulty and standard deviation estimates, we can estimate τx for each difficulty response category by averaging Equation 15 across items using the estimated values of z (xi) for each item.
The parameter estimates can be used with either Equation 11 or Equation 12 to estimate person abilities, αn, from the item responses. First, we must compute the probability that person n will use response category x for item i. This response probability, π(xni), is the probability that τx < αn-ρi < τx-1, which is simply the difference between the probability that αn - ρi > τx and the probability that αn - ρi > τx-1 (note that for difficulty ratings, τx = -∞ for the largest value of x). Fig. 20 demonstrates for a single patient how π(xni) varies as a function of α for the different items on the VF-14 (note that this patient responded not applicable to two of the items, so those data are missing). The value of α at the maximum of each curve represents an estimate of αn for that patient based on the patient's response to that item. The ogives in Fig. 20, which have maxima at plus or minus infinity, represent items to which the patient responded with an extreme rating category (i.e., 1 or 5).
Next, we need to estimate the probability of obtaining the observed pattern of responses to all of the items by person n. The assumption is made within IRT that a person's response to an item is locally independent. That is, after the main variable (αn in our case) is accounted for, the probability of obtaining a particular response to an item is independent of the response probabilities for other items. Mathematically, this means that the probability of responding x to item i and y to item j is the product of the individual probabilities of responding x to item i and responding y to item j. Thus, using Equation 11 as an example, the probability of obtaining response pattern xn (bold face denotes an array of values), given a particular value of α for person n, iswhere the xni and xni -1 subscripts on τ refer to the actual response of patient n to item i.
From Equation 18 we can estimate π(xn‖αn) for all possible values of αn. This is the same as multiplying all of the curves in Fig. 20 together as a function of αn (which computationally is equivalent to the geometric mean of the probability curves for each item). The value of αn that gives us the maximum estimate of π(xn‖αn) is the most likely value of α for person n. The likelihood ratio is π(xn‖αn)/[1 - π(xn‖αn)], and with Equation 18, the maximum of the likelihood ratio is the same as the maximum of Equation 18. Fig. 21 illustrates on a logarithmic scale the likelihood estimates from Equation 18, as a function of α, for the patient whose data are illustrated in Fig. 20. The maximum likelihood occurs at αn = -1.4.
Missing data do not affect the interpretation of the estimate of αn, the scale of the variable, or its expected value. Rather, missing data affect the precision of the estimate (analogous to fewer samples in the geometric mean). The units of α are the same as those of ρ, i.e., the standard deviation of α - ρi, ςΔ(i), for the item that was arbitrarily chosen to normalize the matrix of slopes.
Fig. 22 illustrates the maximum likelihood estimates of αn for the 550 preoperative cataract patients who responded to the VF-14. These estimates are plotted as a function of the corresponding average response rating to the items on the VF-14. This scatter plot confirms that a VF-14 score based on the average response rating is approximately monotonic with the estimated latent ability variable, but the relation between the two is nonlinear. The solid curve fit to the data is an estimate of this nonlinear relation and can be used to back transform α to an estimate of the true score for the VF-14. The shape of this nonlinear function will depend on the choices of items to include in the VF-14. Missing item responses will contribute to the scatter in the data about the regression.
To complete the discussion of the two-parameter IRT model fit to the VF-14 data, Fig. 23, top, illustrates an example of item characteristic curves that are well fit by the model (for the question about difficulty going up or down steps and stairs). Fig. 23, bottom, illustrates an example of item characteristic curves that are poorly fit by the model (for the question about difficulty engaging in sports and outdoor activities). The goodness of fit of the IRT model to the item characteristic curve is influenced by the sample size, which was relatively small for each average difficulty rating bin (particularly for the higher ratings), and by missing data.
Equation 17 demonstrates that estimates of ρi and τx are independent of α (except for the dependence of the scale on ςα). Nevertheless, IRT theorists typically would not estimate model parameters in the sequential manner just described. Rather, they would simultaneously estimate αn, ρi, di (or ςi), and τx using a maximum likelihood procedure based on Equation 18.
IRT has attracted a large number of critics who dismiss it as simply a data-fitting exercise with a large statistical model. 82,83 A number of studies using synthesized data, ranging from random responses to data synthesized by other models, have demonstrated that IRT happily estimates believable model parameters irrespective of the structure of the data (i.e., garbage in-roses out). 84 IRT proponents respond by demonstrating their impressive fits of item-characteristic curves for large databases, 85 evidence of external validation, 86 and a counter-battery of their own simulation studies. 87
The most serious criticisms of IRT are those that challenge its status as a measurement model. 88 The IRT model was developed around the premises that α and ρ are independent traits of persons and items. That is, the value of ρi does not vary with the various abilities of the people responding to the item and the value of αn does not change with changes in the item that the person is asked to rate. The IRT model describes how α and ρ are related to the manifest variable p (xni), which empirically is understood to be the relative frequency that person n gives response x to item i. However, to work backwards and obtain valid estimates of α and ρ from observations of p (xni), the relationships among the three variables must conform to what measurement theorists call noninteractive conjoint structure. 89
If α and ρ are measurable quantities, then they (or a monotonic transformation of them) must exhibit an additive relationship. 57 Consider the case of three people with abilities αm, αn, and αp, respectively, and three items with required abilities ρi, ρj, and ρk. If αm - ρi ≥ αn - ρj and αn - ρk ≥ αp - ρi, then from additivity, the relation αm - ρk ≥ αp - ρj must be true. This can be seen by adding ρj to both sides of the first inequality to obtain αm - ρi + ρj ≥ αn and adding ρk to both sides of the second inequality to obtain αn ≥ αp - ρi + ρk. These two expressions lead to the conclusion that αm - ρi + ρj ≥ αp - ρi + ρk, or after canceling ρi and rearranging terms, αm - ρk ≥ αp - ρj.
A noninteractive conjoint structure says that the manifest variable, i.e., the variable that can be publicly observed, must exhibit the same ordered relationship as the added latent variables. This required condition is called double cancellation. In the case of the IRT model, the corresponding relations of the manifest variable must be as follows: if p(xmi) ≥ p(xnj) and p(xnk) ≥ p(xpi), then p(xmk) ≥ p(xpj). For this to be true of Equation 11, the following relations must also be true: if di(αm - ρi) ≥ dj(αn - ρj) and dk(αn - ρk) ≥ di(αp - ρi), then dk(αm - ρk) ≥ dj(αp - ρj). These relations are true only if dj ≥ di, di ≥dk, and (necessarily) dj ≥ dk. But di, dj, and dk are free parameters and can take on any values in the model. Therefore, in the general case, the two-parameter IRT model does not satisfy the condition of double cancellation.
A second condition that must be satisfied by noninteractive conjoint structures is called solvability.57 Solvability essentially requires that the latent and manifest variables are continuous. That aspect of the requirement is easily met by the IRT model. However, solvability says that if αm - ρi = αn -ρj, then p(xmi) = p(xnj), which in terms of Equation 11 means that di(αm - ρi) = dj(αn - ρj). This condition is true if and only if dj = di. Therefore, the general two-parameter IRT model violates the condition of solvability.
The third condition that noninteractive conjoint structures must satisfy is the Archimedean axiom. 57 The Archimedean axiom says that no value of α - ρ is infinitely larger than any other value of α - ρ (or, more formally, if αm - ρi > αn - ρj, there is a number, N, for which N(αn - ρj) > αm - ρi). In the case of the logistic model, this condition translates to the requirement that 0 < p(xni) < 1. The two-parameter IRT model satisfies the Archimedean condition as long as the discrimination parameter is nonzero and finite, but it would require an infinite number of observations to prove that this requirement is met empirically.
Finally, to be a noninteractive conjoint structure, the effects of α and ρ on p(xni) must be independent. In terms of the ordered structure, independence means that if ρi > ρj > ρk > … > ρz, then it must be true that α - ρi < α - ρj < α - ρk < … < α - ρz for any value of α. Furthermore, the sum of person measure values across persons must obey the relationships (Σnαn) - ρi < (Σnαn) - ρj < (Σnαn) - ρk < … <(Σnαn) - ρz. Similarly, if αm > αn > αp > … > αz then it must be true that αm - ρ > αn - ρ > αp - ρ > … > αz - ρ for any value of ρ. The same is true of the sum of the item measures, i.e., αm - (Σiρi) > αn - (Σiρi) > αp - (Σiρi) > … > αz - (Σiρi). If persons are ordered by α and items are ordered by ρ, then the same relations must hold for the manifest variable, i.e., p(xni) < p(xnj) < p(xnk) < … < p(xnz) for every person and p(xmi) > p(xni) > p(xpi) > … > p(xzi) for every item. In terms of Equation 11, independence requires di(α - ρi) < dj(α - ρj) < dk(α - ρk) < … < dz(α - ρz) for all values of α, and di(αm - ρi) < di(αn - ρi) < di(αp - ρi) < … < di(αz - ρi) for all values of ρ. This latter relation for different values of α is easily satisfied, but the former relation for different values of ρ is not because the discrimination parameter, d, is unrestricted. Thus, the general two-parameter IRT model fails to satisfy the independence requirement.
The two-parameter IRT model fails to comply with the requirements of noninteractive conjoint structures, which must be satisfied by all measurement systems. The model fails because the discrimination parameter, di, causes the item trait to interact with the person trait in the determination of the manifest variable p(xni). Consequently, the latent traits do not produce independent effects on p(xni), which means that the ordered relationships of p(xni) are not governed by the additive structure of α and ρ. From the perspective of measurement theory, the two-parameter IRT model is not a valid measurement model.
The discrimination parameter, di or ςi, makes the two-parameter IRT model a descriptive statistical model for the population, rather than a measurement model of the individual patient responses. Variations in the item difficulty parameter, ρi, reflect differences between people in circumstances surrounding the performance of the activity or interpretation of the item, not variation within a person on the value of ρ. For a given person, the average value of ρi might not agree with the personal value of ρi, but that disagreement would translate to an error in the estimate of α for the person. It would not manifest itself as an item-dependent change in the slope of the psychometric function describing that person's probability of a particular response. (As discussed for variability in tx, person-dependent variations in ρi, εni, would be incorporated in the person measure variability, i.e., δni = -[εnx + εnx+1 + εni].) The two-parameter IRT model has persisted because of the tradition from classical test theory of evaluating item performance on the basis of the slope of the item characteristic curve. It also has persisted because the two-parameter model does a better job of fitting item characteristic curves than do models that do not include a discrimination parameter. 75
Rasch Analysis
In the late 1950s, the Danish mathematician Georg Rasch developed a family of psychometric models that are based on many of the same theoretical principles as those used by IRT, but constrained by the same axioms articulated in modern measurement theory. 90 Even though Rasch developed his theory independent of IRT, Rasch's models are often classified as one-parameter IRT models. 75 Many critics hastily dismiss Rasch models along with IRT models because of their mathematical similarities. 84 But Rasch models differ from IRT in important fundamental ways. In particular, Rasch models meet the conditions of noninteractive conjoint structures so, unlike IRT models, they are valid measurement models. 88 The Rasch logistic model for dichotomous responses is the same as Equation 11 with di set to one for all items. The log odds ratio of the Rasch logistic model,MATH is a monotonic transformation of the manifest variable, p (xni), that is an additive function of the latent variables. Because monotonic transformations maintain ordinal relations, the Rasch logistic model satisfies the double cancellation, solvability, and Archimedean conditions of noninteractive conjoint structures.
Like IRT, αn in Rasch's model is a trait of the individual and ρi is defined by the population. However, unlike IRT, for Rasch's model, the ogive for the probability of a response represents the behavior of the individual, not the behavior distribution in the population. In terms of measuring vision disability, the probability of a patient responding with a particular difficulty rating depends only on functional reserve. This assumption is the same as that used by psychophysical models of vision thresholds. 81 With the absence of the confounding discrimination parameter, the Rasch model realizes mathematically what IRT tries to say in words. But Rasch models cannot be used to fit item characteristic curves. Item characteristic curves require independent estimates of ςi.
In the case of polytomous response categories, the a priori probability that person n will respond with rating x to item i is not simply the difference between two cumulative probabilities, as used in Equation 18. That difference represents the a posteriori probability that a person with ability α, who was drawn at random from the population, responded with rating x to item i. For the individual, rating scale responses are mutually exclusive. So, if we are given the information that person n responded to item i with rating category x or x - 1, then the posterior probability of that person responding with any other rating category is zero. This condition underlies a version of the Rasch model for polytomous response categories that was developed by Wright and Masters, 91 building on the rating scale model of Andrich 92 and the partial credit model of Masters 93which is the conditional probability that person n responded with rating x to item i, πnix, given that he/she responded with rating x or x - 1. Equation 20 reduces to the dichotomous Rasch model when πnix + πnix - 1 = 1 (which is true for dichotomous responses).
To obtain an expression for πnix, Masters 93 used odds ratios. The odds ratio for Equation 20 is and the product of the odds ratios for successive rating response categories is
From the condition that the rating response categories are mutually exclusive,MATH and substituting Equation 22 for πnix
Combining Equations 22 and 24, we obtain the Wright and Masters model for polytomous rating categorieswhich is the probability that person n will respond with rating category x to item i. A maximum likelihood procedure, similar to Equation 18, is used to estimate αn, ρi, and τx from Equation 25 for a set of patient item responses. 91
As can be seen from Equation 21, the log odds ratio reduces the manifest variable, πnix, to an additive function of the latent variables, α and ρ. Thus, the polytomous response model of Wright and Masters also satisfies the conditions of noninteractive conjoint structures.
Despite profound theoretical differences, the two-parameter IRT model and the Rasch model generate similar estimates of item parameters. Fig. 24 compares estimates of ρ from the two models for the VF-14 items. Fig. 25 similarly compares the estimates of α from the two models. In both cases, there are disagreements between individual model estimates, sometimes quite large, but there is an overall trend of agreement (solid lines in the figures). Mathematically, the models are so similar in structure that it is not surprising that they agree as much as they do. However, part of the agreement is a fortuitous choice of a data set that minimized the differences between models (e.g., there was relatively little variation between items in the discrimination parameter of the IRT model, effectively making it a noisy Rasch model).
The aim of IRT is to find model parameters that produce the best fit of the model to the data. In contrast, Rasch analysis uses Rasch models to test the construct validity of the attempted measurement. The latent variables α and ρ are theoretical constructs. If α and ρ exist and are measurable with a visual function questionnaire, then patient responses to the items must adhere to the conditions of conjoint measurement. In other words, the data must agree with the Rasch model.
Using the difficulty rating decision rule of responding x when τx ≤ αn - ρi < τx - 1 (e.g., x = 0 to 4 for the VF-14), the patient's ordinal ratings theoretically are a monotonic transformation of αn - ρi, assuming that tx has the same order as x. By virtue of its additive structure, the Rasch model satisfies the requirements of conjoint measurement, and any monotonic transformation of αn - ρi must also satisfy those requirements. Thus, the order of the sum of the ratings across items for each person, called the person raw score, must agree with the order of αn (i.e., if αm ≥ αn ≥ αp ≥ etc., then iΣxmi ≤ iΣxni ≤ iΣxpi ≤ etc.-the inequalities are reversed in this example because a larger value of x, greater difficulty, corresponds to a lower value of α, less ability). Similarly, the order of the sum of the ratings across people for each item, called the item raw score, must agree with the order of ρi (i.e., if ρi ≥ ρj ≥ ρk ≥ etc., then nΣxni ≥ nΣxnj ≥ nΣxnk ≥ etc.). If the rating responses are to provide evidence of additive (and therefore measurable) latent variables, then when the patients are ordered according to the patient raw score and items are ordered according to the item raw score, the ratings across patients will have the same rank order for every item (i.e., xmi ≥xni ≥xpi ≥ etc., for all items) and the ratings across items will have the same rank order for every patient (i.e., xni ≥xnj ≥xpk ≥ etc., for all patients). This ordered structure in the data matrix is called a Guttman pattern. 94
For real data, the requirement of a Guttman pattern is too strict. We already have discussed between- and within-person variability in tx and between person variability in ρi, all of which was assigned to the error term, δni, in the response decision rule. There undoubtedly are many other sources of response variability that we have not made explicit. Thus, at best, we would expect the data matrix to exhibit a statistical Guttman pattern.
If we repeatedly present the same visual function questionnaire to the same group of patients, then the theoretical manifest variable, π(xni), would empirically be equivalent to the frequency that patient n gave response x to item i. But for practical reasons, data usually consist of a single presentation of the questionnaire to a group of patients (sometimes it is repeated once to evaluate test-retest reliability). In this case, the data matrix must be considered a single observation out of the large sample of possible matrices that would be required to directly estimate the π(xni) matrix.
Rasch analysis uses statistical tools to test the hypothesis that the observed matrix of patient responses to questionnaire items satisfies the requirements of conjoint measurement given the estimated values of α for each patient and ρ for each item. Estimating the α and ρ values for Rasch analysis is done much the same way as described for the IRT model, but using the Wright and Masters polytomous response model. Many IRT practitioners stop at this point and declare the estimates to be valid if they obtain reasonable reliability coefficients and reasonable fits to item characteristic curves. But for Rasch analysis, at this point the job is only half complete. The next step is to test the validity of the conjoint structure of the observed ratings and the latent variables. This is accomplished by evaluating the distributions of differences between expected and observed responses.
For each person-item encounter, the expected rating category response (i.e., average response of person n to item i) isMATH for which Equation 25 is used to calculate πnik, incorporating the model parameters estimated from the data set. Because it is a probabilistic model, the Rasch model also has an expected variance in responses for person-item encounters. The model's expected variance of the responses of person n to item i iswhere, again, Equation 25 is used to calculate πnik with model parameters estimated from the data set. The difference between the actual subject response, xni, and the model's expected response is the error of the estimate for the person-item encounter (also called the response residual). This difference can be normalized to the average expected error for that person-item encounter, i.e., the square root of Equation 27,MATH which is the standard error of the estimate. The average normalized variance from the model's expectations across all patients for item i is the mean squared error for the item,MATH for which N is the number of patients who responded to the item. The average normalized variance from the model's expectations across all items for patient n is the mean squared error for that patient,MATH for which I is the number of items answered by the patient. In both cases, the expected value of the MSE is one. Values less than one indicate that the average error is less than that expected by the model, values greater than one indicate that the average error exceeds the model's expectations. 91
A limitation of the average mean squared error is that extreme outliers can distort the evaluation of goodness of fit. An anomalous response to one or two items by a person, or a few anomalously responding people to an item, can disproportionately influence the value of the average mean squared error. For example, a small expected variance, Wni, relative to the error in the estimate of the response, can magnify the standard error out of proportion to the absolute magnitude of the error. A second fit statistic, a weighted average mean squared error, that is less sensitive to the influence of outliers has been used. 95 This weighted mean squared error for each item is the average of the squared errors across all patients for the item divided by the average of the expected errors across all patients for the item,MATH
Similarly, the weighted mean squared error for each patient iswhere, in both cases, Wni is defined by Equation 27. Like the MSE (which is also called the outlier sensitive fit statistic or outfit96), the expected value of the wMSE (which is also called the information weighted fit statistic or infit96) is one. Values less than one indicate that the error for the person or item is less than the expected error. Values greater than one indicate that the error exceeds expectations.
The fit statistics provide evidence of construct validity at the level of individual items and persons. If confounding constructs such as ambiguous wording (e.g., use of conjunctions and unrelated examples), variability in individual circumstances (e.g., the size of the TV screen), or inappropriate content (e.g., asking a question about an activity that does not depend on vision) heavily influence person responses to an item, then the fit statistics will identify that item as a misfit to the model. If confounding constructs such as cognitive impairments, physical impairments, psychological disorders, etc., for individual people govern their responses to the items, then the fit statistic will identify the person as a misfit to the model. The nature of confounding constructs can be explored in greater depth by performing principal components analyses on the response residuals. 97
The MSE and wMSE distributions can be transformed to normalized t distributions with Wilson-Hilferty transformations. 98 The outfit and infit statistics can then be expressed as normal deviates with an expected value of zero and a standard deviation of one. Fig. 26, top, illustrates the covariance of the normalized MSEi with the normalized wMSEi for the Rasch model applied to the VF-14 data. Items 4 (recognize people when they are close to you), 6 (read traffic signs, street signs, or store signs), and 14 (drive at night) poorly fit the expectations of the model with values for MSEi and wMSEi that exceed the expected value by more than two standard deviations (box). Fig. 26, bottom, similarly illustrates the covariance of normalized MSEn with normalized wMSEn for the Rasch model applied to the VF-14 data. In this case, 8% of the patients have MSEn and wMSEn that exceed the expected value by more than two standard deviations (box). Because only 2.5% of the patients would be expected to have person measures more than two standard deviations greater than the mean, some of the patients have response patterns that are statistically inconsistent with the expectations of the model. With the exception of these misfitting items and patients, we would conclude that a valid interval metric of vision disability for preoperative cataract patients can be estimated from patient responses to the VF-14.
The fit statistics are indices of measurement accuracy. Construct validity means that the instrument is accurately measuring what it intends to measure. Reliability is an index of measurement precision. If measurements are reliable, they are consistent. The reliability coefficient, r2, is the ratio of the true variability (i.e., the variability due to the distribution of the latent variable) to the observed variability (which is the sum of true variability and estimation error), i.e.,MATH where ςo2 is the observed variability and ςe2 is the variance of the estimation error. If the estimation error is zero, then r2 is one. If all of the observed variability is due to estimation error (i.e., ςo2 = ςe2), then r2 is zero.
For the VF-14 data, the reliability of the item measures is 0.99, and the reliability of the person measures is 0.77. One way to interpret reliability is think of it as an index of how well the estimated measures statistically separate items and persons. The ratio r2/(1 -r2) is an estimate of true variability expressed in estimation error variance units (i.e., [ςo2 - ςe2]/ςe2). The person measure distribution for the VF-14 data has a standard deviation (ςo) of 0.86. The root mean square of the person measure estimate (ςe) is 0.49. Thus, the true person measure standard deviation is 1.81 times the standard error of the person measure estimate. If we consider a difference of three standard errors to be statistically resolvable, then the person measure distribution (±2 standard deviations) can be divided into approximately three statistically distinct strata. The item measure standard deviation is 0.97, and the root mean square of the item measure estimate is 0.09. Thus, the true item measure standard deviation is 10.61 times the standard error of the item measure estimate. This ratio means that the item measure distribution can be divided into approximately 14 statistically distinct strata.
Monte Carlo Simulation
Although we can statistically evaluate the validity of the Rasch model for any instrument and any population of patients, it is still difficult to be convinced, without some external verification, that we are making measurements of vision disability. Without independent knowledge of the true values of α for each patient and ρ for each item, how can we be confident that the Rasch model produces accurate estimates of these variables? Skeptics could still claim (and some do 99) that we are engaged in a tautology and that the estimated variables are nothing more than fictions created by the model. To help convince the skeptic, we need to demonstrate that the Rasch model generates verifiable estimates of the latent variable. This demonstration can be accomplished with Monte Carlo simulations.
We begin with the definition of a 15-item instrument. Each item is assigned a mean value of ρ in arbitrary units. For our demonstration, ρi ranges from -0.5 to 5. We also assign a standard deviation to ρ for each item. Here, ςi ranges from 1 to 1.75. Next, we invent 1000 patients who will be administered our 15-item instrument. Each patient is assigned a value of α that is chosen at random from a normal distribution with a mean (μα) of five and a standard deviation (ςα) of two. The values of α assigned to the patients by the computer in this manner ranged from -0.2 to 10.4. Finally, we choose four average threshold values of functional reserve, α - ρ, that define the boundaries for five rating categories (i.e., mean values of the tx distribution across patients). The selected values for τx are 4, 3, 2.3, and 0.8, respectively. We also assigned a standard deviation to the distribution of tx between and within patients (ςx). For the demonstration, ςx was set to 1.0 for all patients and for all response categories.
The Monte Carlo simulation consisted of 15,000 patient-item encounters in the computer. For each patient-item encounter, a value of ρ was drawn at random from a normal distribution with a mean equal to ρi and a standard deviation equal to ςi. Also, four response category boundaries, tx, were drawn at random from four normal distributions having means equal to τx and standard deviations equal to ςx. The drawn value of ρ was subtracted from the value of α that was assigned to that patient. The difference was then compared with each value of tx drawn for that patient-item encounter. If α - ρ <t4, then response category 5 was assigned to that encounter. If t4 < α - ρ<t3, then response category 4 was assigned. Category 3 was assigned if t3 < α - ρ <t2, and category 2 if t2 < α - ρ <t1. At the other extreme, if α - ρ >t1, then response category 1 was assigned.
A maximum likelihood estimation routine was used to estimate α for each patient, ρ for each item, and τ for each response category from the Wright and Masters version of Rasch's model for polytomous rating scales. 91 Fig. 27 compares the estimates of τ (expressed as rating category boundaries rather than the τ parameters in Equation 24) to the assigned mean values of τx. There is nearly perfect agreement between the two. These results indicate that despite both within-person and between-person variability in response thresholds, the average thresholds can be estimated with high precision.
Fig. 28 a compares the estimates of ρ for each item to the assigned mean value of ρi. The two are in excellent agreement (R2 for the fit of the regression line is 0.995), except that the origins and scales are arbitrary. The reliability of the item measure is nearly 1.0, which means that only a small fraction of a percent of the observed item measure variability can be attributed to estimation error.
The estimate of α is compared with the assigned value of α for each patient in Fig. 29 a. The regression line shows that the estimate of α is accurate, but the scatter about the regression shows that it is not precise. The R2 for the fit of the regression line is 0.85. The source of the imprecision is ςi and ςx, which represent model parameter variability between patients and within patients, respectively. Some of this variability would have been absorbed by di in the two-parameter IRT model. With the Rasch model, it is made explicit in the fit statistics. The reliability of the person measure is 0.91, which means that 9% of the variability observed in the person measure distribution can be attributed to estimation error.
The infit and outfit mean squares for each item are estimates of the variability in ρi across patients, i.e., ςi2, scaled by a constant. Fig. 30 illustrates the relations of wMSEi and MSEi to the actual ςi2 used in the simulation. The linear relationships are consistent with the interpretation that the item mean square fit statistics represent variability across patients in the item parameter. The poorer fit to the regression line for the outfit mean squares (Fig. 30, below) is a consequence of the MSEi sensitivity to outliers.
Fig. 31 a illustrates a scatter plot of estimated item measures vs. normalized wMSEi (z-score values) from the simulation. Items that fall between the shaded areas are within ±2 standard deviations of the expected value. Four of the 15 items have wMSEi values that exceed the expected value by more than two standard deviations. These items had the largest values of ςi in the simulation. The high variability across persons indicates that responses to these items are influenced heavily by unrelated constructs and therefore are not consistent with responses to the other items in producing a probabilistic conjoint measurement of the vision disability construct. To increase construct validity, these four items would be edited from the instrument. The two items that are more than two standard deviations less than the expected value have the lowest ςi in the simulation. These items would be examined carefully but retained for future iterations of the analysis because they are likely to regress to the expected value when the four misfitting items are removed.
Fig. 32 a illustrates a scatter plot of estimated person measures vs. normalized wMSEn (z-score values) from the simulation. Persons that fall between the shaded areas are within ±2 standard deviations of the expected value. None of the persons had wMSEn values that were less than the expected values by more than two standard deviations. Eight percent of the persons had wMSEn values that exceed the expected value by more than two standard deviations compared with 2.5% that would occur by chance alone.
The results of the simulation demonstrate that maximum likelihood estimates of model parameters for Wright and Master's polytomous response version of the Rasch model can accurately and reliably recover item difficulties, population estimates of response category thresholds, and individual abilities within the limits of the inherent variability in the system of data. Because the origin and units of the parameter estimates are arbitrary, like temperature, vision disability necessarily is measured on an interval scale.
Critics of Rasch models worry, and justifiably so, that the estimation routines produce perfectly reasonable estimates of α and ρ, exhibiting good fit statistics, even when the data being analyzed are completely random. 99 For random data, the average rating across items for each person will be the midpoint of the ordinal rating scale. The average rating across persons for each item will also be the midpoint of the rating scale. The only source of variability is the randomness in the data, so there is nothing to measure. The distribution of response residuals will be the same for every person and every item. Therefore, there will be no outliers, every average item residual and average person residual is a sample mean from the same normalized distribution. In this special case, the fit statistics should be exceptionally good. However, in this special case, all of the variance in the responses is random error. The only variability in the item and person measure distributions will be the estimation errors. Consequently, ςo2 = ςe2, and r2 in Equation 32 will be zero.
The fit statistics identify outlying items and persons for which the pattern of responses is statistically inconsistent with the expectations of probabilistic conjoint additivity. The reliability coefficients tell us how consistently the data can separate items and persons according to the magnitudes of the latent variables. To illustrate the behavior of the fit statistics and reliability coefficients for different levels of random error, we repeated the Monte Carlo simulation 15 times, substituting completely random responses for different numbers of items. On the first repetition, the simulated subject responses to Q1 were replaced with random responses. On each subsequent repetition, responses for an additional item were replaced with random responses, so that by the 15th repetition, all responses for all items were random.
Fig. 28 illustrates scatter plots of estimated item measures vs. the true ρ values used in a subset of the simulations. Filled circles represent items with simulated responses, and open circles represent items with random responses. The regression lines are fit to the items with simulated responses (filled circles only). Note that the estimated item measures for the random responses are near the mean. The regression equation and R2 value for the best-fit line are listed in each figure along with the item reliability estimated from Equation 32, r2. Both R2 and r2 remain high until the very end. This robustness of item measure reliability is a consequence of careful choices of which items to make random so that range of item measures was minimally compromised. However, with increasing numbers of random items, the slope of the regression line becomes shallower (i.e., as random variability in item responses increases with increases in the number of random items, the scale of the estimated measure decreases).
Fig. 29 illustrates scatter plots of estimated person measures vs. the true α values used in each simulation. The regression equation and R2 value for the best-fit line are listed in each figure. As was seen in Fig. 28 for item measure estimates, the slope of the regression line for person measure estimates becomes shallower with increasing random variability in the item responses. Also listed is the r2 for person measure reliability. Both R2 and r2 decrease with increasing numbers of items with random responses. This can be seen better in Fig. 33, which illustrates the dependence of person measure reliability (filled circles), person measure regression line fit (open circles), item measure reliability (aterisks), and item measure regression line fit (open squares) on the number of items with random responses.
Fig. 31 illustrates how normalized infit mean squares and estimates of item measures change with increasing numbers of items having random responses. In each figure, items with random responses are symbolized with open circles, and items with simulated subject responses are symbolized with filled circles. As shown in Fig. 31 b and c, when fewer than half of the items have random responses, those items are easily identified as grossly misfitting (i.e., infit zstd is much more than two standard deviations from the expected value). Because of the overall increased variability from the random responses, the items with nonrandom responses seem to have significantly less variability than expected by the model (i.e., fall in the shaded area to the left of the expected value). When random responses dominate the data set (Fig. 31 e), then the items with random responses fit model expectations, and items with nonrandom responses seem to be the misfits. Finally, when all responses are random, all item measures fall at the mean, and there are no misfits (Fig. 31 f).
Fig. 32 illustrates how normalized infit mean squares and estimates of person measures change with increasing numbers of items having random responses. In agreement with the reliability estimates, with increasing randomness in the data, the person measure distribution collapses toward the mean (distribution along the vertical axis). Also, the distribution of infit mean squares improves with increasing randomness. That occurs because the items with random responses contribute noise to every person measure, eventually overpowering the variability of responses of misfitting persons.
These simulations illustrate that both validity and reliability must be evaluated for each data set. If there is too much randomness in the data, the fit statistics cannot reliably discriminate the misfitting items and persons from those that fit the model. But with increasing randomness in the data, the separation reliability of person measures decreases, eventually reaching zero for totally random data. In summary, the fit statistics tell you which items and persons have response patterns that are inconsistent with the overall response pattern in the data, relative to the expectations of a linear measurement model, and separation reliability tells you how well items and persons can be discriminated from one another on the basis of the estimated measures.
Application of Rasch Models to Vision Disability Measurements
Rasch analysis of patients' responses to visual function questionnaires provides a means of measuring vision disability and evaluating the validity and precision of the measurement. Like many other milestones in psychometrics, the use of Rasch analysis to measure vision disability can trace its origins to the University of Chicago. Georg Rasch was the father of Rasch analysis, but Benjamin Wright must be considered its legal guardian. Wright and his students and colleagues at the University of Chicago further developed and advanced Rasch's models, developed and validated analytic tools, and promoted and facilitated applications of Rasch models to a wide variety of fields. 100 In collaboration with the Hines VA, Wright's laboratory used Rasch analysis to measure vision disabilities in blind veterans enrolled in the Hines VA Blind Rehabilitation Center (BRC). Due to a combination of factors, including the untimely death of Dr. Ross Lambert, the driving force behind the collaboration with the BRC, only a small portion of that work was published. 101 Also, this application of Rasch analysis was far ahead of its time and considered too esoteric by the mainstream journals in the vision field. Even so, one article was published in the Journal of Visual Impairment and Blindness. 102 Fortunately, the corpus of this work is preserved in E. M. Schulz's PhD dissertation. 103
I had the opportunity to learn Rasch analysis a few years ago and apply it to visual function questionnaire data obtained from patients in our low-vision service. 80 Many of the items in our questionnaire were similar in content to items used in the assessment of veterans at the Hines BRC. Like the questionnaire used at the BRC, our questionnaire elicited difficulty ratings from the patient for each item. If we can assume that the sample of patients included in Schulz's study 102 at the Hines BRC were similar to our sample of low-vision patients, i.e., both samples were drawn from the same population, then the ability scales estimated from Rasch analysis should be the same for the two studies. 104 The scales themselves might differ by a factor because of differences in average variability for item difficulty (the arbitrary unit for the ability scale) and differ by a constant because of differences in average item difficulty (the arbitrary origin of the ability scale). However, the relationship between the two sets of item difficulties (ρi) should be linear. Fig. 34 confirms that item difficulty from Schulz's study is proportional to item difficulty from my study for the five items that have similar content
My colleagues, Kathleen Turano and Duane Geruschat, developed a self-assessment questionnaire designed to measure mobility capabilities and administered it to patients with retinitis pigmentosa. We performed Rasch analysis on their data and estimated item difficulties (ρ) and individual patient abilities (α). 105 Many of the items in their questionnaire, which required difficulty ratings, were similar in content to items in the questionnaire developed by Szylk et al. 54 and also administered to retinitis pigmentosa patients. Because item difficulties were estimated here from the data published by Szlyk et al., 54 we can compare the item difficulties for these two studies, as we just did for the data from the two low-vision samples. Fig. 35 illustrates that the difficulties for the items from the Szlyk et al. 54 study are proportional to the difficulties of items with similar content from the Turano et al. study. Thus, we would conclude that the two instruments, developed in different centers and applied to different samples of retinitis pigmentosa patients, are measuring the same variable, albeit in different units. The difference in units is no more serious than one laboratory measuring temperature in degrees Farenheit and another laboratory measuring temperature in degrees Celcius. The different measurements can easily be transformed to a common unit.
CONCLUSIONS
This article has emphasized the measurement of ability as a means of estimating disability. However, as recognized by the early vision economists, disability depends on more than ability. Using the broader concept formalized by the WHO, disability refers to the functional consequences of disease with regard to every aspect of the patient's life. Thus, a complete definition of disability must include the importance of activities to the patient, as well as the patient's ability to perform those activities. We could not consider a person disabled if they were unable to do only those things that they would never do anyway. Although this issue should be kept in mind, it carries us beyond the scope of the present discussion of measurement. The development of more precise definitions of vision disability is explored elsewhere. 80
Snell and Sterling had the right idea in applying Weber's and Fechner's law to the development of a theory of vision disability. Their mathematical error, however, led us in the wrong direction. This error most likely was adopted as policy without criticism because it produced a relationship between visual acuity and vision disability that the ophthalmology profession was prepared to accept. The earlier work of Zehender, Magnus, and Hansell, as well as the prevailing misinterpretation of the Snellen fraction, all proposed the same type of exponential decay function for visual efficiency vs. visual acuity (Figs. 1 and 2). The only differences were in the choice of parameter values. Although Colenbrander's newer definition of vision disability is consistent with the expectations of Weber's and Fechner's laws, it is still just a theoretical exercise that is as yet untested by data.
With Rasch analysis, we have the tools to measure vision disability using patient self-assessment questionnaires. Within Thurstone's framework, the units of measurement are minimally resolvable differences in ability, i.e., the psychophysical JND. The units of measurement imply that vision disability is expected to be proportional to psychophysical measures of visual impairments that also depend on JND's, such as visual acuity. This argument anticipates the experimental validation of Colebrander's vision disability scales.
In a recent study, Dr. Donald Fletcher and I measured vision disability in 340 low-vision patients by estimating α from Rasch analysis of patient responses to part II of the NEI-VFQ. 106 As part of that study, we compared our estimates of α to measurements of patients' binocular visual acuity. As reproduced in Fig. 36, α is proportional to log visual acuity (expressed as logMAR). This proportionality confirms the form of Colenbrander's FAS.
As part of their study, Turano et al. 105 compared measures of α to the stage of retinitis pigmentosa progression for 145 patients. Stage of disease progression was expressed as years past critical age, which is proportional to log visual field area. 107 As reproduced in Fig. 37, they observed the trend that α is proportional to years past critical age and, therefore, expected to be proportional to log visual field area. Because log visual field area is twice log visual field diameter, plus a constant, the Turano et al. 105 data support the form of Colenbrander's FFS.
It would seem that at least the logarithmic form of the vision disability scales proposed by Colenbrander are consistent with the data that we now have. Although encouraging, the data are limited and of unproven generalizability beyond the respective patient samples. Thus, it would be premature to declare Colenbrander's scales valid. However, given the alternative, they are better founded and better supported than the AMA visual efficiency scales based on the erroneous Snell and Sterling formula.
Rasch analysis provides us with the tools to measure functional ability on an interval scale of patients with visual impairments. By making measurements, it is now possible to advance the science of vision disability beyond theorizing to hypothesis-testing. Rasch analysis also provides us with the tools for interpreting patient responses to self-assessment questionnaires in a meaningful way. Many scientists have long been suspicious of the cavalier assertions by developers and users of visual function questionnaires that the average of patient ratings across questionnaire items is a valid measurement scale. With Rasch analysis, the validity of an instrument does not depend on inferential arguments and correlations with external variables. Rather, it rests on objective statistical tests of the model as an explanation of the data.
ACKNOWLEDGMENTS
Supported by from the National Eye Institute, National Institutes of Health, Bethesda, Maryland, grant EY12045.
REFERENCES
1. Editorial: estimating visual impairment. The Nation's Health 1924; 6: 34-5.
2. Donovan JA. Abstract of discussion. JAMA 1925; 85: 1372.
3. Snellen H. Probebuchstaben zur Bestimmung der Sehschärfe. Utrecht: PW van de Weijer, 1862.
4. Bishop VE. Visually handicapped people and the law. J Visual Impair Blin 1987; 81: 53-8.
5. Würdemann HV. The economic valuation of vision. Ann Ophthalmol 1900; 9: 219-56.
6. Zehender W. Ueber den zahlenmässigen Ausdruck der Erwerbsunfähigkeit gegenüber der Unfall-Versicherungs-Gesellschaft. Klin Monatsbl Augenheilkd 1889; 27: 265-82.
7. Groenouw A. Anleitung zur Berechnung der Erwerbsfähigkeit bei Sehstörungen. Wiesbaden: J.F. Bergmann, 1896.
8. Magnus H. Die Blindheit, ihre Entstehung und ihre Verhütung. Breslau: Kern 1883.
9. Magnus H. Leitfaden für Begutachtung und Berechnung von Unfallsbeschädigungen der Augen. Breslau: Kern, 1897.
10. Magnus H, Würdemann HV. Visual Economics, with Rules for Estimation of the Earning Ability after Injuries to the Eyes. Milwaukee: C Porth, 1902.
11. Hansell HF. Estimation of the amount of injury to the earning capacity of the individual from partial or complete loss of vision. Ann Ophthalmol 1901; 10: 599-611.
12. Snell AC. A Treatise on Medicolegal Ophthalmology. St. Louis: CV Mosby, 1940: 119-51.
13. Landolt E. The Refraction and Accommodation of the Eye. Philadelphia: JB Lippincott, 1886: 230.
14. Fuchs E, Duane A. Text-Book of Ophthalmology, 3rd ed. Philadelphia: JB Lippincott, 1908: 738.
15. Minutes of the Proceedings: 39th Annual Meeting. Trans Am Ophthalmol Soc 1903; 10:24.
16. Green J. Notes on the clinical determination of the acuteness of vision, including the construction and gradation of optotypes, and on systems of notation. Trans Am Ophthalmol Soc 1905; 10: 644-54.
17. Green, J. On a new series of test-letters for determining the acuteness of vision. Trans Am Ophthalmol Soc 1868; 4: 68-71.
18. Donders FC. On the Anomalies of Accommodation and Refraction of the Eye: with a Preliminary Essay on Physiological Dioptrics. London: The New Synderham Society, 1864: 194.
19. Black NM, Snell AC, Patton J, Gradle HS. Report of Committee on Compensation for Eye Injuries. JAMA 1925; 85: 113-5.
20. Snell AC, Sterling S. The percentage evaluation of macular vision. Arch Ophthalmol 1925; 54: 443-61.
21. Snell AC. Visual efficiency of various degrees of subnormal visual acuity: its effect on earning ability. JAMA 1925; 85: 1367-72.
22. Weber EH. Der tastsinn und das gemeinfühl.In: Wagner R, ed. Handwörterbuch der Physiologie,Vol 3. Braunschweig: Vieweg, 1846: 481-588.
23. Fechner GT, Adler HE, Howes DH, Boring EG. Elements of Psychophysics,Vol 1. New York: Holt Rinehart & Winston, 1966.
24. Baird JC, Noma EJ. Fundamentals of Scaling and Psychophysics. New York: Wiley, 1978.
25. The Committee on Medical Rating of Physical Impairment. Guides to the evaluation of permanent impairment: committee report. JAMA 1958; 168: 475-85.
26. American Medical Association. Committee on Rating of Mental and Physical Impairment. Guides to the Evaluation of Permanent Impairment. Chicago: American Medical Association Press, 1971.
27. Doege TC, Houston TP, eds. Guides to the Evaluation of Permanent Impairment, 4th ed. Chicago: American Medical Association Press, 1993.
28. World Health Organization. International Classification of Impairments Disabilities and Handicaps: a Manual of Classification Relating to the Consequences of Disease. Geneva: World Health Organization, 1980.
29. Weiner JP. Introduction to managed care.In: Massof RW, Lidoff L, eds. Issues in Low Vision Rehabilitation: Service Delivery, Policy, and Funding. New York: AFB Press, 2001: 97-106.
30. Colenbrander A. Dimensions of visual performance. Trans Am Acad Ophthalmol Otolaryngol 1977; 83: 332-7.
31. Colenbrander A. The visual system.In: Cocchiarella L, Andersson BJ, eds. Guides to the Evaluation of Permanent Impairment, 5th ed. Chicago: American Medical Association Press, 2000: 277-304.
32. Bailey IL. Measurement of visual acuity: towards standardization.In: Baldwin WR, ed. Vision Science Symposium: a Tribute to Gordon G. Heath. Bloomington, IN: Indiana University, 1988: 215-30.
33. Esterman B. Functional scoring of the binocular field. Ophthalmology 1982; 89: 1226-34.
34. Legge GE, Ross JA, Luebker A, LaMay JM. Psychophysics of reading: VIII. The Minnesota Low-Vision Reading Test. Optom Vis Sci 1989; 66: 843-53.
35. Legge GE, Ross JA, Maxwell KT, Luebker A. Psychophysics of reading: VII. Comprehension in normal and low vision. Clin Vis Sci 1989; 4: 51-60.
36. Szlyk JP, Seiple W, Viana M. Relative effects of age and compromised vision on driving performance. Hum Factors 1995; 37: 430-6.
37. Rubin GS. Prevalence of visual disabilities and their relationship to visual impairment.In: Massof RW, Lidoff L, eds. Issues in Low Vision Rehabilitation: Service Delivery, Policy, and Funding. New York: AFB Press, 2001: 27-38.
38. Bullimore MA, Bailey IL, Wacker RT. Face recognition in age-related maculopathy. Invest Ophthalmol Vis Sci 1991; 32: 2020-9.
39. Kuyk T, Elliott JL, Fuhr PS. Visual correlates of mobility in real world settings in older adults with low vision. Optom Vis Sci 1998; 75: 538-47.
40. Granger CV, Hamilton BB. The Uniform Data System for Medical Rehabilitation report of first admissions for Am J Phys Med Rehabil 1991; 1993; 72: 33-8.
41. Long RG, Crews JE, Mancil R. Creating measures of rehabilitation outcomes for people who are visually impaired: the FIMBA project. J Visual Impair Blin 2000; 94: 292-306.
42. Szlyk JP, Arditi A, Bucci PC, Lederman D. Self-report in functional assessment of low vision. J Vis Impair Blin 1990; 84: 61-6.
43. Bernth-Petersen P. Visual functioning in cataract patients: methods of measuring and results. Acta Ophthalmol (Copenh) 1981; 59: 198-205.
44. Massof RW, Rubin GS. Visual function assessment questionnaires. Surv Ophthalmol 2001; 45: 531-48.
45. Ellwein LB, Fletcher A, Negrel AD, Thulasiraj RD. Quality of life assessment in blindness prevention interventions. Int Ophthalmol 1994; 18: 263-8.
46. Wright BD, Linacre JM. Observations are always ordinal; measurements, however, must be interval. Arch Phys Med Rehabil 1989; 70: 857-60.
47. Steinberg EP, Tielsch JM, Schein OD, Javitt JC, Sharkey P, Cassard SD, Legro MW, Diener-West M, Bass EB, Damiano AM, Steinwachs DM, Sommer A. The VF-14: an index of functional impairment in patients with cataract. Arch Ophthalmol 1994; 112: 630-8.
48. Sloane ME, Ball K, Owsley C, Bruni JR, Roenker DL. The visual activities questionnaire: developing an instrument for assessing problems in everyday visual tasks.In: Noninvasive Assessment of the Visual System. OSA Technical Digest,Vol 1. Washington, DC: Optical Society of America, 1992: 26-9.
49. Mangione CM, Berry S, Spritzer K, Janz NK, Klein R, Owsley C, Lee PP. Identifying the content area for the 51-item National Eye Institute Visual Function Questionnaire: results from focus groups with visually impaired persons. Arch Ophthalmol 1998; 116: 227-33.
50. Thorndike RL. Applied Psychometrics. Boston: Houghton Mifflin, 1982.
51. Likert R. A technique for the measurement of attitudes. Arch Psychol 1932; 140: 55.
52. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951; 16: 297-334.
53. Thorndike EL. An Introduction to the Theory of Mental and Social Measurements. New York: The Science Press, 1904.
54. Szlyk JP, Fishman GA, Alexander KR, Revelins BI, Derlacki DJ, Anderson RJ. Relationship between difficulty in performing daily activities and clinical measures of visual function in patients with retinitis pigmentosa. Arch Ophthalmol 1997; 115: 53-9.
55. Mangione CM, Lee PP, Pitts J, Gutierrez P, Berry S, Hays RD. Psychometric properties of the National Eye Institute Visual Function Questionnaire (NEI-VFQ). NEI-VFQ Field Test Investigators. Arch Ophthalmol 1998; 116: 1496-504.
56. Coren S, Hakstian R. Visual screening without the use of technical equipment: preliminary development of a behaviorally validated questionnaire. Appl Opt 1987; 26: 1468-72.
57. Krantz DH, Luce RD, Suppes P, Tversky A. Foundations of Measurement,Vol 1. New York: Academic Press, 1971.
58. Michell J. Measurement in Psychology: Critical History of a Methodological Concept. Cambridge, UK: Cambridge University Press, 1999.
59. Kisch B. Scales and Weights: a Historical Outline. New Haven: Yale University Press, 1965.
60. Zupko RE. British Weights and Measures: a History from Antiquity to the 17th Century. Madison: University of Wisconsin Press, 1977.
61. Kula W. Measures and Men. Princeton, NJ: Princeton University Press, 1986.
62. Klein HA. The Science of Measurement: a Historical Survey. New York: Simon and Schuster, 1974.
63. Kupfer C. The expanded role of randomized clinical trials. Am J Ophthalmol 1996; 122: 883-5.
64. Medicare. Medicare part B: Kansas/Nebraska/Western Missouri Low Vision Services. Medical Policy Manual Low Vision Services. Washington, DC: Center for Medicare and Medicaid Services, 1999: 1-16.
65. Stevens SS. Mathematics, measurement and psychophysics.In: Stevens SS, ed. Handbook of Experimental Psychology. New York: Wiley, 1951: 1-49.
66. Fahie JJ. Galileo, His Life and Work. New York: J Pott, 1903.
67. Bolton HC. Evolution of the Thermometer, 1592-1743. Easton, PA: The Chemical Publishing, 1900.
68. Middleton WEK. A History of the Thermometer and its Use in Meteorology. Baltimore: Johns Hopkins Press, 1966.
69. Thurstone LL. A law of comparative judgment. Psychol Rev 1927; 34: 273-86.
70. Green DM, Swets JA. Signal Detection Theory and Psychophysics. New York: Wiley, 1966.
71. Thurstone LL, Chave EJ. The Measurement of Attitude: a Psychophysical Method and Some Experiments with a Scale for Measuring Attitude Toward the Church. Chicago: University of Chicago Press, 1929.
72. Bergner M, Bobbitt RA, Carter WB, Gilson BS. The Sickness Impact Profile: development and final revision of a health status measure. Med Care 1981; 19: 787-805.
73. Janz NK, Wren PA, Lichter PR, Musch DC, Gillespie BW, Guire KE. Quality of life in newly diagnosed glaucoma patients: The Collaborative Initial Glaucoma Treatment Study. Ophthalmology 2001; 108: 887-97;discussion 98.
74. Lord FM. Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: L Erlbaum Associates, 1980.
75. Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of Item Response Theory. Newbury Park, CA: Sage Publications, 1991.
76. Hambleton RK, Cook LL. Latent trait models and their use in the analysis of educational test data. J Educ Measurement 1977; 14: 75-96.
77. Birnbaum A. Some latent trait models and their use in inferring an examinee's ability.In: Lord FM, Novick MR, eds. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley, 1968: 397-472.
78. Samejima F. Normal ogive model on the continuous response level in the multidimensional latent space. Psychometrika 1974; 39: 111-21.
79. Kirby RL. The nature of disability and handicap.In: Basmajian JV, Kirby RL, eds. Medical Rehabilitation. Baltimore: Williams & Wilkins, 1984: 14-8.
80. Massof RW. A systems model for low vision rehabilitation: II. Measurement of vision disabilities. Optom Vis Sci 1998; 75: 349-73.
81. Burgess AE, Colborne B. Visual signal detection: IV. Observer inconsistency. J Opt Soc Am A 1988; 5: 617-27.
82. Blinkhorn SF. Past imperfect, future conditional: fifty years of test theory. Br J Math Stat Psychol 1997; 50: 175-85.
83. Goldstein H, Wood R. Five decades of item response modelling. Br J Math Stat Psychol 1989; 42: 139-67.
84. García-Pérez MA. Fitting logistic IRT models: small wonder. Span J Psychol 1999; 2: 74-94.
85. van der Linden WJ, Hambleton RK. Item response theory: brief history, common models, and extensions.In: van der Linden WJ, Hambleton RK, eds. Handbook of Modern Item Response Theory. New York: Springer, 1997: 1-28.
86. Lord FM. Item characteristic curves estimated without knowledge of their mathematical form: a confrontation of Birnbaum's logistic model. Psychometrika 1970; 35: 43-50.
87. Hambleton RK, Cook LL. Robustness of item response models and effects of test length and sample size on the precision of ability estimates.In: Weiss DJ, Bock RD, eds. New Horizons in Testing: Latent Trait Test Theory and Computerized Adaptive Testing. New York: Academic Press, 1983: 31-49.
88. Fisher WP. The Rasch debate: validity and revolution in educational measurement.In: Wilson M, ed. Objective Measurement: Theory Into Practice,Vol 2. Norwood, NJ: Ablex, 1994: 36-72.
89. Michell J. An Introduction to the Logic of Psychological Measurement. Hillsdale, NJ: L Erlbaum Associates, 1990.
90. Rasch G. Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danmarks Paedagogiske Institut, 1960.
91. Wright BD, Masters GN. Rating Scale Analysis: Rasch Measurement. Chicago: MESA Press, 1982.
92. Andrich D. Rating formulation for ordered response categories. Psychometrika 1978; 43: 561-73.
93. Masters GN. A Rasch model for partial credit scoring. Psychometrika 1982; 47: 149-74.
94. Guttman L. A basis for scaling quantitative data. Am Soc Rev 1944; 9: 139-50.
95. Andersen EB. A goodness of fit test for the Rasch model. Psychometrika 1973; 38: 123-40.
96. Smith RM. Person fit in the Rasch model. Educ Psychol Measurement 1986; 46: 359-72.
97. Smith RM, Miao CY Assessing unidimensionality for Rasch measurement.In: Wilson M, ed. Objective Measurement: Theory Into Practice,Vol 2. Norwood, NJ: Ablex, 1994: 316-27.
98. Smith RM. The distributional properties of Rasch item fit statistics. Educ Psychol Measurement 1991; 51: 541-65.
99. Wood R. Fitting the Rasch model: a heady tail. Br J Math Stat Psychol 1978; 31: 27-32.
100. Fisher WP, Wright BD. Applications of probabilistic conjoint measurement. Int J Educ Res 1994; 21: 557-664.
101. Becker SW, Lambert RW, Schulz EM, Wright BD, Burnet DL. An instrument to measure the activity level of the blind. Int J Rehabil Res 1985; 8: 415-24.
102. Schulz EM, Lambert RW, Becker SW, Wright BD, Bezruczko N. An assessment of the needs of rehabilitated veterans. J Vis Impair Blind 1985; 79: 301-05.
103. Schulz EM. Functional Assessment in Rehabilitation: an Example with the Visually-Impaired. PhD Dissertation. University of Chicago, 1987.
104. Fisher WP Jr, Harvey RF, Taylor P, Kilgore KM, Kelly CK. Rehabits: a common language of functional assessment. Arch Phys Med Rehabil 1995; 76: 113-22.
105. Turano KA, Geruschat DR, Stahl JW, Massof RW. Perceived visual ability for independent mobility in persons with retinitis pigmentosa. Invest Ophthalmol Vis Sci 1999; 40: 865-77.
106. Massof RW, Fletcher DC. Evaluation of the NEI visual functioning questionnaire as an interval measure of visual ability in low vision. Vision Res 2001; 41: 397-413.
107. Massof RW, Dagnelie G, Benzschawel T, Palmer RW, Finkelstein D. First order dynamics of visual field loss in retinitis pigmentosa. Clin Vis Sci 1990; 5: 1-26.