In recent years, performance assessments have become increasingly popular in medical education. While the term “performance assessment” can be applied to many different types of assessments,1 in medical education this term usually refers to some sort of simulated patient encounter, such as an objective structured clinical examination (OSCE) or a computer simulation of an encounter. These types of assessments appeal to many educators because the tasks or items used are often seen as more realistic than items on multiple-choice examinations. However, this increased “realism” or apparent authenticity comes at a cost—performance examinations are typically more time-consuming and expensive both to administer and to score. On an OSCE, each encounter with a standardized patient is typically scored as a single item, often resulting in an examinee's completing only four to eight items in a two-hour testing period. In contrast, an examinee might complete 100 to 150 items during a two-hour multiple-choice examination.
The fact that performance examinations are typically relatively short means that test users must pay particular attention to the reliability and validity of test scores. In general, other things being equal, a shorter test will result in scores that are less reliable than a longer test. Lower reliability reflects greater error. Adding more items is one way that test developers may increase reliability. On a multiple-choice test, it is relatively inexpensive to write and administer additional items. However, on a performance test both the development and administration of even a single new item can be expensive, and often must be justified in terms of expected gains in score precision.
A second consideration in performance examinations is that scoring is typically more difficult and expensive than scoring of multiple-choice examinations. Expert or trained raters are generally required to review each performance or a sample of performances. Such ratings may be used to score specific performances or to develop scoring criteria or weighting schemes. In either case, raters are a potential source of error.
Generalizability theory2 provides a framework for estimating the relative magnitudes of various sources of error in a set of scores. In most performance assessments, both items and raters are potential sources of error. Generalizability theory allows estimation of the error associated with each of these sources separately, as well as the relevant interaction effects. In a generalizability study (G study), the variance in a set of scores is partitioned in a manner similar to that used in the analysis of variance. However, in a G study the emphasis is not on testing for statistical significance, but rather on assessing the relative magnitudes of the variance components. Depending on the study design, different variance components can be estimated. Once the variance components are estimated, additional analyses can be conducted. In the framework of generalizability theory, the second stage of analysis is referred to as a decision study (D study). In a D study, the estimated variance components are used to estimate generalizability coefficients (comparable to reliability coefficients) under various measurement conditions. Thus, using the results from a single test administration, it is possible to estimate the impacts of changing both the number of raters and the number of items. This is an important benefit of conducting analyses based on generalizability theory. However, it must be stressed that the variance components and G coefficients are estimates, and as such will vary depending on the specific sample used.
Given that the results of generalizability analyses are often used to make practical decisions about test implementation, it is important to collect the data for a G study in a way that will maximize the precision of the variance-components estimates. Given also that performance assessments are costly to administer and score, and that resources (time, raters, and money) are typically limited, the question of how available resources should be allocated for a G study is an important one. Is it preferable to collect data from 100 examinees on 16 items, or 200 examinees on eight items? Should four raters score 50 examinee performances, or should two raters score 100 performances? Decision studies may help to inform these types of decisions after the data are collected and analyzed, but D studies are based on G studies. To date there is no research we are aware of to help in planning data collection for a G study, especially under constraints.
The purpose of the present study was to examine the impacts of different G-study designs. All of the designs simulated here contain the same number of data points, but the distributions of the data points over examinees, items, and raters are varied. By starting with a relatively large data set (200 medical student examinees, completing 16 items each, scored by four raters each for a total of 12,800 data points), we were able to conduct repeated sampling of different data-collection conditions and to construct empirical confidence intervals for variance components estimates. Computed confidence intervals were also constructed4 and compared with the empirically constructed intervals. A series of D studies was then conducted to illustrate how different sampling strategies and different samples within those strategies could have substantial impacts on the decisions that would be likely to be made based on such analyses. It should be stressed that the focus of this study was to illustrate the impacts of various sampling strategies, rather than to make decisions about this particular data set. We hope to inform and remind test designers and users that estimates are based on samples, and as such contain variability, and to illustrate the extent to which that variability is greatly affected by the data-collection procedure used.
Data. The data set used here, hereafter referred to as the “full sample,” consisted of four expert ratings of 200 medical students on 16 performance items related to a computer simulation. Each examinee performance was rated by each of the four independent raters on a holistic nine-point rating scale. From this data set, samples were selected to five data-collection designs or conditions. The numbers of persons or examinees (P), items (I), and raters (R) for each condition were as follows: condition 1, P = 25, I = 16, R = 4; condition 2, P = 50, I = 8, R = 4; condition 3, P = 50, I = 16, R = 2; condition 4, P = 100, I = 4, R = 4; condition 5, P = 100, I = 8, R = 2. These five conditions were chosen so that all samples contained the same total number of observations (1,600). While many other possible combinations were possible, it was beyond the scope of the present study to investigate every possible design. These five conditions were considered representative and realistic. One hundred replications were conducted for each condition in constructing the empirical confidence intervals. For the computed confidence intervals for conditions 1 through 5, one sample was selected at random, and computations were based on that single sample.
Analysis. For each of the 500 samples, and for the full data set, a person × item × rater (p × i × r) G study was performed, and variance components were estimated using GENOVA.3 The 100 replications of each sampling condition provided an empirical sampling distribution for each of the variance components and allowed empirical estimation of means, standard deviations, and 95% confidence intervals for each variance component. The percentage of variance due to each variance component was also calculated, along with the appropriate 95% confidence intervals for these percentages. These empirical confidence intervals were compared with the confidence intervals obtained using Satterthwaite's technique.4
To assess the practical implications of the differences in the variance components, a series of D studies was conducted. Because the results of the G studies suggested that only a small percentage of the variance was associated with the rater facet, the number of raters was fixed at four for all D studies, while the number of items varied from one to 30. Two sets of D studies were conducted for each of the five simulated conditions. This was done in order to illustrate how results could differ even under the same data-collection design. The specific samples were chosen so that the person variance component was at the 10th and 90th percentiles of the distribution for that condition. A D study was also conducted on the full data set.
The results of the G studies using the full data set and the five different conditions are summarized in Table 1. For conditions 1 through 5, the percentages associated with each variance component represent the averages across the 100 replications. The confidence intervals reported here are based on the empirical distributions of these percentages in the 100 replications. The confidence intervals obtained using Satterthwaite's technique are reported below the empirical confidence intervals.
Comparing the average percentage of variance associated with each of the facets across the five sampled conditions, it appears that differences between conditions are minimal. The average percentages are also very similar to the variance-components estimates obtained using the full data set. However, because the results for the various sampling conditions were based on the variance components averaged across 100 samples, it is important to consider the associated confidence intervals, which indicate the variability in the sampling distributions. A review of the empirical confidence intervals suggests differences in the stability of the estimates obtained under various conditions. For example, the widths of the empirical confidence intervals for the item component range from about 9% (condition 3) to 36% (condition 4), suggesting that condition 3 provides a more stable estimate of the item-variance component. Considering all five sampling conditions, condition 1 provides the most stable estimates of four of the seven variance components. By contrast, condition 4 provides the least stable estimates of five of the seven components.
The computed confidence intervals show considerable variability across conditions in the widths of the intervals and the values of the lower and upper limits. Sixteen of the 35 computed confidence intervals for conditions 1 through 5 were wider than the empirical intervals; the remaining 19 were not. Twelve of the 35 computed confidence intervals did not contain the value of percentage of variance estimated from the full sample, and 12 did not contain the value of the mean percentage of variance estimated from the 100 samples of the specified condition.
As noted above, a series of D studies was conducted to illustrate how estimates of G coefficients might vary depending on the sampling design and the specific sample used in the G study. Because such decision studies are often used to determine a minimal number of items to be administered to obtain a specified G coefficient (much as the Spearman-Brown prophecy formula is used in classical test theory), the number of items was varied from 1 to 25. These results are presented in Table 2. One result of interest is the number of items estimated to be needed to obtain a G coefficient of.80. This value is in bold in each column.
Considering the 90th percentile samples for all five conditions, it can be seen that for four of the five conditions the estimate of the number of items needed to achieve a G of.80 is 12; for the second condition the estimate is 11. The estimate based on the full sample is 15 items. Considering the 10th percentile samples for the five sampling conditions, the estimated numbers of items needed range from 19 to 24. Comparing the 10th and 90th percentile samples within conditions, substantial differences in estimates are apparent. For instance, for condition 5 (Np = 100, Ni = 8, Nr = 2) the number of items estimated to be necessary at the 10th percentile is 24, versus only 12 items if the sample at the 90th percentile is used.
Discussion and Implications
The results presented above suggest that, at least for the data set used here, different data-collection designs would have had little impact on average on the variance-components estimates obtained. In other words, collecting ratings of 25 examinees on 16 items using four raters would have resulted in approximately the same variance-components estimates as collecting ratings of 100 examinees on eight items using two raters. In all the conditions studied here, including the full sample, it was clear that a far higher percentage of the variability in scores was related to the item facet, and the associated interactions, compared with that associated with the rater facet and associated interactions. Thus, conclusions as to the relative impacts of item and rater would have been similar regardless of which data-collection design was used.
While the average percentages of variance associated with the individual facets were very similar across the five conditions, the widths of the associated confidence intervals (both empirical and computed) did vary across conditions. In addition, for all of the five conditions, estimates of the numbers of items needed to obtain a given level of generalizability varied considerably depending on whether the 10th percentile or the 90th percentile sample was used. Since in practice investigators have only one sample, and no knowledge of where their sample falls in the distribution, it is important to be aware of the fact that substantially different estimates might have resulted if a different sample had been used. The results of the D studies reported here highlight this. Depending on the condition, the numbers of items required to obtain a G of.80 differed by as much as 100% depending on the specific sample used in the G study. This is particularly important given the time and expense associated with most performance assessments. In this case, analysis of the full data set suggests that 15 items would be needed to achieve a G coefficient of.80. While this is also, of course, a sample, it can be considered our best estimate of the true number of items needed. If, instead of the full data set, we had had only one of the samples investigated here, we might have come to the conclusion that a test of only 11 or 12 items would result in a G coefficient of.80. Were we then to administer a 12-item test based on these results, it is likely that the results would be less generalizable than expected, a result with potentially serious consequences for test developers and users. In contrast, using a different sample we might conclude that 24 items were needed to obtain sufficiently generalizable results. While this overestimation would not be a problem from a psychometric perspective, the costs associated with administering more items than are in fact needed to achieve a specified G could be. In fact, in some circumstances, an overstimation error could result in a decision that a particular testing format is not feasible given cost and time constraints.
It is difficult to interpret the results found for the computed confidence intervals, particularly since these were calculated based on a single sample and would be expected to differ if a different sample were selected. In comparing the computed confidence intervals with the empirical confidence intervals, substantial discrepancies were found, especially for the components that accounted for higher percentages of the variance. In some cases discrepancies were found not only in the width of the interval but also between the values within the interval—at times the intervals were not even overlapping. These findings raise questions as to the usefulness of Satterthwaite's technique in this instance.
It is important for test developers, psychometricians, and test users to remember that generalizability coefficients and other reliability coefficients are estimates based on samples, and as such may be expected to vary depending on the specific sample used in estimation. The results of the study reported here highlight how different samples may produce different results, which can in turn lead to very different decisions. The fact that the computed confidence intervals differed substantially from the empirical confidence intervals, and in some cases did not even contain the appropriate percentage, suggests that computing confidence intervals from a single sample will not necessarily improve decision making. This study does not allow us to make specific recommendations regarding the number or distribution of data points required when conducting a G study. Which design provides the most stable estimates will depend on the nature of the data collected. It is ideal, naturally, to obtain the largest sample possible; but when smaller samples are used, it is crucial that the stability of the estimate be taken in to consideration before decisions are made based on a specific sample, as was illustrated in this study.