In recent years, the standards for accrediting medical education programs have increasingly been framed in the language of general competencies.1,2 It has proven challenging, however, to measure attainment of these competencies in learners. In this commentary, the authors examine these difficulties within the context of modern views of test validity.
Educational Competencies and Social Values
The trend in medical education toward framing assessment around general competencies reflects a larger, more holistic idea of physician competence as seen from the societal perspective. In this view, professional competence goes beyond mere mastery of information; a fully competent medical professional is expected to skillfully deploy factual knowledge in the context of communications skills, clinical reasoning, professional ethics, social engagement, interpersonal conduct, cross-cultural awareness, and perhaps other areas as well.
This movement has been associated with a widespread expansion of medical curricula to include elements that have historically been conceived (and sometimes thereby unfairly disparaged) as part of the “art” of medicine. Many U.S. medical schools now list their school-specific competencies on their Web sites. The framework of general competencies has also been endorsed by the American Board of Medical Specialties3 and, thus, seems poised to influence lifelong certification requirements for nearly every practicing physician in the United States.
Beyond a general movement in the direction of competencies, however, there seems to be little consensus about how exactly competencies might be defined. A recent literature review4 identified at least 173 different sets of published definitions of specific competencies for medical practice in various settings and specialties. Such definitions are typically negotiated by various working groups of experts and stakeholders. Although such groups may all have access to the same published literature on this topic, the final language of competencies reflects specific biases and outlooks and negotiations of individual committee members, as well as dynamic trends in society's expectations of medical professionals in various settings.
This method of generating competencies is thus similar to organizational processes to derive mission statements. Like mission statements, competencies represent an attempt to negotiate a concise statement of shared overarching values. As such, the language of competencies necessarily embodies a series of ideological and political compromises. Competencies are therefore pragmatic, flexible, and focused on the perceived needs of particular groups of people. Thus, just as with mission statements, it would be useful for sponsoring organizations to revisit the competencies from time to time to ensure that they remain relevant and topical.
Educational Competencies as Measurement Constructs
Having defined specific competencies, accreditation organizations may then ask learners to show evidence that they are on a path toward achieving them. Such accreditation requirements make the tacit assumption that people can internalize organizational values and that the degree of internalization can be discerned by direct observation of how people behave, by their attainment of specific outcomes, or by their scores on various assessment tools. Although it may seem reasonable to think that certain measures may relate to specific competencies, empirical relationships between behaviors and competencies need to be rigorously established, particularly for high-stakes testing situations. Thus, in an official statement about assessment of its proposed general competencies, the Accreditation Council for Graduate Medical Education (ACGME) states that “results of an assessment should allow sound inferences about what learners know, believe, and can do.”5 This document goes on to state that the first two requirements of such sound inferences are validity and reliability.
We agree wholeheartedly with the spirit of this data-oriented approach, and we recently published a systematic review6 that explored the evidence linking various outcomes-based assessment methods to the ACGME's legislated competencies. We found little evidence that current assessment tools directly reflect the ACGME competencies. We concluded that this is not surprising, in that there was no a priori reason that a series of socially negotiated constructs would necessarily reflect themselves in how people actually behave, or in how working faculty would interpret learners' behaviors in real-world classroom and clinical settings.
Educational Competencies and Test Validity
We believe that our results are intelligible in the context of the modern theory of test validity, which does not see validity as an intrinsic property of a test. Rather, the concept of validity relates to the number of logical and statistical arguments that can be made to support a particular interpretation of test results. That is, rather than the simplistic question, “Is this test valid?” the relevant question is, “What does the evidence tell us about how the results of this test relate to particular outcomes?” As Kane7 points out in his discussion of test validity,
It is the interpretation and uses of test scores that are validated, and not the tests themselves. Although it can be quite reasonable to talk elliptically about the “validity of a test,” this usage makes sense only if the interpretation/use has already been adopted, either explicitly or implicitly.
Here is a similar idea from Messick,8 who expands this view to include the entire social milieu of a testing situation:
Validity is not a property of the test or assessment as such, but rather of the meaning of test scores. These scores are a function not only of the items or stimulus conditions, but also of the persons responding as well as the context of the assessment. In particular, what needs to be valid is the meaning or interpretation of the score; as well as any implications for action that this meaning entails.
Evidence to support the existence of stable theoretical constructs (also known as “abilities,” “skills,” “competencies,” etc.) may be inferred from patterns of relationships between test scores and other outcomes. Such constructs should be logically coherent in the face of empirical data from test results. In this way, it is not only tests that are validated in terms of constructs but also constructs that are validated in terms of test results—the logical web of relationships among actual test results and theoretical constructs constitutes the so-called “nomological net” of construct validity.9
From this perspective, we find little evidence that available tests are valid measures of the ACGME competencies. The problem is not that the available tests are inherently good or bad; rather, the difficulty is that, in the aggregate, they do not yield a coherent pattern of inferences about the proposed ACGME competencies.
What Is the Evidence for the Validity of Available Tests in Terms of Hypothesized Competencies?
Green and Holmboe10 provide a spirited critique of our findings and conclude that we are part of the “nihilistic din that the instruments currently available are inadequate for evaluating residents in the six competencies.” We read this statement with some dismay, as we fear that we were not sufficiently clear about the conclusions of our study. The problem, as we see it, is not that available tests are inadequate to measure the ACGME competencies, but rather that these hypothetical competencies do not adequately explain the results of available testing. It is the interpretation of a test that must be validated; in this case, the ACGME competencies do not appear to be a high-fidelity representation of available test results.
In their article, Green and Holmboe10 criticize our search strategy and conclude that our review “missed many important studies … given that investigators frequently develop and test evaluation instruments without the Outcome Project framework specifically in mind.” In light of our discussion of test validity, however, we point out that if a test is not specifically evaluated in terms of the competencies, then little can be concluded about the validity of such a test in terms of them. Indeed, in reviewing the ACGME toolbox, we were unable to locate any articles to directly refute our findings that were not already turned up by our search.
Furthermore, as we pointed out in our article, there have been several reviews that have looked more broadly at skills implied by the ACGME competencies. These reviews consistently fail to find that such intuitive-sounding constructs as “communications skills” or “professionalism” emerge as measurable constructs from the psychometric data. Thus, we are confident that expanding our search strategy would not have fundamentally altered our conclusions.
Green and Holmboe10 also conclude that “the biggest problem in evaluating the competencies (is) not the lack of adequate assessment instruments but, rather, the inconsistent use and interpretation of those available by unskilled faculty.” From a purely practical point of view, we question the utility of mandating tools and specifying competencies that a majority of working faculty find nonintuitive to apply and difficult to interpret. In response to this implicit concern, Green and Holmboe10 go on to state, “Faculty can develop and maintain evaluation skills, but development and maintenance require substantial training and ongoing practice.” However, they offer no empirical evidence that training can improve faculty's ability to assess the ACGME competencies or that “substantial training and ongoing practice” would be feasible and cost-effective on the scale required to reliably assess the many tens of thousands of U.S. medical trainees, let alone on a global scale.
We suspect that trainees bring lifelong patterns of organizing information, skills, and attitudes, as attested by vast psychological literature in each of these areas. Faculty, for their part, bring their own lifelong patterns of attributing meaning to others' observed behaviors. Such overrehearsed patterns of social cognition would scarcely be expected to be influenced by recently legislated guidelines of regulatory bodies. Thus, we strongly support efforts to develop naturalistic understandings of competency,11 which could inform the development of ecologically meaningful assessment tools. Such efforts would produce rating scales that discriminate among dimensions that trainees and evaluators do find meaningful, and exhibit in the course of their daily work.
To address a lack of validity among existing measures of competencies, Green and Holmboe10
endorse another type of discrimination—the ability of an instrument to discriminate among different levels of performance within a single competency, or a subdomain of a single competency. … Instruments with this type of discriminative validity would be useful to chart residents' development along “milestones” of competence.
Of course, it is true that many scales have been found to discriminate between trainees at various levels of experience, and many tests may thereby prove useful for that purpose. But one has no need for the ACGME competencies to reach that conclusion. The finding that more experienced test takers get higher scores is the weakest possible evidence about whether such scores thereby represent one hypothesized competency or another.
Testing as a Way of Discovering Educational Constructs
Green and Holmboe10 state that the ACGME toolbox “is not overflowing with perfect instruments.” We agree with this to a point—it is always possible to develop new instruments and improve existing ones. If, on the other hand, the goal is to develop an instrument that would directly reflect one or more politically defined ACGME competency, then we expect that such a project will lead to ongoing frustration. This is because the competencies, as currently specified, do not seem to have any demonstrated empirical basis and, thus, cannot yield themselves to measurement. Competencies may be created by negotiation and agreement, but social consensus alone cannot compel them to materialize in actual testing outcomes.
Thus, the difficulty with measuring the competencies, as we see it, is that they have not been permitted to adapt themselves to the results of empirical testing. As Downing12 recently pointed out,
Validity always refers to score interpretations and never to the assessment itself. The process of validation is closely aligned with the scientific method of theory development, hypothesis generation, data collection for the purpose of hypothesis testing and forming conclusions concerning the accuracy of the desired score interpretations.
Implicit in Downing's argument is the novel-sounding suggestion that, in the spirit of the scientific method, the results of empirical studies could actually help to guide and refine theoretical models of ability. Currently, however, competencies are virtually never framed in this spirit of intellectual inquiry. Rather, once legislated into being, presented at national forums, and finally promulgated as accreditation requirements, they may take on the false appearance of self-evident and immutable physical truths. As such, the only conceivable role for scientific study vis-à-vis the competencies would be simple quantification. This view of competencies may thus partake of the logical error variously known as “reification” or “hypostatization” (the act of treating an idea as concrete reality), or the “fallacy of misplaced concreteness.”13
Although we recognize the imperative of developing socially responsive guidelines for medical education, this urgency may have led to some confusion about the appropriate roles for professional competencies. Indeed, we suspect that many faculty recognize that the mandated competencies correspond only imperfectly to the real skills, attributes, and attitudes that they assess on a daily basis. Perhaps it is the poor fit of the competencies to the real work of teaching and assessment that has led to the “nihilistic din” to which Green and Holmboe10 refer.
Thus, rather than continuing to falsely imagine that the competencies represent coherent and measurable human attributes, we believe that they are best understood as overarching aspirations for curricular reform. Such an understanding would prevent unrealistic expectations about the competencies from interfering with the empirical work of developing data-driven models of professional competence. In contrast to the current situation, where hundreds of proposed competencies compete for the “validity” of a limited number of assessment tools, such scientific work could establish measurable models of clinical performance. This effort could yield evidence-based outcomes that could be directly compared across time, settings, and learners. Such an effort could set the stage for a more evidence-based approach to defining and assessing professional performance.