Secondary Logo

Share this article on:

When Assessment Data Are Words: Validity Evidence for Qualitative Educational Assessments

Cook, David A., MD, MHPE; Kuper, Ayelet, MD, DPhil; Hatala, Rose, MD, MSc; Ginsburg, Shiphra, MD, MEd

doi: 10.1097/ACM.0000000000001175
Articles

Quantitative scores fail to capture all important features of learner performance. This awareness has led to increased use of qualitative data when assessing health professionals. Yet the use of qualitative assessments is hampered by incomplete understanding of their role in forming judgments, and lack of consensus in how to appraise the rigor of judgments therein derived. The authors articulate the role of qualitative assessment as part of a comprehensive program of assessment, and translate the concept of validity to apply to judgments arising from qualitative assessments. They first identify standards for rigor in qualitative research, and then use two contemporary assessment validity frameworks to reorganize these standards for application to qualitative assessment.

Standards for rigor in qualitative research include responsiveness, reflexivity, purposive sampling, thick description, triangulation, transparency, and transferability. These standards can be reframed using Messick’s five sources of validity evidence (content, response process, internal structure, relationships with other variables, and consequences) and Kane’s four inferences in validation (scoring, generalization, extrapolation, and implications). Evidence can be collected and evaluated for each evidence source or inference. The authors illustrate this approach using published research on learning portfolios.

The authors advocate a “methods-neutral” approach to assessment, in which a clearly stated purpose determines the nature of and approach to data collection and analysis. Increased use of qualitative assessments will necessitate more rigorous judgments of the defensibility (validity) of inferences and decisions. Evidence should be strategically sought to inform a coherent validity argument.

D.A. Cook is professor of medicine and medical education, associate director, Mayo Clinic Online Learning, and consultant, Division of General Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minnesota.

A. Kuper is assistant professor, Department of Medicine, Faculty of Medicine, University of Toronto, scientist, Wilson Centre for Research in Education, University Health Network/University of Toronto, and staff physician, Division of General Internal Medicine, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada.

R. Hatala is associate professor of medicine and director, Clinical Educator Fellowship, University of British Columbia, Vancouver, British Columbia, Canada.

S. Ginsburg is professor, Department of Medicine, Faculty of Medicine, University of Toronto, scientist, Wilson Centre for Research in Education, University Health Network/University of Toronto, and staff physician, Mount Sinai Hospital, Toronto, Ontario, Canada.

Funding/Support: None reported.

Other disclosures: None reported.

Ethical approval: Reported as not applicable.

Correspondence should be addressed to David A. Cook, Division of General Internal Medicine, Mayo Clinic College of Medicine, Mayo 17-W, 200 First St. SW, Rochester, MN 55905; telephone: (507) 284-2269; e-mail: cook.david33@mayo.edu.

Not everything that can be counted counts, and not everything that counts can be counted.

—W. B. Cameron1

Recent trends in health professions education toward competency-based education, milestones, entrustable professional activities, and mastery learning all indicate a need for robust assessments providing data that inform defensible decisions about trainees.2–6 As educators increasingly recognize that quantitative scores fail to capture all important features of a trainee’s performance,7 , 8 they have begun to incorporate and embrace qualitative data.9–11 This renaissance in learner assessment parallels the increased presence of qualitative and mixed-methods approaches in education research over the past 30 years.12 Some might argue that recognition of qualitative assessments’ role and value is long overdue, and experts have begun to elaborate on this approach to assessment.9–11 , 13–15 Yet at present such voices constitute a small minority. We believe that a fuller use of qualitative assessment is hampered in part by incomplete understanding of its role in forming judgments, and also by a lack of consensus in how to appraise the rigor or merits of judgments therein derived.16

The purpose of this article is to (1) articulate the role of qualitative assessment as part of a comprehensive program of assessment, (2) translate the concept and language of validity from quantitative scores to also apply to judgments arising from qualitative assessments, (3) illustrate this approach using examples from published research, and (4) highlight issues requiring more complete understanding. We elaborate on principles some of us recently introduced that relate the validity argument to both quantitative and qualitative assessment,17 focusing on the latter and explaining how evidence might be collected to support decisions based on qualitative assessments. In doing this, we build on existing frameworks for evaluating rigor in qualitative research.18–21

Back to Top | Article Outline

A Brief Overview of Validity Frameworks

The science of validation has evolved substantially in recent years.17 , 22 , 23 The “classical” model of three types of validity—content, criterion (also referred to as correlational, concurrent, or predictive), and construct—has been replaced by two contemporary frameworks that offer complementary approaches to support the interpretations and uses of assessments. In Messick’s22 framework all validity is unified under the overarching umbrella of construct validity, which is supported by evidence derived from five different sources (defined below). The focus on sources of evidence rather than types of validity emphasizes the centrality of the construct being measured (e.g., knowledge, skill, attitude), the concept that validity is a testable hypothesis, and the imperative to seek multiple and complementary sources of evidence to support that hypothesis. However, Messick’s framework does not prioritize among the different evidence sources or indicate how such prioritizing might vary for different assessments. To address this problem, Kane23 developed a framework focusing on the process of collecting validity evidence (validation), specifically highlighting four key phases or inferences in planning and evaluating a validity argument for an assessment’s scores. In practice, an educator would typically choose to use one of these two contemporary frameworks when collecting and interpreting data to support the validity of scores and decisions. However, in this article we invoke both to show readers how each might be used.

As we elaborate below, neither of these frameworks is restricted to a positivist paradigm (i.e., that an objective reality exists with specific, discoverable natural laws); rather, both lend themselves to an appropriately interpretivist stance (i.e., that reality is socially constructed through language and shared meaning and is open to multiple interpretations).

Back to Top | Article Outline

The Role of Qualitative Assessment in Making Defensible Decisions

The ultimate purpose of any assessment method is to come to a valid (i.e., defensible) decision or judgment about the person being assessed. Such decisions might include promotion, identification of areas for remediation, admission into a program, or tailored feedback. Some decisions are dichotomous (e.g., pass/fail) while others are more nuanced and heterogeneous (e.g., strengths and weaknesses in communication skill). These decisions, alone or in combination, have important consequences for the lives of learners and, in the world of medicine, for the patients, peers, and systems with which they work.24 Formal assessment decisions have historically been grounded largely in quantitative data, even though numeric scores are inherently limited to capturing attributes and actions prospectively identified as important, and thereby miss unanticipated but potentially important aspects of performance and context. Other deficiencies in numeric scores include rater biases (leniency, strictness, and idiosyncrasies), restriction of range, teaching to the test (or testing to what was taught), and the false security of “objectivity.”11 , 25 Qualitative assessments may be less (or differently) susceptible to many of these limitations.

Of course, qualitative assessment has its own share of concerns, including susceptibility to the personal biases and idiosyncrasies of assessors and analysts, lack of generalizability across contexts, and practical concerns about the time and training required to collect and analyze what is often a large volume of qualitative data. Nonetheless, a focus on decisions rather than scores provides an interpretive lens through which quantitative and qualitative data can figure equally in offering useful information.

Back to Top | Article Outline

What Counts as Qualitative Assessment?

Broadly speaking, qualitative data can include any non-numeric information, but for present purposes we limit this to words, such as the narratives contained in in-training evaluation comments, learner portfolios, supervisor field notes, and reference letters. These sources are typically rich in description about those being assessed and often capture elements of performance different from those captured by numeric scores.26–30 It is common practice in both research and assessment to convert such narratives to numeric scores (e.g., rating an in-training evaluation narrative on a 10-point scale), but while such scores are useful and have a role in assessment, they do not (by our definition) count as qualitative assessment. Rather, in true qualitative assessment educators analyze the raw narrative data to identify common themes, patterns, disagreements, examples, and disconfirming examples regarding learner performance, and synthesize these into an interpreted narrative and summary judgment. When properly done, such narrative syntheses retain a depth, breadth, nuance, and richness of meaning that would be lost in reducing the narratives to mere numbers. As Hodges16 noted, “Perhaps the translation of behaviours into numbers and then numbers back into statements is an unnecessary detour.” While quantitative and qualitative approaches to analysis of narrative data can complement one another, in talking about qualitative assessment we refer here only to narrative syntheses and judgments.

We wish to make a brief note regarding terminology. When discussing qualitative assessment, we eschew the word rater when referring to the person observing and documenting the learner’s performance, since it connotes the creating of a numeric rating, preferring instead a neutral term such as assessor or observer. We likewise avoid the words rating and score, and use instead the words narrative (to refer to the words or raw data generated from the observation) and synthesis or interpretation (to refer to narratives that have been analyzed, interpreted, and triangulated with other data). Finally, we note that whereas the analysis of ratings or scores from a given quantitative assessment is usually straightforward (e.g., summing or averaging across items or cases), in qualitative research the synthesis and interpretation function typically require analysts with specific training and experience in qualitative methods.

Back to Top | Article Outline

Can We Talk About “Validity” in Qualitative Assessment?

A decision is only as good as the data on which it was founded. Decision makers need a way to judge the quality, rigor, and robustness of the information they use. In quantitative assessment, rigor is established using evidence such as detailed descriptions of instrument creation, random sampling of observations, controlled conditions, interrater reliability, and statistical associations with related scores. This evidence is grouped and interpreted through the lens of validity, which is defined as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.”31 (p11)

By contrast, rigor in qualitative research is judged very differently, relying on approaches such as purposive sampling, naturalistic settings, rich data, insightful interpretation, and transferability. Qualitative researchers eschew the concept of study validity, and instead interpret rigor using terms such as credibility and trustworthiness.18–21 These different ways of describing and labeling concepts related to the collection, analysis, and interpretation of data are critically important to each set of methodologies, as they allow for a shared understanding of processes that enhance rigor within each domain.

Yet we believe that the concept of validity can, in fact, be appropriately applied to qualitative assessment without sacrificing the core values and inherently different paradigms that define this field and approach. In asserting this, we specifically restrict our claim to the evaluation of assessment validity; judgments of research validity constitute an entirely different set of methods and standards,32 and fall outside the scope of this article. We see at least three reasons to invoke the concepts of validity and validation when referring to qualitative assessments.

First, validation refers to the collection of evidence to support the defensibility of a given decision.23 , 33 This broad definition makes no restriction on the nature of the evidence used, and in fact much of the evidence collected to support quantitative scores is qualitative in nature (e.g., content evidence often includes a narrative description of test development34). The focus on decisions also enables us to move beyond the lay notion that assessment is synonymous with numeric scores, and accept a broader and more nuanced range of assessment data including judgments and narrative syntheses emerging from qualitative analyses. Good decisions can be based on information presented as either numbers or words—and, perhaps ideally, both in concert.

Second, for the field of educational assessment to move forward and embrace a greater diversity of data types we need to find common ground in language and vocabulary. We acknowledge that the vocabulary chosen to define rigor in qualitative assessment is no trivial matter. Language shapes beliefs, defines paradigms, and affords power to some discourses over others. Indeed, words figure centrally in all phases of qualitative analysis. Some might argue to retain two independent vocabularies—distinct languages—or to adopt the language of qualitative research and operationalize it for quantitative assessments. However, the concept of validation in the quantitative assessment literature is already quite well developed and reflects a robust framework that is accepted by many educators. Thus, even though two of us (A.K., S.G.) work primarily in a qualitative/interpretivist research paradigm, we came to a pragmatic consensus that researchers and educators might benefit from reoperationalizing and broadening terms typically used in the quantitative assessment literature such that they align with both the language and paradigms of qualitative research.

Third, and most important, the evidence and paradigms traditionally used to establish rigor in qualitative research can be readily reinterpreted through the lens of contemporary validity frameworks to support the rigor of qualitative assessments.17 , 35 The classical validity model drew heavily on psychometric traditions but is now considered out of date. By contrast, Messick’s and Kane’s frameworks readily guide the collection and interpretation of evidence that supports the defensibility of decisions founded on qualitative assessments.

To help educators understand how they might apply contemporary validity frameworks to qualitative assessments, we first identified established guidelines for high-quality qualitative research 18–21 and considered how they might specifically translate to qualitative assessments (see Table 1). We then used Messick’s “five sources of evidence” and Kane’s “four inferences of validation” frameworks to reframe these domains of high-quality qualitative inquiry into a comprehensive approach to establishing the defensibility of judgments (see Tables 2 and 3).

Table 1

Table 1

Table 2

Table 2

Back to Top | Article Outline

Applying Messick’s Framework to Qualitative Assessment: Five Sources of Evidence

Current standards31 endorse the model first espoused by Messick22 in which the data to support (or refute) the validity or defensibility of score interpretations can be broadly classified as evidence of content, response process, internal structure, relationships to other variables, and consequences. Importantly, these sources of evidence are not different types of validity but, rather, different categories of evidence that inform judgments about validity. Evidence should be sought from several sources to support any given interpretation. Table 2 contains definitions for these five sources, together with examples of evidence that might be collected to support a qualitative assessment.

To summarize briefly:

  • Content evidence evaluates the “relationship between the content of a test and the construct it is intended to measure.”31 (p14) For a qualitative assessment, content evidence might include the method for selecting specific prompts (questions), the wording of prompts and instructions, and the sampling strategy (often deliberate or purposeful, to explore specific perspectives; and iterative, to tailor the duration and focus of data collection). The qualitative research concept of saturation36—continuing purposeful, iterative data collection and analysis until additional observations do not suggest new themes—provides a useful theoretical sampling end point, although in practice “sufficient” rather than complete saturation will usually be the goal.
  • Response process evidence is formally defined as “the fit between the construct and the detailed nature of performance … actually engaged in.”31 (p15) In more practical terms it reflects the processes that link the observation itself to the record (answer, rating, narrative) documenting that observation, such as the assessor’s interpretations of events (i.e., mental processes) that underlie his or her narrative comments or the system by which such comments are recorded (e.g., clerical or computer processes). Evidence might show that assessors follow instructions, that narratives offer rich data and differ between learners for the same assessor, that assessors’ reflexivity (such as their relationships with learners and context) has been considered, or that the computer interface allows extended narratives.
  • Internal structure reflects the relationship among data elements in the same assessment— their coherence (or lack thereof), and the rationale and method for synthesizing these elements into a meaningful message. Evidence to support qualitative assessment might explore the triangulation among different data elements (e.g., narratives collected from different assessors or different contexts), the nature of the analysis itself (e.g., iteration, and often additional data collection, until saturation), or the reflexivity of the analysts. For example, educators might highlight discordant narratives and seek additional data or a more nuanced interpretation that accounts for this seeming inconsistency.37
  • Whereas internal structure looks at associations among elements within the assessment and the approach to synthesis, relationships with other variables examines the associations between the final synthesis and other information sources outside the assessment under study. Application of the qualitative research concepts of triangulation (here, with external data sources) and transferability (e.g., to other contexts or future or past performances) can supply evidence of these relationships.
  • Finally, evidence of consequences looks at the actual impact of the assessment and the associated decisions and actions.24 The collection of such evidence might use qualitative or quantitative research methods to explore learner and instructor impressions of the assessment and how it influenced their lives, objective evidence of intended and unintended consequences, or the degree to which decisions resonate with other stakeholders.
Back to Top | Article Outline

Applying Kane’s Framework to Qualitative Assessment: Four Validation Inferences

By contrast with Messick’s focus on sources of evidence, Kane’s framework focuses on key inferences in the validity argument (see Table 3). As some of us recently summarized elsewhere17:

Table 3

Table 3

essentially, Kane traces an assessment from the Scoring of a single observation (e.g., multiple-choice exam question, skill station, clinical observation, or portfolio item), to using the observation score(s) to generate an overall test score representing performance in the test setting (Generalization), to drawing an inference regarding what the test score might imply for real-life performance (Extrapolation), and then to interpreting this information and making a decision (Implications). Each phase in this process represents an inference laden with several assumptions.

By focusing on the argument itself rather than sources of evidence, Kane’s approach applies equally well to quantitative assessments, qualitative assessments, and programs of assessment.35 Kane emphasizes the imperative to articulate up front both the intended uses and decisions and also the evidence needed to support such decisions (the “interpretation/use argument”),38 just as a researcher would prospectively identify the research question and outline the research data required to answer the question.

The first inference—scoring—concerns the generation of an insightful and accurate response (e.g., narrative comment) based on one or more observations of performance. Data to support the scoring inference might include the wording of questions or prompts inviting a narrative response, the richness of the raw narrative, the credibility of the observers, and the use of “thick description” quotes or images in the final report.

The next inference is generalization, in which individual data elements (e.g., comments) are synthesized into a coherent, insightful, and accurate judgment regarding performance in the test setting. Qualitative assessments are not intended to be generalizable in the usual sense19; rather, in qualitative assessment educators

continually revis[e] initial interpretations until they account for all of the available evidence. Inconsistency in students’ performance across tasks … becomes an empirical puzzle to be solved by searching for a more comprehensive or elaborated interpretation.37

The generalization inference would be supported by rigorous methods for sampling and data analysis, including purposeful sampling, iterative and responsive data collection (ideally continuing until saturation), and triangulation (deliberate selection and synthesis of data from complementary sources). The training and experience of analysts would also inform this inference.

Extrapolation extends the interpretations beyond the test setting and claims relevance to real-life situations and future performance. Extrapolation evidence comes primarily from two sources: steps taken to ensure that the test or assessment reflects the most important aspects of real-life performance, and empiric analyses that evaluate (quantitatively or qualitatively) the relationship between test performance and real-world performance. Note that both the test-setting assessment data and the corresponding real-world data could be quantitative, qualitative, or both.

Finally, the implications inference explores the link between claims for real-world performance and the decisions and actions based thereon. This step evaluates the intended or unintended consequences of assessment-informed decisions and, in most respects, parallels the consequences evidence discussed above.

Back to Top | Article Outline

Application to an Existing Assessment Approach

We conclude by using published evidence to show how both frameworks can apply to a prototypical qualitative assessment approach, the learning portfolio.13 , 15 , 39–58 We selected this assessment approach because it is commonly used and has been the object of substantial research. Our intent in presenting evidence is to illustrate rather than be comprehensive.

Back to Top | Article Outline

Definition and purpose

While there are many definitions of a learning portfolio, for our purposes it is a collection of information (“artifacts”) used to facilitate and measure learning that includes a narrative, reflective component.41 Learning portfolios (hereafter, “portfolios”) are atypical among assessments in that learners are usually their own assessors (i.e., they select and comment on many, if not all, of the artifacts), and learners also frequently contribute to the analysis (i.e., synthesis during the reflective component) along with one or more faculty members.42 , 56 Since the first step in validation is to define the intended use, in the following exercise we will hypothetically plan to use the portfolio to support a program director in providing formative feedback to a resident. Some of the evidence we cite derives from studies of medical students, nurses, or practicing physicians, but we do not emphasize these distinctions here, since our primary purpose is to illustrate evidence and inferences of validation. In evaluating this (or any other) assessment for a real application, it would be essential to appraise the relevance of evidence deriving from populations and contexts different from our own (what qualitative researchers would refer to as transferability).

Back to Top | Article Outline

Applying Messick’s framework

As detailed in Table 2, we find supportive evidence from all five of Messick’s sources. Two areas of concern are that residents, in the assessor (data collector) role, could bias the assessment by selectively including only favorable data,13 , 42 , 44 , 53 and that the use of excessive structure might impede reflection.57 Both of these issues could be addressed through portfolio redesign (e.g., involving others in artifact selection). Evidence of consequences is quite extensive for portfolios, likely because they have been explicitly viewed as training interventions as well as assessment instruments.39–41 While generally favorable, fairness and overall value have been questioned.43–45 , 47 , 52 , 55

Back to Top | Article Outline

Applying Kane’s framework

Applying Kane’s approach requires that we not only identify the intended use (i.e., to provide formative feedback, as noted above) but that we also articulate the interpretation/use argument38 (i.e., the evidence and interpretations that would support such use) before embarking on a search for evidence.17 To support the proposed use, for the scoring inference we would hope to find evidence that those collecting information (typically the residents) had proper training and/or instructions, that they were allowed sufficient time to collect their artifacts, and that the artifacts accurately reflect the original activity being assessed and contain sufficient information to allow insightful interpretations. To support generalization, we would hope to find evidence that the artifacts represent a variety of meaningful activities and collectively form a coherent, accurate, and complete picture of the resident across multiple clinical situations. Ideally, artifacts suggesting conflicting or incomplete interpretations would prompt additional data collection or a narrative explanation from the resident. We would further expect that those conducting the interpretive analysis (e.g., mentors or program directors) have appropriate training or experience, that the interpretation integrates but does not rely exclusively on the learner’s reflections, and that the analysis process is transparent and rigorous. This might include explicit consideration of personal idiosyncrasies when providing feedback (analyst reflexivity). To support extrapolation, we would hope that the final synthesis is coherent with other information (e.g., qualitative or quantitative assessments) about the resident (triangulation) and that those involved in the original activities would agree with this final synthesis (member check). Finally, to support the proposed implications, we would hope to find evidence that relevant stakeholders (e.g., the resident and program director) agree with the decision (in this case, specific feedback) based on these narratives, and that actions based on this feedback have the desired effect without unanticipated negative impact.

We evaluate the proposed interpretation/use argument using Kane’s framework (Table 3) by reconsidering, reorganizing, and prioritizing the evidence reported using Messick’s framework. Support for the inferences of scoring, generalization, and implications is generally favorable aside from issues regarding sampling, reflexivity, structure, and overall value. Extrapolation evidence is lacking.

Back to Top | Article Outline

Further evidence and summary

To further reinforce this illustration, we refer readers to one study13 that deliberately sought evidence to support the robustness (trustworthiness) of a portfolio-based qualitative assessment system using the qualitative research standards of credibility and dependability (see Tables 2 and 3 for details about this study’s findings).

In summary, we can make a reasonable argument to support the use of portfolios for providing formative feedback, with the caveats noted above and the acknowledgment that all validation is context dependent. We note that using portfolios for another purpose, such as decisions regarding academic advancement, would require a fresh formulation and evaluation of the interpretation/use argument. Readers seeking another illustration may find our application of Kane’s framework to in-training assessments of interest.17

Back to Top | Article Outline

Challenges, Uncertainties, and Next Steps

In this article we advocate the increased use of qualitative assessments when warranted, and simultaneously advocate more robust judgments of the defensibility of inferences, decisions, and actions arising from their use. We illustrate how two contemporary frameworks for evaluating the validity of quantitative assessments can be readily applied to qualitative assessments. Such frameworks illuminate potential gaps in the defensibility of decisions grounded in the assessment results, and have the added benefit of offering a common lens and language to facilitate communication among quantitative and qualitative scholars.

Back to Top | Article Outline

Limitations of this work

This work has limitations. First, the “potential evidence” examples in Tables 2 and 3 are largely hypothetical, and when we sought actual evidence for a real assessment our search was not systematically comprehensive. Second, our effort is, to our knowledge, the first to describe a formal framework for evaluating the defensibility of purely qualitative assessments. We expect that over time this framework will mature, and that what we present herein will be extended or perhaps overturned. Nonetheless, we feel that this work represents a critical advance for the field and a starting point for debate and future research.

Back to Top | Article Outline

Practical issues in qualitative assessment

We do not suggest that qualitative assessment is without limitations or that it is appropriate for all constructs and contexts.16 In the messy reality of real-world assessment, numerous sources of information must commonly be integrated. Just as they do in research, quantitative and qualitative methods in assessment complement one another as they serve different purposes, address different needs, and exemplify different strengths and weaknesses. Choices among available methods of data collection and analysis will hinge on conceptual issues related to the intended use (as discussed above) and practical issues related to the assessment activity, which we discuss below. Focusing on deliberate purposes and defensible decisions provides a framework for prioritization and integration of assessment data regardless of format. As with mixed-methods research,59 robust mixed-methods assessment will ideally focus on each approach (i.e., quantitative and qualitative) separately—ensuring optimal methods, data, and human resources for each—and then triangulate the results.

Both the collection of qualitative data and the analysis of data once collected tend to demand more time, more energy, and vastly different skills than do quantitative assessments. Crafting and recording an insightful narrative typically takes longer than marking a checklist or rating scale, and narrative quality may suffer if prompts are preceded by numerous other questions (i.e., assessor fatigue). Meaningful narratives (just like accurate quantitative ratings) also require sufficient observation of performance-in-context, which time-pressed assessors may be tempted to shortchange. Done properly, data collection efforts will yield a large volume of raw narratives that must then be distilled and synthesized into meaningful, actionable insights. Such sense making requires people with qualitative analysis skills, context- and content-specific knowledge, and adequate time. Programs considering the incorporation of qualitative assessments will need to take into account such resource requirements, which might include skill development for assessors and analysts.

All assessments are only as robust as the quality of the data and interpretations, and qualitative assessments are susceptible to quality issues that are different from those of quantitative measures. First, there is inherently greater variability in the amount and content of data; rich narrative comments may be saturated with insights, whereas shallow comments are harder to interpret, and narrative length does not necessarily reflect richness or utility. Factors such as assessor training, motivation, time pressure, and fatigue can influence the quality of comments. Second, both raw narratives and synthesized interpretations will inevitably reflect the idiosyncrasies of assessors and analysts, respectively. For example, assessors may veil their comments to mask their identity or may defer strong indictments to avoid passing judgment.60 , 61 Indeed, research suggests that qualitative assessment narratives serve multiple social purposes and that assessors use coded language that is difficult to interpret.30 Finally, it may be more difficult to identify problems in the quality of qualitative data (such as shallow comments, or failure to record negative impressions60) in contrast with quantitative assessment, for which measures such as reliability, halo effect, missing data, and discrimination can be readily calculated.

Qualitative analyses are not intended to be generalizable in the usual sense19 , 62 but, rather, to accurately reflect context-bound impressions that may transfer to new situations. Sampling is thus intentionally purposive rather than random, typically aims for thematic sufficiency, and does not necessarily require large samples. However, sampling must be guided by relevant theories, responsive to accumulating evidence, and in most cases attentive to maximizing the variability in data.

We underscore that the frameworks we set forth apply to qualitative assessment (i.e., the application of qualitative methods to integrate observations and make decisions regarding an individual person) rather than qualitative research (which uses data from multiple individuals and sources to generate potentially transferable insights and understanding). Qualitative research is well served by existing quality frameworks.18–21

Perhaps the greatest challenge—at least in the short term—will be to overcome the biases and preconceptions that question the relevance of qualitative assessment in comparison with the overwhelming dominance of the quantitative/psychometric model of assessment in education.63 We hope that the inroads made by qualitative research over the past decade, together with the pressing needs for an expanded set of assessment tools3 , 64 and for programmatic assessment,65 will hasten its acceptance. We believe that the ideas set forth in this article will also address this challenge.

Back to Top | Article Outline

Conclusions and next steps

Much remains to be learned about the use of qualitative assessment. Areas of uncertainty include identifying and addressing the training needs of both assessors and analysts, encouraging assessors to provide rich and balanced raw narratives, defining sampling strategies that provide the needed variety and depth of perspectives, supporting analysts in efficiently synthesizing large amounts of qualitative data (which might include the use of electronic tools), and effectively using final interpretations. We also need to better define the performance domains that most benefit from qualitative assessment, and we need to determine how to use qualitative interpretations in conjunction with other information (e.g., quantitative assessments) to optimally inform important decisions.8 Finally, although we now have robust frameworks for evaluating validity (as described above), frameworks for judging the merits of original educational assessment research (i.e., validation studies) are less well developed, even for quantitative assessments. Such frameworks might draw upon standards for testing66 and reporting67 clinical diagnostic tests,68 and for evaluating20 , 69 , 70 and reporting18 qualitative research.

We vigorously oppose the segregation of quantitative and qualitative assessment methods.71 Rather, we advocate a “methods-neutral” approach, in which a clearly stated purpose determines the nature of and approach to data collection and analysis. As Patton72 stated,

a consensus has gradually emerged that the important challenge is to appropriately match methods to purposes and inquiry questions, not to universally and unconditionally advocate any single methodological approach for all inquiry situations.

Although we urge the use of a contemporary validity framework when evaluating any assessment, quantitative or qualitative, we do not strongly favor either Messick or Kane. What matters most in validation is that evidence is strategically sought to inform a coherent argument that evaluates the defensibility of intended decisions.

Back to Top | Article Outline

References

1. Cameron WB. Informal Sociology: A Casual Introduction to Sociological Thinking. 1963.New York, NY: Random House.
2. Frank JR, Snell LS, Cate OT, et al. Competency-based medical education: Theory to practice. Med Teach. 2010;32:638–645.
3. Iobst WF, Sherbino J, Cate OT, et al. Competency-based medical education in postgraduate medical education. Med Teach. 2010;32:651–656.
4. Weinberger SE, Pereira AG, Iobst WF, Mechaber AJ, Bronze MS; Alliance for Academic Internal Medicine Education Redesign Task Force II. Competency-based education and training in internal medicine. Ann Intern Med. 2010;153:751–756.
5. ten Cate O. Trust, competence, and the supervisor’s role in postgraduate training. BMJ. 2006;333:748–751.
6. Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Mastery learning for health professionals using technology-enhanced simulation: A systematic review and meta-analysis. Acad Med. 2013;88:1178–1186.
7. Ginsburg S, McIlroy J, Oulanova O, Eva K, Regehr G. Toward authentic clinical evaluation: Pitfalls in the pursuit of competency. Acad Med. 2010;85:780–786.
8. Schuwirth LW, van der Vleuten CP. A plea for new psychometric models in educational assessment. Med Educ. 2006;40:296–300.
9. Kuper A, Reeves S, Albert M, Hodges BD. Assessment: Do we need to broaden our methodological horizons? Med Educ. 2007;41:1121–1123.
10. Govaerts MJ, Van de Wiel MW, Schuwirth LW, Van der Vleuten CP, Muijtjens AM. Workplace-based assessment: Raters’ performance theories and constructs. Adv Health Sci Educ Theory Pract. 2013;18:375–396.
11. Hodges B. Assessment in the post-psychometric era: Learning to love the subjective and collective. Med Teach. 2013;35:564–568.
12. Bordage G. Moving the field forward: Going beyond quantitative–qualitative. Acad Med. 2007;82(10 suppl):S126–S128.
13. Driessen E, van der Vleuten C, Schuwirth L, van Tartwijk J, Vermunt J. The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: A case study. Med Educ. 2005;39:214–220.
14. Govaerts M, van der Vleuten CP. Validity in work-based assessment: Expanding our horizons. Med Educ. 2013;47:1164–1174.
15. Johnston B. Summative assessment of portfolios: An examination of different approaches to agreement over outcomes. Stud Higher Educ. 2004;29:395–412.
16. Eva KW, Hodges BD. Scylla or Charybdis? Can we navigate between objectification and judgement in assessment? Med Educ. 2012;46:914–919.
17. Cook DA, Brydges R, Ginsburg S, Hatala R. A contemporary approach to validity arguments: A practical guide to Kane’s framework. Med Educ. 2015;49:560–575.
18. O’Brien BC, Harris IB, Beckman TJ, Reed DA, Cook DA. Standards for reporting qualitative research: A synthesis of recommendations. Acad Med. 2014;89:1245–1251.
19. Kuper A, Lingard L, Levinson W. Critically appraising qualitative research. BMJ. 2008;337:a1035.
20. Popay J, Rogers A, Williams G. Rationale and standards for the systematic review of qualitative literature in health services research. Qual Health Res. 1998;8:341–351.
21. Lincoln YS. Emerging criteria for quality in qualitative and interpretive research. Qual Inq. 1995;1:275–289.
22. Messick S. Linn RL. Validity. In: Educational Measurement. 1989:3rd ed. New York, NY: American Council on Education and Macmillan; 13–103.
23. Kane MT. Brennan RL. Validation. In: Educational Measurement. 2006:4th ed. Westport, Conn: Praeger; 17–64.
24. Cook DA, Lineberry M. Consequences validity evidence: Evaluating the impact of educational assessments. Acad Med. 2016;91:785–795.
25. Norman GR, Van der Vleuten CP, De Graaff E. Pitfalls in the pursuit of objectivity: Issues of validity, efficiency and acceptability. Med Educ. 1991;25:119–126.
26. Cohen GS, Blumberg P, Ryan NC, Sullivan PL. Do final grades reflect written qualitative evaluations of student performance? Teach Learn Med. 1993;5:10–15.
27. Ginsburg S, Gold W, Cavalcanti RB, Kurabi B, McDonald-Blumer H. Competencies “plus”: The nature of written comments on internal medicine residents’ evaluation forms. Acad Med. 2011;86(10 suppl):S30–S34.
28. Myers KA, Zibrowski EM, Lingard L. A mixed-methods analysis of residents’ written comments regarding their clinical supervisors. Acad Med. 2011;86(10 suppl):S21–S24.
29. Zibrowski EM, Singh SI, Goldszmidt MA, et al. The sum of the parts detracts from the intended whole: Competencies and in-training assessments. Med Educ. 2009;43:741–748.
30. Ginsburg S, Regehr G, Lingard L, Eva KW. Reading between the lines: Faculty interpretations of narrative evaluation comments. Med Educ. 2015;49:296–306.
31. American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Validity. In: Standards for Educational and Psychological Testing. 2014:Washington, DC: American Educational Research Association; 11–31.
32. Hulley SB, Cummings SR, Browner WS, Grady D, Hearst N, Newman TB. Designing Clinical Research: An Epidemiologic Approach. 2001.2nd ed. Philadelphia, Pa: Lippincott Williams & Wilkins.
33. Cook DA. When I say… validity. Med Educ. 2014;48:948–949.
34. Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med. 2006;119:166.e7–166.16.
35. Schuwirth LW, van der Vleuten CP. Programmatic assessment and Kane’s validity perspective. Med Educ. 2012;46:38–48.
36. Charmaz K. Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis. 2009.London, UK: Sage.
37. Moss PA. Can there be validity without reliability? Educ Res. 1994;23(2):5–12.
38. Kane MT. Validating the interpretations and uses of test scores. J Educ Meas. 2013;50:1–73.
39. Driessen E, van Tartwijk J, van der Vleuten C, Wass V. Portfolios in medical education: Why do they meet with mixed success? A systematic review. Med Educ. 2007;41:1224–1233.
40. Buckley S, Coleman J, Davison I, et al. The educational effects of portfolios on undergraduate student learning: A best evidence medical education (BEME) systematic review. BEME guide no. 11. Med Teach. 2009;31:282–298.
41. Tochel C, Haig A, Hesketh A, et al. The effectiveness of portfolios for post-graduate assessment and education: BEME guide no 12. Med Teach. 2009;31:299–318.
42. Mathers NJ, Challis MC, Howe AC, Field NJ. Portfolios in continuing medical education—effective and efficient? Med Educ. 1999;33:521–530.
43. Lonka K, Slotte V, Halttunen M, et al. Portfolios as a learning tool in obstetrics and gynaecology undergraduate training. Med Educ. 2001;35:1125–1130.
44. Smith K, Tillema H. Long-term influences of portfolios on professional development. Scand J Educ Res. 2001;45:183–203.
45. Tiwari A, Tang C. From process to outcome: The effect of portfolio assessment on student learning. Nurse Educ Today. 2003;23:269–277.
46. Carraccio C, Englander R. Evaluating competence using a portfolio: A literature review and Web-based application to the ACGME competencies. Teach Learn Med. 2004;16:381–387.
47. Rees C, Sheard C. Undergraduate medical students’ views about a reflective portfolio assessment of their communication skills learning. Med Educ. 2004;38:125–128.
48. Driessen EW, van Tartwijk J, Overeem K, Vermunt JD, van der Vleuten CP. Conditions for successful reflective use of portfolios in undergraduate medical education. Med Educ. 2005;39:1230–1235.
49. Driessen EW, Overeem K, van Tartwijk J, van der Vleuten CP, Muijtjens AM. Validity of portfolio assessment: Which qualities determine ratings? Med Educ. 2006;40:862–866.
50. Kjaer NK, Maagaard R, Wied S. Using an online portfolio in postgraduate training. Med Teach. 2006;28:708–712.
51. Driessen EW, Muijtjens AM, van Tartwijk J, van der Vleuten CP. Web- or paper-based portfolios: Is there a difference? Med Educ. 2007;41:1067–1073.
52. Hrisos S, Illing JC, Burford BC. Portfolio learning for foundation doctors: Early feedback on its use in the clinical workplace. Med Educ. 2008;42:214–223.
53. Dannefer EF, Bierer SB, Gladding SP. Evidence within a portfolio-based assessment program: What do medical students select to document their performance? Med Teach. 2012;34:215–220.
54. Donato AA, George DL. A blueprint for implementation of a structured portfolio in an internal medicine residency. Acad Med. 2012;87:185–191.
55. O’Sullivan AJ, Howe AC, Miles S, et al. Does a summative portfolio foster the development of capabilities such as reflective practice and understanding ethics? An evaluation from two medical schools. Med Teach. 2012;34:e21–e28.
56. Dannefer EF, Prayson RA. Supporting students in self-regulation: Use of formative feedback and portfolios in a problem-based learning setting. Med Teach. 2013;35:655–660.
57. Goodyear HM, Bindal T, Wall D. How useful are structured electronic portfolio templates to encourage reflective practice? Med Teach. 2013;35:71–73.
58. McEwen LA, Griffiths J, Schultz K. Developing and successfully implementing a competency-based portfolio assessment system in a postgraduate family medicine residency program. Acad Med. 2015;90:1515–1526.
59. Small ML. How to conduct a mixed methods study: Recent trends in a rapidly growing literature. Annu Rev Sociol. 2011;31:57–86.
60. Dudek NL, Marks MB, Regehr G. Failure to fail: The perspectives of clinical supervisors. Acad Med. 2005;80(10 suppl):S84–S87.
61. Watling CJ, Kenyon CF, Schulz V, Goldszmidt MA, Zibrowski E, Lingard L. An exploration of faculty perspectives on the in-training evaluation of residents. Acad Med. 2010;85:1157–1162.
62. Kuper A, Reeves S, Levinson W. An introduction to reading and appraising qualitative research. BMJ. 2008;337:a288.
63. Hodges B. Medical education and the maintenance of incompetence. Med Teach. 2006;28:690–696.
64. Lurie SJ, Mooney CJ, Lyness JM. Measurement of the general competencies of the Accreditation Council for Graduate Medical Education: A systematic review. Acad Med. 2009;84:301–309.
65. Schuwirth LW, Van der Vleuten CP. Programmatic assessment: From assessment of learning to assessment for learning. Med Teach. 2011;33:478–485.
66. Whiting PF, Rutjes AW, Westwood ME, et al; QUADAS-2 Group. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155:529–536.
67. Bossuyt PM, Reitsma JB, Bruns DE, et al; STARD Group. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.
68. Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Technology-enhanced simulation to assess health professionals: A systematic review of validity evidence, research methods, and reporting quality. Acad Med. 2013;88:872–883.
69. Giacomini MK, Cook DJ. Users’ guides to the medical literature: XXIII. Qualitative research in health care A. Are the results of the study valid? Evidence-Based Medicine Working Group. JAMA. 2000;284:357–362.
70. Malterud K. Qualitative research: Standards, challenges, and guidelines. Lancet. 2001;358:483–488.
71. Eva KW. Broadening the debate about quality in medical education research. Med Educ. 2009;43:294–296.
72. Patton MQ. Qualitative Research & Evaluation Methods: Integrating Theory and Practice. 2015.4th ed. Thousand Oaks, Calif: Sage.
© 2016 by the Association of American Medical Colleges