Applying Messick’s Framework to Qualitative Assessment: Five Sources of Evidence
Current standards31 endorse the model first espoused by Messick22 in which the data to support (or refute) the validity or defensibility of score interpretations can be broadly classified as evidence of content, response process, internal structure, relationships to other variables, and consequences. Importantly, these sources of evidence are not different types of validity but, rather, different categories of evidence that inform judgments about validity. Evidence should be sought from several sources to support any given interpretation. Table 2 contains definitions for these five sources, together with examples of evidence that might be collected to support a qualitative assessment.
To summarize briefly:
- Content evidence evaluates the “relationship between the content of a test and the construct it is intended to measure.”31 (p14) For a qualitative assessment, content evidence might include the method for selecting specific prompts (questions), the wording of prompts and instructions, and the sampling strategy (often deliberate or purposeful, to explore specific perspectives; and iterative, to tailor the duration and focus of data collection). The qualitative research concept of saturation36—continuing purposeful, iterative data collection and analysis until additional observations do not suggest new themes—provides a useful theoretical sampling end point, although in practice “sufficient” rather than complete saturation will usually be the goal.
- Response process evidence is formally defined as “the fit between the construct and the detailed nature of performance … actually engaged in.”31 (p15) In more practical terms it reflects the processes that link the observation itself to the record (answer, rating, narrative) documenting that observation, such as the assessor’s interpretations of events (i.e., mental processes) that underlie his or her narrative comments or the system by which such comments are recorded (e.g., clerical or computer processes). Evidence might show that assessors follow instructions, that narratives offer rich data and differ between learners for the same assessor, that assessors’ reflexivity (such as their relationships with learners and context) has been considered, or that the computer interface allows extended narratives.
- Internal structure reflects the relationship among data elements in the same assessment— their coherence (or lack thereof), and the rationale and method for synthesizing these elements into a meaningful message. Evidence to support qualitative assessment might explore the triangulation among different data elements (e.g., narratives collected from different assessors or different contexts), the nature of the analysis itself (e.g., iteration, and often additional data collection, until saturation), or the reflexivity of the analysts. For example, educators might highlight discordant narratives and seek additional data or a more nuanced interpretation that accounts for this seeming inconsistency.37
- Whereas internal structure looks at associations among elements within the assessment and the approach to synthesis, relationships with other variables examines the associations between the final synthesis and other information sources outside the assessment under study. Application of the qualitative research concepts of triangulation (here, with external data sources) and transferability (e.g., to other contexts or future or past performances) can supply evidence of these relationships.
- Finally, evidence of consequences looks at the actual impact of the assessment and the associated decisions and actions.24 The collection of such evidence might use qualitative or quantitative research methods to explore learner and instructor impressions of the assessment and how it influenced their lives, objective evidence of intended and unintended consequences, or the degree to which decisions resonate with other stakeholders.
Applying Kane’s Framework to Qualitative Assessment: Four Validation Inferences
By contrast with Messick’s focus on sources of evidence, Kane’s framework focuses on key inferences in the validity argument (see Table 3). As some of us recently summarized elsewhere17:
essentially, Kane traces an assessment from the Scoring of a single observation (e.g., multiple-choice exam question, skill station, clinical observation, or portfolio item), to using the observation score(s) to generate an overall test score representing performance in the test setting (Generalization), to drawing an inference regarding what the test score might imply for real-life performance (Extrapolation), and then to interpreting this information and making a decision (Implications). Each phase in this process represents an inference laden with several assumptions.
By focusing on the argument itself rather than sources of evidence, Kane’s approach applies equally well to quantitative assessments, qualitative assessments, and programs of assessment.35 Kane emphasizes the imperative to articulate up front both the intended uses and decisions and also the evidence needed to support such decisions (the “interpretation/use argument”),38 just as a researcher would prospectively identify the research question and outline the research data required to answer the question.
The first inference—scoring—concerns the generation of an insightful and accurate response (e.g., narrative comment) based on one or more observations of performance. Data to support the scoring inference might include the wording of questions or prompts inviting a narrative response, the richness of the raw narrative, the credibility of the observers, and the use of “thick description” quotes or images in the final report.
The next inference is generalization, in which individual data elements (e.g., comments) are synthesized into a coherent, insightful, and accurate judgment regarding performance in the test setting. Qualitative assessments are not intended to be generalizable in the usual sense19; rather, in qualitative assessment educators
continually revis[e] initial interpretations until they account for all of the available evidence. Inconsistency in students’ performance across tasks … becomes an empirical puzzle to be solved by searching for a more comprehensive or elaborated interpretation.37
The generalization inference would be supported by rigorous methods for sampling and data analysis, including purposeful sampling, iterative and responsive data collection (ideally continuing until saturation), and triangulation (deliberate selection and synthesis of data from complementary sources). The training and experience of analysts would also inform this inference.
Extrapolation extends the interpretations beyond the test setting and claims relevance to real-life situations and future performance. Extrapolation evidence comes primarily from two sources: steps taken to ensure that the test or assessment reflects the most important aspects of real-life performance, and empiric analyses that evaluate (quantitatively or qualitatively) the relationship between test performance and real-world performance. Note that both the test-setting assessment data and the corresponding real-world data could be quantitative, qualitative, or both.
Finally, the implications inference explores the link between claims for real-world performance and the decisions and actions based thereon. This step evaluates the intended or unintended consequences of assessment-informed decisions and, in most respects, parallels the consequences evidence discussed above.
Application to an Existing Assessment Approach
We conclude by using published evidence to show how both frameworks can apply to a prototypical qualitative assessment approach, the learning portfolio.13 , 15 , 39–58 We selected this assessment approach because it is commonly used and has been the object of substantial research. Our intent in presenting evidence is to illustrate rather than be comprehensive.
Definition and purpose
While there are many definitions of a learning portfolio, for our purposes it is a collection of information (“artifacts”) used to facilitate and measure learning that includes a narrative, reflective component.41 Learning portfolios (hereafter, “portfolios”) are atypical among assessments in that learners are usually their own assessors (i.e., they select and comment on many, if not all, of the artifacts), and learners also frequently contribute to the analysis (i.e., synthesis during the reflective component) along with one or more faculty members.42 , 56 Since the first step in validation is to define the intended use, in the following exercise we will hypothetically plan to use the portfolio to support a program director in providing formative feedback to a resident. Some of the evidence we cite derives from studies of medical students, nurses, or practicing physicians, but we do not emphasize these distinctions here, since our primary purpose is to illustrate evidence and inferences of validation. In evaluating this (or any other) assessment for a real application, it would be essential to appraise the relevance of evidence deriving from populations and contexts different from our own (what qualitative researchers would refer to as transferability).
Applying Messick’s framework
As detailed in Table 2, we find supportive evidence from all five of Messick’s sources. Two areas of concern are that residents, in the assessor (data collector) role, could bias the assessment by selectively including only favorable data,13 , 42 , 44 , 53 and that the use of excessive structure might impede reflection.57 Both of these issues could be addressed through portfolio redesign (e.g., involving others in artifact selection). Evidence of consequences is quite extensive for portfolios, likely because they have been explicitly viewed as training interventions as well as assessment instruments.39–41 While generally favorable, fairness and overall value have been questioned.43–45 , 47 , 52 , 55
Applying Kane’s framework
Applying Kane’s approach requires that we not only identify the intended use (i.e., to provide formative feedback, as noted above) but that we also articulate the interpretation/use argument38 (i.e., the evidence and interpretations that would support such use) before embarking on a search for evidence.17 To support the proposed use, for the scoring inference we would hope to find evidence that those collecting information (typically the residents) had proper training and/or instructions, that they were allowed sufficient time to collect their artifacts, and that the artifacts accurately reflect the original activity being assessed and contain sufficient information to allow insightful interpretations. To support generalization, we would hope to find evidence that the artifacts represent a variety of meaningful activities and collectively form a coherent, accurate, and complete picture of the resident across multiple clinical situations. Ideally, artifacts suggesting conflicting or incomplete interpretations would prompt additional data collection or a narrative explanation from the resident. We would further expect that those conducting the interpretive analysis (e.g., mentors or program directors) have appropriate training or experience, that the interpretation integrates but does not rely exclusively on the learner’s reflections, and that the analysis process is transparent and rigorous. This might include explicit consideration of personal idiosyncrasies when providing feedback (analyst reflexivity). To support extrapolation, we would hope that the final synthesis is coherent with other information (e.g., qualitative or quantitative assessments) about the resident (triangulation) and that those involved in the original activities would agree with this final synthesis (member check). Finally, to support the proposed implications, we would hope to find evidence that relevant stakeholders (e.g., the resident and program director) agree with the decision (in this case, specific feedback) based on these narratives, and that actions based on this feedback have the desired effect without unanticipated negative impact.
We evaluate the proposed interpretation/use argument using Kane’s framework (Table 3) by reconsidering, reorganizing, and prioritizing the evidence reported using Messick’s framework. Support for the inferences of scoring, generalization, and implications is generally favorable aside from issues regarding sampling, reflexivity, structure, and overall value. Extrapolation evidence is lacking.
Further evidence and summary
To further reinforce this illustration, we refer readers to one study13 that deliberately sought evidence to support the robustness (trustworthiness) of a portfolio-based qualitative assessment system using the qualitative research standards of credibility and dependability (see Tables 2 and 3 for details about this study’s findings).
In summary, we can make a reasonable argument to support the use of portfolios for providing formative feedback, with the caveats noted above and the acknowledgment that all validation is context dependent. We note that using portfolios for another purpose, such as decisions regarding academic advancement, would require a fresh formulation and evaluation of the interpretation/use argument. Readers seeking another illustration may find our application of Kane’s framework to in-training assessments of interest.17
Challenges, Uncertainties, and Next Steps
In this article we advocate the increased use of qualitative assessments when warranted, and simultaneously advocate more robust judgments of the defensibility of inferences, decisions, and actions arising from their use. We illustrate how two contemporary frameworks for evaluating the validity of quantitative assessments can be readily applied to qualitative assessments. Such frameworks illuminate potential gaps in the defensibility of decisions grounded in the assessment results, and have the added benefit of offering a common lens and language to facilitate communication among quantitative and qualitative scholars.
Limitations of this work
This work has limitations. First, the “potential evidence” examples in Tables 2 and 3 are largely hypothetical, and when we sought actual evidence for a real assessment our search was not systematically comprehensive. Second, our effort is, to our knowledge, the first to describe a formal framework for evaluating the defensibility of purely qualitative assessments. We expect that over time this framework will mature, and that what we present herein will be extended or perhaps overturned. Nonetheless, we feel that this work represents a critical advance for the field and a starting point for debate and future research.
Practical issues in qualitative assessment
We do not suggest that qualitative assessment is without limitations or that it is appropriate for all constructs and contexts.16 In the messy reality of real-world assessment, numerous sources of information must commonly be integrated. Just as they do in research, quantitative and qualitative methods in assessment complement one another as they serve different purposes, address different needs, and exemplify different strengths and weaknesses. Choices among available methods of data collection and analysis will hinge on conceptual issues related to the intended use (as discussed above) and practical issues related to the assessment activity, which we discuss below. Focusing on deliberate purposes and defensible decisions provides a framework for prioritization and integration of assessment data regardless of format. As with mixed-methods research,59 robust mixed-methods assessment will ideally focus on each approach (i.e., quantitative and qualitative) separately—ensuring optimal methods, data, and human resources for each—and then triangulate the results.
Both the collection of qualitative data and the analysis of data once collected tend to demand more time, more energy, and vastly different skills than do quantitative assessments. Crafting and recording an insightful narrative typically takes longer than marking a checklist or rating scale, and narrative quality may suffer if prompts are preceded by numerous other questions (i.e., assessor fatigue). Meaningful narratives (just like accurate quantitative ratings) also require sufficient observation of performance-in-context, which time-pressed assessors may be tempted to shortchange. Done properly, data collection efforts will yield a large volume of raw narratives that must then be distilled and synthesized into meaningful, actionable insights. Such sense making requires people with qualitative analysis skills, context- and content-specific knowledge, and adequate time. Programs considering the incorporation of qualitative assessments will need to take into account such resource requirements, which might include skill development for assessors and analysts.
All assessments are only as robust as the quality of the data and interpretations, and qualitative assessments are susceptible to quality issues that are different from those of quantitative measures. First, there is inherently greater variability in the amount and content of data; rich narrative comments may be saturated with insights, whereas shallow comments are harder to interpret, and narrative length does not necessarily reflect richness or utility. Factors such as assessor training, motivation, time pressure, and fatigue can influence the quality of comments. Second, both raw narratives and synthesized interpretations will inevitably reflect the idiosyncrasies of assessors and analysts, respectively. For example, assessors may veil their comments to mask their identity or may defer strong indictments to avoid passing judgment.60 , 61 Indeed, research suggests that qualitative assessment narratives serve multiple social purposes and that assessors use coded language that is difficult to interpret.30 Finally, it may be more difficult to identify problems in the quality of qualitative data (such as shallow comments, or failure to record negative impressions60) in contrast with quantitative assessment, for which measures such as reliability, halo effect, missing data, and discrimination can be readily calculated.
Qualitative analyses are not intended to be generalizable in the usual sense19 , 62 but, rather, to accurately reflect context-bound impressions that may transfer to new situations. Sampling is thus intentionally purposive rather than random, typically aims for thematic sufficiency, and does not necessarily require large samples. However, sampling must be guided by relevant theories, responsive to accumulating evidence, and in most cases attentive to maximizing the variability in data.
We underscore that the frameworks we set forth apply to qualitative assessment (i.e., the application of qualitative methods to integrate observations and make decisions regarding an individual person) rather than qualitative research (which uses data from multiple individuals and sources to generate potentially transferable insights and understanding). Qualitative research is well served by existing quality frameworks.18–21
Perhaps the greatest challenge—at least in the short term—will be to overcome the biases and preconceptions that question the relevance of qualitative assessment in comparison with the overwhelming dominance of the quantitative/psychometric model of assessment in education.63 We hope that the inroads made by qualitative research over the past decade, together with the pressing needs for an expanded set of assessment tools3 , 64 and for programmatic assessment,65 will hasten its acceptance. We believe that the ideas set forth in this article will also address this challenge.
Conclusions and next steps
Much remains to be learned about the use of qualitative assessment. Areas of uncertainty include identifying and addressing the training needs of both assessors and analysts, encouraging assessors to provide rich and balanced raw narratives, defining sampling strategies that provide the needed variety and depth of perspectives, supporting analysts in efficiently synthesizing large amounts of qualitative data (which might include the use of electronic tools), and effectively using final interpretations. We also need to better define the performance domains that most benefit from qualitative assessment, and we need to determine how to use qualitative interpretations in conjunction with other information (e.g., quantitative assessments) to optimally inform important decisions.8 Finally, although we now have robust frameworks for evaluating validity (as described above), frameworks for judging the merits of original educational assessment research (i.e., validation studies) are less well developed, even for quantitative assessments. Such frameworks might draw upon standards for testing66 and reporting67 clinical diagnostic tests,68 and for evaluating20 , 69 , 70 and reporting18 qualitative research.
We vigorously oppose the segregation of quantitative and qualitative assessment methods.71 Rather, we advocate a “methods-neutral” approach, in which a clearly stated purpose determines the nature of and approach to data collection and analysis. As Patton72 stated,
a consensus has gradually emerged that the important challenge is to appropriately match methods to purposes and inquiry questions, not to universally and unconditionally advocate any single methodological approach for all inquiry situations.
Although we urge the use of a contemporary validity framework when evaluating any assessment, quantitative or qualitative, we do not strongly favor either Messick or Kane. What matters most in validation is that evidence is strategically sought to inform a coherent argument that evaluates the defensibility of intended decisions.
1. Cameron WB. Informal Sociology: A Casual Introduction to Sociological Thinking. 1963.New York, NY: Random House.
2. Frank JR, Snell LS, Cate OT, et al. Competency-based medical education: Theory to practice. Med Teach. 2010;32:638–645.
3. Iobst WF, Sherbino J, Cate OT, et al. Competency-based medical education in postgraduate medical education. Med Teach. 2010;32:651–656.
4. Weinberger SE, Pereira AG, Iobst WF, Mechaber AJ, Bronze MS; Alliance for Academic Internal Medicine Education Redesign Task Force II. Competency-based education and training in internal medicine. Ann Intern Med. 2010;153:751–756.
5. ten Cate O. Trust, competence, and the supervisor’s role in postgraduate training. BMJ. 2006;333:748–751.
6. Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Mastery learning for health professionals using technology-enhanced simulation: A systematic review and meta-analysis. Acad Med. 2013;88:1178–1186.
7. Ginsburg S, McIlroy J, Oulanova O, Eva K, Regehr G. Toward authentic clinical evaluation: Pitfalls in the pursuit of competency. Acad Med. 2010;85:780–786.
8. Schuwirth LW, van der Vleuten CP. A plea for new psychometric models in educational assessment. Med Educ. 2006;40:296–300.
9. Kuper A, Reeves S, Albert M, Hodges BD. Assessment: Do we need to broaden our methodological horizons? Med Educ. 2007;41:1121–1123.
10. Govaerts MJ, Van de Wiel MW, Schuwirth LW, Van der Vleuten CP, Muijtjens AM. Workplace-based assessment: Raters’ performance theories and constructs. Adv Health Sci Educ Theory Pract. 2013;18:375–396.
11. Hodges B. Assessment in the post-psychometric era: Learning to love the subjective and collective. Med Teach. 2013;35:564–568.
12. Bordage G. Moving the field forward: Going beyond quantitative–qualitative. Acad Med. 2007;82(10 suppl):S126–S128.
13. Driessen E, van der Vleuten C, Schuwirth L, van Tartwijk J, Vermunt J. The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: A case study. Med Educ. 2005;39:214–220.
14. Govaerts M, van der Vleuten CP. Validity in work-based assessment: Expanding our horizons. Med Educ. 2013;47:1164–1174.
15. Johnston B. Summative assessment of portfolios: An examination of different approaches to agreement over outcomes. Stud Higher Educ. 2004;29:395–412.
16. Eva KW, Hodges BD. Scylla or Charybdis? Can we navigate between objectification and judgement in assessment? Med Educ. 2012;46:914–919.
17. Cook DA, Brydges R, Ginsburg S, Hatala R. A contemporary approach to validity arguments: A practical guide to Kane’s framework. Med Educ. 2015;49:560–575.
18. O’Brien BC, Harris IB, Beckman TJ, Reed DA, Cook DA. Standards for reporting qualitative research: A synthesis of recommendations. Acad Med. 2014;89:1245–1251.
19. Kuper A, Lingard L, Levinson W. Critically appraising qualitative research. BMJ. 2008;337:a1035.
20. Popay J, Rogers A, Williams G. Rationale and standards for the systematic review of qualitative literature in health services research. Qual Health Res. 1998;8:341–351.
21. Lincoln YS. Emerging criteria for quality in qualitative and interpretive research. Qual Inq. 1995;1:275–289.
22. Messick S. Linn RL. Validity. In: Educational Measurement. 1989:3rd ed. New York, NY: American Council on Education and Macmillan; 13–103.
23. Kane MT. Brennan RL. Validation. In: Educational Measurement. 2006:4th ed. Westport, Conn: Praeger; 17–64.
24. Cook DA, Lineberry M. Consequences validity evidence: Evaluating the impact of educational assessments. Acad Med. 2016;91:785–795.
25. Norman GR, Van der Vleuten CP, De Graaff E. Pitfalls in the pursuit of objectivity: Issues of validity, efficiency and acceptability. Med Educ. 1991;25:119–126.
26. Cohen GS, Blumberg P, Ryan NC, Sullivan PL. Do final grades reflect written qualitative evaluations of student performance? Teach Learn Med. 1993;5:10–15.
27. Ginsburg S, Gold W, Cavalcanti RB, Kurabi B, McDonald-Blumer H. Competencies “plus”: The nature of written comments on internal medicine residents’ evaluation forms. Acad Med. 2011;86(10 suppl):S30–S34.
28. Myers KA, Zibrowski EM, Lingard L. A mixed-methods analysis of residents’ written comments regarding their clinical supervisors. Acad Med. 2011;86(10 suppl):S21–S24.
29. Zibrowski EM, Singh SI, Goldszmidt MA, et al. The sum of the parts detracts from the intended whole: Competencies and in-training assessments. Med Educ. 2009;43:741–748.
30. Ginsburg S, Regehr G, Lingard L, Eva KW. Reading between the lines: Faculty interpretations of narrative evaluation comments. Med Educ. 2015;49:296–306.
31. American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Validity. In: Standards for Educational and Psychological Testing. 2014:Washington, DC: American Educational Research Association; 11–31.
32. Hulley SB, Cummings SR, Browner WS, Grady D, Hearst N, Newman TB. Designing Clinical Research: An Epidemiologic Approach. 2001.2nd ed. Philadelphia, Pa: Lippincott Williams & Wilkins.
33. Cook DA. When I say… validity. Med Educ. 2014;48:948–949.
34. Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med. 2006;119:166.e7–166.16.
35. Schuwirth LW, van der Vleuten CP. Programmatic assessment and Kane’s validity perspective. Med Educ. 2012;46:38–48.
36. Charmaz K. Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis. 2009.London, UK: Sage.
37. Moss PA. Can there be validity without reliability? Educ Res. 1994;23(2):5–12.
38. Kane MT. Validating the interpretations and uses of test scores. J Educ Meas. 2013;50:1–73.
39. Driessen E, van Tartwijk J, van der Vleuten C, Wass V. Portfolios in medical education: Why do they meet with mixed success? A systematic review. Med Educ. 2007;41:1224–1233.
40. Buckley S, Coleman J, Davison I, et al. The educational effects of portfolios on undergraduate student learning: A best evidence medical education (BEME) systematic review. BEME guide no. 11. Med Teach. 2009;31:282–298.
41. Tochel C, Haig A, Hesketh A, et al. The effectiveness of portfolios for post-graduate assessment and education: BEME guide no 12. Med Teach. 2009;31:299–318.
42. Mathers NJ, Challis MC, Howe AC, Field NJ. Portfolios in continuing medical education—effective and efficient? Med Educ. 1999;33:521–530.
43. Lonka K, Slotte V, Halttunen M, et al. Portfolios as a learning tool in obstetrics and gynaecology undergraduate training. Med Educ. 2001;35:1125–1130.
44. Smith K, Tillema H. Long-term influences of portfolios on professional development. Scand J Educ Res. 2001;45:183–203.
45. Tiwari A, Tang C. From process to outcome: The effect of portfolio assessment on student learning. Nurse Educ Today. 2003;23:269–277.
46. Carraccio C, Englander R. Evaluating competence using a portfolio: A literature review and Web-based application to the ACGME competencies. Teach Learn Med. 2004;16:381–387.
47. Rees C, Sheard C. Undergraduate medical students’ views about a reflective portfolio assessment of their communication skills learning. Med Educ. 2004;38:125–128.
48. Driessen EW, van Tartwijk J, Overeem K, Vermunt JD, van der Vleuten CP. Conditions for successful reflective use of portfolios in undergraduate medical education. Med Educ. 2005;39:1230–1235.
49. Driessen EW, Overeem K, van Tartwijk J, van der Vleuten CP, Muijtjens AM. Validity of portfolio assessment: Which qualities determine ratings? Med Educ. 2006;40:862–866.
50. Kjaer NK, Maagaard R, Wied S. Using an online portfolio in postgraduate training. Med Teach. 2006;28:708–712.
51. Driessen EW, Muijtjens AM, van Tartwijk J, van der Vleuten CP. Web- or paper-based portfolios: Is there a difference? Med Educ. 2007;41:1067–1073.
52. Hrisos S, Illing JC, Burford BC. Portfolio learning for foundation doctors: Early feedback on its use in the clinical workplace. Med Educ. 2008;42:214–223.
53. Dannefer EF, Bierer SB, Gladding SP. Evidence within a portfolio-based assessment program: What do medical students select to document their performance? Med Teach. 2012;34:215–220.
54. Donato AA, George DL. A blueprint for implementation of a structured portfolio in an internal medicine residency. Acad Med. 2012;87:185–191.
55. O’Sullivan AJ, Howe AC, Miles S, et al. Does a summative portfolio foster the development of capabilities such as reflective practice and understanding ethics? An evaluation from two medical schools. Med Teach. 2012;34:e21–e28.
56. Dannefer EF, Prayson RA. Supporting students in self-regulation: Use of formative feedback and portfolios in a problem-based learning setting. Med Teach. 2013;35:655–660.
57. Goodyear HM, Bindal T, Wall D. How useful are structured electronic portfolio templates to encourage reflective practice? Med Teach. 2013;35:71–73.
58. McEwen LA, Griffiths J, Schultz K. Developing and successfully implementing a competency-based portfolio assessment system in a postgraduate family medicine residency program. Acad Med. 2015;90:1515–1526.
59. Small ML. How to conduct a mixed methods study: Recent trends in a rapidly growing literature. Annu Rev Sociol. 2011;31:57–86.
60. Dudek NL, Marks MB, Regehr G. Failure to fail: The perspectives of clinical supervisors. Acad Med. 2005;80(10 suppl):S84–S87.
61. Watling CJ, Kenyon CF, Schulz V, Goldszmidt MA, Zibrowski E, Lingard L. An exploration of faculty perspectives on the in-training evaluation of residents. Acad Med. 2010;85:1157–1162.
62. Kuper A, Reeves S, Levinson W. An introduction to reading and appraising qualitative research. BMJ. 2008;337:a288.
63. Hodges B. Medical education and the maintenance of incompetence. Med Teach. 2006;28:690–696.
64. Lurie SJ, Mooney CJ, Lyness JM. Measurement of the general competencies of the Accreditation Council for Graduate Medical Education: A systematic review. Acad Med. 2009;84:301–309.
65. Schuwirth LW, Van der Vleuten CP. Programmatic assessment: From assessment of learning to assessment for learning. Med Teach. 2011;33:478–485.
66. Whiting PF, Rutjes AW, Westwood ME, et al; QUADAS-2 Group. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155:529–536.
67. Bossuyt PM, Reitsma JB, Bruns DE, et al; STARD Group. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.
68. Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Technology-enhanced simulation to assess health professionals: A systematic review of validity evidence, research methods, and reporting quality. Acad Med. 2013;88:872–883.
69. Giacomini MK, Cook DJ. Users’ guides to the medical literature: XXIII. Qualitative research in health care A. Are the results of the study valid? Evidence-Based Medicine Working Group. JAMA. 2000;284:357–362.
70. Malterud K. Qualitative research: Standards, challenges, and guidelines. Lancet. 2001;358:483–488.
71. Eva KW. Broadening the debate about quality in medical education research. Med Educ. 2009;43:294–296.
© 2016 by the Association of American Medical Colleges
72. Patton MQ. Qualitative Research & Evaluation Methods: Integrating Theory and Practice. 2015.4th ed. Thousand Oaks, Calif: Sage.