A high level of professionalism—which encompasses both knowledge and behaviors—is an essential attribute for medical graduates and has become a key focus of medical curricula and accreditation boards. The teaching and assessment of this multifaceted construct, however, presents significant challenges.1 Assessment strategies covering knowledge, moral attitudes, and values are particularly limited, and authors have highlighted the need for novel approaches.2
At the University of Melbourne, the doctor of medicine (MD) program incorporates the teaching of professionalism throughout the four years. Prior to the innovation described herein, we have assessed student professionalism through students’ reflective essays in Year 2 and through their ratings on professional behavior checklists completed at the end of each year.
In addition to these assessment tasks—and in line with the requirements of graduate licensing bodies—our purpose was to introduce a pregraduation assessment tool for professionalism that (1) highlights the importance of professionalism to students, (2) introduces them to ethical and moral challenges they are likely to encounter as junior doctors, (3) promotes their thinking about issues related to professionalism, (4) provides a means for assessing knowledge of professionalism, and (5) enables feedback that guides students’ learning and helps to identify students in need of remediation. This assessment process needed to include both formative and summative components and be cost-effective, valid, and reliable.
We identified situational judgement tests (SJTs) as having the potential to meet all of our stated goals for our new process of assessing professionalism. SJTs, designed to assess judgement or knowledge of appropriate behavior in work-related settings, are widely used to rank job applicants. In recent years, SJTs have been used in the medical field to inform the selection of applicants to health care courses and postgraduate vocational training programs.3–5 Research has shown these tests to be effective in assessing a broad range of nonacademic factors relating to professional behavior including empathy, integrity, resilience, and teamwork.4
This process of adapting SJTs to assess aspects of professionalism in our student population involved developing test items, creating marking schemes, setting standards, and establishing feedback and evaluation processes.
After both consulting with intern supervisors and reviewing curriculum documents and regulatory guidelines, a team of experts (academics and clinicians, including B.D.G., A.T.R., J.W., and R.C.O’B.) developed a bank of test items based on professionalism attributes relevant to the role of an intern (i.e., a first-year junior doctor). We tagged each of the test items according to their alignment with the five SJT domains adopted in the United Kingdom for selecting medical graduates to the Foundation Programme—effective communication, patient focus, working effectively as part of a team, coping with pressure, and commitment to professionalism6—because these met the requirements for our students at their stage of training. We identified key ethical principles within each of these domains for inclusion in the test items (List 1). An expert panel reviewed and validated the items in the manner described by Lynn.7 In 2013, we pilot-tested the item format, content, delivery, and marking with a group of final-year student volunteers. On the basis of this successful trial, we formally implemented SJT assessment in 2014, using scannable answer sheets and electronic marking to simplify the administration of the tests.
Key Principles Covered in the Items in Three Situational Judgement Tests Undertaken by Final-Year Medical Students at the University of Melbourne, 2014
Knowledge of Principles
- Patient autonomy
- Informed consent
- Patient welfare
- Patient safety
- Personal safety
- Boundary setting
- Work–leisure balance
- Personal integrity
Application of Principles
- Dealing with uncertainty
- Dealing with error
- Dealing with complaints
- Dealing with gifts
- Dealing with conflicting demands
- Acknowledging limits of competence
- Addressing problems with colleagues
- Receiving feedback
- Giving feedback
- Communicating sensitively
- Communicating assertively
We developed three 40-item tests with content balanced across the five domains. Each test contained 25 ranking items consisting of a scenario followed by five possible responses which students ranked in order from most to least appropriate, plus 15 selection items consisting of a scenario followed by eight responses from which students selected the three most appropriate (see examples in Chart 1). We chose these two item types because they have been previously and extensively used in SJTs designed to inform applicant selection in medical settings.6,8
We administered the three tests as invigilated paper-based assessments to all final-year students during the second semester of 2014 at the end of each month (August, n = 318; September, n = 320; October, n = 316). Students were required to pass two of the three tests.
We modified the customary scoring methods for both item types (8–20 points for ranking items and 0–12 for selection items)6 to allow scoring on a common scale (0–6 points per item). The scores for each item reflected the level of agreement with the ideal response. For example, for the ranking item example given in Chart 1, the correct response BCDEA scores 6 marks, CBDEA would score 5 marks (as the position of each option is very close to the prescribed response—only B and C are reversed), and CBDAE would score 4 points (as the order is also relatively close to the prescribed response—B and C are reversed and E and A reversed); however, ADECB would score 0 as it varies widely from the prescribed response with no options in the correct position and the best and worst options, B and A, reversed. For the selection example given in Chart 1, the correct response CEG scores 6 marks, CAG would score 4 marks (2 correct responses), FCA would score 2 marks (1 correct response), and DHA would score 0 (no correct responses). The possible scores for each test ranged from 0 to 240.
Our novel use of SJTs as summative assessments required a new standard-setting approach. We adopted the Angoff method of standard setting because it is the most widely researched and validated method used in an item-centered context. The Angoff method required us to determine minimal performance criteria for the two item types. For the ranking type questions, we deemed the key judgement to be the likelihood that a borderline student would rank the most appropriate option ahead of the least appropriate option. For the selection items, we deemed the key judgement to be how many of the three correct responses a borderline student would be likely to select. We converted each judgement to a numerical score on the 7-point (0–6) scale. Between 10 and 12 reviewers independently scored all items, and final judgements were pooled to create an indicative cut score for each of the three tests.
We provided personalized feedback reports after each test and delivered them via e-mail to students. Reports included each student’s overall score, the cut score, their score for each of the five domains, plus comparative scores and statistics for the whole cohort (see Supplemental Digital Appendix 1 at https://links.lww.com/ACADMED/A429). We also included generic but domain-specific feedback to help guide students’ further learning, and we organized remediation for students who performed below the cut score.
Additional learning activities
To help students prepare for the tests, we provided them with examples of SJT items in their subject guides, and we encouraged them to discuss the professionalism issues raised in the item scenarios with their clinical supervisors. We also asked students to generate example SJT items based on their experiences during their clinical placements.
We evaluated students’ perceptions of the SJTs via a questionnaire administered twice: after the delivery of Test 1 and Test 3. We designed the questionnaire items (scored on a five-point Likert scale) to assess the relevance of the tests to students’ current stage of training, students’ confidence in dealing with the professionalism-related issues covered in the tests, and their perception of the educational value or benefit of the tests and the related activities (see Supplemental Digital Appendixes 2 and 3 at https://links.lww.com/ACADMED/A429).
The results of the three SJTs (shown in Table 1) reveal that, despite a broad range of scores, almost all students met the cut score on each test occasion. The three students who failed to achieve the cut score for Test 1 met with a member of the academic staff for feedback and guidance, including review of the roles and responsibilities of a junior doctor. All achieved a pass score on Test 2 and Test 3.
Although we endeavored to create equivalent tests, overall performance varied with mean scores increasing between Test 1 (191) and Test 2 (196) and then falling for Test 3 (184). Cut scores followed a similar pattern, although the variation between tests was minor. Additional or confounding factors may have contributed to these variations. For example, test conditions changed slightly at Test 2. Although our pilot implementation indicated that a time allocation of 60 minutes for 40 questions was sufficient, some students failed to complete Test 1, others displayed random response patterns to the last few questions, and in the posttest evaluation 94 students (29.7%) reported that they needed more time. The test time was subsequently increased to 80 minutes, and all students completed Test 2 and Test 3 in the allotted time.
Similarly, rather than suggesting a decline in knowledge, the lower mean score for Test 3 may simply reflect the requirement to pass only two of the three tests. After the first two tests, the overwhelming majority of students had satisfied this assessment requirement, potentially reducing their motivation to perform well in the third. In fact, we noted clear evidence of systematic random responses from all four students who fell below the cut score for the third test.
After the first implementation of the SJT assessments, we reviewed our ability to meet our stated goals and purposes.
We critically reviewed all test items through a content validity process. Our scoring method ensured item equivalence and a fair and reproducible score, and our standard-setting procedure involved a novel and defensible process.
In relation to test versus real-world performance, we acknowledge that a paper-based test is well removed from the real professional and ethical dilemmas experienced by practicing clinicians. Nevertheless, the successful use elsewhere of SJTs in candidate selection indicates that the face and content validity of items can be high if the item scenarios and response options capture realistic work situations.3 We included junior doctors in our item-development process to ensure that all scenarios related to the role of a first-year doctor. The majority of the students who responded to the first evaluation questionnaire (82.0%; 259/316) agreed that the test items were relevant to students who were transitioning to internship.
We used a discrimination index (DI) to validate students’ responses for each item. Although the DI may be calculated in various ways, the simplest form is to compare the proportion of students responding correctly in the upper and lower quartiles for each item. This ratio produces an index which can be interpreted similarly to a correlation coefficient, with 0.2 commonly proposed as an acceptable threshold. Given the more challenging nature of the required judgements and especially the ranking format compared with standard single-best-answer multiple-choice questions, we adopted a lower criterion of 0.1 as the threshold, and reviewed all items with DIs below this for content, wording, or scoring issues. We also tested the validity and reliability of the questions by generating response frequency data for each question and comparing these data with the correct response to check for unexpected response patterns, indicating a possible problem with the question. These analyses informed minor revisions of 10 to 12 items on each test in anticipation of their future use.
Previous researchers have noted the challenges of calculating a suitable reliability measure for SJTs.9 The internal consistency measures (Cronbach alpha) for the three tests were lower than would be expected for summative tests; however, this finding accords with results from earlier research.9 We expect that both the relative heterogeneity of the underlying construct of professionalism and the different item formats (ranking and selection) contributed to the lower internal consistency measures. However, given the potential advantages of SJT items over other item formats for the assessment of professionalism, this decreased reliability seemed like a reasonable trade-off to us: that is, we compromised high statistical reliability for a more nuanced assessment of professionalism, realizing that responses to moral or ethical challenges rarely conform to a simple binary (correct/incorrect) model.10
Learning and assessment
Students were generally positive about the SJT program: Of the 316 students who completed the first questionnaire, 52.8% (n = 167) agreed that Test 1 was a useful learning experience, and 68.7% (n = 217) agreed that it raised their awareness of professionalism issues that junior doctors would likely face. We believe the assessment also positively influenced students’ learning and study behavior: 76.0% (241/317) reported that they had prepared specifically for Test 1 (through practice questions, discussion of SJT-related issues, and/or review of relevant learning materials). Additionally, of the 306 students who answered the question on the second questionnaire, 233 (76.2%) agreed that their own ward-based experiences helped them to understand how to deal with SJT scenarios, and those who agreed that they felt confident in dealing with the professionalism issues assessed in the tests increased from 69.3% on Test 1 (219/316) to 80.1% on Test 3 (246/307).
Students’ responses to the individualized feedback reports were mixed: Only 33.9% of those responding to the second questionnaire (104/307) agreed that the provided feedback was useful. A number of students informally indicated that they would prefer to receive specific item-level feedback rather than the more generic domain-level guidance provided.
We acknowledge that significant costs were involved in establishing the SJT-style assessments of professionalism. The largest related to the staff time spent generating the question bank, developing the marking and standard-setting procedures, and undertaking the content validity review. Additional time costs entailed developing marking sheets, feedback reports, and software tools to automate the marking and grading, as well as the generation and distribution of feedback. All of these will be substantially lower in subsequent years.
We have made several revisions to the SJT testing format since the first full implementation in 2014. In 2015 we modified a number of test items with low DIs, changed our testing protocol to require a pass score in all three tests, and provided more detailed item-level feedback to students. Future plans include implementing a proposed “red flag” system to identify students who demonstrate an increased tendency to rank or select the worst responses above the best and to provide these students with personalized guidance.
We also plan to expand the SJT item bank, especially to better balance test items across the key principles (List 1). We plan to review student-contributed items as part of this process—and to further develop the best of these for possible adoption. Use of some readministered items will enable consideration of test–retest performance and contribute to our ability to monitor reliability over time. Also, implementing SJTs earlier in the curriculum will align with recommendations for a longitudinal assessment of professionalism.1
Future research will investigate correlations between SJT performance and a range of outcome measures, including reports or notifications of unprofessional behavior of students during the course of the MD program. We will also explore whether the use of pre-graduation SJTs has a beneficial impact on intern performance. Possible approaches include both surveying supervisors regarding interns’ professionalism and evaluating interns’ perceived preparedness for dealing with the complex ethical challenges they encounter in the modern medical workplace.
Acknowledgments: Frances Varian and Lara Cartwright (Warwick University, England), coauthors of The Situational Judgement Test at a Glance, provided valuable guidance for the adaption of situational judgement tests (SJTs) from a selection tool to their use in teaching and assessment of professionalism in medical students. Stephen Farish (University of Melbourne, Australia) provided scoring and statistical assistance. The authors are grateful to the University of Melbourne medical students at both the Austin Clinical School and the Northern Clinical School who volunteered to participate in the SJT pilot study.
1. Goldie JAssessment of professionalism: A consolidation of current thinking. Med Teach. 2013;35:e952–e956.
2. Wilkinson TJ, Wade WB, Knock LDA blueprint to assess professionalism: Results of a systematic review. Acad Med. 2009;84:551–558.
3. Luschin-Ebengreuth M, Dimai HP, Ithaler D, Neges HM, Reibnegger GSituational judgment test as an additional tool in a medical admission test: An observational investigation. BMC Res Notes. 2015;8:81.
4. Patterson F, Ashworth V, Zibarras L, Coan P, Kerrin M, O’Neill PEvaluations of situational judgement tests to assess non-academic attributes in selection. Med Educ. 2012;46:850–868.
5. Lievens F, Coetsier PSituational tests in student selection: An examination of predictive validity, adverse impact, and construct validity. Int J Sel Assess. 2002;10:245–257.
6. The Foundation Programme. FAQS. http://www.foundationprogramme.nhs.uk/pages/home/how-to-apply/SJT-FAQs
. Updated 2016. Accessed December 22, 2016.
7. Lynn MRDetermination and quantification of content validity. Nurs Res. 1986;35:382–385.
9. Catano VM, Brochu A, Lamerson CDAssessing the reliability of situational judgment tests used in high-stakes situations. Int J Sel Assess. 2012;20:333–346.
10. Parker MAssessing professionalism: Theory and practice. Med Teach. 2006;28:399–403.