Fast, Easy, and Good: Assessing Entrustable Professional Activities in Psychiatry Residents With a Mobile App : Academic Medicine

Secondary Logo

Journal Logo

Innovation Reports

Fast, Easy, and Good: Assessing Entrustable Professional Activities in Psychiatry Residents With a Mobile App

Young, John Q. MD, MPP, PhD; McClure, Matthew MD

Author Information
Academic Medicine 95(10):p 1546-1549, October 2020. | DOI: 10.1097/ACM.0000000000003390



Medical educators who have implemented workplace-based assessments (WBAs) have encountered significant challenges. Common barriers have been the lack of time and competing demands (such as clinical workload) that interfere with the faculty members’ ability to complete these assessments.1 To facilitate more efficient capture, delivery, and aggregation of assessment data, mobile applications have been developed and tested in multiple specialties (e.g., pediatrics, surgical specialties, internal medicine) and with multiple frameworks (e.g., milestones, competencies, entrustment scales).2,3 Initial outcomes of these assessments have supported the feasibility and utility of mobile platforms for this purpose. At the same time, educators have also raised concerns that the competencies and milestones included in these assessments are too numerous, too granular, and/or too abstract for frontline faculty to use.4 Entrustable professional activities (EPAs) have emerged as an assessment framework that translates competencies into clinical practice.

We are not aware of any published literature on assessments that combine the EPA framework with a mobile platform, though there are numerous initiatives underway, as evidenced by abstracts at national and international meetings. Many of the available apps use milestones or competencies instead of EPAs as the assessment framework.5 Educators in surgical specialties have developed 2 related mobile-based assessment approaches—the O-SCORE (Ottawa Surgical Competency Operating Room Evaluation) and SIMPL (System for Improving and Measuring Procedural Learning). The O-SCORE assigns the level of supervision required (e.g., “I had to talk the trainee through…”) for each of the 9 components of any surgical procedure (e.g., case preparation, postoperative plan) but differs from the EPA framework in that it switches from a level-of-supervision scale to a yes/no determination of competence for the overall procedure.6 SIMPL is another mobile platform that uses a level-of-supervision scale (e.g., active help vs passive help) for the activity, but the activity is not necessarily an EPA.2 Warm and colleagues have published on large WBA datasets captured by mobile devices. This work has employed “observable professional activities” (i.e., often tasks within an EPA) and has not focused to date on the mobile platform itself.7

To address this gap, we designed and implemented a WBA that uses a mobile platform to assess EPAs based on the direct observation of residents. Here we describe that process as well as our evaluation of the feasibility and utility of the app.



This pilot occurred in a second-year psychiatry resident outpatient continuity clinic. Residents were paired with the same attending for all clinic sessions, which occurred 1 half day a week, 3 times a month. The pilot was initiated in response to faculty and resident complaints about the challenges of completing the paper-based direct observation assessments used at the time in the clinic. Ten faculty–resident dyads participated, representing a total of 8 faculty (2 faculty members attended in 2 clinics). The pilot test ran for 10 months from September 1, 2017 through June 30, 2018. We did not participate as faculty, and the pilot faculty were not aware of the outcomes being studied.

Design of the mobile app

We developed a mobile app for the iOS platform in Xcode, Apple’s suite of software development tools. The app was written in the Swift programming language. Data were uploaded to Google’s Firebase cloud service and emailed directly to residents via a Firebase Cloud Function. We designed the app to be as streamlined as possible to facilitate adoption. To this end, we adhered closely to the iOS Human Interface Guidelines, a set of documents published by Apple, that aim to improve the user experience by making an app’s interface clear, intuitive, and consistent.

During the initial app design phases, we considered collecting a variety of types of data. The final data types included were: name of resident, the 13 end-of-training EPAs for psychiatry that emerged from a 3-stage national Delphi study,8 entrustment ratings, and corrective narrative feedback. Faculty selected the name of the resident and the relevant EPAs from a prepopulated list. They also selected the entrustment rating from a standard 5-item entrustment scale (co-treatment, direct full, direct partial, indirect, and independent). This rating was prospective, that is, based on the question, “What level of supervision does the resident require next time?”

An information button next to each EPA and anchor brought up a definition of the item when tapped. Faculty provided narrative feedback in response to the prompt, “One thing the trainee can do to advance to the next level is …” They entered their narrative feedback into a text field either by typing or using speech-to-text transcription. The app was designed such that the assessments could not be submitted unless all of these data types were entered. Once submitted, an email containing the entrustment rating and feedback was automatically generated and sent to the resident, faculty member, and the residency program (see Supplemental Digital Appendix 1 at for step-by-step screenshots of the app). In addition to collecting explicitly entered data, a silent timing function measured and saved how long it took for the assessments to be completed. The faculty member’s name and the date of assessment were also automatically uploaded.

Faculty development

Faculty were sent written instructions and participated in a 30-minute one-on-one meeting where they installed and practiced using the app. In addition, all faculty attended 3 interactive, skills-based 1-hour training sessions on: direct observation (e.g., how to support resident autonomy while observing), EPA-based assessment (including performance dimension and frame-of-reference training), and narrative feedback (e.g., practice writing feedback then receiving rubric-based feedback from a peer).


Faculty were asked to use the mobile app to complete 1 assessment during each continuity clinic in which the resident saw at least 1 patient. We monitored use of the app and sent faculty email reminders to encourage them to continue using the app.


To assess the feasibility, utility, and validity of the intervention, we examined 3 outcomes: (1) utilization, (2) quality of comments, and (3) correlation between the entrustment level assigned to the resident and the resident’s experience.


We measured the number of assessments completed by each faculty member. Over the 10-month pilot, each dyad had approximately 27 sessions. We set a goal of 10 assessments per faculty member (those faculty who were part of more than 1 dyad had to complete 10 assessments per resident). We calculated mean time to complete, median time to complete, and the percentage of assessments that were completed in less than 120 seconds. The silent timer did not account for the time in which the faculty member was interrupted during an assessment. For this reason, completion times greater than 5 minutes (300 seconds) were discarded, as we assumed that this was not reflective of the actual time needed to complete the assessment.

Quality of comments.

We used a previously published method for evaluating comment quality.9,10 We characterized each comment, defined as a grouping of words focused on a unique concept or behavior, in 3 dimensions: content, valence or polarity (i.e., reinforcing vs corrective), and specificity (i.e., general comment vs behaviorally specific and actionable). We independently coded each comment, then compared assigned codes. Differences were resolved through consensus. The identities of the comment authors were blinded to us. We calculated the proportion of comments that were corrective and specific.

Correlation between entrustment level and the resident’s experience.

To address a dimension of validity—how the scores correlated with other variables—we examined how the entrustment scores varied with the resident’s experience. We converted the entrustment scale into numbers (e.g., co-treatment was assigned 1, direct full 2, etc.). We operationalized resident experience by using the day of the academic year (i.e., July 1 = 1, July 2 = 2), assuming that the resident gained experience with each day of the academic year. We then calculated the Pearson bivariate correlation in SPSS Statistics 26 (IBM, Armonk, New York). The Northwell institutional review board deemed this study exempt.



Faculty completed a total of 99 assessment (mean 9.9 per dyad, standard deviation [SD] = 7.6). The number of assessments that were completed varied significantly among faculty. Six of the 10 dyads met the goal of at least 10 completed assessments. One dyad completed 5 assessments and 3 dyads completed a single assessment, 2 of these dyads had the same faculty member. The mean time to complete an assessment was 76 seconds (SD = 50 seconds, median = 67 seconds). Fourteen assessments were recorded as taking longer than 300 seconds and were excluded from these calculations. Of all assessments, 72% (N = 71) were completed in less than 2 minutes.

Quality of comments

Of the assessments, 98% (N = 97) generated a single comment and 2% (N = 2) contained 2 distinct comments. Of the comments, 95% (N = 94) were behaviorally specific (e.g., “Assess for substance use routinely”), while 5% (N = 5) were general (e.g., “Experience with more outpatients will help solidify skills, which are already good”). Additionally, 75% (N = 74) of comments had an action verb as the first or second word in the feedback (e.g., “Screen for trauma,” “Ask about progress in therapy”). In terms of valence, 91% (N = 90) of narrative feedback was found to be corrective (e.g., “Assess for adverse effects”), while 4% (N = 4) was reinforcing (e.g., “Good use of open-ended questions”), and 5% (N = 5) contained both corrective and reinforcing comments. No comments were only reinforcing.

Correlation between entrustment level and the resident’s experience

The entrustment scores given to residents correlated moderately with the day of the academic year, that is, with resident experience (r = 0.43, P < .001).

Next Steps

EPAs can be used to operationalize competency-based medical education. Mobile apps offer the possibility of efficiently completing and capturing feedback based on direct observation. Our innovation combines EPAs and a mobile app into a single package. The outcomes of our pilot indicate that, when designed with user-interface principles in mind, the EPA mobile app assessment can be completed quickly, generate high-quality feedback, and produce entrustment scores that improve as the resident gains experience. The app’s efficiency supports feasibility of its use within a training environment, and the comment quality and correlation with experience support its utility and validity.

By design, each assessment required only a single corrective comment, which reduced the burden of having to enter multiple comments into a smartphone. This differed from a paper-based direct observation tool used in psychiatry which requested both reinforcing and corrective comments and generated 5 comments per assessment.10 Anecdotally, many faculty and residents in our pilot said they would prefer to have 2 comment boxes—1 for corrective comments and 1 for reinforcing comments. Our next iteration of the app will likely include that feature, acknowledging the trade-off with efficiency (hassle and time to complete). The relative value of 1 comment vs 5 is unclear, as we do not know how many comments is optimal for learning.

We faced challenges with implementing our EPA app. The most notable challenge was the uneven use of the app by faculty. Three of the 8 faculty involved in the pilot completed 5 or fewer assessments during the intervention despite the trainings and multiple reminders, and 14 of the assessments took longer than 300 seconds to complete, indicating that interruptions occasionally delayed completion. Though this kind of variability in adoption is not uncommon in the implementation of a new technology, it does raise questions about what barriers to use may exist.

Looking forward, we plan to examine the enablers and barriers to adoption of our EPA app from an implementation science perspective. We plan to compare the user experience of the app with that of a paper-based tool. In addition, we plan to have the app data aggregated into learner-specific dashboards that could be used in the context of longitudinal coaching (formative purposes) as well as by the residency program’s clinical competency committee (summative purposes). Also, most residency programs use outside vendors (e.g., MedHub, New Innovations) for their evaluation programs. These vendors have mobile apps of varying quality that represent alternatives to developing or purchasing a stand-alone app. The advantage of using an app from such a vendor is easier incorporation of the captured data into existing clinical competency committee processes; the disadvantage is that the residency program typically has less control over the user interface and overall design. These trade-offs need to be explored through experimentation. Finally, we hope to pilot our EPA app with an integrated dashboard at multiple residency programs to gather additional evidence of its validity and usability, both for coaching and for judging competence.


1. Cheung WJ, Patey AM, Frank JR, Mackay M, Boet S. Barriers and enablers to direct observation of trainees’ clinical performance: A qualitative study using the theoretical domains framework. Acad Med. 2019;94:101–114.
2. Bohnen JD, George BC, Williams RG, et al.; Procedural Learning and Safety Collaborative (PLSC). The feasibility of real-time intraoperative performance assessment with SIMPL (System for Improving and Measuring Procedural Learning): Early experience from a multi-institutional trial. J Surg Educ. 2016;73:e118–e130.
3. Fitzpatrick R, Paterson NR, Watterson J, Seabrook C, Roberts M. Development and implementation of a mobile version of the O-SCORE assessment tool and case log for competency-based assessment in urology residency training: An initial assessment of utilization and acceptance among residents and faculty. Can Urol Assoc J. 2019;13:45–50.
4. Malone K, Supri S. A critical time for medical education: The perils of competence-based reform of the curriculum. Adv Health Sci Educ Theory Pract. 2012;17:241–246.
5. Hicks PJ, Margolis MJ, Carraccio CL, et al.; PMAC Module 1 Study Group. A novel workplace-based assessment for competency-based decisions and learner feedback. Med Teach. 2018;40:1143–1150.
6. Saliken D, Dudek N, Wood TJ, MacEwan M, Gofton WT. Comparison of the Ottawa Surgical Competency Operating Room Evaluation (O-SCORE) to a single-item performance score. Teach Learn Med. 2019;31:146–153.
7. Warm EJ, Held JD, Hellmann M, et al. Entrusting observable practice activities and milestones over the 36 months of an internal medicine residency. Acad Med. 2016;91:1398–1405.
8. Young JQ, Hasser C, Hung EK, et al. Developing end-of-training entrustable professional activities for psychiatry: Results and methodological lessons. Acad Med. 2018;93:1048–1054.
9. Lockyer JM, Sargeant J, Richards SH, Campbell JL, Rivera LA. Multisource feedback and narrative comments: Polarity, specificity, actionability, and CanMEDS roles. J Contin Educ Health Prof. 2018;38:32–40.
10. Young JQ, Sugarman R, Holmboe E, O’Sullivan PS. Advancing our understanding of narrative comments generated by direct observation tools: Lessons from the psychopharmacotherapy-structured clinical observation. J Grad Med Educ. 2019;11:570–579.

Supplemental Digital Content

Copyright © 2020 by the Association of American Medical Colleges