Secondary Logo

Journal Logo

Does Incorporating a Measure of Clinical Workload Improve Workplace-Based Assessment Scores? Insights for Measurement Precision and Longitudinal Score Growth From Ten Pediatrics Residency Programs

Park, Yoon Soo, PhD; Hicks, Patricia J., MD, MHPE; Carraccio, Carol, MD, MA; Margolis, Melissa, PhD; Schwartz, Alan, PhD for the PMAC Module 2 Study Group

doi: 10.1097/ACM.0000000000002381
The Emerging Learning Environment

Purpose This study investigates the impact of incorporating observer-reported workload into workplace-based assessment (WBA) scores on (1) psychometric characteristics of WBA scores and (2) measuring changes in performance over time using workload-unadjusted versus workload-adjusted scores.

Method Structured clinical observations and multisource feedback instruments were used to collect WBA data from first-year pediatrics residents at 10 residency programs between July 2016 and June 2017. Observers completed items in 8 subcompetencies associated with Pediatrics Milestones. Faculty and resident observers assessed workload using a sliding scale ranging from low to high; all item scores were rescaled to a 1–5 scale to facilitate analysis and interpretation. Workload-adjusted WBA scores were calculated at the item level using three different approaches, and aggregated for analysis at the competency level. Mixed-effects regression models were used to estimate variance components. Longitudinal growth curve analyses examined patterns of developmental score change over time.

Results On average, participating residents (n = 252) were assessed 5.32 times (standard deviation = 3.79) by different raters during the data collection period. Adjusting for workload yielded better discrimination of learner performance, and higher reliability, reducing measurement error by 28%. Projections in reliability indicated needing up to twice the number of raters when workload-unadjusted scores were used. Longitudinal analysis showed an increase in scores over time, with significant interaction between workload and time; workload also increased significantly over time.

Conclusions Incorporating a measure of observer-reported workload could improve the measurement properties and the ability to interpret WBA scores.

Y.S. Park is associate professor, Department of Medical Education, University of Illinois at Chicago College of Medicine, Chicago, Illinois; ORCID:

P.J. Hicks is professor of clinical pediatrics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania; ORCID:

C. Carraccio is vice president of competency-based assessment programs, American Board of Pediatrics, Chapel Hill, North Carolina; ORCID:

M. Margolis is senior measurement scientist, National Board of Medical Examiners, Philadelphia, Pennsylvania; ORCID:

A. Schwartz is Michael Reese Endowed Professor of Medical Education, Department of Medical Education, and research professor, Department of Pediatrics, University of Illinois at Chicago College of Medicine, Chicago, Illinois; ORCID:

Funding/Support: This study was funded by the National Board of Medical Examiners (NBME) and the American Board of Pediatrics (ABP) Foundation. M. Margolis is an employee of the NBME. C. Carraccio is an employee of the ABP. Y.S. Park, P.J. Hicks, and A. Schwartz were supported in part by contracts between the funders and their respective institutions.

Other disclosures: None reported.

Ethical approval: This study was approved by the institutional review boards at the participating residency programs and the University of Illinois at Chicago.

Correspondence should be addressed to Yoon Soo Park, Department of Medical Education, College of Medicine, University of Illinois at Chicago, 808 S. Wood St., 963 CMET (MC 591), Chicago, IL 60612-7309; telephone: (312) 355-5406; e-mail:; Twitter: @YoonSooPark2.

Graduate medical education (GME) has long used workplace-based assessments (WBAs) to measure learner performance in the authentic clinical environment.1–3 With recent mandates for milestone reporting by the Accreditation Council for Graduate Medical Education, WBAs measuring the developmental progress of learners have become increasingly important.4,5 However, residency programs and educators face the complex challenge of assigning ratings and synthesizing learner performance data from a workplace with varying day-to-day workload demands.

Workload refers to patient demands in relation to clinical resources, including the time and effort required from clinicians and hospital staff.6,7 More specifically, a practitioner’s workload can be influenced by perceived or experienced factors such as supervision level, number of providers per patient, experience of providers, patient or family needs, and patient acuity and complexity.8–10 From a health care microsystems framework, managing clinical workload is critical for safe medical care and improved patient outcomes, and as such, assessment of trainee–provider performance in varying levels of workload is necessary to determine trainee assignments.11

WBAs generally require multiple raters to provide judgments of resident performance through observation of individual patient care encounters or across multiple observations of learners throughout a rotation or longitudinal learning activity. Structured clinical observations (SCOs) are examples of observations of behaviors seen during one patient care activity; multisource feedback (MSF) or end-of-rotation evaluations are used to record observations of behaviors seen over a longer period of time, including observations of behaviors across more than one type of patient care or learning activity.1,2

Although WBAs are used widely in GME, they have also been challenged because of concerns over the reliability of WBA scores and potential rater bias affecting the response process and validity of WBA scores.12–14 Recent efforts to understand the validity of WBA scores in a competency-based framework include factors that contribute to their reproducibility and factorial structure, thus demonstrating sufficient reliability when adequate numbers of WBA scores are obtained.15,16 Moreover, WBA scores have also been shown to predict underperforming residents, providing new insights into improving the quality of WBA scores.3,5,16

In a recent publication, Weller and colleagues17 suggest methods to incorporate score adjustments into WBA scores. In their approach, differences between observed and expected scores were used to adjust WBA scores, demonstrating improved reliability and better identification of low-performing residents.18 The idea to use rater-mediated measures to improve the precision of scores has been noted previously in the measurement literature.19,20 While assessing competence is context dependent, prior studies have not incorporated a measure of workload into WBA scores. Moreover, adjusting for WBA scores using a longitudinal analysis of rater-mediated scores has not been systematically studied in the health professions.

In this study, we expand prior work on improving WBA scores by investigating (1) the psychometric characteristics of workload-adjusted WBA scores and (2) longitudinal trajectory of learner performance comparing workload-adjusted and unadjusted WBA scores. These analyses are conceptualized as a validity study, contributing to the internal structure (psychometric characteristics) and relations to other variables (longitudinal trajectory) validity evidence of workload-adjusted WBA scores.21 We use workload-adjusted WBA scores to examine the validity of scores obtained from Pediatrics Milestones–based SCO and MSF assessment tools measuring 8 subcompetencies collected from postgraduate year 1 learners at 10 pediatrics residency programs. We hypothesized that workload-adjusted WBA scores would lead to better discrimination, higher reliability, and more insight into longitudinal trends of learner performance. We also examine differences in score variability and reliability by rater groups (faculty vs. senior residents) and by item type (items mapped to specific competencies vs. global ratings).

Back to Top | Article Outline


Study setting and participants

We used a subset of data from a larger collaborative research project, Pediatrics Milestones Assessment Collaborative (PMAC).22–24 PMAC is a multiphase joint project of the American Board of Pediatrics, Association of Pediatric Program Directors, and National Board of Medical Examiners that aims to develop assessment tools contributing evidence to decisions regarding learners’ readiness to advance. Ten residency programs voluntarily participated in this phase of the project (representing a subset of the first-year residents at each site) on which we based this study, collecting data electronically from July 2016 to June 2017 to inform decisions about learner readiness to serve as a pediatric intern on an inpatient service.

Back to Top | Article Outline

Assessment tools: SCO and MSF forms


We studied item-level data collected from three different WBA forms developed for this phase of PMAC: (1) SCO–History & Physical Examination (SCO-H&P: 46 scored items across 17 questions), (2) SCO of Rounds (SCO-R: 23 scored items across 7 questions), and (3) MSF (40 scored items across 18 questions). The SCO-H&P instrument focuses both on how the learner interviewed the patient or family to obtain information to generate differential diagnoses and how the learner performed the physical examination. Depending on what the rater was able to observe, the rater may only complete the history or physical examination sections of the SCO-H&P assessment form. The SCO-R instrument measures how the learner presents the patient based on information available in the chart or gathered during rounds. The MSF instrument measures learner behaviors during the rotation using information gathered through longitudinal direct observations. The MSF is divided into sections on direct patient-care-related activities (e.g., diagnostic reasoning and clinical decision making) and indirect aspects of patient care (e.g., communication and teamwork). The MSF also included an item for a global rating of readiness to be an intern on an inpatient service. These assessment forms all include fields for narrative comments; however, for this study, we only used quantitative scores from faculty and resident raters for analysis and interpretation. (Authors can be contacted directly for specific details about the instruments or for interest in collaboration.)

Back to Top | Article Outline

Subcompetencies and items.

By design, each item on the SCO-H&P, SCO-R, and MSF instruments maps to one of eight subcompetencies in the Pediatrics Milestones (see Table 1 for subcompetencies and their associated descriptors): Patient Care (PC-1, PC-2, PC-5), Interpersonal and Communication Skills (ICS-1, ICS-4), Practice-Based Learning and Improvement (PBLI-5), Professionalism (PROF-2), and Personal and Professional Development (PPD-1).25,26 Items were compiled into a single data file, consisting of 41 unique scored items corresponding to the subcompetency and instrument (e.g., 10 items measuring ICS-1 across SCO-R, SCO-H&P, and MSF instruments).

Table 1

Table 1

Back to Top | Article Outline


To measure workload, we asked faculty and resident observers to rate an additional item using a visual analogue scale with anchors of “low” and “high” at each extreme.

Back to Top | Article Outline


Data compilation.

For purposes of analysis and interpretation, we rescaled all item scores and workload measures to a 1–5 scale using linear transformation. We calculated workload-adjusted WBA scores at both the item level and aggregated for descriptive statistics and analysis at the competency level. Subsequently, we calculated three versions of workload-adjusted WBA scores:

  • (1) Formula 1: Adjusted Score 1 = WBA Score – (5 – Workload Score)
  • (2) Formula 2: Adjusted Score 2 = WBA Score – (1 / Workload Score)
  • (3) Formula 3: Adjusted Score 3 = (WBA Score x Workload Score) / 5

Formula 1 is based on the deviance score approach, following a similar method proposed by Weller and colleagues,17,18 which used a difference score based on expected versus observed scores. In this approach, WBA score is equally controlled by the magnitude of the workload score. In Formula 2, workload is treated as a contextual factor, and the adjusted score is proportional to changes in workload score, adding more weight to the raw WBA score. In Formula 3, workload is treated as a covarying factor in the derivation of the workload-adjusted score (workload has a multiplicative effect on the adjusted score).

Back to Top | Article Outline

Variance components analysis.

We estimated variance components using mixed-effects regression models to account for the highly unbalanced WBA data structure. We specified workload random effects using an unstructured covariance matrix. Within the model, we also specified a cross-classification of raters by learner nested in programs and competency-specific items nested in competency.27,28 This model resembles the unbalanced random-effects G study design, [(rater: person): program] × (items: competency).5,29 This allowed estimation of variance components for programs, program-specific learners (person: program), program-specific raters (rater: person: program), competencies, and competency-specific items (items: competency).

To estimate variance components, we chose (1) competency-level items and (2) global ratings (items and components were not estimated for global ratings model, as it is a single-item measure). In addition, we conducted subsample analyses using ratings from faculty or residents alone. This resulted in a total of 18 sets of variance components analyses. Using estimates from the variance components, we were able to derive reliability (Φ-coefficient) and absolute standard error of measurement (SEM: σ(Δ)) using the Brennan29 method. We calculated projections of reliability and SEM following standard decision study rules.

Back to Top | Article Outline

Longitudinal analysis of WBA scores.

As a proxy for training time, we used the month of rotation start date. Identifying July 2016 as the baseline month (time = 0), we conducted longitudinal analyses to examine patterns of score change over time, comparing workload-unadjusted and workload-adjusted scores. To allow for learner variation in the rate of growth (slope), we used quadratic growth curve functions with random effects.28

Data compilation and analyses were conducted using Stata 14 (College Station, Texas). The institutional review board at the participating programs and the University of Illinois at Chicago approved this study.

Back to Top | Article Outline


Descriptive statistics

We collected data on 252 first-year pediatrics residents during rotations that took place between July 2016 and June 2017. On average, residents were assessed 5.32 times (standard deviation [SD] = 3.79) by different raters during the data collection period; residents were only assessed 1.08 times (SD = 0.35) by the same rater. Workload scores were provided by a total of 330 observers (faculty: n = 173; peer resident: n = 157). Overall, workload had low correlation with the WBA score, r = 0.15, P < .001 (correlation varied by competency, r = 0.03–0.23).

The overall variability of scores is illustrated in Figure 1, which shows the distribution of scores for the overall unadjusted score (Figure 1A: median = 4.00, interquartile range [IQR] = 3.75–5.00) and the workload-adjusted scores (see Figure 1B–1D: Formula 1: median = 3.08, IQR = 2.00–5.00; Formula 2: median = 3.80, IQR = 3.49–4.74; Formula 3: median = 3.16, IQR = 2.38–4.00). The bottom four histograms (Figure 1E–1H) show the distribution of workload-adjusted scores for unadjusted score ranging 1–2 for low-performing trainees (Figure 1E shows the unadjusted range as reference; Figure 1F–1H shows the corresponding distributions using Formulas 1, 2, and 3, respectively). Depending on the unadjusted score range (e.g., 1–2, 2–3, 3–4, and 4–5), there were gradual increases in the distribution among workload-adjusted scores. Based on these descriptive statistics, workload-adjusted scores were better able to discriminate WBA scores for low- and high-performing learners (i.e., wider variability at the extremes of the distribution). Correlation between workload-adjusted scores (using Formulas 1, 2, and 3) ranged between r = 0.81 and r = 0.85, P < .001.

Figure 1

Figure 1

Overall, workload increased significantly over time but plateaued after month 7 (February 2017), P = .001 (July–October: mean = 3.65, SD = 0.84; November–February: mean = 3.88, SD = 0.77; March–May: mean = 3.82, SD = 0.72). Mean WBA scores increased for all subcompetencies, except for PC-5. For PC-5, the mean scores dropped slightly during the February–May months. For PPD-1, mean workload scores dropped after February 2017. For workload-adjusted scores, all subcompetencies had consistent score increase, except for PC-5, which dropped after February 2017. Table 1 shows the descriptive statistics of unadjusted WBA scores, workload-adjusted scores, and workload scores.

Back to Top | Article Outline

Internal structure of scores

Variance components and reliability.

Using scores from items mapped to subcompetencies, we demonstrated a 6.0% to 7.4% increase in person variance with the use of workload-adjusted scores (unadjusted: 10.8%; Formula 1: 16.8%; Formula 2: 16.9%; Formula 3: 18.2%). The overall structure of variance components was similar between the three formulas (% variance components all within ± 1.5% range). There was also modest increase in program-level variance and rater variance, indicating wider variability in institutional and rater evaluations of workload. Variability between competencies was minimal, with most of the variance at the item level (items nested in competency). However, adjusting for workload also reduced variability at the item level by more than 20%. As such, adjusting for workload led to better trainee discrimination (greater trainee variability) and higher reliability (workload-unadjusted: Φ-coefficient = 0.60; workload-adjusted: Φ-coefficient = 0.65). For global scores, variance components did not change between workload-unadjusted and adjusted scores. In particular, person variance accounted for over 27% in both adjusted and unadjusted scores. Table 2 shows the variance components by item type.

Table 2

Table 2

Between rater groups, workload-adjusted faculty ratings showed meaningful improvement in reliability due to increase in person variance (4% increase). Moreover, for faculty ratings, adjusting for workload dropped variability due to raters by nearly 6%. On the other hand, resident ratings had minimal change in variance components. Global ratings also had minimal changes in variance components between workload-adjusted and workload-unadjusted scores.

Back to Top | Article Outline

Decision study: Projections in reliability and SEM.

Reliability was over 0.60, with two and four observations per learner, for workload-adjusted and workload-unadjusted scores, respectively. Workload-adjusted scores resulted in a 28% reduction in SEM compared with workload-unadjusted scores. Figure 2 shows the plots of reliability (Figure 2A) and SEM (Figure 2B) by number of observations.

Figure 2

Figure 2

Back to Top | Article Outline

Relations to other variables: Longitudinal changes in WBA scores

Longitudinal analysis showed significant increase in scores over time (β = 0.04; 95% CI = 0.01–0.07), with significant interaction between workload and time (month of training), P = .034. This indicates that overall, the rate of change in workload (slope) increased more rapidly as learners progressed in their training. The variability of the slope (rate of increase) was significantly larger when scores were adjusted for workload (Formula 1: SD slope of growth curve = 0.30 [95% CI = 0.27–0.34]; Formula 2: SD slope of growth curve = 0.17 [95% CI = 0.15–0.20]; Formula 3: SD slope of growth curve = 0.24 [95% CI = 0.21–0.27]) compared with scores unadjusted for workload (SD slope of growth curve = 0.13 [95% CI = 0.12–0.15]); that is, adjusting for workload nearly doubled the variability in rates of growth for WBA scores, where some learners had faster improvement in WBA scores, while others had slower rates of improvement.

Workload-unadjusted scores showed steady linear improvement throughout the data collection period. Figure 3 shows the change in mean scores over time. Figure 3A illustrates the linear increase in unadjusted WBA scores (y-axis) by months from July (x-axis). Figure 3B shows the corresponding changes in workload scores. Figure 3C illustrates changes in WBA scores that adjust for workload (y-axis), by months from July (x-axis); trends were similar between estimated growth curves using the three formulas. While workload-adjusted scores improved rapidly during the initial months of training, the growth trajectory began improving modestly just after the five- to six-month training mark. A vertical reference line at the four- (November) to seven-month (February 2017) mark was added to underscore this period of inflection point in the rate of change for workload-adjusted scores.

Figure 3

Figure 3

Back to Top | Article Outline


This study contributes to the growing literature on potential benefits of adjusting for workload when measuring resident competencies using WBAs. Our items and assessment tools were mapped to the Pediatrics Milestones,25,26 a competency-based framework. Overall, findings from this study show that adjusting for workload yielded better discrimination of learner performance (improvement of 6.0%–7.4% in person variance) and higher reliability, thereby reducing measurement error by 28%. These findings are consistent with the work of Weller and colleagues,17,18 who showed improvement in measurement properties of their WBA scores. Weller and colleagues used a deviance score approach to calculate differences in expected and observed scores to derive adjusted scores; we used this same approach in our Formula 1, but used workload in the adjustment. We also explored other variant formulas, which were based on proportionally weighting workload (Formula 2) and using workload as a covarying factor (Formula 3). Overall, these different formulas for incorporating workload generated highly correlated workload-adjusted scores, showing similar trends. When workload was taken into account, workload-adjusted scores showed greater variability of trainee performance at all levels of performance. This may prove beneficial in identifying low-performing residents; that is, beyond improvements in reliability of WBA scores, our study also showed better discrimination of learners at low (1–2 score range) levels of performance, where unadjusted scores may not necessarily identify underperforming residents.

Results of our variance components analysis showed that scores from residents only had modest improvements in reliability relative to faculty ratings, perhaps indicating that these two observer groups have different perceptions of workload. It may also be possible that residents were more acutely aware of workload than faculty, and as such, already incorporated workload directly into their WBA scores. Prior studies have noted greater variability to discriminate learner differences from faculty ratings, compared with fellows or peer residents.3,5 Future studies may provide further evidence on rater cognition and their thought processes on incorporating workload.

Using external rater-mediated information to control for assessment scores has been previously discussed in the general psychometrics literature.19,20 However, it has not been systematically examined using a single, global observer-reported measure over a longitudinal period. Although workload is a complex construct requiring the consideration of multiple factors, our study shows potential in using rater-mediated information to yield more precise WBA scores. To advance this area of workload-adjusted WBA scores, better measures of workload incorporating its multidimensional characteristics may be needed.9 In this regard, measures of workload can be developed and their validity evidence examined to further refine methods to derive workload-adjusted scores. Education researchers and psychometricians may also need to identify more efficient methods to calculate workload-adjusted scores, improving beyond the three formulas used in this study, to incorporate a framework where latent and observed measures can be combined.30 These additional studies could refine the measurement of workload and factors that raters could consider, which may include better rater training and faculty development. Benefits and consequences of using global scores need further investigation, as our findings noted minimal effects when global scores were adjusted for workload.

Longitudinal trajectories of learner progress using rotation start date as proxy of time showed significant improvement. However, our study showed differences in the growth trajectories. For example, it was notable that scores for PC-5 declined after February. Unadjusted scores showed a linear increase across competencies with a constant rate of growth (slope). However, adjusting for workload showed a different pattern, where learner performance increased rapidly initially and then increased incrementally after month 5 (December).

The longitudinal trends in Figure 3C indicate that learners’ workload had greater impact on scores in the initial months of training and then tapered off, possibly due to workload reaching the maximum limits of safety. It may also be possible that in the first five- to six-month time frame, supervising faculty were more aware of learner ability and workload. In the second half of the academic year, the faculty may continue to assess the learner’s improving observed performance, but be less discriminating regarding the workload experienced. The sharp increase in the slope of learner performance in the initial months and the subsequent plateau effect at the five- to six-month mark present an interesting finding which could be explored in future research. Beyond differences in the growth curves, our study also found greater variability in the rate of growth when scores are adjusted for workload regardless of formula used, signaling added value in incorporating workload into WBA scores over time.

This work is not without limitations. Our study was based on data from a single specialty and focused on competencies to inform a particular inference—readiness to serve as a pediatric intern in the pediatric inpatient setting. However, learners were recruited from 10 residency programs, which may add to the generalizability of our findings. Our data were also unbalanced, with more WBA scores for some learners relative to others; this is typical of WBA in the practice setting. Another limitation could be the accuracy of the assigned scores by faculty and residents. We have not done any “think-aloud” studies or other measures to provide evidence for the accuracy of the ratings assigned by various observers or roles of observers. There is a possibility that observers initially compensated for workload and adjusted scores accordingly. Additional validity evidence supporting the use of workload and adjusting for workload may be needed; this could serve as the basis for additional psychometric work to evaluate methods to improve scores. Workload-level ratings asked the observer to address the level of workload for the rotation in comparison with the typical level of workload for that particular rotation at the beginning of the instrument. Thus, the workload levels assigned were not absolute but, rather, relative to what were typical for that rotation. We are refining measures of workload for future PMAC phases to examine how workload-adjusted scores can be improved and more accurately measured. As such, while this study addressed internal structure and relations to other variables sources of validity evidence in workload-adjusted WBA scores, we plan to conduct additional studies to address content, response process, and consequences validity evidence as part of future work.31–35

In conclusion, incorporating a measure of observer-reported workload could improve the measurement properties and the ability to interpret WBA scores. Educators may consider adjusting for workload when measuring performance in the workplace to provide more reliable and valid assessments of trainee competence in the clinical learning environment.

Back to Top | Article Outline


The following members of the PMAC Module 2 Study Group also meet the criteria for authorship: Nick Potisek, MD, and Allison McBride, MD (Wake Forest University Medical School); Kathleen Donnelly, MD, and Meredith Carter, MD (Inova Fairfax Medical Campus/Inova Children’s Hospital); Teri Turner, MD, MPH, MEd (Baylor College of Medicine–Houston); Renuka Verma, MD (Unterberg Children’s Hospital at Monmouth Medical Center); Su-Ting Li, MD, MPH (UC Davis Health System); Amanda Osta, MD (University of Illinois College of Medicine at Chicago); Hilary Haftel, MD, MHPE (University of Michigan Medical Center); Lynn Thoreson, DO (University of Texas at Austin Dell Medical School); Linda Waggoner-Fountain, MD, MEd, and Mark Mendelsohn, MD (University of Virginia Health System); and Ann Burke, MD (Wright State University). Additional members who were collaborators on this work are Brian Clauser, EdD, and Tom Rebbecchi, MD (National Board of Medical Examiners).

Back to Top | Article Outline


1. Norcini J, Burch V. Workplace-based assessment as an educational tool: AMEE guide no. 31. Med Teach. 2007;29:855–871.
2. Schwind CJ, Williams RG, Boehler ML, Dunnington GL. Do individual attendings’ post-rotation performance ratings detect residents’ clinical performance deficiencies? Acad Med. 2004;79:453–457.
3. Park YS, Riddle J, Tekian A. Validity evidence of resident competency ratings and the identification of problem residents. Med Educ. 2014;48:614–622.
4. Nasca TJ, Philibert I, Brigham T, Flynn TC. The next GME accreditation system—Rationale and benefits. N Engl J Med. 2012;366:1051–1056.
5. Park YS, Zar FA, Norcini JJ, Tekian A. Competency evaluations in the next accreditation system: Contributing to guidelines and implications. Teach Learn Med. 2016;28:135–145.
6. Weissman JS, Rothschild JM, Bendavid E, et al. Hospital workload and adverse events. Med Care. 2007;45:448–455.
7. Reason J. Human error: Models and management. BMJ. 2000;320:768–770.
8. Cachon G, Terwiesch C. Matching Supply With Demand: An Introduction to Operations Management. 2006.New York, NY: McGraw-Hill;
9. Fieldston ES, Zaoutis LB, Hicks PJ, et al. Front-line ordering clinicians: Matching workforce to workload. J Hosp Med. 2014;9:457–462.
10. Schumacher DJ, Slovin SR, Riebschleger MP, Englander R, Hicks PJ, Carraccio C. Perspective: Beyond counting hours: The importance of supervision, professionalism, transitions of care, and workload in residency training. Acad Med. 2012;87:883–888.
11. Kc D, Terwiesch C. Impact of workload on service time and patient safety: An econometric analysis of hospital operations. Manage Sci. 2009;5:1486–1498.
12. Siegel BS, Greenberg LW. Effective evaluation of residency education: How do we know it when we see it? Pediatrics. 2000;105(4 pt 2):964–965.
13. Kogan JR, Conforti L, Bernabeo E, Iobst W, Holmboe E. Opening the black box of clinical skills assessment via observation: A conceptual model. Med Educ. 2011;45:1048–1060.
14. Cleland JA, Knight LV, Rees CE, Tracey S, Bond CM. Is it me or is it them? Factors that influence the passing of underperforming students. Med Educ. 2008;42:800–809.
15. Moonen-van Loon JM, Overeem K, Donkers HH, van der Vleuten CP, Driessen EW. Composite reliability of a workplace-based assessment toolbox for postgraduate medical education. Adv Health Sci Educ Theory Pract. 2013;18:1087–1102.
16. Ginsburg S, Eva K, Regehr G. Do in-training evaluation reports deserve their bad reputations? A study of the reliability and predictive ability of ITER scores and narrative comments. Acad Med. 2013;88:1539–1544.
17. Weller JM, Misur M, Nicolson S, et al. Can I leave the theatre? A key to more reliable workplace-based assessment. Br J Anaesth. 2014;112:1083–1091.
18. Weller JM, Castanelli DJ, Chen Y, Jolly B. Making robust assessments of specialist trainees’ workplace performance. Br J Anaesth. 2017;118:207–214.
19. Patterson BF, Wind SA, Engelhard G Jr.. Incorporating criterion ratings into model-based rater monitoring procedures using latent-class signal detection theory. Appl Psychol Meas. 2017;41:472–491.
20. DeCarlo LT. A latent class extension of signal detection theory, with applications. Multivariate Behav Res. 2002;37:423–451.
21. Messick S. Standards of validity and the validity of standards in performance assessment. Educ Meas. 1995;14:5–8.
22. Schwartz A, Young R, Hicks PJ; APPD LEARN. Medical education practice-based research networks: Facilitating collaborative research. Med Teach. 2016;38:64–74.
23. Hicks PJ, Margolis M, Poynter SE, et al; APPD LEARN-NBME Pediatrics Milestones Assessment Group. The Pediatrics Milestones assessment pilot: Development of workplace-based assessment content, instruments, and processes. Acad Med. 2016;91:701–709.
24. Schwartz A, Margolis MJ, Multerer S, Haftel HM, Schumacher DJ; APPD LEARN–NBME Pediatrics Milestones Assessment Group. A multi-source feedback tool for measuring a subset of pediatrics milestones. Med Teach. 2016;38:995–1002.
25. Pediatrics Milestones Working Group. The Pediatrics Milestone Project. January 2012. Chapel Hill, NC: Accreditation Council for Graduate Medical Education/American Board of Pediatrics; Accessed July 20, 2018.
26. Accreditation Council for Graduate Medical Education; American Board of Pediatrics. The pediatrics milestone project. Accessed July 20, 2018.
27. Rabe-Hesketh S, Skrondal A, Pickles A. Generalized multilevel structural equation modeling. Psychometrika. 2004;69:167–190.
28. Skrondal A, Rabe-Hesketh S. Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models. 2004.Boca Raton, FL: Chapman;
29. Brennan RL. Generalizability Theory. 2001.New York, NY: Springer-Verlag;
30. Park YS, Xing K, Lee YS. Explanatory cognitive diagnostic models: Incorporating latent and observed predictors. Appl Psych Meas. 2018;42(5):376–392.
31. Schaub-de Jong MA, Shönrock-Adema J, Dekker H, Verkerk M, Cohen-Schotanus J. Development of a student rating scale to evaluate teachers’ competencies for facilitating reflective learning. Med Educ. 2011;45:155–165.
32. Cook DA, Thompson WG, Thomas KG. The Motivated Strategies for Learning Questionnaire: Score validity among medicine residents. Med Educ. 2011;45:1230–1240.
33. Archer J, Norcini J, Southgate L, Heard S, Davies H. Mini-PAT (peer assessment tool): A valid component of a national assessment program in the UK? Adv Health Sci Educ. 2010;15:633–645.
34. McDaniel CE, White AA, Bradford MC, et al. The high-value care rounding tool: Development and validity evidence. Acad Med. 2018;93:199–206.
35. Plant JL, van Schaik SM, Sliwka DC, Boscardin CK, O’Sullivan PS. Validation of a self-efficacy instrument and its relationship to performance of crisis resource management skills. Adv Health Sci Educ Theory Pract. 2011;16:579–590.
© 2018 by the Association of American Medical Colleges