Research has shown that deliberate practice is a useful instructional strategy for the development of expertise in fields ranging from violin playing to surgery.1 Ericsson2 characterizes deliberate practice as effective improvement of performance that entails sequential, mindful repetitions of a training task, along with immediate feedback, such that expert performance is acquired gradually.
In the field of medical education specifically, researchers have shown that deliberate practice is effective for tasks such as radiograph and electrocardiogram interpretation, auscultation, and surgery simulations.3–5 The assessment of a deliberately practiced skill traditionally consists of discrete and formative or summative assessments that are usually separated in time and space from the practice itself.6,7 Alternatively, learning curves plot performance against a continuous measure of time spent learning. They allow inspection of the process of a person's learning at a fine level of granularity in real time, and they may allow medical educators to predict how much practice is required to achieve a given level of competency.1,2,8,9
Figure 1 shows a generic learning curve first substantively reported by Thurstone.8 Learning increases with time, according to a negative exponential function.10,11 In other words, as learners complete more cases, the amount of performance improvement gained with each case decreases. Parameters taken from the learning curve model include the y-intercept (the performance level at the beginning of the practice period), the slope of the curve (the rate of learning), and an upper asymptote (maximal performance attainable using this learning intervention to infinite time).8,10 Of note, each learning curve has an inflection point at which the rate of learning slows from an initial rapid phase to a slower phase during which each successive unit of practice results in less learning. This “law of diminishing returns” is based on the recognition that the very highest performance is difficult to achieve.9 In medicine, the main use of learning curves as an assessment tool has been for the adoption of new surgical procedures or new health technologies12; however, their use in mainstream medical education is uncommon.
For this study, we created a deliberate practice exercise for the skill of radiograph interpretation in order to demonstrate how learning curves can describe proficiency improvements. We hypothesized that the results would allow a more detailed assessment of the development of competency in radiograph interpretation both at an individual and group level.
Participant recruitment and setting
We recruited a convenience sample of participants for this study. We invited 32 pediatric trainees who rotated through a pediatric emergency department (PED) from two different programs (Children's Hospital of New York Presbyterian [New York, New York] and The Hospital for Sick Children [Toronto, Ontario, Canada]) via an e-mail. The pediatric trainees ranged from postgraduate training years (PGY) 2 through 5, whereby PGY 4 and 5 trainees were enrolled in a pediatric emergency medicine (PEM) fellowship. For a period of four months (March to June 2008), we sent monthly e-mail reminders to those participants who had not started or completed the exercise. To encourage enrollment, we offered potential participants who completed the study both a chance to win an iPod and a $25 gift certificate. We informed participants of the purpose of the study but blinded them to the ratio of normal to abnormal radiographs and to fracture types included in the exercise. We guaranteed participants' anonymity. We kept participants' results confidential and did not provide them to training programs.
To garner a reference standard, we also invited six PEM attending physicians—three from each institution—to participate.
The research ethics boards at the participating institutions approved this study. All participants provided informed consent.
Radiograph selection and diagnostic classification
We chose ankle films for this research because they can be considered a model of a dichotomous clinical decision. That is, the clinician must either (1) declare the radiograph free of fracture and discharge the patient with only supportive measures or (2) diagnose a fracture and appropriately manage the patient with splinting and arrangements for further care. The details of the image bank development are reported elsewhere.3 In brief, we used 234 radiograph cases that provided the following content: a case-mix frequency of abnormal/normal/normal-variant radiographs that is consistent with the mix seen in actual clinical practice and that includes all the educational content necessary for emergency management of pediatric ankle injuries. The diagnoses represented in the 234 radiographs included 136 (58.1%) normal ankles, 15 (6.4%) normal-variant ankles, 69 (29.5%) growth plate fractures, 12 (5.1%%) avulsion fractures, 1 (<1%) combined tibia/fibula fracture, and 1 ankle (<1%) demonstrating osteochondritis dissecans.
A case consisted of three images (anterior–posterior, lateral, and mortise views). These were downloaded, along with the final staff pediatric radiology report, from the institutional picture archiving and communications system and saved in Joint Photographic Experts Group (.jpeg) format. We wrote for each case a brief clinical history based on the information present on the imaging requisition, and we categorized all cases as either normal or abnormal 3 based on the information provided by the official radiology report.
Online software application for presentation of radiograph cases
From May to December 2007, we developed a Web site using HTML (version 4, World Wide Web Consortium, Cambridge, Massachusetts), PHP (version 5, The PHP Group, Cary, North Carolina), and Flash Professional (version 8, Macromedia, San Francisco, California) software. We safeguarded data via a participant name and unique password. Participants were permitted to complete the exercise at a site of their choice. They could also save their work and log out at any time and return later to complete the exercise. The software tracked participants' progress through the cases and recorded their responses immediately to a mySQL database (version 4.0.27-max-log, MySQL AB, Uppsala, Sweden).
For each case, the participant first viewed a screen listing the patient's presenting complaint and the physician's clinical findings. Clicking the appropriate button took the participant to one of the three standard radiograph views of the ankle. We imposed no time limitations, and when the participant was ready, he or she declared the case either “normal” or “abnormal.” If the participant selected “abnormal” for any radiograph, he or she then marked it, using a yellow dot, to indicate where he or she thought the abnormality was located. When the participant committed to an answer (by clicking the “Submit” button), he or she received instantaneous feedback comprising a visual overlay indicating the region of abnormality (if any) and the entire official radiology report. Cases were presented in a random order unique to each participant so as to minimize order effects.
Conceptual model of deliberate practice
This educational intervention meets the criteria of deliberate practice, as defined by Ericsson1 (i.e., sustained practice of a relevant task along with receipt of feedback). This intervention is consistent with the characteristics he detailed, based on the following: (1) all participants were motivated by the high relevance of the task to their field, (2) participants received immediate feedback on their performance, and (3) the overall task of ankle radiograph interpretation remained the same with each repetition, although the specifics of each radiograph varied. These factors allowed the participants to improve as they progressed through the cases.
We considered each case completed by a participant to be one item. We scored normal items dichotomously (correct, incorrect) depending on the match between the participant's response and the original radiology report (the latter having been determined a priori). We considered abnormal items to be correct only if the participant had both classified it as abnormal and indicated the correct region of abnormality on at least one of the images of the case.
For the group and each participant, we calculated each of the following measures, collectively referred to as “test characteristics”: accuracy, sensitivity, specificity, positive predictive value, negative predictive value, likelihood ratio positive, and likelihood ratio negative.13
Plotting learning curves.
We expected that the participants would show improvement with successive practice and that this could be captured by a running estimate of the aforementioned test characteristics. As participants completed each case, the software computed the test characteristic in question and then graphed it as a function of cases completed to that point, resulting in a learning curve. Because the result is weighed down by the early cases for which performance would not have been as good, these cumulative statistics underestimate the final performance level of each participant. However, the goal was to report, at a fine level of granularity, the relative changes in performance over successive repetitions. In addition to the individual learning curves, the software plotted group-level summary curves for the mean of the test characteristic estimates of all participants.
Learning curve form.
To determine how often the individual learners' learning curves fit the Thurstone pattern shown in Figure 1, we each independently qualitatively assessed through visual inspection all of the curves and provided an opinion on whether a given curve fit the classic Thurstone pattern. We reasoned that if the learning curves fit the Thurstone pattern, the collective fit would support the contention that participants should improve as they complete each unit of practice/repetition, but that the rate of improvement slows at advanced levels.14 We defined a fit as a curve that had an initial upslope, an inflection point, and then a slower upslope that ended in maximal performance. We recorded the simple agreement and raw kappa score among the three raters.
We compared the final average sensitivity of all participants with that of six PEM attending physicians. We did not think that the trainees would be able to achieve a level of competency equal to that of the attending physicians. Based on the consensus of three medical educators (distinct from the study authors), we set the criterion for competency for trainees at two-thirds the difference between trainee initial sensitivity and the mean sensitivity of the referent PEM attendings.
We carried out all analyses in September and October 2008 using SPSS 13.0 (Chicago, Illinois) and Stata 10.1 (College Station, Texas).
We initially contacted all 32 trainees rotating through the PEDs of two different programs (via an e-mail solicitation in March 2008); of those, 20 consented to participate. Two participants dropped out of the study; thus, by July 2008, 18 residents (56.3% of those eligible) completed the study protocol: 6 PGY 2–3 trainees (all associated with the Children's Hospital of New York Presbyterian) and 12 PGY 4–5 trainees (7 associated with The Hospital for Sick Children and 5 associated with the Children's Hospital of New York Presbyterian). In addition, all 6 PEM attending physicians completed the protocol. All 18 residents and all 6 attendings who completed the study completed all 234 cases.
Summative performance statistics
The mean accuracy of the participants improved over the course of the 234 radiograph cases they completed. When we compared the accuracy of the first 30 completed cases with that of the last 30, the accuracy improved from 0.71 (SD 0.45) to 0.80 (SD 0.40), with the 95% CI for the difference being +0.04, +0.14. The internal consistency (Cronbach α) of the 234-item tutorial was 0.72.
Group-level learning curves
Figure 2 presents the results of all the cumulative test characteristic group-level learning curves for the 18 participants. For each of the test characteristics, the graph patterns at the group level were similar, qualitatively following the Thurstone pattern shown in Figure 1.
Figure 3 demonstrates more closely the cumulative sensitivity group-level learning curve. Notably, participants required approximately 20 cases for the standardized error of measurement to stabilize; the period of most efficient learning was from 21 to 50 cases, during which sensitivity (95% CI) increased from 0.50 (0.45, 0.57) to 0.54 (0.47, 0.58); there was then an inflection point after which learning slowed but did not stop even at 234 cases, at which point the final cumulative sensitivity was 0.60 (0.54, 0.63). The mean cumulative sensitivity of our 6 PEM attendings was 0.65. Based on inspection of the group-level graph, attaining the sensitivity of PEM attending participants would theoretically require the average resident or trainee in our study to do a further 150 cases using this form of instruction.
Individual-level learning curves
Figure 4 presents each participant's cumulative sensitivity learning curve. Considerable variability occurred in the shapes of learning curves among participants. The three authors individually rated the curves with respect to a qualitative Thurstone pattern fit. In 15 of the 18 cases there was complete agreement between all the raters (κ = 0.75). Of the 18 cases, 11 participants' curves fit the Thurstone pattern and 7 did not. We did not detect one particular alternate pattern among the learning curves of the 7 participants whose curves did not completely follow the general pattern, although 2 of the participants' curves showed a negative slope. This result—that the majority of the learning curves fit the Thurstone pattern—held across the various test characteristics.
Relationship to criterion.
Plotting a criterion line on the learning curves, as in Figure 5, shows that the learning curves can provide some insight into a trainee's progression toward competency. Our participants showed a wide range of ability both initially and as they progressed through the radiograph cases. We observed, on the basis of their running estimate (which did not dip below the threshold), that 3 of the 18 participants were above our chosen level of competency from the beginning of the intervention. Another 7 did not achieve the designated competency level even after 234 cases, and in the middle were 8 participants who made steady progress, eventually crossing the threshold at a point that varied from 70 cases to almost the final case presented.
Using the deliberate practice of interpreting pediatric ankle radiographs as an example, we demonstrated that learning curves can be a useful representation of how trainees acquire a particular skill. In our sample, after an initial period of statistical “noise,” the slope of the learning curve increased rapidly to an inflection point after which learning gains were slower. A qualitative inspection of the learning curves allowed us to determine not only at what point practice was most efficient for learning a skill but also how much practice was required to achieve a given level of competence. These findings are in keeping with the classic Thurstone pattern of the learning curve8 and were evident for all the diagnostic test characteristics examined in this study.
Several state-of-the-art reviews of assessment in medical education have been published.6,7,15 Whereas each describes a range of assessment options from written examinations to standardized-patient-based experiences, none lists the use of learning curves in the way we have described. Epstein's6 recent review does suggest the need for assessment over time, but it does so in the context of multimethod assessment, which emphasizes whether a destination has been achieved rather than what the path of learning looks like. Assessments based on learning curves that detail the results of deliberate practice offer advantages over the traditional methods in that the assessment occurs during learning instead of after the exercise. Medical educators can draw conclusions both about the rate of learning and about the effectiveness and/or efficiency of the learning intervention for either a group or an individual. In addition, the learning curve is a visual representation of a learning trajectory which gives learners (or educators) a sense of where they (the learners) have been, and when or even whether, they are likely to achieve their competency goals.
Learning curves have previously been used in radiology education research. Nodine and colleagues16 created a learning set of 150 mammograms, one-third of which showed a lesion. Thirty-one participants of varying levels of expertise interpreted the mammograms, and the learning curves showed that increased participant experience correlated highly with successful performance on the learning set. However, unlike in our study, the participants did not receive feedback on each case, which may have limited the degree of knowledge gains.17 In addition, the investigators reported transformed group-level analyses such as logarithms of time or performance variables, which may be difficult for the average health educator to interpret. In contrast, we performed our analyses using test characteristics that are familiar to most educators, and we employed qualitative learning curve interpretations that do not require sophisticated analysis in order to generate useful information.
Our results show that an individual's learning curve contains important information. The pattern of approximately two-thirds of the participants matched the predicted Thurstone pattern. This finding supports research by Hicklin14 who demonstrated that although the specifics (e.g., rate of learning) may be different for each learner, all learners can eventually achieve the same level of mastery in an ideal environment. Those who did not match the Thurstone pattern did not show a particularly characteristic alternate pattern except for two individuals whose learning curves both showed a negative slope. We can only speculate the reasons these individuals learned less over time. In any case, such negative sloping or other unusual learning curves could be helpful in early identification of individuals who require remediation. Further, if the learners reviewed their curves while working through the cases, they might be able to self-regulate their progress.18 We note that the group- and individual-level curves, although similar in form, can be used for different purposes. As mentioned, medical educators can use individual curves to identify learners requiring remediation or trainees for whom the intervention is not working, while they can use the parameters taken from the overall curve to manage education for a group including predicting how much practice is required and how efficient it will be.
Using a learning goal or criterion helped establish a context for our participants' learning curves. In many medical education activities, all learners spend the same amount of time learning a skill, resulting ultimately in competency levels that vary among individuals.14 According to the theory of competency-based training, learners vary in terms of the time they spend learning, but they can all eventually achieve the same competency level.14 From a residency program director's perspective, an important advantage of learning curves is that they can show when the learner has achieved competence—without risking inefficient use of training time.
This research has limitations that warrant consideration. Not all skills lend themselves to this type of analysis. Interpreting ankle radiographs is a well-codified task that remains consistent from case to case, and knowledge gained from one case is immediately applicable to the next one. The dichotomous nature of the problem, determining whether a fracture is present or not, allows for unambiguous feedback, which may not be the case for other situations. We used a convenience sample, which raises the concern that learners who participated may represent a different skill level or motivation compared with those who did not. We assigned a consensus competency criterion for illustrative purposes only; the principle behind learning curves—that they show an individual's trajectory toward competence—would likely remain intact even if another level were selected. Participants in this exercise, including PEM staff, did not achieve high levels of competency despite the deliberate practice of over 200 radiograph cases. The final competency estimates that we have reported are likely to be lower than the actual performance as cumulative test characteristics results, weighed down by initial performances, were reported and/or as technical inefficiencies may have affected results. Finally, our participant sample size was small, and therefore this limits statistical comparisons of results at different points in time. However, residency program directors function on a day-to-day basis with group sizes comparable with that in our study, and our results show that learning curves can distinguish the varying learning paths of even a small number of individuals.
Learning curves describing deliberate practice of radiograph interpretation allow medical educators to define at what point practice is most efficient and how much practice is required of an individual or group to achieve a defined level of mastery.
The authors thank Dr. Adina Kalet, Dr. David Kessler, and Dr. Andrew Mutnick for careful reviews of preliminary drafts.
This research was funded by the Royal College of Physicians and Surgeons of Canada and the Dean's Excellence Fund University of Toronto.
Both the ethics board at Children's Hospital of New York Presbyterian and the ethical review board of The Hospital for Sick Children approved this study.
This research was presented at the Pediatric Academic Societies, Baltimore, Maryland, May 3, 2009.