The Use of Virtual Standardized Patients for Practice in High Value Care : Simulation in Healthcare

Journal Logo

Empirical Investigations

The Use of Virtual Standardized Patients for Practice in High Value Care

Bond, William F. MD, MS; Mischler, Matthew J. MD; Lynch, Teresa J. MD; Ebert-Allen, Rebecca A. BS; Mou, Kyle M. BA; Aiyer, Meenakshy MD; Park, Yoon Soo PhD

Author Information
Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare 18(3):p 147-154, June 2023. | DOI: 10.1097/SIH.0000000000000659



This study examined the influence of high value care (HVC)-focused virtual standardized patients (VSPs) on learner attitudes toward cost-conscious care (CCC), performance on subsequent standardized patient (SP) encounters, and the correlation of VSP performance with educational outcomes.


After didactic sessions on HVC, third-year medical students participated in a randomized crossover design of simulation modalities consisting of 4 VSPs and 3 SPs. Surveys of attitudes toward CCC were administered before didactics and after the first simulation method. Performance markers included automated VSP grading and, for SP cases, faculty-graded observational checklists and patient notes. Performance was compared between modalities using t tests and analysis of variance and then correlated with US Medical Licensing Examination performance.


Sixty-six students participated (VSP first: n = 37; SP-first: n = 29). Attitudes toward CCC significantly improved after training (Cohen d = 0.35, P = 0.043), regardless of modality. Simulation order did not impact learner performance for SP encounters. Learners randomized to VSP first performed significantly better within VSP cases for interview (Cohen d = 0.55, P = 0.001) and treatment (Cohen d = 0.50, P = 0.043). The HVC component of learner performance on the SP simulations significantly correlated with US Medical Licensing Examination step 1 (r = 0.26, P = 0.038) and step 2 clinical knowledge (r = 0.33, P = 0.031).


High value care didactics combined with either VSPs or SPs positively influenced attitudes toward CCC. The ability to detect an impact of VSPs on learner SP performance was limited by content specificity and sample size.

The US health care system spends significant resources (approximately $9255 per person per year), with poorer outcomes than comparably wealthy countries who spend less.1 Low value care includes failures of care delivery and coordination, overtesting and treatment, fraud and abuse, and administrative complexity.2,3 Within this broader picture, appropriate screening, testing, and diagnostic strategies can have a profound impact on costs of healthcare,4 and the ordering provider thus has significant impact on costs of care. Providers also demonstrate significant variability in ordering patterns for radiologic and laboratory testing.5–8

These concerns drove the introduction of high value care (HVC) into the lexicon of medicine.9,10 Work up variability, costs, the risk of false positives, and concerns for diagnostic radiation exposure led the American College of Physicians9 and the American Board of Internal Medicine11 to put forth recommendations for evidence-based testing strategies. Core competencies have been suggested,10 and leading organizations have created an online curriculum for HVC.12 As of 2012, HVC became a subscore on the In-Training Exam for Internal Medicine Trainees in preparation for HVC content in the internal medicine board examination.11 Curricular development and educational opportunities are part of a much larger picture that includes a formidable unwritten curriculum,13 the need for understanding in evidence-based medicine,14,15 and the ability to exercise shared decision making,16,17 in particular around topics of testing and treatment.18,19

High value care ideally proceeds from a thorough history and physical leading to uncovering red flag findings that would suggest the need for testing if present, followed by shared decision making based on an informed differential diagnosis (DDx). Virtual standardized patients (VSPs) that allow practice in finding the high-risk features, permit the learner to make choices for testing, and give feedback on performance could help learners practice for HVC. We chose to investigate the introduction of VSPs that are driven by natural language processing (NLP) into an existing HVC curriculum. Natural language processing–based avatars with a conversational history gathering approach are a relatively new phenomenon.20,21

Although virtual patients have changed greatly across time, there is only limited evidence for their impact in education,22,23 including use to practice history taking,21,24–28 and clinical decision making.24,28–33 Moreover, with the emergence of educational technology and virtual adaptive platforms to enhance remote learning and assessment, the demand for investigating the educational impact and utility of VSPs has become greater. To address these gaps, we created a series of VSPs as a learning opportunity for HVC and incorporated them into a larger curriculum. This study intended to answer several questions:

  1. Does exposure to a curriculum including VSPs or standardized patients (SPs) change attitudes toward cost-conscious care (CCC)?
  2. Will practice with a series of VSPs improve performance on a series of SP actor cases on HVC topics?
  3. How does performance on a series of VSPs correlate with other markers of success?


Study Design

We followed the reporting guidelines of the Consolidated Standards of Reporting Trials (CONSORT) and Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statements simulation extension.34

  • Learning Objectives: The VSPs and SPs were designed to allow practice in HVC concepts after experiencing an HVC didactic curriculum consisting of two 90-minute sessions with the following learning objectives:
    •  o Define HVC and why it is relevant
    •  o Describe barriers that prevent HVC decisions in clinical practice
    •  o Identify 5 steps toward providing HVC
    •  o Briefly define quality improvement process and its role in HVC
  • Trial design: This was a randomized crossover design.
  • Participants: University of Illinois College of Medicine at Peoria medical students (MS3s) on a third-year internal medicine medical student clerkship were eligible to participate. This required course could take place at any time in the third medical school year. Students partook in all educational activities, whether they consented to the research project. Those who consented underwent additional surveys. Randomization allocation was by block randomization to either VSP first or SP first with a pragmatic approach allowing the course coordinator (nonstudy personnel) to randomly assign half the rotation cohort to the VSPs or SPs first.
  • Interventions: See study design in Figure 1. Both groups took a survey on cost-conscious care (CCC)35 followed by didactics derived from the American College of Physicians HVC Curriculum.36 The VSP group first experienced the VSP simulations in a computer laboratory. The SP group first experienced the SP simulations in a simulation center. After the first method of simulation, participants filled out a second CCC survey. In most cases, the learners did the second simulation method later in the same day. With regard to consequences, the learners were told the VSP cases were for their education, and only attendance and completion were tracked. The SP case consequences were as before the study: one formative element within their internal medicine clerkship.
  • This study was submitted to the Peoria Area Institutional Review Board, which serves the Peoria regional campus of the University of Illinois College of Medicine. The consent process included introduction to the project, with faculty leaving the room such that research coordinators could collect either signed or unsigned consent forms.
Study design. *Simulations did not take place on the same day as didactics. CCC, Cost-Conscious Care; SP, Standardized Patient; Hx, History; PE, Physical Examination; DDx, Differential Diagnosis.

Virtual SP Component

The VSPs were created within the open-freeware online platform University of Southern California Standard Patient (University of Southern California Institute for Creative Technologies, Los Angeles, CA). The platform software used NLP to generate avatar audible responses to free-text questions that were typed in by the learner. A demonstration video of one of the VSP cases is available as Supplemental Digital Content (see video, Supplemental Digital Content 1, demonstration video of a VSP case,

Virtual Standardized Patient Assessments

For the history-taking (Hx) module within the VSP cases, the text entry queries of the learner are mapped via NLP to customized case content created by the faculty educators, leading to an audible avatar response. The physical examination (PE) module within the cases is completed by selecting the body area and getting a text description of findings. The learners could not return to gather history and physical data after entering their DDx. The DDx was choose-from-list method, with each case having 6 items on the list. Learners were instructed to choose their primary most likely diagnosis and 2 possible alternative diagnoses. The diagnostic testing choice module was deliberately placed after the DDx module and allowed the learner to select from menus of available diagnostic tests. The list was quite extensive as within a computer ordering system of an electronic health record. The treatment module was choose-from-list, and the number of options varied from 8 to 9.

The VSP automated grading export included: Hx and PE items found relative to faculty consensus of important items, DDx items chosen relative to the primary and 2 “possible” alternative diagnoses over the denominator of 6 possible choices, tests ordered relative to faculty consensus of appropriate testing, and treatments relative to the faculty consensus treatments. Workup costs of the learners were exported along with the consensus faculty workup cost for comparison. uses costs primarily derived from but supplements those costs with other sources for completeness. A negative difference indicated the mean learner workup cost less than faculty consensus, and positive difference indicated the learner workup cost more than faculty consensus. Based on our prior work, we did not feel the need to double-grade the automated scoring but checked a 10% sample for appropriateness of avatar response in the interview.

Virtual Standardized Patient Scoring and Feedback

The platform provides end of case feedback, including overall scoring and module-specific scoring, a list of the content items that were given credit, and a method to discover which items were missed. Specifically, by scrolling over the after action review of each case, they could see exactly which content areas they missed in the history and PE, which should guide them to think about red flag features. Decisions on how the system would score the various components of the cases were arrived at by 3 faculty authors (T.L., M.M., W.B.) using informal consensus during case building. The system gives 3 points for “must ask” items, and 1 point for “nice to ask” items in the Hx and PE. We did not formally calculate the ability of a finding to discriminate among DDx items but informally agreed that “must ask” items were akin to “clinically discriminating” and “nice to ask” items were akin to “thoroughness” items as described by Yudkowsky et al.37 The end of case summary shared with the learner also contained text that explained the diagnosis and why testing may or may not have been indicated. For more details on how the cases were created, consensus scoring developed, and pilot testing for avatar response rate, please see the study group's prior work.38

Virtual Standardized Patient Case Play

Faculty developed a series of case pairs within the VSP platform, with each presenting a complaint having a low-risk (testing not appropriate) and high-risk (testing appropriate) VSP case. Orientation to the platform included a 1-page written assignment introduction, a video demonstration of the VSP platform, and instruction on how to phrase inquiries to the avatar. This included avoiding complex or compound questions, which are beyond the limitations of the NLP system. The method of obtaining end of case feedback was demonstrated in the orientation (hovering, clicking, color changes indicating successful items, etc). The learners played 2 cases as a small group to practice using the platform and then moved to complete the 4 VSP cases as individuals at a PC with a headset in the computer laboratory. A research assistant was present to answer technical questions. The orientation cases were low back pain (low risk: lumbar muscle strain; high risk: spinal metastatic breast cancer), and the individual learner cases were heartburn (low risk: gastroesophageal reflux; high risk: esophageal cancer) and headache (low risk: cluster headache; high risk: subarachnoid hemorrhage).

Actor-Based SP Component

Standardized Patient Case Development

The SP simulation included 3 cases: high risk headache: metastatic lung cancer; low risk back pain: low back strain; low risk chest pain: costochondritis. All SP cases were developed on a case template. Standardized patient actors underwent both general and case-specific orientation and were given feedback to improve their consistency before the study. The learners had previously been oriented to SP simulation but were oriented to the flow of the half day. Each case had an HVC patient challenge: the headache patient has cost concerns about testing, the back pain patient is insistent on magnetic resonance imaging to rule out prostate cancer, and the chest pain patient requests a stress test to rule out a cardiac cause. Each case included approximately 15 minutes to interview and examine the SP, 15 minutes to enter the chart note, and 15 minutes to debrief with the faculty. The faculty debriefed the observational checklist (see hereinafter) or communications items at their discretion. The faculty could debrief the “less is more” conversation,18 a discussion of the harms that can come from excessive testing or interventions, that the learner may or may not have had with the patient, along with other communication suggestions. Faculty were encouraged to use an advocacy-inquiry framework39 to debrief and were offered feedback on debriefing from our simulation facilitation course faculty (W.B.).

Standardized Patient Scoring and Feedback

Performance on the SP cases was determined through 2 methods. First was a faculty observation checklist (see pdf, Supplemental Digital Content 2, an example checklist, based on direct observation of the learner by faculty through 2-way glass, for the primary and when applicable the sample of secondary ratings for inter-rater reliability (IRR) done by study authors (W.B., T.L., M.M.). Learner study number was worn in place of a name tag, but the learner identity was not completely blinded because of faculty familiarity with the learners in other contexts. This checklist included a critical action item list for Hx and PE items, as well as global ratings for HVC management, communication, and overall management. The global ratings for HVC and case management were on a scale of using the language of entrustable professional activities40 where 1 = needs further instruction, 2 = competent, but with close supervision, 3 = competent, with minimal supervision, and 4 = competent to perform independently. The global communications scale was: 1 = unsatisfactory, 2 = satisfactory, 3 = good, 4 = very good, 5 = outstanding. Rating forms were piloted before this study, and unclear items were dropped or modified. Rater training was minimal and consisted of a brief period of instruction followed by a small amount of coaching, primarily during the pilot phase. The intention was that the rating forms should be straightforward and require little explanation. The SPs themselves did not rate the learners but were available to answer questions during debriefing. The second method of data collection for the SP cases was the written grading rubric (see pdf, Supplemental Digital Content 3, example grading rubric, used by faculty to grade the students' patient notes: faculty read over the pdf file export of the student notes, deidentified via study number, and scored with a binary items checklist for pertinent positives and negatives on both Hx and PE, as well as DDx items (scored for presence, not order), and testing choices (scored for presence, not workup order). A sample was double graded for IRR.

Subjects were consented for use of test scores including US Medical Licensing Examination for correlation testing. Performance and survey data were gathered using Qualtrics survey software (Qualtrics, Provo, Utah), including the faculty observation and written grading rubric checklists, as well as the CCC surveys. The CCC survey consisted of 18 questions on a 4-point Likert scale, with anchors strongly disagree, moderately disagree, moderately agree, and strongly agree (no neutral answer).35


The CCC survey data were analyzed using analysis of variance (ANOVA). For performance data comparability between scales with different numbers of rating anchors, we normalized measures to a 0 to 1 scale using linear extrapolation, allowing for proportional spacing between anchors. Depending on order, SP or VSP cases first, t tests were used to examine performance on the SP case component areas and VSP component areas. Cohen's d was used to estimate effect sizes. Internal-consistency reliability was estimated for the assessments using Cronbach α with values of 0.60 to .69 considered acceptable, 0.70 to 0.79 considered good, and greater than 0.80 considered excellent.41 Interrater reliability for the faculty observation checklist and written grading rubric was calculated for a sample using intraclass correlation, with 0.40 to 0.59 considered fair, 0.6 to 0.74 good, and 0.75 to 1.00 excellent.42 Correlations in performance with the SP and VSP cases were examined using Pearson correlation coefficients. Data compilation and analyses were conducted using Stata 16 (College Station, Texas). Sample size estimation used repeated-measures ANOVA and assumed a medium effect size of 0.25 (Cohen's d), suggesting that a total of 34 learners would be needed to achieve a power of 0.8 with an α level of 0.05 or 17 learners per group. To account for declines of consent, attrition, and greater sensitivity in detecting discernible effects, we targeted 30 learners per group. For instances in which participants did not complete all parts of the study, data were excluded.



Of the 85 medical students who were eligible to participate, 79 consented to the research, with 66 completing all parts of the research. The primary reasons for noncompletion were illness or scheduling issues. Forty one of the students were male, and 26 were female. Ages of the students were not collected. Among the participants, 37 were assigned to begin VSPs first and the remaining 29 were assigned to undergo SPs first.

Appropriate Response Rate of VSP Avatars

A random sample of interview transcripts were reviewed to ensure that the avatars responded to student queries appropriately. Of the 10% of VSP avatar responses sampled, 89.2% (553/621) were found to be appropriate. Of the inappropriate avatar responses, 41.2% (28/68) were attributed to an inappropriate learner inquiry (far off topic or humorous questioning, questions deemed unintelligible to a human reviewer).

Reliability of SP Case Assessments

Aggregate interrater reliability was calculated using a subset of the data (randomly sampled 46/198, 23%). Intraclass correlation was 0.52 (SE = 0.04) for the faculty observation checklist and 0.83 (SE = 0.03) for the written grading rubric note score. Internal-consistency reliability (Cronbach α) was 0.63 for the faculty observation checklist and 0.66 for the written grading rubric note scoring.

Attitudes Toward CCC

Overall, attitudes toward CCC significantly improved following training (Cohen d = 0.35, P = 0.043), demonstrating moderate effect. Regardless of simulation modality (VSP or SP), the effect was comparable (not statistically different by modality). See Table 1, which combines the VSP- and SP-first groups to highlight the difference before and after the didactic plus intervention. At the individual survey item level, attitudes improved toward two of the concepts: (1) more agreement that physician clinical practices (eg, ordering, prescribing) are key drivers of high health care costs and (2) more disagreement that spending more money on health care leads to better health outcomes.

TABLE 1 - Changes on the CCC Survey
CCC Survey* Precourse† Postcourse† Overall†‡ P§
Physicians should take a more prominent role in limiting use of unnecessary tests 3.69 (0.58) 3.57 (0.78) 3.63 (0.69) 0.319
Physicians should try not to think about the cost to the health care system when making treatment decisions. 2.78 (0.88) 3.00 (0.94) 2.89 (0.91) 0.157
Physicians should be aware of the costs of the tests or treatments they recommend. 3.70 (0.55) 3.64 (0.57) 3.67 (0.56) 0.538
Physicians should talk to patients about the cost of care when discussing treatment options. 3.00 (0.72) 3.12 (0.71) 3.06 (0.71) 0.334
Physicians should change their clinical practices if the cost of care they provide is higher than colleagues who care for similar patients. 2.54 (0.75) 2.78 (0.76) 2.66 (0.76) 0.068
Physicians should provide tests or treatments if a patient requests them. 3.03 (0.70) 3.18 (0.76) 3.10 (0.73) 0.237
Physician clinical practices are key drivers of high health care costs. 2.66 (0.93) 3.21 (0.77) 2.93 (0.89) <0.001∥
The cost of a test or medication is only important if the patient has to pay for it out of pocket. 3.27 (0.73) 3.31 (0.74) 3.29 (0.73) 0.726
It is unfair to ask physicians to be cost-conscious and still keep the welfare of their patients foremost in their minds. 3.01 (0.69) 3.04 (0.77) 3.03 (0.72) 0.813
Cost-effectiveness data should be used to determine what treatments are offered to patients. 2.94 (0.85) 2.93 (0.72) 2.93 (0.79) 0.913
Patients will be less satisfied with the care they receive from physicians who discuss costs when choosing tests and treatments. 2.91 (0.69) 3.04 (0.79) 2.98 (0.74) 0.296
Doctors are too busy to worry about the costs of tests and procedures. 3.21 (0.75) 3.15 (0.78) 3.18 (0.76) 0.653
Trying to contain costs is the responsibility of every physician. 3.18 (0.72) 3.34 (0.57) 3.26 (0.65) 0.143
Managing health care resources for all patients is compatible with physicians' obligations to serve individual patients. 3.33 (0.64) 3.48 (0.59) 3.40 (0.61) 0.161
Eliminating unnecessary tests and procedures will improve patient safety. 3.55 (0.53) 3.67 (0.47) 3.61 (0.50) 0.171
Spending more money on health care leads to better health outcomes. 3.24 (0.74) 3.24 (0.85) 3.24 (0.80) <0.001∥
Practicing CCC will undermine patients' trust in physicians. 3.24 (0.76) 3.30 (0.78) 3.27 (0.77) 0.654
Ordering fewer tests and procedures will increase physicians' risk of medical malpractice litigation. 2.60 (0.80) 2.82 (0.76) 2.71 (0.78) 0.098
Overall 3.10 (0.29) 3.21 (0.32) 3.16 (0.31) 0.043¶
*Survey consisted of 4-point scale for each question, where 1 is least agreement with CCC decision and 4 is most agreement.
†Values in parentheses are SDs.
‡Effect size for pretraining and posttraining difference in survey is Cohen's d = 0.35, P = 0.043.
§P values based on ANOVA.
P < 0.001.
P < 0.05.
Changes before and after curriculum + VSPs or curriculum + SPs.

Standardized Patient Assessment Performance

No significant differences were found in learner performance on SP cases based on simulation order (Table 2). Overall SP case performances for the SP case note score were 0.43 (SD = 0.08) and, for the faculty observation scores: communication 0.71 (SD = 0.08), HVC components 0.62 (SD = 0.11), case management 0.63 (SD = 0.10), and overall faculty observation checklist 0.69 (SD = 0.06).

TABLE 2 - Standardized Patient (SP) Assessment Scores
SP Case Score Type Sim Order: SP → VSP (n = 29) Sim Order: VSP → SP (n = 37) Combined (n = 66)
Mean SD Mean SD Mean SD
Back pain FOC score 0.71 0.07 0.73 0.08 0.72 0.07
Case management 0.65 0.14 0.63 0.16 0.64 0.15
HVCCC component 0.63 0.17 0.64 0.14 0.63 0.15
Communication 0.69 0.15 0.68 0.15 0.69 0.15
Note score 0.43 0.08 0.46 0.11 0.45 0.10
Chest pain FOC score 0.76 0.10 0.76 0.08 0.76 0.09
Case management 0.71 0.15 0.65 0.18 0.68 0.17
HVCCC component 0.68 0.15 0.63 0.18 0.66 0.17
Communication 0.73 0.12 0.71 0.14 0.72 0.13
Note score 0.49 0.12 0.48 0.11 0.48 0.11
Headache FOC score 0.59 0.08 0.63 0.09 0.61 0.09
Case management 0.56 0.14 0.59 0.17 0.58 0.16
HVCCC component 0.56 0.17 0.57 0.17 0.57 0.17
Communication 0.69 0.12 0.73 0.12 0.71 0.12
Note score 0.38 0.11 0.36 0.08 0.37 0.10
Overall FOC score 0.68 0.05 0.70 0.06 0.69 0.06
Case management 0.64 0.10 0.62 0.11 0.63 0.10
HVCCC component 0.62 0.10 0.62 0.11 0.62 0.11
Communication 0.70 0.09 0.71 0.08 0.71 0.08
Note score 0.42 0.09 0.43 0.08 0.43 0.08
All P > 0.05 (not significant).
Descriptive statistics (mean and SD) and comparison by simulation order. The checklist score, case management score, HVC component score, and communication score were all from the faculty observation checklist (direct faculty observation). The note score was from faculty grading of the learner's patient note.
FOC, faculty observation checklist; HVCCC, high value cost conscious care.

Virtual Standardized Patient Assessment Performance

In examining aggregate performance on the VSP cases (Table 3), VSP performance in the VSP interview section was higher if the VSPs were completed first 0.34 (SD = 0.12) for SP first and 0.42 (SD = 0.08) for VSP first. Treatment scores were also higher on the VSP cases, 0.58 (SD = 0.16) versus 0.50 (SD = 0.14) if the VSPs were done first.

TABLE 3 - Standardized Patient (VSP) Assessment Scores
VSP Case Score Type Sim Order: SP → VSP (n = 29) Sim Order: VSP → SP (n = 37) Combined (n = 66) P
Mean SD Mean SD Mean SD
Heartburn/GERD Interview 0.34 0.12 0.43 0.10 0.38 0.12 0.001
Diagnosis 0.46 0.30 0.62 0.32 0.54 0.32 0.028
Cost† $899 $1045 $63 $817 $757 $935 0.324
Treatment 0.55 0.38 0.74 0.34 0.65 0.37 0.026
Heartburn/esophageal cancer Interview 0.30 0.13 0.38 0.10 0.34 0.12 0.005
Diagnosis 0.89 0.23 0.89 0.25 0.89 0.24 0.996
Test 0.32 0.19 0.38 0.21 0.35 0.20 0.192
Cost† –$303 $1582 –$459 $1303 –$385 $1428 0.723
Treatment 0.64 0.17 0.68 0.22 0.66 0.20 0.394
Headache/subarachnoid hemorrhage Interview 0.37 0.16 0.44 0.14 0.40 0.15 0.051
Diagnosis 0.46 0.30 0.56 0.28 0.51 0.29 0.166
Test 0.23 0.24 0.28 0.22 0.25 0.23 0.298
Cost† –$26 $1102 –$295 $528 –$274 $845 0.873
Treatment 0.28 0.19 0.37 0.20 0.32 0.20 0.064
Headache/cluster headache Interview 0.34 0.16 0.43 0.10 0.38 0.14 0.004
Diagnosis 0.66 0.34 0.68 0.37 0.67 0.35 0.742
Test 0.10 0.20 0.07 0.15 0.09 0.18 0.546
Cost† $148 $572 $275 $426 $213 $51 0.401
Treatment 0.55 0.23 0.52 0.27 0.53 0.25 0.615
Overview Interview 0.34 0.12 0.42 0.08 0.38 0.11 0.001‡
Diagnosis 0.62 0.17 0.69 0.18 0.65 0.18 0.082
Test 0.21 0.15 0.25 0.12 0.23 0.13 0.245
Treatment 0.50 0.14 0.58 0.16 0.54 0.16 0.043‡
Descriptive statistics (mean and SD) and comparison by simulation order. These scores are from the automated grading export from the VSP platform.
*No testing was recommended for this case.
†This value describes the difference in costs between testing workup ordered by the learner and the faculty consensus testing cost. A negative difference indicated the mean learner workup cost less than faculty consensus, and a positive difference indicated the learner workup cost more than faculty consensus.
P < 0.05.
GERD, gastroesophageal reflux disease.

Association With the US Medical Licensing Examination

Combined correlations were positive for all VSP and SP cases, for all components of both, for both US Medical Licensing Examination (USMLE) step 1 and USMLE step 2 clinical knowledge (CK; Table 4). Correlations were statistically significant between step 1 and SP performance on overall case management and the HVC component. Correlation was statistically significant between the step 2 CK and the SP HVC component.

TABLE 4 - Correlation of Simulation Outcomes With USMLE Scores
USMLE (N = 66) Simulation Format Score Type Combined
Correlation P
Step 1* VSP Interview 0.22 0.080
Diagnosis 0.08 0.532
Test 0.07 0.593
Cost 0.06 0.722
Treatment 0.04 0.773
SP Note score 0.19 0.134
Checklist score 0.03 0.806
Case management 0.23 0.060
HVCCC component 0.26† 0.038
Communication 0.04 0.773
Step 2 CK* VSP Interview 0.02 0.904
Diagnosis 0.03 0.838
Test 0.07 0.682
Cost 0.34 0.179
Treatment 0.08 0.604
SP Note score 0.13 0.393
Checklist score 0.07 0.640
Case management 0.21 0.172
HVCCC component 0.33† 0.031
Communication 0.08 0.600
Correlations between both steps of the USMLE and performance on SPs and VSPs.
*USMLE step 1: mean = 235.08 (SD = 16.92) and step 2 CK: mean = 244.79 (SD = 14.99); there were no differences by simulation order, P = 0.369 and P = 0.159, respectively.
P < 0.05
HVCCC, high value cost conscious care.


While actor-based simulation has been used to teach43 and assess44 HVC, we believe that this is the first albeit small study to apply NLP-based VSPs as an adjunct to a broader HVC curriculum. We successfully implemented and integrated VSPs creating an additional opportunity for deliberate practice for these novice learners. Attitudes toward CCC improved after didactics plus either method of simulation. This is encouraging, because the SP cases were a formative exercise constituting a small part of their clerkship assessment and included faculty-led debriefing after direct faculty observation, which implies that faculty consider the topic important. In contrast, the VSP cases had no formal debrief other than the automated feedback within each case and, from the standpoint of consequences, were likely perceived as less important, because they lacked clerkship credit value other than participation. We considered repeating the CCC survey after didactic only and after the second simulation method but were concerned about learner survey fatigue. Given this limitation, our results suggest that either the change in attitude can be contributed to the didactic portion of the curriculum (the same in both groups) or that both the VSP and the SP encounters can contribute to improving attitudes toward CCC.

We did not see improvement in SP case performance after VSP practice. Potential reasons for this include small effect size with exposure to only 4 VSP cases, limited ability to detect an effect due to small sample size, degree of learner engagement with the VSP end of case feedback, and modality performance specificity (learning acquired from VSPs may not translate to standard SP cases). Content specificity was also likely a factor, as expertise in managing one case or content area does not typically translate into other content areas.45 We suggest that researchers find methods to record feedback interactions and conduct surveys or qualitative exploration of perceived effectiveness of feedback. Content specificity might be overcome by training on a series of cases in a complaint area, for example, chest pain cases, and then returning after a delay to complete a similar series of cases in another testing method.

With regard to correlation with alternative assessments, none of the VSP scores were statistically significant in their correlation with the USMLE step 1 and step 2 clinical knowledge, but we were encouraged that all areas of the VSP cases positively correlated with USMLE performance. The lack of correlation between the VSPs and performance on the SPs may reflect issues including sample size, perceived lower consequences for VSPs, content specificity, or assessment of different underlying constructs. While overall SP note score performances seemed low at 0.43, they are in keeping with scores for our institution’s graduate competency examination stations using SPs.

An interesting finding was that student performance in some areas for the VSPs was better when the VSPs were done first and may reflect fatigue as a factor when the VSPs were done later in the learning day. The addition of greater consequences may have mitigated this effect to some degree, but given that this was the introduction of VSP work to the educational production environment, we hesitated to attach substantial grade credit. Also relevant to motivation is that the VSPs were simulated patient encounters, and there was very little “fun factor” or game-like quality beyond the novelty of the speaking avatar. The avatar appropriate response rate at approximately 89% is beginning to reach an acceptable standard in assessment for learning46; although for assessment of learning (high-stakes assessment), this is still a concern.

Patient workup cost comparisons were not a primary study focus, but the VSP format enables deeper exploration of the topic and provides feedback to learners. The costs of learner-driven workups from the VSPs showed significant variability as demonstrated by the large SDs for cost in Table 3, and simulation modality order (VSP or SP first) did not demonstrate an impact. We note that the pattern of ordering tests and thus costs within the VSPs was as expected for novices: ordering too much in low-risk cases and too little in high-risk cases. Future studies or VSP designs might bring more attention to costs either during the ordering process or during case feedback.

Overall, the VSPs provided a complement to the SPs, allowing an opportunity to practice more cases rapidly, look for red flag features, and choose or avoid diagnostic tests in a standardized fashion. The tested generation of VSPs did not challenge the learner on ordering (asking for test, for example) nor engage in any shared decision making or “less-is-more” conversation. Thus, at this stage, we recommend VSP practice before SPs, with a focus on identification of high-risk history and physical features, an area of strength with immediate feedback. We plan to build patient challenges, such as the desire or reluctance for testing or treatment options, into the VSP case format in future iterations in the hopes of improved learner engagement. Future VSP cases could easily integrate links to clinical risk score calculators or evidence-based testing guidance to encourage weighing of literature evidence before a decision on test ordering. Our learners were not restricted from looking up such guidance, but it was not built into the current generation VSPs. Successful HVC curricular efforts might also include online interactive cases,47,48 rounding tools,49 and value-focused charting reminders,50 in addition to SP methods.43


Limitations included minimal rater training, risk of rater bias due to incomplete blinding, fair IRR, moderate internal consistency reliability, and small sample size. With regard to IRR, primarily, we found that faculty had to agree on what reasonable expectations were for MS3s as opposed to resident learners, as many faculty teach and debrief both. Although not a study limitation per se, we debriefed each SP case immediately after the note writing process, which differs from end of year or high-stakes assessment when there is no debriefing that may influence performance on the next case. These, along with using only a single study site, are threats to generalizability of results.


When combined with didactics, either method of simulation showed a moderate improvement in attitudes toward cost conscious care. The ability to detect an impact of VSPs on SP performance was likely limited by content specificity and sample size. The study illustrates the challenges in executing VSP simulation studies and shares considerations for future iterative applications.


The authors thank the University of Illinois College of Medicine at Peoria faculty who rated the students during the simulations. The authors also thank Thomas “Brett” Talbot, Jason Taylor, and Dilip Mahale for their work in simulation case development and rating. The authors thank Jessica Fish, Jeremy McGarvey, Dipen Kumar, Keaton McClure, Gabriel Beratio, Rebecca Ruger, and Jenny Doerr for contributions at various stages.


1. Levitt L, Claxton G, Cox C, Gonzales S, Kamal R. Assessing the performance of the U.S. health system. In: Peterson-Kaiser Health System Tracker. Kaiser Family Foundation; November 30th, 2014: Available at: Accessed January 20, 2017.
2. Papanicolas I, Woskie LR, Jha AK. Health care spending in the United States and other high-income countries. JAMA 2018;319(10):1024–1039.
3. Shrank WH, Rogstad TL, Parekh N. Waste in the US health care system: estimated costs and potential for savings. JAMA 2019;322(15):1501–1509.
4. Zhi M, Ding EL, Theisen-Toupal J, Whelan J, Arnaout R. The landscape of inappropriate laboratory testing: a 15-year meta-analysis. PLoS One 2013;8(11):e78962.
5. Chien AT, Lehmann LS, Hatfield LA, et al. A randomized trial of displaying paid price information on imaging study and procedure ordering rates. J Gen Intern Med 2017;32(4):434–448.
6. Durand DJ, Feldman LS, Lewin JS, Brotman DJ. Provider cost transparency alone has no impact on inpatient imaging utilization. J Am Coll Radiol 2013;10(2):108–113.
7. Goetz C, Rotman SR, Hartoularos G, Bishop TF. The effect of charge display on cost of care and physician practice behaviors: a systematic review. J Gen Intern Med 2015;30(6):835–842.
8. Silvestri MT, Bongiovanni TR, Glover JG, Gross CP. Impact of price display on provider ordering: a systematic review. J Hosp Med 2016;11(1):65–76.
9. Owens DK, Qaseem A, Chou R, Shekelle P; Clinical Guidelines Committee of the American College of Physicians. High-value, cost-conscious health care: concepts for clinicians to evaluate the benefits, harms, and costs of medical interventions. Ann Intern Med 2011;154(3):174–180.
10. Weinberger SE. Providing high-value, cost-conscious care: a critical seventh general competency for physicians. Ann Intern Med 2011;155(6):386–388.
11. Smith CD, Levinson WS; Internal Medicine HVC Advisory Board. A commitment to high-value care education from the internal medicine community. Ann Intern Med 2015;162(9):639–640.
12. Smith CD. Teaching high-value, cost-conscious care to residents: the Alliance for Academic Internal Medicine–American College of Physicians Curriculum. Ann Intern Med. 2012;157(4):284–286.
13. Leep Hunderfund AN, Starr SR, Dyrbye LN, et al. Imprinting on clinical rotations: multisite survey of high- and low-value medical student behaviors and relationship with healthcare intensity. J Gen Intern Med 2019;34(7):1131–1138.
14. Rao G, Kanter SL. Physician numeracy as the basis for an evidence-based medicine curriculum. Acad Med 2010;85(11):1794–1799.
15. West CP, Jaeger TM, McDonald FS. Extended evaluation of a longitudinal medical school evidence-based medicine curriculum. J Gen Intern Med 2011;26(6):611–615.
16. Elwyn G, Edwards A, Kinnersley P, Grol R. Shared decision making and the concept of equipoise: the competences of involving patients in healthcare choices. Br J Gen Pract 2000;50(460):892–899.
17. Elwyn G, Frosch D, Thomson R, et al. Shared decision making: a model for clinical practice. J Gen Intern Med 2012;27(10):1361–1367.
18. Grady D, Redberg RF. Less is more: how less health care can result in better health. Arch Intern Med 2010;170(9):749–750.
19. Hauer KE, Fernandez A, Teherani A, Boscardin CK, Saba GW. Assessment of medical students' shared decision-making in standardized patient encounters. J Gen Intern Med 2011;26(4):367–372.
20. Talbot TB, Kalisch N, Christoffersen K, Lucas G, Forbell E. Natural language understanding performance & use considerations in virtual medical encounters. Stud Health Technol Inform 2016;220:407–413.
21. Maicher K, Danforth D, Price A, et al. Developing a conversational virtual standardized patient to enable students to practice history-taking skills. Simul Healthc 2017;12(2):124–131.
22. Cook DA, Erwin PJ, Triola MM. Computerized virtual patients in health professions education: a systematic review and meta-analysis. Acad Med 2010;85(10):1589–1602.
23. Cook DA, Triola MM. Virtual patients: a critical literature review and proposed next steps. Med Educ 2009;43(4):303–311.
24. Bergin R, Youngblood P, Ayers MK, et al. Interactive simulated patient: experiences with collaborative E-learning in medicine. J Educ Comput Res 2003;29(3):387–400.
25. Hubal RC, Kizakevich PN, Guinn CI, Merino KD, West SL. The virtual standardized patient. Simulated patient-practitioner dialog for patient interview training. Stud Health Technol Inform 2000;70:133–138.
26. Schittek Janda M, Mattheos N, Nattestad A, et al. Simulation of patient encounters using a virtual patient in periodontology instruction of dental students: design, usability, and learning effect in history-taking skills. Eur J Dent Educ 2004;8(3):111–119.
27. Stevens A, Hernandez J, Johnsen K, et al. The use of virtual patients to teach medical students history taking and communication skills. Am J Surg 2006;191(6):806–811.
28. Vash JH, Yunesian M, Shariati M, Keshvari A, Harirchi I. Virtual patients in undergraduate surgery education: a randomized controlled study. ANZ J Surg 2007;77(1–2):54–59.
29. Fleiszer D, Hoover ML, Posel N, Razek T, Bergman S. Development and validation of a tool to evaluate the evolution of clinical reasoning in trauma using virtual patients. J Surg Educ 2018;75(3):779–786.
30. Forsberg E, Ziegert K, Hult H, Fors U. Assessing progression of clinical reasoning through virtual patients: an exploratory study. Nurse Educ Pract 2016;16(1):97–103.
31. Kleinert R, Heiermann N, Wahba R, Chang DH, Hölscher AH, Stippel DL. Design, realization, and first validation of an immersive web-based virtual patient simulator for training clinical decisions in surgery. J Surg Educ 2015;72:1131–1138.
32. McCoy L, Pettit RK, Lewis JH, Allgood JA, Bay C, Schwartz FN. Evaluating medical student engagement during virtual patient simulations: a sequential, mixed methods study. BMC Med Educ 2016;16:20.
33. Triola M, Feldman H, Kalet AL, et al. A randomized trial of teaching clinical skills using virtual and live standardized patients. J Gen Intern Med 2006;21(5):424–429.
34. Cheng A, Kessler D, Mackinnon R, et al. Reporting guidelines for health care simulation research: extensions to the CONSORT and STROBE statements. Adv Simul (Lond) 2016;1:25.
35. Leep Hunderfund AN, Dyrbye LN, Starr SR, et al. Role modeling and regional health care intensity: U.S. medical student attitudes toward and experiences with cost-conscious Care. Acad Med 2017;92(5):694–702.
36. Clancy C, Dine CJ, Williams D, Smith CD. The High Value Care Curriculum for Internal Medicine Residents. (Version 4.0 current, Version 3.0 used in study). American College of Physicians. 2018. Available at: Accessed February 25, 2020.
37. Yudkowsky R, Park YS, Riddle J, Palladino C, Bordage G. Clinically discriminating checklists versus thoroughness checklists: improving the validity of performance test scores. Acad Med 2014;89(7):1057–1062.
38. Bond WF, Lynch TJ, Mischler MJ, et al. Virtual standardized patient simulation: case development and pilot application to high-value care. Simul Healthc 2019;14(4):241–250.
39. Rudolph JW, Simon R, Raemer DB, Eppich WJ. Debriefing as formative assessment: closing performance gaps in medical education. Acad Emerg Med 2008;15(11):1010–1016.
40. ten Cate O, Chen HC, Hoff RG, Peters H, Bok H, van der Schaaf M. Curriculum development for the workplace using entrustable professional activities (EPAs): AMEE guide no. 99. Medical teacher. 2015;37(11):983–1002.
41. Yudkowsky R, Park YS, Downing SM. Assessment in Health Professions Education. Routledge; Routledge in New York2019.
42. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess 1994;6(4):284–290.
43. Natt N, Starr SR, Reed DA, Park YS, Dyrbye LN, Leep Hunderfund AN. High-value, cost-conscious communication skills in undergraduate medical education: validity evidence for scores derived from two standardized patient scenarios. Simul Healthc 2018;13(5):316–323.
44. Pahwa AK, Eaton K, Apfel A, Bertram A, Ridell R, Cayea D. Effect of a high value care curriculum on standardized patient exam in the core clerkship in internal medicine. BMC Med Educ 2020;20(1):365.
45. Turnbull J, Danoff D, Norman G. Content specificity and oral certification examinations. Medical education. 1996;30(1):56–59.
46. Dannefer EF. Beyond assessment of learning toward assessment for learning: educating tomorrow's physicians. Med Teach 2013;35(7):560–563.
47. Aquifer high value care: a case-based virtual course. 2020. Available at: 2020. Accessed August 20, 2021.
48. Avery CS, Beck J, Padrez R, et al. Developing the PedsValue modules—a national pediatric high value care curriculum. Acad Pediatr 2017;17(5):471–475.
49. McDaniel CE, White AA, Bradford MC, et al. The high-value care rounding tool: development and validity evidence. Acad Med 2018;93(2):199–206.
50. Moser EM, Huang GC, Packer CD, et al. SOAP-V: introducing a method to empower medical students to be change agents in bending the cost curve. J Hosp Med 2016;11(3):217–220.

High value care; virtual standardized patients; standardized patients; educational assessment; deliberate practice; virtual simulation

Supplemental Digital Content

Copyright © 2022 Society for Simulation in Healthcare