Taking a medical history and using the information to develop appropriate diagnoses are fundamental skills for any successful physician.1 Most methods for assessing history taking involve interaction with standardized patients (SPs) and real patients. Although SP interviews can be standardized, they require significant faculty effort and ongoing support.2,3 In addition, SP interactions can be internally inconsistent, with quality depending on the experience and training of the actor portraying the patient.2,3
Virtual standardized patients (VSPs) are avatar-based representations of human SPs that converse with students using natural language.4,5 Using spoken or typed questions, students can practice their history-taking skills with a VSP to gain practice and experience before interacting with standardized or real patients. Because understanding the variety of question forms in a typical natural language encounter is difficult, any platform designed to communicate using natural language must employ robust dialogue management that can direct the conversations with a high degree of accuracy.
Lok and coworkers6–8 have described a system that allows students to take histories using chatted or selection-based approaches. Their system uses XML-based keyword matching to find questions in their database that are most similar to questions being asked. In contrast, investigators using the NPC Editor at the Institute for Creative Technologies use a text classifier approach, matching user questions to known answers rather than known questions.9 Other investigators have created similar systems using natural language communication.2,3,10 Based on our literature review, system response accuracy rates for these approaches generally range from 60% to 84%.7–9
We describe herein the development of a conversationally adept VSP system with accuracy that equals or exceeds that reported by other investigators. We review the key components of the system and provide preliminary data on the accuracy and utility of the platform. All studies were approved by the institutional review board of The Ohio State University College of Medicine.
Our system consists of the following 3 components: (1) natural language processing (NLP) software that controls the conversation between the doctor and patient, (2) the virtual environment in which the doctor-patient interaction occurs, and (3) the electronic medical record (EMR) of the virtual patient in which students document the encounter.
The Conversation Engine
Our system is based on the open-source NLP engine ChatScript.11 The system is designed to answer questions appropriate for a complete medical history, with more limited ability to engage in general conversation. ChatScript has several advantages that make it especially suited for doctor-patient dialogues where conversations tend to be fairly narrow (covering a limited domain) but can be quite detailed in their depth and nuance.
- ChatScript has extensive pattern-matching capabilities employing simple concise syntax that allows users to focus on meaning, including the ability to exclude words to avoid improper context.
- ChatScript can make extensive use of topics, which are independent of each other and provide logical structured areas for rules on areas of knowledge, making it easy for multiple authors to work on a project simultaneously.
- ChatScript allows precise control of context and dialogue state, enabling the VSP to accurately respond to similar or identical questions (eg, “Tell me more”) with context-specific answers and to questions containing pronouns (eg, “When did that happen?”).
- ChatScript has extensive analytics and debugging tools to facilitate dialogue management and creation of multiple VSPs with unique characteristics.
- ChatScript is relatively simple to use for the authors, requiring minimal knowledge of programming.
Previous efforts to develop a VSP system used Second Life12 as the immersive environment and artificial intelligence markup language13 as the dialogue engine.14 Although this early system was useful for demonstrating the feasibility of the approach, both the virtual environment and the dialogue management were less than ideal. To manage the conversations, we used more than 200,000 rules because of the inefficiencies with the artificial intelligence markup language and syntax. Even with such a large corpus of patterns, our accuracy rates rarely exceeded 70%.15 As such, we redesigned the application using Unity 3D to create the virtual environments and deployed ChatScript for dialogue management.
The initial set of rules used in ChatScript was designed from the dialogues captured using our previous platform. Because we already had a relatively large set of dialogues available, we did not use any specific approaches for dialogue development such as the Human-Centered Distributed Conversational Modeling approach described by Rossen and Lok.16 Rules were created, tested by the authors, and refined more 2 or 3 iterations. Once the basic dialogue rules had been created, experiments were conducted with medical students to test the accuracy of the system.
Our ChatScript database contains approximately 2500 rules for managing conversations. This number is relatively small because of the narrow focus of the conversations, as opposed to general-purpose systems designed to converse on any topic using tens of thousands of rules.17 Our rules (or patterns) are arranged in the following categories: history of present illness, medical history, family history, and social history. We also have categories on medications, opening and closing dialogue, and a small set for managing general conversation questions.
To begin a conversation, the student speaks or types a question into the system. ChatScript first performs basic NLP processing of the input (spell checking, canonization, determining parts of speech, analyzing type of input [question vs statement], managing interjections, etc). Inputs are then analyzed for specific key words identifying current history, medical history, or family history questions. Finally, the input is matched against the rules or patterns in relevant topics, and when a match is found, the output is generated.
If an input question correctly matches the appropriate pattern and returns the appropriate response, we consider that as “answered correctly.” If a question does not match any pattern in our system, we consider that as “not answered.” Finally, if an input question matches a pattern other than the intended pattern, we consider that as “answered incorrectly,” regardless of whether the response generated might be plausible for the question asked.
The 3D characters serve as the virtual interfaces through which the conversation engine functions (Fig. 1). Our agents use supplemental communication channels such as facial expressions and gestures to mimic elements of natural human conversation.
Several design considerations directed the production of these agents, (1) establishing appropriate degrees of agent fidelity, (2) implementing emotional expression and responsiveness, (3) maximizing character diversity, and (4) streamlining the development pipeline. These were considered important to the success of the system as an effective and practical tool that could be easily implemented and extended.
The Unity 3D game engine18 was selected as the development environment for the VSP (Fig. 2) because it offered an intuitive and powerful means of designing interactive applications. In addition to its many features and relatively low learning curve, the Unity environment is capable of exporting builds to multiple platforms capable of running on a variety of devices.
Character models were created with the Autodesk Character Generator (ACG,19Fig. 3) and refined using Autodesk Maya20 to increase fidelity. Models generated from the ACG can be produced quickly, which facilitates the addition of new patients (Fig. 4).
Agents were animated through the application and blending of animation clips onto the ACG character skeletons. Most of these clips were obtained by motion-capturing student performances of doctor-patient sessions to provide the natural and subtle movements, postures, and gestures of a seated patient. Motion capture was selected over other animation techniques in an attempt to provide movement that was more fluid and representative of the states that the agents were designed to portray (Fig. 5).
The VSP was designed in 2 versions, a web-based application that students run in a web browser (Fig. 6), and a stand-alone version that extends the capabilities by allowing students to interact with life-sized representations of agents using speech-based conversation (Fig. 7). Both versions were created to counterbalance tradeoffs between accessibility and realism. While the web version provides greater accessibility for students, the stand-alone system provides a more natural conversational experience.
The web version of the application was created for students to access on any internet-connected laptop or workstation running a web browser. Although speech-to-text technologies have advanced in recent years, the unpredictable nature of end-user environments in applications using web-based speech recognition makes the implementation challenging.21,22 Without the ability to control for factors such as microphone quality, user proximity, and ambient background noise, smooth verbal conversations over extended periods of time can be difficult to conduct. Moreover, many environments (eg, libraries, coffee houses) are not appropriate for private verbal conversations with a VSP. Implementation of a speech-to-text system across multiple types of laptop hardware and software is expensive and technically challenging. Finally, students with speech and hearing impairments have additional challenges to overcome with speech-based interaction, regardless of the environment.
To address these issues, conversations in the web versions take place through typed text instead of spoken dialogue. Students participate in these conversations by typing questions and reading text-based responses. By providing text-based interaction, problems of accessibility, usability, and variability of speech-to-text accuracy were reduced or eliminated.
Limitations of the web-based system center on the inherent differences between spoken and written conversations. Text-based dialogues tend to be more succinct and focused, with questions typically addressing a single topic. Spoken conversations are more casual, often including unessential dialogue, slang, irrelevant modifiers (eg, “like”), and various disfluencies (eg, “uh”, “um”).23 In our experience, spoken conversations are also more likely to include multiple questions or topics in a single conversational turn.
The stand-alone version of the VSP allows students to speak to a life-sized display of a virtual character in a space resembling an examination room (Fig. 7). In this version, the speech recognition software Dragon Naturally Speaking (http://www.nuance.com)24 was used to capture and translate verbal questions. These questions are captured by a headset microphone or a mounted Microsoft Kinect25 camera containing a multiarray microphone. Questions are translated to text and are submitted by the Unity application to ChatScript for pattern matching and response generation. Responses include patient answers because subtitle text and/or synthesized speech as well as variables dictating the behaviors patients will display. These behavior variables are used to initiate gestures, posture, and facial expressions that emulate emotional responses. The Kinect camera is also used to track student movement and gestures. Using custom scripts that rotate the head, neck, and eye joints of the agents, their gaze adjusts to follow students while they move. The system is also programmed to have the VSP occasionally “look away” from the student to more naturally mimic an authentic encounter; a gaze continually fixed on the student proved unnerving. Interactive features such as these were integrated to help establish agent presence, responsiveness, and the illusion of inhabiting physical space.
Electronic Medical Record
The student version of the EMR was created to allow students to practice documentation skills in a learning environment. The demographics of the VSP are prepopulated in the EMR, after which students then document key aspects of the encounter on the basis of the information obtained during the interview. The EMR in the virtual encounter is a replica of the version used in clinical practice at Ohio but without any actual patient data.
Initial experiments involved third-year students taking a focused history from a patient with back pain (Table 1). The patient (VPatient1) presented with back pain, leg pain, nocturnal urination, and taking ibuprofen, saw palmetto, but had no previous history of illness. The students asked 855 questions, of which 86% were answered correctly, 5% were answered incorrectly (pattern mismatches), and 9% were not answered (no pattern matched in the database).
A second experiment was performed with first-year students taking a more complete history of the patient, including family history, medical history, and social history in addition to the history of present illness (Table 1). Because first-year students had less experience with the differential diagnosis for acute back pain, this case was slightly less complex (VPatient2; no leg pain, no nocturnal urination, and ibuprofen as the only medication) than the previous case. The VSP was able to correctly answer 83% of the questions asked, with approximately 6% answered incorrectly and 10% unanswered.
To assess the ability of the system to manage spoken conversations, we recruited first-year students to take a history of the simple VSP (VPatient2) in our Clinical Skills Simulations and Assessment Center. As previously described, spoken conversations present additional challenges to NLP systems because of the nature of spoken versus typed conversations. The students asked a total of 729 questions of which 72% were initially answered correctly. Failure of the speech-to-text software to accurately transcribe the spoken question accounted for 26% of the errors, and after manually correcting these errors and resubmitting the actual question asked, the system response accuracy increased to 79.2%, with 9.4% answered incorrectly and 11.4% not answered (Table 1).
To determine whether a VSP could provide sufficient information for students to construct an appropriate differential diagnosis, third-year medical students took a focused history of the complex VSP (VPatient1) and constructed a differential diagnosis. Students were awarded 0 to 4 points on the basis of the accuracy of the diagnoses, 2 points for the most likely diagnosis, 1 point each for 2 other appropriate diagnoses, and −1 point for an inappropriate diagnosis. Most of the students successfully diagnosed this case, with 76.9% receiving the maximum score for the exercise and 11.6% getting 3 of 4 points (Fig. 8). Only 6 students (4.5%) were unable to obtain the correct diagnosis. The most likely diagnosis was correctly identified by 135 of the 141 students.
The experiments described previously were conducted using variations of a VSP presenting with acute lower back pain. We have since expanded the number of VSPs in the system with 7 additional patients presenting with chief complaints including abdominal pain, headache, dizziness, shortness of breath, dysuria, and rhinitis. Each of these VSPs has been used in our curriculum by students to practice their history-taking and clinical decision-making skills. Students have asked a total of more than 12,000 questions, and system response accuracy for these patients was 84.7% (0.9%; range, 76.3%–89.0%).
We have developed a system for students to practice history-taking skills and documentation using natural language conversations with VSPs. Our VSPs engage in contextually appropriate dialogue and display natural movement and emotions appropriate for questions being asked. Students document key aspects of the encounter because each VSP has their own record in our EMR.
The VSPs are able to answer questions with a relatively high degree of accuracy. Conversations between a doctor and patient are generally narrow in scope but can be very detailed. A robust dialogue management system must be able to provide context-appropriate answers to similar or identical questions depending on the topic being discussed. For example, “Tell me more” may be asked at several points in the encounter, and the system must be able to understand the context of the question/statement and provide the answer appropriate for that part of the conversation. Similarly, the ability to resolve and understand pronouns is often a challenge for NLP software. “Did that help?” may refer to several different treatments, and the system must be able to parse those correctly. Our system is able to manage these difficult conversational nuances and provide correct responses with a high degree of accuracy. Indeed, we know of no other system that can manage these in-depth conversational challenges with a similar degree of robustness and fidelity.
When the VSP does not understand a particular question, it prompts the student to rephrase that question. Questions that are answered incorrectly result from pattern mismatches and invariably return answers that are inappropriate for the question being asked. As such, students are not misled by incorrect information but rather realize the obvious incongruity and move on.
Dialogue development for each patient is an iterative process requiring several rounds of development and testing. Each new round of questioning improves the dialogue capabilities of the VSP because it is “taught” by programmatically introducing new ways in which students may phrase particular questions.
When performing focused histories, accuracy rates were slightly higher with third-year students because their advanced skills and experience resulted in narrower, more focused conversations. First-year students experienced somewhat less accurate responses because their questions tended to be less specific and involved a more complete history. The questions of first-year students also varied more in style because they had not had as much practice in taking a medical history as more advanced students.
The conversational ability of the VSP was sufficient for students to accurately derive the correct differential diagnosis from the encounter. The case was not especially difficult but did involve over-the-counter herbal remedies and additional symptoms such as frequent urination and leg pain, which made the dialogue more complex and the differential slightly more complicated. Nevertheless, students were readily able to identify the appropriate diagnoses. For future encounters, cases will be enhanced by adding complexity to the history, symptoms, and medications taken.
Considerable effort was devoted to ensuring adequate fidelity of the VSPs. As described previously, the patients were given the ability to display a range of facial expressions and emotive characteristics. The affective capabilities of the system were perhaps most useful in the stand-alone version where students engaged in spoken dialogue with the patients. With this version, students could focus on the agents themselves and were less likely to be distracted. In the web version, students tended to focus on typing and reading, so subtle changes in patient movement or expression were likely less noticed.
SUMMARY AND CONCLUSIONS
In summary, we have developed a VSP system that students use to practice their history-taking and information-gathering skills. Our VSPs engage in natural language conversations and display context-appropriate emotions and responses. Our system can successfully manage in-depth nuanced conversations typical of a doctor-patient interaction with a high degree of accuracy. It is also cost-effective and scalable, and we have developed multiple VSP cases that have been tested with more than 12,000 questions.
These VSP encounters are not designed to replace existing training with SPs or real patients. Rather, they present an opportunity for students to gain early practice on their history-taking skills in safe, nonthreatening environments before real-life or simulated SP encounters.
Virtual standardized patient simulations have the potential to reduce cost, faculty time, and resources needed to assist students in developing their communication skills, and the interactions can be standardized across students. Furthermore, enhancing early skills acquisition with VSPs may facilitate more effective and efficient use of costly SP encounters. Current efforts for the project are focused on adding feedback and assessment capabilities so that students can receive immediate feedback on the quality of their encounters.
1. Courteille O, Josephson A, Larsson LO. Interpersonal behaviors and socioemotional interaction of medical students in a virtual clinical encounter. BMC Med Educ
2. Hubal RC, Kizakevich PN, Guinn CI, Merino KD, West SL. The virtual standardized patient. Simulated patient-practitioner dialog for patient interview training. Stud Health Technol Inform
3. Parsons TD. Virtual standardized patients for assessing the competencies of psychologists. In: Information Science and Technology
. 3rd ed. Hershey, PA: IGI Global; 2015:297–305.
4. Stevens A, Hernandez J, Johnsen K, et al. The use of virtual patients to teach medical students history taking and communication skills. Am J Surg
5. Talbot T, Sagae K, John B, Rizzo A. Designing useful virtual standardized patient encounters. Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC) 2012.
6. Lok B, Ferdig RE, Raij A, et al. Applying virtual reality in medical communication education: current findings and potential teaching and learning benefits of immersive virtual patients. Virtual Reality
7. Carnell S, Halan S, Crary M, Madhavan A, Lok B. Adapting virtual patient interviews for interviewing skills training of novice healthcare students. In: Brinkman WP, Broekens J, Heylen D. Intelligent Virtual Agents, LNAI 9238
. Switzerland: Springer International Publishing; 2015:50–59.
8. Rossen B, Lind S, Lok B Human-centered distribution conversational modeling: efficient modeling of robust virtual human conversations. In: Ruttkay Z, Kipp M, Nijholt A, et al, eds. Intelligent Virtual Agents, LNAI 5773
. Berlin Heidelberg: Springer-Verlag; 2009:474–481.
9. Leuski A, Traum D. NPCEditor: creating virtual human dialogue using information retrieval techniques. AI Magazine
10. Rizzo A, Parsons TD, Kenny P, Buckwalter JG Using virtual reality for clinical assessment and intervention. In: L'Abate L, Kasier DA, eds. Handbook of Technology in Psychology, Psychiatry, and Neurology: Theory, Research, and Practice
. Hauppauge, NY: Nova Science Publishers, Inc.; 2012:277–313.
11. ChatScript (Version 3.81). Available at: http://chatscript.sourceforge.net
. Accessed June 16, 2016.
12. Linden Research Inc. Second life. Available at: http://secondlife.com/
. Accessed June 16, 2016.
13. Wallace R AIML. Available at: http://www.alicebot.org/aiml.html
. Accessed June 16, 2016.
14. Danforth DR, Procter M, Heller R, Chen R, Johnson M. Development of virtual patient simulations for medical education. J Virtual Worlds Res
15. Danforth DR, Procter M, Heller R, et al. Development of virtual patient simulations for medical education. CGEA
. Chicago, IL: Spring Conference Program; 2009.
16. Rossen B, Lok B. A crowdsourcing method to develop virtual human conversational agents. Int J Hum Comput Stud
17. Wilcox B, Wilcox S. Making it real: Loebner-winning chatbot design. ARBOR
18. Unity Technologies Unity 3D (Version 5.1.0f3). Available at: http://unity3d.com
. Accessed June 16, 2016.
19. Autodesk, Inc. Autodesk Character Generator. Available at: https://charactergenerator.autodesk.com/
. Accessed June 16, 2016.
20. Autodesk, Inc. Maya 2014. Available at: http://www.autodesk.com/products/maya/overview
. Accessed June 16, 2016.
21. Schafer PB, Jin DZ. Noise-robust speech recognition through auditory feature detection and spike sequence decoding. Neural Comput
22. Rudžionis A, Ratkevičius K, Rudžionis V. Speech in call and web centers. Elektronika ir elektrotechnika
23. Ortiz CL. The road to natural conversational speech interfaces. IEEE Internet Computing
24. Nuance Communications: Dragon Naturally Speaking (Version 13). Available at: http://www.nuance.com/dragon/index.htm
. Accessed June 16, 2016.
25. Microsoft Corporation: Microsoft Kinect. Available at: https://www.microsoft.com/en-us/kinectforwindows
. Accessed June 16, 2016.