Sandella, Jeanne M.; Roberts, William L.; Gallagher, Laurie A.; Gimpel, John R.; Langenau, Erik E.; Boulet, John R.
Accurate written documentation by physicians is essential in providing quality care to patients. The American Osteopathic Association (AOA) and the Accreditation Council for Graduate Medical Education consider professionalism and written communication core competencies for the profession.1,2 The National Board of Osteopathic Medical Examiners (NBOME) administers the Comprehensive Osteopathic Medical Licensing Examination USA Level 2-PE (COMLEX-USA Level 2-PE) and has developed a system to link these competencies by tracking discrepancies between postencounter notes and behavior in an encounter.
Professionalism is defined as “advocacy of patient welfare, adherence to ethical principles, collaboration with health professionals, life-long learning, and sensitivity to a diverse patient population”1 and “accountability to patients, society and the profession.”2 Assessing professionalism is complex. Reddy et al3 concluded that students perceive unprofessional behavior dynamically; they perceived behaviors differently after clerkship experience and were more likely to participate in those behaviors after five months of clerkship rotations.
Assessing written communication skills is essential in the competency evaluation of communication. In most clinical skills examinations there is some form of postencounter exercise where the examinee must synthesize information gathered in the clinical encounter (e.g., patient note). During a high-stakes licensing examination, there may be reasons for errors of omission or fact in the postencounter note. Candidates may neglect (either intentionally or unintentionally) to document certain portions of the history obtained or physical examination maneuvers they performed.4–6 Perhaps candidates felt that elements were irrelevant and chose not to document those findings in order to finish within the allotted time. Candidates may also inaccurately document history or physical examination findings. Macmillan et al7 compared checklists and postencounter notes and found discrepancies between the two, noting that some students had difficulty translating the encounter to paper which contributed to error on the note. Errors of commission, or “fabrication,” are of greater interest in the assessment of professionalism. Errors of commission involve documentation of history that was not asked of the patient or physical examination maneuvers not performed. Studies investigating the accuracy of patient notes showed all three of these errors occur, both in the educational arena4,6,7 and in practice.8,9
Links between medical school performance and clinical practice have been shown. Papadakis et al10 found that physicians who were disciplined by state medical boards were three times more likely to have a history of “unprofessional behavior” in medical school. Researchers at the Medical College of Canada found that deficiencies in communication and clinical decision making on the national licensing performance examination were predictive of future complaints to regulatory authorities.11 Systematically identifying unprofessional behavior would be valuable.
NBOME’s mission is to protect the public by providing the means to assess competencies for osteopathic medicine and related health care professions. NBOME policy states that “under no circumstances should a candidate document history items that were not elicited from the patient, or physical exam findings that were not performed.”12 The purpose of this study was to investigate fabrication in postencounter notes to see whether a systematic screening process can be implemented. This procedure represents a novel means by which professionalism can be assessed and provides a way to protect the public from unprofessional behavior as a result of inaccurate documentation.
COMLEX-USA Level 2-PE serves as a clinical skills examination for osteopathic medical students working toward licensure.13 Candidates, primarily fourth-year osteopathic medical students, are tested on 12 standardized patient (SP) encounters. They are given 14 minutes to evaluate and treat an SP with an additional 9 minutes to complete a postencounter SOAP note. Candidates document history findings in the Subjective (S) section, physical findings in the Objective (O) section, integrated differential diagnosis or problem list in the Assessment (A) section, and diagnostic and treatment plan in the Plan (P) section of the SOAP note.12 Candidates are assessed within two domains: Biomedical/Biomechanical and Humanistic. The Biomedical Domain consists of Data Gathering (history and physical examination), synthesis of clinical findings in a SOAP note, and performance of Osteopathic Manipulative Treatment. The Humanistic Domain consists of Doctor–Patient Communication, Interpersonal Skills, and Professionalism.12 COMLEX-USA Level 2-PE uses a conjunctive pass/fail decision system; candidates must pass both domains by demonstrating minimal competency on the skill areas assessed therein.
The process used to identify potential SOAP note fabrication was implemented in the 2007–2008 test cycle. Candidates are informed before the exam that fabrication could result in a fail decision and sign consent forms acknowledging that their scores are screened for quality assurance purposes and may be used for research. No individual scores or outcomes are reported; therefore, IRB approval is not needed. The data represent 3,753 candidates who took the COMLEX-USA Level 2-PE in the 2007–2008 testing cycle. Of the total tested, 7.4% were repeat testers.
Three methods were devised to flag candidates whose SOAP notes are evaluated: screening by SOAP note raters, by a purposeful SOAP note review, and by psychometric indices (Figure 1).
For method one, SOAP note raters report potential discrepancies in the note. Raters undergo case-specific training and are familiar with the facts of the case. Raters are informed to flag notes that have two discrepancies in either the S or O section. Internal investigations demonstrated a greater ratio of fabricators to flagged candidates when more than two errors were demonstrated.
For method two, patient notes are selected purposefully for review. In the first quarter of the test cycle, various candidate ID numbers were pulled, and three notes from each of these candidates were sent directly to the screening process.
For method three, a psychometric heuristic algorithm is employed to review records for the given month. Using SP case-specific checklists, data gathering (DG) scores are calculated. DGenc_score is the percentage of correctly asked questions regarding patient history and adequately performed physical examinations during each SP encounter. The encounter-level scores are averaged across encounters to compute an average DG percentage score. This final computation provides the mean DG score for each candidate on the test day. Encounter-level ratings are averaged across the scoreable encounters for the test day for each of the encounter-level components. An overall average score for each component (Saverage, Oaverage, Aaverage, Paverage, and DGaverage) is computed across encounter-level ratings for the candidate’s component score. This screening process uses the Saverage and Oaverage encounter-level scores only.
Candidates’ DGenc_scores, Senc_scores, and Oenc_scores are averaged across candidates from the same day, resulting in test-day average encounter scores (i.e., DGaverage, Saverage, and Oaverage). Each candidate’s encounter-level score is subtracted from the test-day mean, resulting in three difference scores (i.e., DGdiff_score, Sdiff_score, and Odiff_score). If the candidate’s DGdiff_score is equal to or more than one standard deviation below the mean encounter score for DG and the candidate’s Sdiff_score or Odiff_score is equal to or more than one standard deviation higher than mean encounter scores S or O, respectively, then the candidate is flagged for further review. Examinees with lower than average DG scores and higher than average S, O, or both S and O note scores are most likely to be selected for further review. Using one standard deviation for selection captures scores within this pattern of performance further to the extreme. Scores are individually reviewed by physician staff, and candidates with disparate scores on individual encounters are noted. DG percentages are compared with the average of the S and O scores for single encounters. Only those candidates with encounter discrepancies are screened.
Methods one and three provided greater accuracy in identifying candidates who fabricate. Raters detect errors in the written note (method one), and psychometrics flags discrepancies between DG and note scores (method three). We found that these two methods functioned as independent processes, with different fabricators identified from each method. Method two was discontinued after it failed to systematically identify any fabricators during the first quarter.
Regardless of the flagging method, screening involves physician review of a minimum of three notes from the flagged candidate against the recorded encounter. If two or more patient notes contain fabrication, or if one note contains multiple fabrications, all 12 encounters are reviewed. Notes are reviewed and sent by the vice president for clinical skills testing to a testing committee composed of physician members for final review and decision. If this committee decides that fabrication occurred, then the candidate receives a failing score on the Biomedical/Biomechanical and Humanistic Domains, resulting in an examination failure with annotation of “irregular behavior” on the transcript and score report.
For the 2007–2008 testing cycle (July 9, 2007 to May 19, 2008), 3,753 candidates were tested. Ninety-five were flagged for SOAP note fabrication.
Sixty-four candidates were flagged by psychometric review, and physician staff screened 17 candidates (26.6%). Four (23.5%) were sent to committee for review, and three (17.6%) were determined to have fabricated on at least 50% of their SOAP notes.
All 26 (100%) candidates who were flagged by SOAP note raters were screened. Five (19.2%) went to committee review and were found to have fabricated.
Five candidates were flagged by method two in the first quarter of the cycle; none were found to be fabricators.
Forty-eight candidates were screened, and nine (18.8%) were sent to committee for review. The subcommittee failed eight of nine. The overall examination failure rate was 0.2% based on fabrication review.
NBOME takes a strong stance on the issue of SOAP note fabrication. Candidates who fabricated SOAP notes are issued a failing score for the Biomedical/Biomechanical and Humanistic Domains. SOAP notes are one component score of the Biomedical/Biomechanical Domain. If SOAP notes are inaccurate, then the domain score is invalid. As fabrication is unprofessional, the candidate will also fail the Humanistic Domain.
In high-stakes examinations, candidates are under pressure to pass, which, combined with the knowledge that this is not a true clinical scenario, may result in unprofessional behaviors.14,15 Students are accustomed to formative and lower-stakes SP exercises at their schools and may therefore document history or physical examination maneuvers not performed, assuming that they would be negative or normal.8 Additionally, there may be a perception that record falsification in an examination doesn’t have the human consequence it would in the clinical setting. This perception of behaviors being contextually dependent has been documented in a study by Green et al16 in which 19% of internal medicine residents felt that protecting a patient from the diagnosis of a stigmatizing illness justifies intentional misrepresentation of the medical record.
The identification of overdocumentation in a SOAP note is a valued task, but there are operational challenges to this undertaking. With nearly 4,000 candidates tested annually, reviewing every SOAP note is impractical. To address this challenge, we have developed a SOAP note rater training program and a psychometric heuristic to identify, probabilistically, who is likely to have fabricated. Given the rates of overdocumentation reported previously,4–6 this likely underestimates our incidence of fabrication.
For method one, the flagging process is limited by the ability of a SOAP note rater to recognize possible fabrication on a SOAP note. It is difficult to distinguish whether an error is a documentation error (“pain for two weeks” when it was three weeks) or fabrication (never asked when the pain began). This leads to increased numbers of flagged and screened notes that are not fabrication. We have integrated fabrication recognition into the SOAP note rater training and mandatory yearly refresher training modules to minimize this.
Flagging SOAP notes using psychometrics is challenging. Currently, the process compares DG scores with note scores. Direct comparisons among the SP DG checklist items and the S and O portions of the note are problematic because they do not measure the exact same constructs. An SP completes a checklist requiring that history and physical examination maneuvers conform to a predetermined standard. S and O portions of the note are candidates’ perceptions of the encounter. Occasionally, these two “perceptions” do not agree, but no fabrication occurred. Similar discrepancies were noticed by Worzala et al6 in the medical education setting.
Finally, there are limitations in SOAP note evaluation. Judgments are based on what the candidate believed he or she asked or did during the encounter. Inconsistency is limited by having screening evaluations performed by physician trainers who use set protocols. Notes are sent for review by another team of trained osteopathic physicians. Evaluation errs on the side of the candidate, potentially underestimating fabrication incidence.
More research and instruction in medical schools in the area of accuracy in documentation is desirable. The detection of this type of unprofessional behavior is one way that testing organizations can ensure their public protection mandate.
1 Tunanidas A, Burkhart D. American Osteopathic Association commitment to quality and lifelong learning. J Am Osteopath Assoc. 2005;105:404–407.
3 Reddy ST, Farnan JM, Yoon JD, et al. Third year medical students’ participation in and perceptions of unprofessional behaviors. Acad Med. 2007;82(10 suppl):S35–S39.
4 Szauter KM, Ainsworth MA, Holden MD, Mercado AC. Do students do what they write and write what they do? The match between the patient encounter and patient note. Acad Med. 2006;81(10 suppl):S44–S47.
5 Keenan CF, Adubofour K, Daftary AV. Reducing medical errors in primary care. JAAPA. February 2004 [no longer available].
6 Worzala K, Rattner AL, Boulet JR, et al. Evaluation of the congruence between students’ postencounter notes and standardized patient checklists in a clinical skills examination. Teach Learn Med. 2008;20:31–36.
7 Macmillan MK, Fletcher EA, DeChamplain AF, Klass DJ. Assessing post-encounter note documentation by examinees in a field test of a nationally administered standardized patient test. Acad Med. 2000;75(10 suppl):S112–S114.
8 Horowitz LI, Meredith T, Schuur JD, Shah NR, Kulkarni RG, Jenq GY. Dropping the baton: A qualitative analysis of failures during the transition from emergency department to inpatient care. Ann Emerg Med. 2009;53:701–710.e4.
9 Dresselhaus TR, Luck J, Peabody JW. The ethical problem of false positives: A prospective evaluation of physician reporting in the medical record. J Med Ethics. 2002;28:291–294.
10 Papadakis MA, Teherani A, Banach MA, et al. Disciplinary action by medical boards and prior behavior in medical school. N Engl J Med. 2005;353:2673–2682.
11 Tamblyn R, Abrahamowicz M, Dauphinee D, et al. Physician scores on a national clinical skills examination as predictors of complaints to medical regulatory authorities. JAMA. 2007;298:993–1001.
13 Gimpel JR, Boulet JR, Erichetti AM. Evaluating the clinical skills of osteopathic medical students. J Am Osteopath Assoc. 2003;103:267–279.
14 Ginsberg S, Regehr G, Hatala R, et al. Context, conflict and resolution: A new conceptual framework for evaluating professionalism. Acad Med. 2000;75(10 suppl):S6–S11.
15 Rees CE, Knight LV. Banning, detection, attribution and reaction: The role of assessors in constructing students’ unprofessional behaviours. Med Educ. 2008;42:125–127.
16 Green MJ, Farber NJ, Ubel PA, et al. Lying to each other: When internal medicine residents use deception with their colleagues. Arch Intern Med. 2000;160:2317–2323.