The objective of this study was to develop a semiautomated approach to screening cases that describe hazards associated with the electronic health record (EHR) from a mandatory, population-based patient safety reporting system.
Potentially relevant cases were identified through a query of the Pennsylvania Patient Safety Reporting System. A random sample of cases were manually screened for relevance and divided into training, testing, and validation data sets to develop a machine learning model. This model was used to automate screening of remaining potentially relevant cases.
Of the 4 algorithms tested, a naive Bayes kernel performed best, with an area under the receiver operating characteristic curve of 0.927 ± 0.023, accuracy of 0.855 ± 0.033, and F score of 0.877 ± 0.027.
The machine learning model and text mining approach described here are useful tools for identifying and analyzing adverse event and near-miss reports. Although reporting systems are beginning to incorporate structured fields on health information technology and the EHR, these methods can identify related events that reporters classify in other ways. These methods can facilitate analysis of legacy safety reports by retrieving health information technology–related and EHR-related events from databases without fields and controlled values focused on this subject and distinguishing them from reports in which the EHR is mentioned only in passing.
Machine learning and text mining are useful additions to the patient safety toolkit and can be used to semiautomate screening and analysis of unstructured text in safety reports from frontline staff.