Secondary Logo

Journal Logo

Featured Articles: Original Clinical Research Report

Prediction of Prolonged Opioid Use After Surgery in Adolescents: Insights From Machine Learning

Ward, Andrew PhD*; Jani, Trisha MS; De Souza, Elizabeth PhD; Scheinker, David PhD§; Bambos, Nicholas PhD*,§; Anderson, T. Anthony PhD, MD

Author Information
doi: 10.1213/ANE.0000000000005527



  • Question: Can machine-learning models be used to predict adolescents at high risk of prolonged opioid use after surgery (POUS)?
  • Findings: Machine-learning models to predict POUS risk among adolescents show modest prediction across a broad range of surgeries and strong results for some individual surgeries.
  • Meaning: Machine learning’s promising performance for certain surgeries suggests that these surgeries may benefit from specialized risk-prediction models (developed with electronic health record data).

Opioid use disorder is a global disease; its worldwide prevalence is estimated to be at least 16 million people.1 Prescription opioid abuse imposes significant health and economic burdens in Europe and North America.2,3 Opioids are prescribed across a diverse range of health care settings4; children are prescribed opioids in the emergency department and ambulatory care settings.5–7 Six percent of adults prescribed at least 1 day of opioids will continue to use them 1 year later; 2.9% will continue to use them 3 years later.8 Early use (before the age of 15 years) of any substance of abuse is associated with a 550% increase in the risk of subsequent substance use disorder (28.1% vs 4.3%),9 and even the legitimate use of prescription opioids by children before the 12th grade (17–18 years of age) increases the risk of future opioid misuse by 33%.10

Up to 80% of adolescents and adults who undergo surgery receive a postoperative opioid prescription,11 and prolonged opioid use after surgery (POUS; defined as ≥1 opioid prescription in the 90–180 days after surgery) occurs in up to 13% of adult patients.12,13 Certain patient characteristics (eg, preoperative substance use disorders, mood disorder, and chronic pain) and surgeries (eg, cholecystectomy and joint arthroplasty) increase risk.12,13 In particular, preoperative opioid use is strongly associated with POUS in adults.14,15 While fewer studies exist on the risk of continued opioid use in adolescents after surgery, in those published, it was reported as high as 15%.16,17 Adults with POUS, as a result of adverse health effects, have increased health care costs, hospitalizations, and emergency room visits.18

Almost exclusively, multivariable regression and classification models with adjustment for confounders and covariates have been the technique of choice to determine which variables are associated with postsurgical adverse events such as POUS in different surgical populations and to estimate the associations. These results are valid, clinically relevant, and allow adjusted associations between individual risk factors and the outcome to be quantified. However, when the number of variables used in the analysis is comparable to the number of observations in the data (as may be the case with detailed, retrospective medical datasets), there is a risk of model overfitting, which introduces bias. Artificial-intelligence techniques and machine-learning (ML) techniques, such as tree-based models, regularization, and evaluation on a held-out test set, are ideal for developing predictive algorithms and well suited to health care applications as they allow the processing of large datasets with many variables.

As no generalized prediction model exists to identify surgical patients at risk of developing POUS, we sought to assess the use of ML techniques to develop a predictive algorithm to indicate adolescents who would potentially develop POUS. Using the data from a national insurance claims database, we investigated the utility of ML techniques for POUS prediction model development and assessed model performance across a wide range of surgeries. We hypothesized that these models would be able to identify higher risk surgeries and useful variables for future predictive models developed within individual health care systems using electronic health records and other patient- and family-specific data.


Due to the deidentified nature of the dataset, the Stanford University Institutional Review Board (Stanford, CA) determined that this research does not involve human subjects as defined in 45 CFR 46.102(f) or 21 CFR 50.3(g) and waived the need for informed consent.

Study Population

Medical claims data from January 1, 2003 to December 30, 2017 were collected from Optum Clinformatics Data Mart Database (OptumInsight, Eden Prairie, MN), a deidentified database from a national insurance provider. Patients aged 12–21 years (early and late adolescence) who underwent a surgical procedure under general anesthesia between January 1, 2011 and June 30, 2017 were identified using any of 1298 surgical current procedural terminology (CPT) codes (Supplemental Digital Content, Table 1, As opioid prescription practices have varied over time and detailed census data were not available before 2011, surgeries before January 1, 2011 were excluded. Patients were eligible if they had been enrolled for at least 1 year before surgery and remained enrolled for at least 6 months following the surgery. These requirements ensured that a 1-year history of comorbidities and POUS information were available. To ensure that the perioperative opioid of second surgery was not misidentified as a prescription opioid in the 90 to 180 days after surgery, patients who had anesthesia in the 3 to 180 days after their surgery were excluded. Patients whose surgery date fell within a hospital stay longer than 30 days were excluded, as these more complex patients were not the study’s focus. Patients whose zip code was not available were excluded.

Outcome Measures

The primary outcome was POUS, defined as having filled at least 1 opioid prescription in the 90 to 180 days after surgery. Opioid prescriptions were identified by the American Hospital Formulary Service (AHFS) classes 280808, 280812, 28080800, and 28081200. The primary measures used to assess the ability of a model to predict POUS were the area under the receiver-operating characteristic curve (AUC), also known as the C statistic, and the mean average precision (MAP). Individual decision thresholds for the best model were compared using sensitivity, specificity, Youden Index, F1 score, and number needed to evaluate (NNE). The primary prediction point was identified as the day of surgery. We defined an additional prediction point as 14 days after surgery (a typical time for a follow-up surgical visit).

Variable Set Creation

The features commonly examined in the adult and pediatric opioid use after surgery literature were extracted. Only variables that could be obtained from an in-hospital electronic medical record, or by querying a widely available database, were considered. Each patient’s age and gender were considered. Variables capturing medical utilization and opioid prescription fill information, including days’ supply and average daily oral morphine milligram equivalent (MME) doses (calculating by dividing the total MME by the days’ supply) in the year before surgery and in the perioperative period (defined as 2 days before or after surgery), were created. This information can be gathered from patients, hospital electronic medical records, or state prescription drug-monitoring programs, such as the California Controlled Substance Utilization Review and Evaluation System. Variables capturing International Classification of Diseases (ICD) and CPT code information were created using predefined lists (Supplemental Digital Content, Tables 1 and 2, and the Clinical Classifications Software (CCS).19 Separately, the surgery (or surgeries) on each patient’s surgery date was (or were) mapped to one of the CCS procedure classes. Geodemographic information was obtained from the American Community Survey (ACS). For the secondary prediction point at 14 days after surgery, we collected additional variables between the day of surgery and 14 days after surgery.

As some of the variables collected were not expected to have much predictive power, a composite score based on the ML model results was used to prune variables. Details of the variable definition, classification, and selection process can be found in Supplemental Digital Content,

Training and Test Datasets

As rates of opioid prescribing to children and adolescents were previously shown to decrease between 2011 and 2016,20 patient surgery data were split by year into training and test sets. Surgeries that took place before 2016 were placed into the training set, while surgeries that took place in 2016 and 2017 were placed in the test set. Splitting training and test data in time not only inherently incorporates the effect of temporal drift in patient characteristics and outcome rates in test-set performance metrics but also mirrors the process of evaluating a predictive model on new patients in real time.

Model Development

Several key ML algorithms were used to train models on the training set: random forests (RF), gradient boosted machines (GBMs), extreme gradient boosting (XGBoost) models, and logistic regression with the standard L2 penalty (LR) and with an L1 lasso (Lasso) penalty. These algorithms were chosen for their ability to effectively incorporate many variables into the models trained by the algorithm. In particular, the tree-based models (RF, GBM, and XGBoost) are able to model complex, high-order interactions between the input variables. These algorithms are core candidates in the ML literature, and they have been applied to related prediction problems.21,22 Analysis was performed in Python 3.6 (Python Software Foundation, using the scikit-learn and XGBoost packages.23,24

For each ML algorithm, 5-fold cross validation was used on the training set to tune hyperparameters (Supplemental Digital Content, Table 3, The hyperparameters were tuned to control the models’ complexities and prevent overfitting. Once the hyperparameters and variables were selected, the best-performing cross-validated model of each algorithm was retrained on the entire training set. This process was repeated for the 14-day after-surgery prediction point.

Model Evaluation

During the training phase, models were compared based on the AUC. Once the best-performing model of each algorithm was retrained, final metrics of the best-performing ML model were reported on the held-out test set. The ML model with the highest AUC on the test set was then evaluated across additional performance metrics. To compare the model operating points, probability cutoffs were chosen so that the proportion of interventions (ie, number of patients classified as “positive”) was static; for example, a threshold was chosen so that 1% of patients are classified as positive; therefore, the 1% of patients with the highest risk would receive interventions. The sensitivity, specificity, precision (or positive predictive value), NNE, and Youden Index are calculated and reported for each fixed number of interventions. The AUC, MAP, maximum Youden Index, and highest F1 score of the best-performing models for the most common individual surgery types, and those with the highest AUCs, are also reported.

Variable Importance

Traditionally, feature importance for the tree-based methods are reported using the decrease in the Gini impurity in the training process, but this has a number of shortcomings, particularly in the presence of mixed-variable types (binary/categorical/continuous).25 To assess the contributions of variables to the predictions of the model, we used the permutation importance metric.26 This process estimates the contribution of an individual variable to a pretrained model by randomly permuting that variable’s values within the test set. The pretrained model is reevaluated on the modified test set, and the decrease in performance is measured. This analysis was performed on the held-out test set and the best-performing model for each prediction point.


Patient Characteristics

Table 1. - Preoperative Patient Characteristics
Characteristic Patient visits
With POUS (n = 8410) Without POUS (n = 178,083) Total (n = 186,493)
Male gender 3398 (40) 90,591 (51) 93,989 (50)
Age (y) 18.38 ± 2.34 17.47 ± 2.64 17.51 ± 2.63
Geographic region
 Middle Atlantic 289 (3.4) 10,415 (5.8) 10,704 (5.7)
 Pacific 610 (7.3) 14,205 (8) 14,815 (7.9)
 Mountain 1052 (13) 17,910 (10) 18,962 (10)
 South Atlantic 1845 (22) 39,511 (22) 41,356 (22)
 East South Central 347 (4.1) 6050 (3.4) 6397 (3.4)
 West South Central 1725 (21) 28,165 (16) 29,890 (16)
 West North Central 1060 (13) 26,560 (15) 27,620 (15)
 East North Central 1332 (16) 30,118 (17) 31,450 (17)
 New England 150 (1.8) 5149 (2.9) 5299 (2.8)
Medical history
 History of chronic pain 1772 (21) 19,466 (11) 21,238 (11)
 CCS: screening and history of mental health and substance abuse codes 717 (8.5) 5862 (3.3) 6579 (3.5)
Opioid prescription history
 Days supply of opioids filled in 365 d before surgery 24.99 ± 71.28 6.07 ± 10.61 6.92 ± 18.76
 Days supply of perioperative opioids 4.67 ± 5.03 3.99 ± 4.18 4.02 ± 4.22
 Total MME of opioids filled in 365 d before surgery 1199.40 ± 4945.43 261.63 ± 598.50 303.92 ± 1217.66
 Total MME of perioperative opioids 230.11 ± 306.73 182.71 ± 210.67 184.85 ± 216.15
Characteristics are presented as either “N (%)” or “mean ± standard deviation.”
Abbreviations: CCS, Clinical Classifications Software; MME, oral morphine milligram equivalents; POUS, prolonged opioid use after surgery.

Of the 354,978 patients with an eligible surgery under anesthesia in the dataset, 167,538 met all inclusion criteria (Supplemental Digital Content, Figure 1, Eight hundred forty-seven (0.24%) patients were missing a zip code; 186,493 surgical visits were included (some patients had >1 surgery at least 6 months apart). The surgeries included in the study represented 120 CCS surgery types. Patient cohort characteristics are presented in Table 1. The mean (standard deviation [SD]) patient age was 17.5 (SD 2.6) years at the time of surgery, and 49.6% were woman. Compared to those without POUS, patients with POUS had more prior chronic pain (21% vs 11%), higher rates of mental health-related and substance abuse-related ICD codes (8.5% vs 3.3%), and more days’ supply of opioids filled in the year before surgery (25 vs 6 days).

POUS Outcome

There were 8410 (4.5%) cases of POUS identified across all surgical visits. POUS incidence ranged from 1.0% (dental surgery) to 13.1% (noncardiac vascular catheterization) among the 49 surgery types with at least 10 cases of POUS included in the test set. POUS incidence decreased over time (Supplemental Digital Content, Figure 2,

Machine Learning Model Performance

The training set comprised 146,620 surgeries (78.6% of total); the test set comprised 39,873 surgeries (21.4% of total). One thousand one variables were used as inputs to the initial day-of-surgery model; of these, 109 were retained in the final model after variable pruning. For the 14-day after-surgery model, 1015 variables were used initially, and 136 were ultimately selected. Variables selected by the ML models are detailed in Supplemental Digital Content, List 1, Figure 1 shows the receiver-operating characteristic (ROC) curves and the precision-recall curves of the best-performing cross-validated LR, Lasso, RF, GBM, and XGBoost models evaluated on the test set for the day of surgery prediction (Figure 1A, B) and the 14 days after surgery prediction (Figure 1C, D). GBM had the strongest test performance at both time points, achieving a test AUC of 0.711 (95% confidence interval [CI], 0.699-0.723) and MAP of 0.142 on the day of surgery, and a test AUC of 0.713 (95% CI, 0.701-0.725) and MAP of 0.147 at 14 days after surgery.

Figure 1.:
Evaluation of machine learning model performance on the held-out test set. ROC curves and precision-recall curves for models predicting on the day of surgery (A, B) and 14 d after surgery (C, D) are shown. For each plot, the best-performing model is denoted with a bolded line. AUC indicates area under the receiver-operating characteristic curve; CI, confidence interval; GBM, gradient boosting machine; Lasso, logistic regression with an L1 Lasso penalty; LR, logistic regression with an L2 penalty; POUS, prolonged opioid use after surgery; RF, random forest; ROC, receiver-operating characteristic; XGBoost, extreme gradient boosting.

Table 2 details the breakdown of AUC, highest Youden Index, and highest F1 score by surgery type for the GBM models. The 10 most common surgery types, as well as the 10 surgeries with the highest AUCs and at least 10 cases of POUS, are shown. Despite an overall AUC of 0.711 on all patients, the GBM model on the day of surgery performed notably better among certain surgical groups (treatment of fracture or dislocation of hip and femur, AUC: 0.823; dental procedures, AUC: 0.812).

Table 2. - POUS and GBM Test Set Performance by Surgery Type
Surgery Number of surgeries Number of POUS cases (%) AUC MAP Highest Youden indexa Highest F1 scorea
All surgeries
 Day-of-surgery prediction 39,873 1298 (3.3) 0.711 0.142 0.295 0.200
 14-d-after-surgery prediction 39,873 1298 (3.3) 0.713 0.142 0.291 0.201
Most common surgeries
 Other OR therapeutic procedures on joints 4355 129 (3.0) 0.626 0.070 0.199 0.138
 Tonsillectomy and/or adenoidectomy 3861 130 (3.4) 0.645 0.085 0.226 0.132
 Arthroscopy 3087 80 (2.6) 0.629 0.086 0.197 0.140
 Excision of semilunar cartilage of knee 2911 70 (2.4) 0.658 0.104 0.249 0.170
 Dental surgery 2600 27 (1.0) 0.748 0.204 0.424 0.294
 Other OR therapeutic procedures on bone 2556 75 (2.9) 0.709 0.117 0.350 0.185
 Appendectomy 2407 66 (2.7) 0.720 0.136 0.353 0.264
 Other OR therapeutic procedures on nose, mouth, and pharynx 2285 60 (2.6) 0.636 0.105 0.208 0.144
 Other therapeutic procedures on muscles and tendons 2096 75 (3.6) 0.741 0.144 0.378 0.183
 Other fracture and dislocation procedure 1987 41 (2.1) 0.648 0.054 0.292 0.104
Surgeries with highest AUC
 Treatment; fracture or dislocation of hip and femur 195 11 (5.6) 0.823 0.319 0.560 0.375
 Dental procedure 604 12 (2.0) 0.812 0.290 0.551 0.333
 Other OR therapeutic procedures on musculoskeletal system 412 16 (3.9) 0.807 0.183 0.506 0.308
 Arthrocentesis 179 10 (5.6) 0.804 0.455 0.504 0.533
 Extracorporeal lithotripsy; urinary 192 16 (8.3) 0.800 0.344 0.506 0.400
 Spinal fusion 424 29 (6.8) 0.794 0.276 0.560 0.375
 Myringotomy 568 18 (3.2) 0.792 0.184 0.480 0.311
 Laminectomy; excision intervertebral disk 251 22 (8.8) 0.791 0.459 0.501 0.500
 Debridement of wound; infection or burn 398 12 (3.0) 0.787 0.113 0.636 0.204
 Ureteral catheterization 166 12 (7.2) 0.761 0.333 0.444 0.381
Abbreviations: AUC, area under the receiver-operating characteristic curve; MAP, mean average precision; OR, operating room; POUS, prolonged opioid use after surgery.
aThe highest Youden Index and highest F1 score were selected by searching across all possible decision thresholds.

Figure 2 compares the relative importance of the 25 most important variables for the GBM models on the day of surgery and 14 days after surgery. For both models (day of surgery and 14 days after surgery), opioid prescription information in the 365 days before surgery (the days’ supply of all opioids, the average daily MME of opioids, and the days’ supply of full opiate agonists) were the most important variables. The other 7 of the top 10 most important variables for the GBM models include MME of opioids 3 to 14 days after surgery, days supply of opioids 3 to 14 days after surgery, zip-level education level, zip-level median household income, zip-level marital status, the number of outpatient visits in the 365 days before surgery, and having undergone dental surgery.

Figure 2.:
Variable importance of the most important variables for the GBM model predicting POUS on the day of surgery and 14 d after surgery. AUC indicates area under the receiver-operating characteristic curve; GBM, gradient boosting machine; GI, gastrointestinal; MME, oral morphine milligram equivalents; POUS, prolonged opioid use after surgery.
Table 3. - Additional GBM Predictive Performance Metrics Specified by Number of Interventions
Prediction point Patients for intervention (%) Sensitivity (%)a Specificity (%)b Precision (%)c NNEd Youden indexe F1 scoref
Day of surgery 1 11 99 37.4 3 0.11 0.18
2 16 98 25.7 4 0.14 0.20
5 23 96 14.8 7 0.18 0.18
10 33 91 10.6 10 0.23 0.16
25 53 76 6.8 15 0.28 0.12
50 76 51 5.0 21 0.27 0.09
75 93 26 4.0 25 0.19 0.08
90 98 10 3.6 29 0.09 0.07
95 99 5 3.4 30 0.04 0.07
98 100 2 3.3 31 0.02 0.06
14 d after surgery 1 12 99 38.2 3 0.11 0.18
2 16 98 25.7 4 0.14 0.20
5 24 96 15.6 7 0.20 0.19
10 34 91 11.1 10 0.25 0.17
25 53 76 6.8 15 0.28 0.12
50 76 51 5.0 21 0.27 0.09
75 93 26 4.0 25 0.19 0.08
90 98 10 3.6 29 0.09 0.07
95 99 5 3.4 30 0.04 0.07
98 100 2 3.3 31 0.02 0.06
Abbreviations: GBM, gradient boosting machine; NNE, number needed to evaluate; POUS, prolonged opioid use after surgery.
aA higher sensitivity indicates that a greater percentage of patients with POUS are classified as at risk for POUS.
bA higher specificity indicates that a greater percentage of patients without POUS are classified as not at risk for POUS.
cA higher precision indicates that a greater percentage of patients classified as at risk for POUS do have POUS.
dA lower NNE indicates that fewer patients classified as at risk for POUS need to be evaluated to capture someone who has POUS.
eA higher Youden Index (closer to 1) indicates that the model has a balance of high sensitivity and specificity.
fA higher F1 score (closer to 1) indicates that the model has a balance of high sensitivity and precision.

The performance of the GBM models is detailed in Table 3. As the number of patient interventions increases, the sensitivity and NNE increase and the specificity and precision decrease. Improved predictive performance 14 days after surgery was negligible.


We investigated the ability of key ML algorithms to predict adolescent POUS and identified variables important for this prediction. Utilizing a large national insurance claims dataset comprising data from over 167,000 eligible surgical patients, ML models achieved modest predictive performance across all surgeries, but substantially higher predictive performance for some specific surgeries. The tree-based ML methods performed slightly, but not significantly, higher than the regression-based models, suggesting that the tree-based models were not able to exploit any complex, nonlinear relationships among the variables that offered increased predictive power. Prediction did not improve substantially when the prediction point was delayed to 14 days after surgery, indicating either (1) there is no clinical or medical information that emerges in the 14-day window after surgery, which substantially improves POUS prediction or (2) there is an information that could substantially improve the POUS prediction, but it is not captured by the methodology outlined here. Of more than 1000 variables used as inputs in the initial model, variable importance analysis found that opioid use in the year before surgery and perioperative opioid prescriptions (both the average daily MME and total days’ supply) are most important for POUS prediction, while zip-level socioeconomic factors contribute to a lesser degree. It is reassuring that these variables have been identified as risk factors for POUS in other studies.27–29 Rates of POUS determined in this study are consistent with similar pediatric and adult studies.12,16,17,27–29

We hope that these results may be used by individual health care systems to (1) analyze risk factors associated with patients at high risk of POUS, (2) develop models to predict individual risk, (3) validate the models within the same system, and finally (4) deploy the model as a tool to identify high-risk patients. These steps will ultimately allow development and investigation of targeted prevention strategies. However, patients at high risk of POUS may benefit from perioperative opioid prescription reduction strategies, intensive pain management, and an interdisciplinary pain care team. In this study, ML only incrementally outperformed linear regression. Still, over the long-term, POUS prediction models may be used at the point of care while supported by a customer data integration platform, which periodically performs analyses on the most recent aggregated data; ML is particularly well suited for such analyses.

In studies like this with imbalanced classes, AUCs alone may not be reflective of clinical effectiveness.30 Hence, performance was evaluated using not only AUC and the ROC curve but also MAP and the precision-recall curve, sensitivity, specificity, precision (positive predictive value), Youden Index, F1 score, and NNE, giving a more holistic view of model performance. Due to the small fraction of POUS cases, either the sensitivity or the precision is low across all of the cutoffs reported. This is consistent with the low precision of prostate and breast cancer screening tests, which also have low outcome rates.31,32 Operationalizing a predictive model in this rare-outcome setting requires considering the trade-offs between false positives and false negatives. The financial and health burdens of POUS are active areas of research,18 so it may be feasible to define an optimal decision threshold based on the cost of an intervention versus the expected or observed cost of POUS.

This study expands on prior publications assessing ML-derived POUS prediction models for individual surgery types. Though not directly comparable due to patient age differences, our day of surgery model AUC across all adolescent surgeries is higher than that found for adults undergoing lumbar spine surgery (AUC 0.70),33 and our day of surgery model AUCs for the top individual surgeries found here are higher than the existing predictive models for total hip arthroplasty (AUC 0.77), lumbar disk herniation (AUC 0.81), and anterior cervical discectomy and fusion surgeries (AUC 0.81).21,34,35 It is encouraging that these results are comparable despite being evaluated on adolescent populations with lower POUS incidence and not being recalibrated on any individual surgeries. These results provide a principled method of determining which surgical practices might benefit most from POUS prediction to direct future research efforts.

The use of tree-based ML algorithms allowed us to effectively leverage the amount of data available in this dataset to train high-performing models. These models incorporated >180,000 patient encounters and >1000 predictor variables (and complex interactions between variables) into the training process. Despite their additional modeling power, the performances of the tree-based models were similar to those of the logistic regression models. Therefore, when making decisions regarding the implementation of these models, researchers, hospital staff, and care teams and clinicians should balance the performance benefits of tree-based models with the relative simplicity and ease-of-implementation of logistic regression models.

The decision to use data after 2016 as the test set accounts for temporal changes in POUS, and the models performed well despite these trends. This contrasts with the standard convention to make random train/test splits, a practice that obscures such temporal patterns and makes results less reproducible or clinically relevant. In addition to AUC, we reported a variety of alternative evaluation metrics at several thresholds, each corresponding to a percentage of patients for which interventions would occur. This method of reporting performance metrics based on the relative number of interventions can inform resource-constrained care providers and systems and illustrates the trade-offs likely to be seen in practice.

This study has limitations. Exclusion of patients who did not have coverage for the 18-month study window may introduce study bias; the exclusion of patients due to geographic moves, changes in insurance, or switching health care systems within study periods is a limitation of all outcomes studies. There are certainly additional variables that may affect POUS rates that we were unable to include in the models; these include cultural differences in different geographic locations, differences between individual hospitals, differences in practice types (academic versus private), managed health care system differences, provider training variation, etc. While individual county, zip code, and/or state differences may exist, we did include 5 geographic regions in the ML models: Middle Atlantic, Mountain, West, South, and Central. The data used were from an insurance claims database, and all of these data may not be available at the time of surgery. As such, this study can guide hospitals and health care systems in model development (the types of models to train, the variables to include, and the surgeries on which to focus prediction and intervention efforts) using accessible electronic health record data. Due to missing individual sociodemographic data, we used aggregated zip code–level percentage variables. Incorporating individual-level data may improve performance. This study was limited to privately insured US patients and may not be generalizable to all populations; however, >50% of children in the United States have private health insurance coverage.


We designed and validated an ML algorithm for predicting adolescents who may develop POUS. This algorithm’s promising performance for certain surgeries suggests that these surgeries may benefit from specialized risk prediction models (developed with electronic health record data) and intervention strategies. Variables identified as important here for adolescent POUS prediction may inform the future model analysis and development.


Name: Andrew Ward, PhD.

Contribution: This author helped with research conceptualization and design, data analysis, data interpretation, and writing and final approval of the manuscript.

Name: Trisha Jani, MS.

Contribution: This author helped with research conceptualization and design, data analysis, data interpretation, and writing and final approval of the manuscript.

Name: Elizabeth De Souza, PhD.

Contribution: This author helped with research conceptualization and design, data analysis, data interpretation, and writing and final approval of the manuscript.

Name: David Scheinker, PhD.

Contribution: This author helped with research conceptualization and design and writing and final approval of the manuscript.

Name: Nicholas Bambos, PhD.

Contribution: This author helped with research conceptualization and design and writing and final approval of the manuscript.

Name: T. Anthony Anderson, PhD, MD.

Contribution: This author helped with research conceptualization and design, data analysis, data interpretation, and writing and final approval of the manuscript.

This manuscript was handled by: Honorio T. Benzon, MD.


    1. Degenhardt L, Whiteford HA, Ferrari AJ, et al. Global burden of disease attributable to illicit drug use and dependence: findings from the Global Burden of Disease Study 2010. Lancet. 2013;382:1564–1574.
    2. Florence CS, Zhou C, Luo F, Xu L. The economic burden of prescription opioid overdose, abuse, and dependence in the United States, 2013. Med Care. 2016;54:901–906.
    3. Shei A, Hirst M, Kirson NY, Enloe CJ, Birnbaum HG, Dunlop WC. Estimating the health care burden of prescription opioid abuse in five European countries. Clinicoecon Outcomes Res. 2015;7:477–488.
    4. Guy GP Jr, Zhang K, Bohm MK, et al. Vital signs: changes in opioid prescribing in the United States, 2006-2015. MMWR Morb Mortal Wkly Rep. 2017;66:697–704.
    5. DePhillips M, Watts J, Lowry J, Dowd MD. Opioid prescribing practices in pediatric acute care settings. Pediatr Emerg Care. 2019;35:16–21.
    6. Meckler GD, Sheridan DC, Charlesworth CJ, Lupulescu-Mann N, Kim H, Sun BC. Opioid prescribing practices for pediatric headache. J Pediatr. 2019;204:240–244.e2.
    7. Tomaszewski DM, Arbuckle C, Yang S, Linstead E. Trends in opioid use in pediatric patients in US Emergency Departments From 2006 to 2015. JAMA Netw Open. 2018;1:e186161.
    8. Shah A, Hayes CJ, Martin BC. Characteristics of initial prescription episodes and likelihood of long-term opioid use—United States, 2006-2015. MMWR Morb Mortal Wkly Rep. 2017;66:265–269.
    9. Whyte AJ, Torregrossa MM, Barker JM, Gourley SL. Editorial: long-term consequences of adolescent drug use: evidence from pre-clinical and clinical models. Front Behav Neurosci. 2018;12:83.
    10. Miech R, Johnston L, O’Malley PM, Keyes KM, Heard K. Prescription opioids in adolescence and future opioid misuse. Pediatrics. 2015;136:e1169–e1177.
    11. Ladha KS, Neuman MD, Broms G, et al. Opioid prescribing after surgery in the United States, Canada, and Sweden. JAMA Netw Open. 2019;2:e1910734.
    12. Brummett CM, Waljee JF, Goesling J, et al. New persistent opioid use after minor and major surgical procedures in US adults. JAMA Surg. 2017;152:e170504.
    13. Johnson SP, Chung KC, Zhong L, et al. Risk of prolonged opioid use among opioid-naïve patients following common hand surgery procedures. J Hand Surg Am. 2016;41:947–957.e3.
    14. Forlenza EM, Lavoie-Gagne O, Lu Y, et al. Preoperative opioid use predicts prolonged postoperative opioid use and inferior patient outcomes following anterior cruciate ligament reconstruction. Arthroscopy. 2020;36:2681–2688, e2681.
    15. Lawal OD, Gold J, Murthy A, et al. Rate and risk factors associated with prolonged opioid use after surgery: a systematic review and meta-analysis. JAMA Netw Open. 2020;3:e207367.
    16. Harbaugh CM, Lee JS, Hu HM, et al. Persistent opioid use among pediatric patients after surgery. Pediatrics. 2018;141:e20172439.
    17. Ward A, De Souza E, Miller D, et al. Incidence of and factors associated with prolonged and persistent postoperative opioid use in children 0-18 years of age. Anesth Analg. 2020;131:1237–1248.
    18. Brummett CM, England C, Evans-Shields J, et al. Health care burden associated with outpatient opioid use following inpatient or outpatient surgery. J Manag Care Spec Pharm. 2019;25:973–983.
    19. Bardach NS, Vittinghoff E, Asteria-Peñaloza R, et al. Measuring hospital quality using pediatric readmission and revisit rates. Pediatrics. 2013;132:429–436.
    20. Gagne JJ, He M, Bateman BT. Trends in opioid prescription in children and adolescents in a commercially insured population in the United States, 2004-2017. JAMA Pediatr. 2018;173:98–99.
    21. Karhade AV, Ogink PT, Thio Q, et al. Machine learning for prediction of sustained opioid prescription after anterior cervical discectomy and fusion. Spine J. 2019;19:976–983.
    22. Lo-Ciganic WH, Huang JL, Zhang HH, et al. Evaluation of machine-learning algorithms for predicting opioid overdose risk among medicare beneficiaries with opioid prescriptions. JAMA Netw Open. 2019;2:e190968.
    23. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:6.
    24. Tianqi Chen CG. XGBoost: A Scalable Tree Boosting System. 2016; Paper presented at: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; San Francisco, CA
    25. Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007;8:25.
    26. Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26:1340–1347.
    27. Fields AC, Cavallaro PM, Correll DJ, et al. Predictors of prolonged opioid use following colectomy. Dis Colon Rectum. 2019;62:1117–1123.
    28. Hah JM, Bateman BT, Ratliff J, Curtin C, Sun E. Chronic opioid use after surgery: implications for perioperative management in the face of the opioid epidemic. Anesth Analg. 2017;125:1733–1740.
    29. Politzer CS, Kildow BJ, Goltz DE, Green CL, Bolognesi MP, Seyler TM. Trends in opioid utilization before and after total knee arthroplasty. J Arthroplasty. 2018;33:S147–S153.e1.
    30. Romero-Brufau S, Huddleston JM, Escobar GJ, Liebow M. Why the C-statistic is not informative to evaluate early warning scores and what metrics to use. Crit Care. 2015;19:285.
    31. Carter HB, Pearson JD. Prostate-specific antigen testing for early diagnosis of prostate cancer: formulation of guidelines. Urology. 1999;54:780–786.
    32. Maxim LD, Niebo R, Utell MJ. Screening tests: a review with examples. Inhal Toxicol. 2014;26:811–828.
    33. Karhade AV, Cha TD, Fogel HA, et al. Predicting prolonged opioid prescriptions in opioid-naïve lumbar spine surgery patients. Spine J. 2020;20:888–895.
    34. Karhade AV, Ogink PT, Thio QCBS, et al. Development of machine learning algorithms for prediction of prolonged opioid prescription after surgery for lumbar disc herniation. Spine J. 2019;19:1764–1771.
    35. Karhade AV, Schwab JH, Bedair HS. Development of machine learning algorithms for prediction of sustained postoperative opioid prescriptions after total hip arthroplasty. J Arthroplasty. 2019;34:2272–2277.e1.

    Supplemental Digital Content

    Copyright © 2021 International Anesthesia Research Society