Establishment and validation of an interactive artificial intelligence platform to predict postoperative ambulatory status for patients with metastatic spinal disease: a multicenter analysis

Background: Identification of patients with high-risk of experiencing inability to walk after surgery is important for surgeons to make therapeutic strategies for patients with metastatic spinal disease. However, there is a lack of clinical tool to assess postoperative ambulatory status for those patients. The emergence of artificial intelligence (AI) brings a promising opportunity to develop accurate prediction models. Methods: This study collected 455 patients with metastatic spinal disease who underwent posterior decompressive surgery at three tertiary medical institutions. Of these, 220 patients were collected from one medical institution to form the model derivation cohort, while 89 and 146 patients were collected from two other medical institutions to form the external validation cohorts 1 and 2, respectively. Patients in the model derivation cohort were used to develop and internally validate models. To establish the interactive AI platform, machine learning techniques were used to develop prediction models, including logistic regression (LR), decision tree (DT), random forest (RF), extreme gradient boosting machine (eXGBM), support vector machine (SVM), and neural network (NN). Furthermore, to enhance the resilience of the study’s model, an ensemble machine learning approach was employed using a soft-voting method by combining the results of the above six algorithms. A scoring system incorporating 10 evaluation metrics was used to comprehensively assess the prediction performance of the developed models. The scoring system had a total score of 0 to 60, with higher scores denoting better prediction performance. An interactive AI platform was further deployed via Streamlit. The prediction performance was compared between medical experts and the AI platform in assessing the risk of experiencing postoperative inability to walk among patients with metastatic spinal disease. Results: Among all developed models, the ensemble model outperformed the six other models with the highest score of 57, followed by the eXGBM model (54), SVM model (50), and NN model (50). The ensemble model had the best performance in accuracy and calibration slope, and the second-best performance in precise, recall, specificity, area under the curve (AUC), Brier score, and log loss. The scores of the LR model, RF model, and DT model were 39, 46, and 26, respectively. External validation demonstrated that the ensemble model had an AUC value of 0.873 (95% CI: 0.809–0.936) in the external validation cohort 1 and 0.924 (95% CI: 0.890–0.959) in the external validation cohort 2. In the new ensemble machine learning model excluding the feature of the number of comorbidities, the AUC value was still as high as 0.916 (95% CI: 0.863–0.969). In addition, the AUC values of the new model were 0.880 (95% CI: 0.819–0.940) in the external validation cohort 1 and 0.922 (95% CI: 0.887–0.958) in the external validation cohort 2, indicating favorable generalization of the model. The interactive AI platform was further deployed online based on the final machine learning model, and it was available at https://postoperativeambulatory-izpdr6gsxxwhitr8fubutd.streamlit.app/. By using the AI platform, researchers were able to obtain the individual predicted risk of postoperative inability to walk, gain insights into the key factors influencing the outcome, and find the stratified therapeutic recommendations. The AUC value obtained from the AI platform was significantly higher than the average AUC value achieved by the medical experts (P<0.001), denoting that the AI platform obviously outperformed the individual medical experts. Conclusions: The study successfully develops and validates an interactive AI platform for evaluating the risk of postoperative loss of ambulatory ability in patients with metastatic spinal disease. This AI platform has the potential to serve as a valuable model for guiding healthcare professionals in implementing surgical plans and ultimately enhancing patient outcomes.


Introduction
Metastatic spinal disease is a common complication of cancer, with a reported incidence of 70% [1,2] .It poses significant challenges for patients and clinicians due to its detrimental effects on neurological function and quality of life [3,4] .The disease is characterized by the spread of cancer cells to the vertebral column, leading to spinal instability, compression of neural elements, pain, and neurological deficits [4,5] .The management of metastatic spinal disease often involves surgical intervention to relieve pain, decompress neural structures, and restore spinal stability [6][7][8] .Notably, the ability to maintain or regain ambulatory status after surgery is a critical factor in determining the success of the procedure and the overall prognosis for patients [9] .
Despite the importance of postoperative ambulatory status, there is a lack of robust clinical tools to predict whether a patient will have the ability to walk after surgery among patients with metastatic spinal disease.Currently, the literature on this topic is limited, with only a few studies reporting on specific risk factors associated with postoperative ambulatory status [10][11][12][13][14] , and some traditional scoring system being developed based on survival prediction to stratify function outcomes [8,15,16] .As a result, surgeons lack reliable guidance to inform their postoperative functional outcome for patients with metastatic spinal disease [17] .Therefore, there is a pressing need for the development of accurate prediction models to assess postoperative ambulatory status in these patients.
In recent years, the application of artificial intelligence (AI) and machine learning techniques in the field of spinal metastatic tumors has demonstrated excellent achievements [18][19][20] .Machine learning algorithms can effectively analyze complex datasets and identify patterns and relationships that may not be readily apparent to human observers.These algorithms have the potential to improve the accuracy and precision of prediction models [19] , providing valuable insights into the prognosis and outcomes of patients with metastatic spinal disease [21,22] .
Therefore, the objective of this study is to establish and validate an AI platform for predicting postoperative ambulatory status in patients with metastatic spinal disease.Furthermore, we will conduct a comparative analysis between the predictions made by spine surgeons and the AI platform to assess its performance in clinical practice.The development of this AI platform has the potential to assist clinicians in making informed decisions regarding surgical intervention and improving patient outcomes.

Patients and study design
This study collected 455 patients with metastatic spinal disease who underwent posterior decompressive surgery at three tertiary medical institutions from January 2015 to May 2023.Of these, 220 patients were prospectively collected from one medical institution to form the model derivation cohort, while 89 and 146 patients were collected from two other medical institutions to form the external validation cohorts 1 and 2, respectively.The three medical institutions in this study, two of which are located in the northern region of our country and one in the southern region, are all teaching hospitals and are classified as tertiary A-grade hospitals with good reputations.Validating the model in different regions can further confirm its effectiveness by demonstrating its ability to perform consistently across diverse populations and healthcare settings.All patients underwent X-ray and MRI scans to confirm the location of the metastatic lesion.The study included patients who met specific criteria, including the presence of radiographic evidence of metastatic spinal disease and at least one of the following symptoms: progressive local mechanical or radiation pain, impairment of sensory function, lower limb motor function, or sphincter function.Patients who were receiving conservative treatment, those with primary spinal tumors, metastatic spinal disease caused by leukemia, and intramedullary metastases of spinal metastases were excluded from the study.Additionally, patients who had previously undergone surgery at the site of spinal metastases were also excluded.A flowchart illustrating the study design is presented in Figure 1.Patients in the model derivation cohort were randomly divided into a training cohort and an internal validation cohort using a 7:3 ratio.Patients from external validation cohorts 1 and 2 were utilized for the external validation of the model.Based on the specified inclusion and exclusion

HIGHLIGHTS
• The interactive AI platform can accurately predict postoperative ambulatory status.• The ensemble model obtained the highest score in a comprehensive scoring system.• External validation confirms high predictive capability of the AI model.• The AI platform outperforms medical experts in predicting risk.
criteria and the objectives of the analysis, the study employed a per-protocol analysis method, and the study protocol is provided in Supplementary File 1 (Supplemental Digital Content 1, http://links.lww.com/JS9/B973).The study protocol was approved by the research ethics board of our institution, and all patients provided informed consent for the review of their medical records and images.This study was registered at a National Clinical Trial Registry Center.The study was conducted in accordance with the guidelines outlined in the Declaration of Helsinki, and the reporting of the study adhered to the strengthening the reporting of cohort, crosssectional, and case-control studies in surgery (STROCSS) criteria [23] and the TRIPOD Checklist [24] (Supplemental Digital Content 2, http://links.lww.com/JS9/B974).

Surgical process
The decision to perform surgery was based on indications such as intractable pain due to spinal instability and myelopathy caused by spinal cord compression.The appropriate surgical approach was determined through multidisciplinary collaboration among a neuro-radiologist, spinal tumor surgeon, and oncologist.The surgical management of spinal metastases involved a complex procedure comprising palliative decompression, partial vertebrectomy/En bloc resection of vertebrae, and internal fixation using pedicle screw instrumentation.Under general anesthesia, the patient was positioned in the prone position on the operating table.A midline skin incision was made over the affected area of the spine, and the muscles were carefully dissected to expose the posterior elements of the spine.The surgical procedure involved a posterior approach with either laminectomy or laminotomy to gain access to the spinal cord and nerve roots.The tumor was then meticulously removed using a combination of vertebrectomy and tumor debulking techniques.The extent of the vertebrectomy depended on the tumor involvement, and the decision to perform either subtotal or total vertebrectomy was based on the surgeon's assessment of the tumor and the patient's overall health status.No intradural work was required in the present study.Following tumor excision, the resultant space within the vertebral body was filled with bone cement and artificial vertebral bodies to facilitate fusion and stabilization of the spine.Fusion was achieved by addressing the adjacent vertebrae above and below the corpectomy site, involving the insertion of screws into the pedicles of the affected vertebrae, which were interconnected using rods to ensure stability and mitigate the risk of further deformity.The pedicle screw instrumentation also facilitated the maintenance of the correction achieved during the vertebrectomy.The wound was closed in layers using sutures.

Evaluation of the primary outcome
The primary outcome of this study was the ambulatory status of the patients within one week after surgery.Ambulatory status was defined as the ability to independently take at least two steps with each foot (totaling four steps), even if the use of a cane or walker was necessary [6] .The ability to ambulate after surgery for metastatic spinal disease is of great importance as it directly impacts the patients' overall quality of life and functional independence [25] .

Quality control
In this study, a meticulous approach was adopted to safeguard the precision and dependability of the acquired data.To begin with, extensive training was provided to the research team, equipping them with a thorough understanding of the data collection protocols.The primary objective of this training was to minimize potential errors and ensure uniform adherence to standardized guidelines by all team members during the data collection and recording processes.Subsequently, a rigorous data entry and validation framework was implemented, involving a meticulous double-entry verification method where two independent individuals entered the data, followed by a meticulous cross-verification process to detect any disparities.Additionally, comprehensive data validation checks were performed to meticulously identify and rectify any inconsistencies or inaccuracies within the collected data.Moreover, an exhaustive data cleaning procedure was meticulously executed to detect and rectify errors, missing data points, or outliers.This involved a meticulous comparison of the collected data with the source documents, diligently resolving any disparities uncovered.Furthermore, continuous data monitoring was conducted throughout the study, facilitating the proactive identification of potential issues or trends that could impact data quality.This encompassed periodic audits, meticulous review of data collection forms, and the provision of constructive feedback to the research team to ensure ongoing data quality control.By adhering to these stringent data quality control measures, the study aimed to uphold a standard of excellence in data quality, thereby fortifying the validity and integrity of the research findings.

Data preparation
Data preprocessing pipeline was utilized to ensure consistent and reproducible transformation of the data using scikit-learn.The pipeline combined multiple preprocessing steps into a single object for improved accuracy and reliability of machine learning models.It involved data transformation, feature selection, data splitting, and standardization and normalization.Imbalanced data was addressed using a SMOTETomek resampling strategy, which combines Synthetic Minority Oversampling Technique (SMOTE) and Tomek Links Undersampling [26] .This strategy generates a new dataset with a larger sample size and a more balanced distribution of data, enhancing statistical power and generalizability of the findings.For example, the sample size of the model derivation cohort increased to 334 after implementing the SMOTETomek resampling strategy, with a positive rate of the primary outcome at 50%.Similarly, cohorts 1 and 2 also saw an increase in sample size to 134 and 228, respectively.In addition, a stratified strategy was employed to maintain consistent proportions of the outcome classes between ambulatory and nonambulatory patients.

Modeling
A comprehensive analysis was conducted using a range of models, including logistic regression (LR) and five machine learning algorithms: extreme gradient boosting machine (eXGBM), support vector machine (SVM), random forest (RF), neural network (NN), and decision tree (DT).Furthermore, to enhance the resilience of the study's model, an ensemble machine learning approach was employed using a soft-voting method [27,28] .All models were provided with the same input features to ensure consistency, and the model features were identify using subgroup analysis.Grid and random hyperparameter searches were performed to identify the optimal hyperparameters for each model, with the area under the curve (AUC) used as the optimization metric.During the grid and random hyperparameter searches to identify the optimal hyperparameters for each model, the model parameters were optimized and validated through 10-fold crossvalidation on the training data.By training and evaluating the model on different combinations of hyperparameters using 10fold cross-validation, we obtained a more reliable estimate of the model's performance.In addition, to accommodate the variability in model performance, wide ranges were established for the hyperparameters.For instance, the DT depth range was set from 2 to 100, allowing for a broad exploration of different tree depths.Similarly, the 'min_child_weight' hyperparameter was varied from 1 to 100, enabling the model to consider a wide range of weights for the minimum number of samples required to create a new child node.The 'min_samples_split' and 'min_sam-ples_leaf' hyperparameters were set from 2 to 200, allowing for flexibility in determining the minimum number of samples needed to split an internal node or form a leaf node, respectively.Learning curves were used as a tool in the identification of overfitting and underfitting issues in models.In the learning curve, the model's performance on both the training and validation datasets was plotted against the number of training instances or iterations.Machine learning algorithms were implemented using Python (version 3.9.7),and hyperparameter tuning was conducted using Python scikit-learn (version 1.2.2).

Validation
The internal and external validation cohorts were used to validate the models, and multiple evaluation metrics were employed, including the AUC, accuracy, precision, recall, specificity, Brier score, log loss, discrimination slope, calibration slope, and intercept-in-large value.The AUC value was obtained using 100 bootstraps.The accuracy, precision, and recall were calculated using a confusion matrix.In the following equation, TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.The Brier score, was calculated using the formula where N represents the total sample, p i represents the predicted risk, and o i represents the actual probability.

TP TN TP FN FP TN
The log loss, calculated using the scikit-learn formula, is a metric that evaluates the quality of classification model predictions.It takes into account the number of samples (N), the number of classes (M), the true labels (y ij ), and the predicted probabilities (p ij ).

Log Loss
The discrimination slope was calculated as the mean difference between the predicted probabilities of patients with and without postoperative ambulatory status.The calibration slope and intercept-in-large value were obtained from the calibration curve.Additionally, a scoring system was used to comprehensively evaluate the prediction performance of the models [29,30] , with each metric rated on a scale of 1 to 6.The scoring system ranged from 0 to 60.Finally, decision curve analysis (DCA) was employed to determine the clinical net benefits for each model.

Feature importance
The Shapley additive explanation (SHAP) values were used to determine the importance of model features [26] .The SHAP values were calculated using the equation, where g represents the interpretation model, M represents the number of input parameters, 0 ϕ represents a constant, j ϕ represents the Shapley value of each model feature, and Z j ′ represents the coalition vector.
Among the coalition vectors, '1' suggests that the feature is the same as the feature of the case x to be explained, while '0' suggests that the feature is missing in the present case x.Therefore, considering case x as all simplified features were 1, and then the SHAP expression could be simplified and outlined below.

Establishment of the interactive AI platform
An AI platform was developed to estimate the risk of not gaining ambulatory status in patients undergoing decompressive surgery for metastatic spinal disease.The AI platform was designed to be user-friendly and accessible using the Streamlit.The code of supporting the development of the AI platform is available at https://github.com/Starxueshu/postoperativeambulatory.It allows users to customize the input parameters, calculates the probability of not gaining ambulatory status based on the selected parameters, and provides an interface that explains the model's methodology and performance.Patients were categorized into high-risk or low-risk groups based on a threshold, and corresponding intervention recommendations were provided in terms of the patient stratification.In addition, a human-machine comparative experiment was conducted to compare the prediction performance of 6 medical experts and the AI platform in predicting the outcome of not gaining ambulatory status among patients with spinal metastatic tumors.AUC values were calculated for each medical expert.

Statistical analysis
Continuous variables were summarized using the mean and SD for normally distributed data, while median and interquartile range (IQR) were utilized for non-normally distributed data.Categorical variables were presented as proportions.Student's t-test was used to compare normally distributed continuous variables, while Wilcoxon Rank Sum Test (Mann-Whitney U test) was used to compare non-normally distributed variables.The χ 2 test was conducted to compare the distribution of categorical variables.Statistical power analysis was conducted for significant variables.Delong test was used to compare the prediction performance between medical experts and the AI platform.Statistical analysis was performed using the R language program (version 4.1.2),and a P-value less than 0.05 was considered statistically significant with two-sided testing.

Patient clinical characteristics
In the model derivation cohort, 220 patients were collected for analysis in the study.The median age was 60.00 (53.00, 68.00) years, with 67.3% of patients being male and 20.9% of patients being current smokers (Table 1).The most common primary tumor was rapid growth (38.6%),followed by moderate growth (35.0%) and slow growth (26.4%).The burden of comorbidities was relatively heavy, since 47.3% of patients had at least one comorbidity.In detail, the most prevalent comorbidities were hypertension (30.91%) and diabetes (13.64%) (Supplementary Table 1, Supplemental Digital Content 3, http://links.lww.com/JS9/B975).Regarding systematic therapies, preoperative chemotherapy, targeted therapy, and endocrinology therapy accounted for 17.3, 9.5, and 9.1%, respectively.The tumor burden was relatively heavy, because there were 43.6% of patients had extravertebral bone metastases, 26.8% of patients had visceral metastases, and 19.1% of patients had an Eastern Cooperative Oncology Group (ECOG) score of four, indicating being unable to take care of oneself in daily life.The majority of patients were treated with palliative decompression (89.5%) at thoracic and thoracolumbar site (67.7%).In the entire cohort, 58.6% of patients had the Bilsky score of 3, indicating severe spinal cord compression, and 49.1% of patients lost their ability to walk before surgery.During surgery, 80.5 of patients received intraoperative blood transfusion.More detailed information on preoperative laboratory examination, including albumin, cholesterol, hemoglobin, and PT, is summarized in Table 1.

Identification of risk factors by subgroup analysis
Subgroup analysis revealed that patients who experienced postoperative inability to walk exhibited certain characteristics.They tended to be older (P = 0.041), have a higher number of comorbidities (P = 0.050), have a higher ECOG score (P < 0.001), a higher rate of surgical site involvement in the thoracic and thoracolumbar regions (P = 0.011), lower levels of preoperative albumin (P = 0.034) and total cholesterol (P = 0.004), higher levels of prothrombin time (PT) (P = 0.001), higher degrees of Bilsky score (P < 0.001), and a higher rate of preoperative inability to walk (P < 0.001) (Table 1).Consequently, these nine variables were utilized as feature inputs to train and optimize machine learning-based models.

Modeling and prediction evaluation
Before training the models, the SMOTETomek resampling strategy was conducted to produce a new database with more balanced data distribution, and the incidences of the positive and negative outcome were both 50%.The baseline clinical characteristics of the new database are presented in Supplementary  2 and Table 2).However, the NN model exhibited the highest accuracy (0.861), followed by the SVM model (0.842), the eXGBM model (0.832), and the RF model (0.832).The eXGBM model demonstrated the best prediction performance in terms of precision (0.837) and specificity (0.843), and the second-best performance in terms of log loss (0.382) (Fig. 3).The calibration curve for each model, depicted in Figure 4A, indicated favorable calibration as the curves were closely aligned with the perfectly calibrated curve.Calculation of the calibration slope and intercept-in-large value for each model revealed that the eXGBM model had the best intercept-in-large value (− 0.002), which was the closest to 0, and the SVM model had the best calibration slope (0.859), which was the closest to 1 (Supplementary Figure 1, Supplemental Digital Content 7, http:// links.lww.com/JS9/B979).Figure 4B displayed plots of the mean predicted probability against count for each model, revealing distinct distribution patterns of predicted risk of experiencing inability to walk.When the mean predicted probability was classified by the actual postoperative ambulatory status (Fig. 5), density curve analysis demonstrated that, with the exception of the DT model, other models, particularly the eXGBM model and NN model, exhibited excellent discrimination ability.These models displayed less overlap and a larger distinguishing area between ambulatory and nonambulatory patients.The violin plot of discrimination slope confirmed that the eXGBM model had the highest discrimination slope (0.584), followed by the NN model (0.539) and the SVM model (0.538) (Fig. 6).DCA revealed similar trends, with the models exhibiting favorable clinical net benefits, except for the DT model (Fig. 7).
In detail, among the seven models, the ensemble model had the best performance in accuracy and calibration slope (Supplementary Figure 2E, Supplemental Digital Content 8, http://links.lww.com/Therefore, the ensemble model was served as the optimal model in this study.

External validation
External validation was performed on two cohorts (n = 89 and n = 146) using the optimal model.The baseline characteristics of the two cohorts can be found in Supplementary  3B, Supplemental Digital Content 13, http://links.lww.com/JS9/B985),indicating a strong discriminative power of the ensemble model in distinguishing between patients who were likely to experience postoperative inability to walk and those who were not.In addition, the accuracy was 0.739 in the external validation cohort 1 and 0.855 in the external validation cohort 2 (Supplementary Table 9, Supplemental Digital Content 14, http://links.lww.com/JS9/B986), with the corresponding precises of 0.864 and 0.846, respectively.The discrimination slopes were 0.316 (Supplementary Figure 3C, Supplemental Digital Content 13, http://links.lww.com/JS9/B985) and 0.533 (Supplementary Figure 3D, Supplemental Digital Content 13, http://links.

Feature importance
In the entire cohort, the analysis of feature importance revealed that the top five influential factors were ECOG, preoperative ambulatory status, total cholesterol, PT, and albumin for both the training cohort (Supplementary Figure 4A, Supplemental Digital Content 15, http://links.lww.com/JS9/B987) and the validation cohort (Supplementary Figure 4B, Supplemental Digital Content 15, http:// links.lww.com/JS9/B987).Moreover, feature importance analysis was conducted for individualized cases using the SHAP method.Supplementary Figures 4C and D (Supplemental Digital Content 15, http://links.lww.com/JS9/B987)highlight a true negative case, where ECOG, preoperative ambulatory status, surgical site, Bilsky score, age, PT, and number of comorbidities were identified as protective factors, while total cholesterol was considered a risk factor.In this case, the most three significant features were ECOG, preoperative ambulatory status, and surgical site.Each feature was assigned a SHAP value, and the total SHAP value in each case represented the cumulative sum of values from all features.For this particular case, the total SHAP value was − 5.588, which was lower than the base value of 0.022, indicating a low-risk of experiencing inability to walk.Conversely, Supplementary Figures 4E and F (Supplemental Digital Content 15, http://links.lww.com/JS9/B987)illustrates a true positive case, where the majority of features were identified as risk factors, with the exception of albumin.Additionally, the three most important features in this case were ECOG, preoperative ambulatory status, and PT.

Comparison of prediction performance in the ensemble model with and without the number of comorbidities
According to the feature importance analysis, the number of comorbidities emerged as the least significant feature.Furthermore, the statistical power analysis revealed that only the number of comorbidities had a statistical power value lower than 0.80 (Supplementary Table 10, Supplemental Digital Content 16, http://links.lww.com/JS9/B988).Consequently, we conducted an investigation and comparison of the predictive performance between the previous model and the new model, excluding the number of comorbidities.We found that the AUC value was still as high as 0.916 (95% CI: 0.863-0.969)(Supplementary Figure 5A, Supplemental Digital Content 17, http://links.lww.com/JS9/B989), and it was slightly greater than the AUC value (0.911) of the previous model including the number of comorbidities.Thus, it demonstrated that the inclusion of the number of comorbidities in the model did not improve the prediction performance.More detailed metrics are summarized in the Supplementary We also tested the prediction performance of the new model in the two external validation cohorts (Supplementary Table 12, Supplemental Digital Content 19, http://links.lww.com/JS9/B991).The results demonstrated that the model exhibited favorable predictive performance in the two external validation cohorts.The AUC values of the new model were 0.880 (95% CI: 0.819-0.940) in the external validation cohort 1 (Supplementary Figure 6A, Supplemental Digital Content 20, http://links.lww.com/JS9/B992) and 0.922 (95% CI: 0.887-0.958) in the external validation cohort 2 (Supplementary Figure 6B, Supplemental Digital Content 20, http://links.lww.com/JS9/B992),still indicating very favorable prediction performance of the new model.The corresponding discrimination slopes were 0.339 (Supplementary Figure 6C, Supplemental Digital Content 20, http://links.lww.com/JS9/B992) and 0.540 (Supplementary Figure 6D, Supplemental Digital Content 20, http://links.lww.com/JS9/B992), respectively, and calibration slopes were 1.323 (Supplementary Figure 6E, Supplemental Digital Content 20, http://links.lww.com/JS9/B992) and 1.035 (Supplementary Figure 6F, Supplemental Digital Content 20, http://links.lww.com/JS9/B992), respectively.The probability density curve demonstrated the favorable discrimination of the new model in the both cohorts (Supplementary Figure 6G, Supplemental Digital Content 20, http://links.lww.com/JS9/B992 and Supplementary Figure 6H, Supplemental Digital Content 20, http://links.lww.com/JS9/B992), and DCA also showed favorable clinical net benefit (Supplementary Figure 6I, Supplemental Digital Content 20, http://links.lww.com/JS9/B992 and Supplementary Figure 6J, Supplemental Digital Content 20, http://links.lww.com/JS9/B992).

Deployment of the interactive AI platform
Based on the machine learning-based model, the AI platform has been deployed on the internet at https://postoperativeambulatory-izpdr6gsxxwhitr8fubutd.streamlit.app/.Healthcare professionals can access the AI platform by visiting our provided web link with internet connectivity, using electronic devices such as smartphones, computers, or iPads.Upon visiting the website, users are able to select relevant parameters for the model feature.By clicking the 'Submit' button, users will receive the predicted risk of experiencing postoperative inability to walk and recommended individualized strategy interventions (Fig. 9).In addition, by leveraging the AI platform and analyzing the contribution of specific features to predictions for each individual case, researchers can identify variables that serve as protective factors or risk factors for the patient.Furthermore, the AI platform assigns importance rankings to variables on an individualized basis, highlighting the significant risk factors that require greater clinical attention.
In the event that the website is not functioning, users can reactivate the online calculator by clicking 'Yes, get this app back up!'.By inputting the aforementioned data into the AI platform, users can determine the predicted risk for this specific individual.For instance, a 78-year-old patient, with an ECOG score of 2, treated at the thoracic spine, preoperative albumin level of 40.5 g/l, total cholesterol of 3.8 mmol/l, PT value of 11.2 s, Bilsky score of 3, and preoperative inability to walk, had a predicted risk of inability of 89.23%.Furthermore, the AI platform provides a summary of feature importance for individualized patients and corresponding recommended intervention strategies based on risk stratification.A video of showing how to use the platform is presented in Supplementary File 2 (Supplemental Digital Content 21, http://links.lww.com/JS9/B993).

A comparison of prediction performance between the AI platform and medical experts
The AUC values obtained from the six medical experts demonstrated variability, ranging from 0.651 to 0.763 (Fig. 10 and Supplementary Table 13, Supplemental Digital Content 22, http://links.lww.com/JS9/B994).This variability suggested differences in the predictive abilities among the medical experts, with some performing better than others.Notably, the average AUC value of the medical experts was 0.698, which was significantly lower than the AUC value achieved by the AI platform (P < 0.001, Delong test).The substantial difference in AUC values highlights the superior predictive performance of the AI platform compared to the individual medical experts.The findings of this comparison underscored the potential of the AI platform to surpass the predictive capabilities of individual medical experts.By harnessing the power of advanced algorithms and machine learning techniques, the AI platform can provide more accurate and consistent predictions, leading to more informed clinical decision-making.

Principal findings
This study aimed to establish and validate an interactive AI platform for predicting postoperative ambulatory status in patients with metastatic spinal disease.The ensemble model performed the best in terms of prediction performance, with other Figure 7. Decision curve analysis for machine learning-based models.The x-axis represents the threshold probability, which is the probability at which a positive outcome is considered.The y-axis represents the net benefit, which is the difference between the proportion of true positives and the proportion of false positives, adjusted for the relative harm of false positives and false negatives.The curve visually demonstrates the net benefit of using the models across different threshold probabilities.The higher the curve lies above the baseline, the greater the potential benefit of using the models in decision-making.models such as eXGBM, NN, and SVM also showing good results.Key variables influencing the outcome included age, ECOG score, surgical site, preoperative albumin, cholesterol levels, prothrombin time, Bilsky score, and preoperative ambulatory status.We further created an interactive AI platform, providing healthcare professionals with predicted risk assessments and personalized therapeutic recommendations.It also was equipped with good transparency in terms of how certain features contribute to the predictions for each individualized case.Integration of the model into applications and patient-centered explanations of predictions represent opportunities for incorporation into healthcare systems as decision tools in the future.

Epidemiology of postoperative ambulatory status
The incidence and recovery rates of postoperative ambulation were examined in this study, with 76.82% of patients being able to walk and 98.21% maintained their ambulatory ability in the model derivation cohort.These findings are consistent with previous research [12,14] .For example, Chaichana et al. [12] reported that 78% of patients were able to ambulate postoperatively, and among those who could walk prior to surgery, 89% maintained ambulation.Another study found that 75.4% of patients were able to walk, with 61.9% of previously nonambulatory patients regaining ambulatory status [11] .A meta-analysis including 25 studies showed that 69.7% of patients were able to walk after treatment, with 84.7% maintaining ambulation [14] .It is important to note that ambulatory ability has a significant impact on quality of life and overall survival in patients with spinal metastases [31][32][33] .Loss of ambulatory ability resulted in an 82% reduction in quality of life among nonambulatory patients [25] and was associated with a significant decrease in survival outcomes [9] .Hence, it is crucial to prioritize the preservation or restoration of ambulatory status for these patients.Additionally, postoperative ambulatory status is a critical consideration for clinicians when determining the appropriateness of surgical interventions and oncological treatment strategies [8,17] .

Factors affecting postoperative ambulatory status
Multiple studies have identified various risk factors that influence postoperative ambulatory status [10][11][12][13][14] .For example, a retrospective study revealed that preoperative ambulatory status, presence of pathological vertebral compression fractures at presentation, and utilization of radiation therapy were significant factors associated with postoperative ambulatory status.Preoperative ambulation and postoperative radiotherapy were found to be protective factors for patients with metastatic epidural spinal cord compression [12] .Another study highlighted that presurgical neurological status and prompt surgery for nonambulatory patients were predictors of improved functional outcomes in patients undergoing decompressive surgery for metastatic spinal cord compression [11] .An additional prospective study identified primary tumor type (breast or prostate), ambulatory status grade, bladder function, and prior chemotherapy as significant predictors for increased odds of ambulation after radiotherapy [13] .Notably, a meta-analysis encompassing 25 studies and 4897 patients revealed that ambulatory status before treatment, time interval from symptom onset to treatment, and time of developing motor deficits were significant prognostic factors for post-treatment ambulatory status in patients with metastatic spinal cord compression [14] .Of the included patients, 3173 received radiotherapy [14] .Our study has revealed a correlation between lower preoperative levels of albumin and total cholesterol, as well as higher levels of PT, and the increased likelihood of postoperative inability to walk.Reduced albumin and total cholesterol levels are indicative of malnutrition or underlying health conditions, which may compromise the body's capacity to heal and recuperate following surgical intervention, thereby impacting ambulatory function.Conversely, elevated PT levels signify impaired blood clotting function, potentially resulting in prolonged recovery and impeding mobility.By identifying the risk factors that influence postoperative ambulatory status, healthcare professionals can better assess and predict the mobility outcomes of patients undergoing surgery.This knowledge can guide treatment decisions, such as the timing of surgery and the use of interventions like radiation therapy, to optimize patient outcomes.Furthermore, by incorporating these factors into predictive models, healthcare professionals can more accurately predict the postoperative ambulatory status of patients.

Prediction of ambulatory status
Several scores have been developed to predict ambulatory status among patients with metastatic spinal cord compression who underwent radiotherapy.For instance, a retrospective study proposed a scoring system incorporating five features: primary tumor type, time interval from tumor diagnosis, presence of visceral metastases at the time of radiotherapy, motor function prior to radiotherapy, and time of developing motor deficits [34] .In the scoring system, each feature was assigned a score, and the cumulative score was used to estimate the likelihood of posttreatment ambulation, with higher scores indicating higher ambulatory rates.This scoring system was subsequently validated in a prospective study, demonstrating its effectiveness in distinguishing different probabilities of post-treatment ambulatory status [35] .Nonetheless, these scores were primarily designed for patients undergoing radiotherapy, and evaluation metrics such as AUC, accuracy, and sensitivity were not assessed.Therefore, their generalizability to patients receiving decompressive surgery might be very limited.In the present study, it addresses a critical clinical need by developing an AI platform specifically designed to assess the risk of postoperative loss of ambulatory ability in patients with metastatic spinal disease.This fills a gap in current clinical tools, providing healthcare professionals with a valuable resource for making informed therapeutic strategies.Additionally, the study employed a variety of machine learning techniques to develop prediction models.This comprehensive approach allowed for a thorough evaluation of different modeling methods and identified the ensemble machine learningbased model as the optimal one.Furthermore, we rigorously finetuned and evaluated the model using cross-validation, splitting the dataset into 10 equal parts to assess its generalization ability.This approach helped to mitigate overfitting and select hyperparameters resulting in the best average performance across all folds.By incorporating extensive hyperparameter ranges, the AI platform effectively captures diverse characteristics and complexities of individual cases, leading to more accurate predictions.It is worth noting that while adding more features to the model could improve prediction performance, an excessive number of features may diminish clinical utility, which is not conducive to the clinical application of our AI platform.Therefore, in our final model, we included eight clinical features as model input features, and the model achieved an AUC value of 0.916, with AUC values of 0.880 in the external validation cohort 1 and 0.922 in the external validation cohort 2.
The study's methodology, which involved collecting data from multiple medical institutions and using external validation cohorts, provides a strong foundation for assessing the generalizability of the developed AI models.By including patients from different institutions, the study has taken a step toward capturing the variability and diversity present in realworld clinical settings.The results of the external validation demonstrated favorable performance of the ensemble machine learning model across the different cohorts, indicating the potential generalizability of the model in predicting postoperative ambulatory status for patients with metastatic spinal disease.Furthermore, the deployment of an interactive AI platform based on the final machine learning model allows for broader accessibility and utilization of the predictive tool.By making the AI platform available online, researchers and healthcare professionals from various institutions and clinical settings can utilize the predictive model, gaining insights into individual predicted risks and accessing stratified therapeutic recommendations.

Intervention guidance under the AI model
The AI platform developed in our study has the capability to automatically stratify patients into high and low-risk groups for postoperative ambulatory status after healthcare professionals input patient variable parameters.This functionality streamlines the process for healthcare professionals and enables them to quickly and accurately assess patient risk, leading to more targeted and effective interventions and ultimately improving patient outcomes.In detail, identifying patients at high-risk of inability to walk after surgery allows surgeons to tailor their therapeutic strategies.For the high-risk group, aggressive surgical interventions and close monitoring may be warranted, including considerations such as widely excision of metastatic tumors, earlier initiation of physical therapy, utilization of assistive devices, intensive rehabilitation programs, and proactive management of pain and complications.Additionally, closer follow-up visits and frequent assessments can help detect any deterioration in ambulatory status and allow for timely intervention.Notably, if patients are at a high-risk of postoperative ambulatory impairment, and if their potential benefits from the surgery, such as pain relief, are exceedingly limited, clinical consideration of the appropriateness of surgery for such patients is crucial.In such cases, a careful evaluation is warranted to weigh the potential risks and benefits, ensuring that surgical interventions are aligned with the best interests of the patient's overall well-being.This study's interactive AI platform, which accurately predicts postoperative ambulatory status, can aid healthcare professionals in making informed decisions about surgical interventions, further enhancing the patient-centered approach to care, and the AI platform is particularly valuable in cases where the risk of postoperative ambulatory loss is high, as it allows for more tailored and patient-specific treatment recommendations.On the other hand, for the low-risk group, a more conservative approach may be taken.Resources can be allocated more efficiently and interventions can be focused on minimizing complications rather than restoring ambulatory status.This group may require less intensive rehabilitation and follow-up, allowing resources to be redirected to patients who need them more.By stratifying patients into high and low-risk groups, this AI model provides clinicians with valuable information to guide their decision-making and optimize patient outcomes.It can help clinicians allocate resources effectively, provide personalized care plans, and tailor interventions based on individual patient needs.Ultimately, this approach can lead to improved patient experiences, better surgical outcomes, and more efficient healthcare resource utilization.
Prior to a patient's surgery, we propose the utilization of our AI platform for all patients with metastatic spinal disease.Through the input of essential model parameters, such as the patient's ECOG, Bilsky score, and preoperative albumin level, and by clicking the 'Submit' button, healthcare professionals can not only obtain the patient's postoperative ambulatory risk and risk stratification but also receive a report detailing the importance ranking of risk variables.Based on the report, researchers can identify variables that act as protective factors or risk factors for postoperative ambulation in the patient.The AI platform also ranks the importance of variables on an individualized basis, highlighting the significant risk factors that should receive more clinical attentions.For instance, if the ranking of risk factors for a patient suggests that comorbidities have a substantial impact on postoperative ambulatory status, clinical focus should be placed on the treatment and management of these comorbidities.This personalized approach allows healthcare professionals to prioritize interventions and treatments based on the specific needs and risks of each patient, optimizing patient outcomes and improving the overall quality of care.
Notably, the DCA curve demonstrated that interventions had greater benefits when the risk probability was lower, which is reasonable because these interventions had lower costs and could be widely implemented.However, in situations where medical resources were limited and only patients with higher risk thresholds could be selected for intervention, the DCA results showed that the model did not benefit in the external cohort 1.This may be due to some bias in the sample distribution, such as a tendency toward predicting positive results.This could be related to the sampling done during the modeling process, which increased the positivity rate from around 20 to 50%.However, in the external validation of cohort 2, the DCA showed that interventions could still be beneficial when medical resources were limited and patients with risk thresholds were selected.Therefore, we claim that interventions can be carried out at low cost and can be effective even when medical resources are limited.

Limitations
The limitations of the study should be taken into consideration when interpreting the results.Firstly, the AI platform developed in this study is heavily reliant on the availability and quality of input data.However, the complex nature of the model and potential data noise could lead to overfitting issues.Secondly, while the AI platform demonstrated good predictive performance, the dynamic nature of the disease and potential changes in patient status over time may impact the accuracy of the platform.In addition, the developed AI model was primarily designed to predict the shortterm ambulatory status after surgery, and thus additional validation and calibration were required for predicting long-term ambulatory prognosis.Thirdly, cross-validation could help to mitigate the risk of overfitting, but it also could result in fewer outcomes per fold, which might impact the reliability of the model's performance assessment.Fourthly, the integration of additional variables, such as the duration between spinal cord metastasis onset and surgical intervention, holds potential to enhance the predictive capabilities of the AI platform.Fifthly, while these models provide accurate predictions, the complex algorithms used in machine learning techniques can make it challenging to understand and explain the deeper reasoning behind the predictions.Finally, the generalizability of the AI models should be carefully considered.External validation demonstrated favorable performance, but as we only conducted external validation in two hospitals, it is important to continually assess and refine the models as new data becomes available to ensure their applicability across different patient populations and clinical settings, such as nontertiary hospitals and hospitals in other regions.Thus, while the AI platform shows promise in predicting postoperative ambulatory status in patients with metastatic spinal disease, it is important to address extensive validation of the model with a large sample size to enhance its applicability and utility in clinical practice.

Figure 1 .
Figure 1.Schematic depiction of study design and machine learning process.

Figure 4 .
Figure 4. Calibration curve and histogram of predicted probability for machine learning-based models.A. Calibration curve for machine learning-based models.In the curve, the x-axis represents the predicted probabilities or scores generated by the models, while the y-axis represents the observed probabilities or actual outcomes.The curve visually displays how well the model's predictions align with the actual outcomes.A perfectly calibrated model would exhibit a diagonal line, indicating a close match between predicted and observed probabilities; B. Histogram of predicted probability of postoperative inability to walk for machine learningbased models.

Figure 5 .
Figure 5. Probability density curve for machine learning-based models.A. Logistic regression; B. Extreme gradient boosting machine; C. Support vector machine; D. Random forest; E. Neural network; F. Decision tree.The curve was plotted with predicted risk against probability density.The blue indicates patients who were able to walk after surgery, whereas the red indicates patients who were unable to walk after surgery.

Figure 6 .
Figure 6.Violin plot to present discrimination slope value for machine learning-based models.A. Logistic regression; B. Extreme gradient boosting machine; C. Support vector machine; D. Random forest; E. Neural network; F. Decision tree.The blue indicates patients who were able to walk after surgery, whereas the red indicates patients who were unable to walk after surgery.

Figure 8 .
Figure 8. Heatmap to comprehensively evaluate the prediction performance for machine learning-based models and the ensemble model.Red indicates superior prediction performance, while blue indicates relatively inferior prediction performance.

Figure 9 .
Figure 9.The human-machine interactive artificial intelligence platform.A. Section of inputting data; B. Section of calculating the risk of postoperative inability to walk; C. Section of reporting risk factors and showing model information.Upon accessing the website, users have the option to select relevant parameters for the model feature.Upon clicking the 'Submit' button, users will obtain the predicted risk of postoperative inability to walk along with recommended individualized strategy interventions.By utilizing the AI platform and analyzing the contribution of specific features to predictions for each individual case, researchers can identify variables that act as protective factors or risk factors for the individual.

Figure 10 .
Figure 10.A comparison of prediction performance between the AI platform and medical experts using the area under the curve analysis (P < 0.001, Delong test).

Table 2 (
Supplemental Digital Content 4, http://links.lww.com/JS9/B976).A comparison of these clinical characteristics was performed between the training cohort and the internal validation cohort, and the results indicated that all the clinical characteristics were similar (All P > 0.05, Supplementary

Table 3 ,
Supplemental Digital Content 5, http://links.lww.com/JS9/B977).This suggests that there were no significant differences in the baseline clinical characteristics between the two groups.The hyperparameters of the all models are presented in Supplementary Table4(Supplemental Digital Content 6, http:// links.lww.com/JS9/B978).Among all the six developed models, the RF model demonstrated the highest AUC value of 0.926

Table 1
Patient's clinical characteristics and a comparison of clinical characteristics between patients with and without postoperative walking ability.JS9/B980), and the second-best performance in precise, recall, specificity, AUC, Brier score, and log loss.The scores of the LR model, RF model, and DT model were 39, 46, and 26, respectively.

Table 2
Prediction performance of models.