Education: Review Article
Risk Stratification Tools for Predicting Morbidity and Mortality in Adult Patients Undergoing Major Surgery: Qualitative Systematic Review
Moonesinghe, Suneetha Ramani F.R.C.A.*; Mythen, Michael G. M.D.†; Das, Priya M.B.B.S.‡; Rowan, Kathryn M. Ph.D.§; Grocott, Michael P. W. M.D.‖
Risk stratification is essential for both clinical risk prediction and comparative audit. There are a variety of risk stratification tools available for use in major noncardiac surgery, but their discrimination and calibration have not previously been systematically reviewed in heterogeneous patient cohorts.
Embase, MEDLINE, and Web of Science were searched for studies published between January 1, 1980 and August 6, 2011 in adult patients undergoing major noncardiac, nonneurological surgery. Twenty-seven studies evaluating 34 risk stratification tools were identified which met inclusion criteria. The Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality and the Surgical Risk Scale were demonstrated to be the most consistently accurate tools that have been validated in multiple studies; however, both have limitations. Future work should focus on further evaluation of these and other parsimonious risk predictors, including validation in international cohorts. There is also a need for studies examining the impact that the use of these tools has on clinical decision making and patient outcome.
ACCURATE prediction of perioperative risk is an important goal—to enable informed consent for patients undergoing surgery and to guide clinical decision making in the perioperative period. In addition, by adjusting for risk, an accurate risk stratification tool enables meaningful comparison of surgical outcomes between providers for service evaluation or clinical audit. Some risk stratification tools have been incorporated into clinical practice, and indeed, have been recommended for these purposes.1
Risk stratification tools may be subdivided into risk scores and risk prediction models. Both are usually developed using multivariable analysis of risk factors for a specific outcome.2
Risk scores assign a weighting to factors identified as independent predictors of an outcome; with the weighting for each factor often determined by the value of the regression coefficient in the multivariable analysis. The sum of the weightings in the risk score then reflects increasing risk. Risk scores have the advantage that they are simple to use in the clinical setting. However, although they may score a patient on a scale on which other patients may be compared, they do not provide an individualized risk prediction of an adverse outcome.3
Examples of risk scores are the American Society of Anesthesiologists’ Physical Status score (ASA-PS)4
and the Lee Revised Cardiac Risk Index.5
By contrast, risk prediction models estimate an individual probability of risk for a patient by entering the patient’s data into the multivariable risk prediction model. Although risk prediction models may be more accurate predictors of an individual patient’s risk than risk scores, they are more complex to use in the day-to-day clinical setting.
Despite increasing interest in more sophisticated risk prediction methods, such as the measurement of functional capacity by exercise testing,6
risk stratification tools remain the most readily accessible option for this purpose. However, clinical experience tells us that they are not commonly used in everyday practice. Lack of use may be due to poor awareness amongst clinicians of the available options and concerns regarding their complexity and accuracy.7
In other clinical settings, low uptake of risk stratification tools has been ascribed to a lack of clarity on the precision of available tools, resulting from perhaps unnecessary efforts to make minor refinements to existing methods, or to developing novel methods, with the aim of achieving greater predictive accuracy.8
With the aim of summarizing the available risk stratification tools in perioperative care, in order to make recommendations about which methods are appropriate for use both in clinical practice and in research, we have undertaken a qualitative systematic review on the available evidence. The specific question we sought to answer was “What is the performance of risk stratification tools, validated for morbidity and/or mortality, in heterogeneous cohort of surgical (noncardiac, nonneurological) patients?” The review had three main objectives as follows: to summarize the available risk prediction methods, to report on their performance, and to comment on their strengths and weaknesses, with particular focus on accuracy and ease of application.
Materials and Methods
Previously published standards for reporting systematic reviews of observational studies were adhered to when undertaking this study.9
A Preferred Reporting Items for Systematic reviews and Meta-analyses checklist10
was used in the preparation of this report (appendix 1
Definitions for the Purposes of This Study
A “risk stratification tool” was defined as a scoring system or model used to predict or adjust for either mortality or morbidity after surgery, and which contained at least two different risk factors. “Major surgery” was defined as a procedure taking place in an operating theatre and conducted by a surgeon; thus, studies of cohorts of patients undergoing endoscopic, angiographic, dental, and interventional radiological procedures were excluded. A “heterogeneous patient cohort” was defined as a cohort of patients including at least two different surgical specialities. Studies of gastrointestinal surgery, which included hepatobiliary surgery, were included. We excluded studies that consisted entirely of cohorts undergoing ambulatory (day case) surgery and cohorts that included cardiac or neurological surgery.
Search Strategy and Study Eligibility
A search for articles published between January 1, 1980 and August 6, 2011 was undertaken using MEDLINE, Embase, and Web of Science. No language restriction was applied. The search strategy and inclusion and exclusion criteria are detailed in appendix 2. Of note, articles reporting development studies were excluded, unless the article included validation in a separate cohort.
Data Extraction and Quality Assessment of Studies
Data extraction was independently undertaken by Drs. Moonesinghe and Das, using standardized tables relating to the study characteristics, quality, and outcomes. Where there was disagreement in the data extraction between these two authors, Dr. Moonesinghe resolved the query by referring again to the original articles. Study characteristics extracted from each article included the number of patients, the country where the study was conducted, the outcome measures and endpoints of each study, and the risk stratification tools being assessed. Data were also extracted regarding the most detailed description of the types of surgery included in each study cohort reported in the articles. We also extracted clinical outcome data (morbidity and mortality) for the cohorts in each study.
Assessment of study quality was based on the framework for assessing the internal validity of articles dealing with prognosis developed by Altman.11
The following criteria were used: the number of patients included in analyses, whether the study was conducted on a single or multiple sites, the timing of data collection (prospective vs.
retrospective), whether a description of baseline characteristics for the cohort was included (including comorbidities, type of surgery, and demographic data), and selection criteria for patients included in the study (to assess for selection bias). Selection bias was judged to be present if a study restricted the type of patient who could be enrolled based on age, ethnicity, sex, premorbid condition, urgency of surgery, or postoperative destination (e.g.
, critical care). In addition, we reported the setting of each validation study—i.e.
, whether the validation was conducted in a split sample of the original development cohort or whether the validation cohort was entirely different from that in which the tool was developed.13
Finally, as a measure of their clinical usability and reproducibility, we reported whether each risk stratification tool used variables which were objective (e.g.
, blood results), subjective (e.g.
, chest radiograph interpretation), or both.14
Data Analysis and Statistical Considerations
The performance of each risk stratification tool was evaluated using measures of discrimination and, where appropriate, calibration. Discrimination (how well a model or score correctly identifies a particular outcome) was reported using either the area under the receiver operating characteristic curve (AUROC) or the concordance (c-) statistic. We considered an AUROC of less than 0.7 to indicate poor performance, 0.7–0.9 to be moderate, and greater than 0.9 to reflect high performance.15
Calibration is defined as how well the prognostic estimation of a model matches the probability of the event of interest across the full range of outcomes in the population being studied. Where reported, either Hosmer–Lemeshow or Pearson chi-square statistics were extracted as an evaluation of calibration; P
value of more than 0.05 was taken to indicate that there was no evidence of lack-of-fit.
In the initial search, 139,775 articles on MEDLINE and 71,841 on Embase were listed, and the titles and abstracts of these were screened to identify articles which described risk stratification tools used in any adult noncardiac, nonneurological surgery. Seven hundred fifty-one articles then underwent a review. Hand searching of reference lists and citations identified a further 432 studies which were also reviewed in detail.
Three studies were identified that graphically displayed receiver operating characteristic curves in their results but did not report AUROCs.16–18
The authors of these studies were contacted for additional information; none responded, so these studies were excluded from the analysis. Six foreign language studies, which may have been eligible for inclusion based on review of the abstracts, but for which we were unable to obtain translations, were also omitted from the analysis.19–24
The flow chart for the review is detailed in figure 1
A total of 27 studies evaluating 34 risk stratification tools were included in the analysis. All were cohort studies. Eight tools were validated in multiple studies; the most commonly reported were the ASA-PS (four studies, total number of patients, n = 4,014), the Acute Physiology and Chronic Health Evaluation II (APACHE II) scoring system (four studies, n = 5,897), the Physiological and Operative Score for the enUmeration of Mortality and Morbidity (POSSUM; three studies, n = 2,915), the Portsmouth variation of POSSUM (P-POSSUM; five studies, n = 10,648; mortality model only), the Surgical Risk Scale (three studies, n = 5,244; mortality model only), the Surgical Apgar Score (three studies, n = 10,795), the Charlson Comorbidity Index (two studies, n = 2,463,997), and Donati Surgical Risk Score (two studies, n = 7,121). The accuracy of a further 26 tools was evaluated in single-validation studies. A comparison of tools that were validated in multiple studies is detailed in tables 1
. The general characteristics of all included studies are summarized in table 3
The quality assessment of included studies is summarized in table 3
. Seven studies were multicenter and 21 were single center. The data collection was prospective in 19 studies, retrospective in 7, and based on administrative data in 2 studies. Sixteen studies used mortality as an outcome measure, four used morbidity, and eight used both. The study endpoints included 30-day outcome in 12 articles, hospital discharge in 15 articles, and 3 articles also included shorter or longer follow-up times ranging from 1 day to 1 yr. Nineteen studies of the total 28 reported baseline patient characteristics of physiology or comorbidity, surgery, and demographics; selection bias was evident in 12 studies.
Outcomes are summarized in table 4
. Surgical mortality at 30 days varied between 1.25 and 12.2% and at hospital discharge between 0.8 and 24.7%.
All but one25
of the six studies which separately tested the discrimination of stratification tools for morbidity and mortality reported that morbidity prediction was less accurate. There was considerable heterogeneity in the definition of morbidity in the 12 studies that reported this outcome (see appendix 3
for summary), and in keeping with this, there was wide variation in complication rates in different studies (between 6.726
Calibration was poorly reported: 16 studies did not report calibration at all; of the remaining 11 articles, 2 reported only whether the models were of “good fit,” without reporting the appropriate statistics. One article did not report calibration in their results, despite stating in the methods that they would calculate it.27
Risk Stratification Tools Using Preoperative Data Only
Four entirely preoperative risk stratification tools (ASA-PS, Surgical Risk Scale, Surgical Risk Score, and the Charlson Comorbidity Index) were validated in multiple studies. The Surgical Risk Scale and the Surgical Risk Score both contain the ASA-PS, and the urgency and severity of surgery; both have also been multiply validated. The Surgical Risk Score28
was developed and originally validated in Italy29
and contains the ASA-PS, a 3-point scale modification of the Johns Hopkins surgical severity criteria and a binary definition of surgical urgency (elective vs.
emergency). The only published study evaluating the Surgical Risk Score after its initial validation found it to be poorly predictive of inpatient mortality.28
The Surgical Risk Scale30–32
uses the ASA-PS alongside United Kingdom definitions of operative urgency (a 4-point scale defined by the United Kingdom National Confidential Enquiry into Postoperative Death and Outcome) and severity (the British United Provident Association classification which is used to rank surgical procedures for the purposes of financial billing in the private sector). Both studies validating this system after its initial development found it to be a moderately discriminant tool (AUROC >0.8).30
A further 18 different risk stratification tools using solely preoperative data were validated in single publications. Several of these were originally derived and validated for purposes other than the prediction of generic morbidity and mortality: these include cardiac risk prediction scores,27
measures of nutritional status,34
and frailty indices.27
These tools are described in appendix 4
Risk Stratification Tools Incorporating Intra- and Postoperative Data
The POSSUM and P-POSSUM scores were the most frequently used tools in heterogeneous surgical cohorts. The POSSUM score was derived by multivariable logistic regression analysis and contains 18 variables, of which 12 were measured preoperatively and 6 at hospital discharge; two separate equations, for morbidity and mortality, were developed and validated.17
After recognition that the POSSUM model overpredicted adverse outcome, the Portsmouth variation (P-POSSUM) was developed to predict mortality, using the same composite variables but a different calculation.36
P-POSSUM has been used in a larger number of more recent studies28–30
,37 than the original POSSUM25
and has been found to be of moderate to high discriminant accuracy (AUROC varying between 0.68 and 0.92) with the exception of one Australian study.37
Medical Risk Prediction Tools Adapted for Surgical Risk Stratification
Two risk stratification tools, which have been multiply validated, APACHE II38
and the Charlson Index,39
were developed for the purposes of risk adjustment and prediction in nonsurgical settings. APACHE II was developed in 1985 as a tool for predicting hospital mortality in patients admitted to critical care; the score consists of 12 physiological variables and an assessment of chronic health status. This approach has face validity, as APACHE II is a summary measure of acute physiology and chronic health, both of which may influence surgical outcome. Only one of the four studies reporting the APACHE II score’s predictive accuracy used it in the way originally intended: by incorporating the most deranged physiological results within 24 h of critical care admission.40
The Charlson comorbidity score was developed to predict 10-yr mortality in medical patients.39
A combined age-comorbidity score was subsequently validated for the prediction of long-term mortality in a population of patients who had essential hypertension or diabetes and were undergoing elective surgery.41
It is the original Charlson score, however, which is used in two studies identified in our search to stratify risk of short-term outcome.42
These two studies reported very different predictive accuracy for the Charlson score; however, the largest single study included in this entire review found the Charlson score (measured using administrative data) to be a moderately accurate tool.44
The purpose of this systematic review was to identify all risk stratification tools, which have been validated in heterogeneous patient cohorts, and to report and summarize their discrimination and calibration. We have found a plethora of instruments that have been developed and validated in single studies, which unfortunately limits any assessment of their usefulness and generalizability. A smaller number of tools have been multiply validated which could be used universally for perioperative risk prediction; of these, the P-POSSUM and Surgical Risk Scale have been demonstrated to be the most consistently accurate systems.
Risk Stratification Tools in Practice: Complexity versus Parsimony
There are two key considerations when assessing the clinical utility of the various risk stratification tools reviewed in our study. First, what level of predictive accuracy is fit for the purposes of risk stratification? Second, what is the likelihood that each of the described instruments may be used in everyday practice by clinicians? Although the answer to the first question may be to aim as “high” (accurate) as possible, this must also be balanced against the issues raised by the second question. Risk models incorporating over 30 variables may be highly accurate but are less likely to be routinely incorporated into preoperative assessment processes than scores of similar performance that use only a few data points. Furthermore, clinical experience tells us that the clinician is less likely to use complex mathematical formulae, as opposed to additive scores, when attempting to risk stratify patients at the bedside or in the preoperative clinic.1
The P-POSSUM model was developed in the United Kingdom and has since been validated in Japan, Australia, and Italy. Although this is the most frequently and widely validated model identified by our study, it has some limitations. First, it includes both preoperative and intraoperative variables, and therefore cannot be used for preoperative risk prediction. Second, several of the variables are subjective (e.g., chest radiograph interpretation), carrying the risk of measurement error. Third, in common with the original POSSUM, the P-POSSUM tends to overestimate risk in low-risk patients. Fourth, it contains 18 variables, which must be entered into a regression equation to obtain a predicted percentage risk value, and clinicians may not wish to use such a complex system. Finally, the inclusion of intraoperative variables, particularly blood loss, which may be influenced by surgical technique, runs the risk of concealing poor surgical performance, therefore, jeopardizing its face validity as a risk adjustment model for comparative audit of surgeons or institutions.
Surgical Risk Scale
The Surgical Risk Scale consists entirely of variables that are available before surgery, making it a useful tool for preoperative risk stratification for the purposes of clinical decision making. However, there are also some limitations. First, it incorporates the ASA-PS, which may be subject to interobserver variability and therefore measurement error.44–46
Second, the surgical severity coding is not intuitive, and some familiarity with the British United Provident Association system would be required for bedside estimation, unless a reference manual was available. Finally, it has only been validated in single-center studies within the United Kingdom; therefore, its generalizability to patient populations in the United States and worldwide is unknown.
The ASA-PS is widely used as an indicator of whether or not a patient falls into a high-, medium-, or low-risk population, but it was not originally intended to be used for the prediction of adverse outcome in individual subjects.4
It is perhaps surprising that the ASA-PS was reported as having good discrimination for predicting postoperative mortality, as it is a very simple scoring system, which has been demonstrated to have only moderate to poor interrater reliability.44–47
Nevertheless, the ASA-PS has face validity as an assessment of functional capacity, which is increasingly thought to be a significant predictor of patient outcome, as demonstrated by more sophisticated techniques such as cardiopulmonary exercise testing.48
Although it is possible that this provides some explanation for the high discriminant accuracy for ASA-PS found in this systematic review, it is possible that publication bias, favoring studies with “positive” results, may also be a factor.
The Biochemistry and Hematology Outcome Model is a parsimonious version of POSSUM, which omits the subjective variables such as chest radiography and electrocardiogram results. It also has the advantage of consisting of variables which are all available preoperatively, with the exception of operative severity. Given the Biochemistry and Hematology Outcome Model’s similarity in predictive accuracy to P-POSSUM in the one study, we identified which made a direct comparison,32
this system warrants further evaluation. Finally, the Identification of Risk In Surgical patients score was developed in The Netherlands and consists of four variables (age, acuity of admission, acuity of surgery, and severity of surgery). In the study, which developed and validated it on separate cohorts, the validation AUROC was 0.92.49
Again, further investigation of this simple system would be useful.
Generalizability of Findings
Clinical and Methodological Heterogeneity.
Clinical heterogeneity (both within- and between-cohort patient heterogeneity) and methodological heterogeneity (between-study differences in the outcome measures used) are both likely to have had a significant influence on some of our findings. For example, between-cohort heterogeneity, and variation in how morbidity is defined (appendix 2), may explain the wide range of morbidity rates reported in different studies. Heterogeneity of morbidity definitions may also in part explain the lower accuracy of models for predicting morbidity compared with mortality. On a different note, our study included all populations of patients who were determined to be heterogeneous, using the definitions described in our methods. However, the degree of heterogeneity varied among studies, including whether or not patients of all surgical urgency categories were included, and this may have affected the predictive accuracy of models in different studies.
Objective versus Subjective Variables and Issues Surrounding Data Collection Methodology.
The variables included in risk stratification tools may be classified as objective (e.g.
, biochemistry and hematology assays), subjective (e.g.
, interpretation of chest radiographs), and patient-reported (e.g.
, smoking history). In some clinical settings, the reliability of nonobjective data may be questionable; for example, previous reports have demonstrated significant interrater variation in the interpretation of both chest radiographs50
Patients may also under- or overestimate various elements of their clinical or social history when questioned in the hospital setting. Despite these concerns, the discrimination of predictors incorporating patient-reported and patient-subjective variables was high in the studies included. This may be due to publication bias; it may also be explained by the fact that in all of these studies, data were collected prospectively by trained staff. Previous work has demonstrated an association between interobserver variability in the recording of risk and outcome measures, and the level of training that data collection staff have received.52
These caveats are important when considering the generalizability of our findings to the everyday clinical setting, where data reporting and interpretation may be conducted by different types and grades of clinical staff. Finally, concerns have also been raised over the clinical accuracy of administrative data used for case-mix adjustment purposes.53
However, one large study included in our review43
showed high discriminant performance when using International Classification of Diseases 9 and 10 administrative coding data to define the Charlson Index variables.
Limitations of This Study
This study has limitations in a number of factors. First, the focus was on studies that measured the discrimination and/or calibration of risk stratification tools in cohorts that were heterogeneous in terms of surgical specialities; therefore, a large number of single-speciality cohort studies identified in the search were excluded from the analysis.
Second, although the inclusion criteria for our review ensured that a standard measure of discrimination was reported (AUROC or c-statistic), many studies did not report measures of calibration. However, in a systematic review such as this, calibration may be seen to be a less important measure of goodness-of-fit than discrimination for a number of reasons. Calibration can only be used as a measure of performance for models that generate an individualized predicted percentage risk of an outcome (e.g.
, the POSSUM systems) as opposed to summative scores, which use an ordinal scale to indicate increasing risk (e.g.
, the ASA-PS). Calibration drift is likely to occur over time and will be affected by changes in healthcare delivery; good calibration in a study over 30 yr ago may be unlikely to correspond to good calibration today.55
Although such calibration drift may affect the usefulness of a model for predicting an individual patient’s risk of outcome, poorly calibrated but highly discriminant models will still be of value for risk adjustment in comparative audit. Finally, the probability of the Hosmer–Lemeshow statistic being significant (thereby indicating poor calibration) increases with the size of the population being studied.57
This may explain why many of the large high-quality studies we evaluated did not report calibration or reported that calibration was poor.
Third, by using the AUROC as the sole measure of discrimination, a number of studies were excluded, particularly earlier articles that used correlation coefficients between risk scores and postoperative outcomes. This was felt to be necessary, as a uniform outcome measure provides clarity to the reader. Fourth, publication bias, where studies are preferentially submitted and accepted for publication if the results are positive, is likely to be a particular problem in cohort studies. Finally, despite an extensive literature search, it is possible that some studies which would have been eligible for inclusion may have been missed. Multiple strategies have been used to prevent this; however, in a review of this size, it is possible that a small number of appropriate articles may have been omitted.
Undertaking clinical risk prediction should be a key tenet of safe high-quality patient care, it facilitates informed consent and enables the perioperative team to plan their clinical management appropriately. Equally, accurate risk adjustment is required to enable meaningful comparative audit between teams and institutions, to facilitate quality improvement for patients and providers. Although we identified dozens of scores and models which have been used to predict or adjust for risk, very few of these achieved the aspiration of being derived from entirely preoperative data, and of being accurate, parsimonious, and simple to implement. The Surgical Risk Scale is the system that comes closest to achieving these goals; the P-POSSUM score is more accurate, but its value is limited by the fact that some of the variables are only available after surgery has been completed. Future work which might be of value would include further comparison of the Surgical Risk Scale, P-POSSUM, and objective models such as the Biochemistry and Hematology Outcome Model in international multicenter cohorts and further investigation of models which combine novel variables such as measures of functional capacity, nutritional status, and frailty.
There is another possible approach. The American College of Surgeons’ National Surgical Quality Improvement Program was created in the 1990s to facilitate risk-adjusted surgical outcomes reporting in Veterans’ Affairs hospitals, and now also includes a number of private sector institutions. Risk adjustment models are produced annually and observed that the expected ratios of surgical outcomes are reported back to institutions and surgical teams to facilitate quality improvement. This organization has published a number of risk calculators to help clinicians to provide informed consent and plan perioperative care. However, none of these calculators have been included in our review, as they have all been developed and validated for use in either specific types of surgery (e.g.
surgery) or for specific outcomes (e.g.
, cardiac morbidity and mortality).61
A parsimonious, entirely preoperative National Surgical Quality Improvement Program model for predicting mortality in heterogeneous cohorts would be of value in the United States; its validation in international multicenter studies would also be a worthwhile endeavor.
Finally, although there are multiple studies aimed at developing and validating risk stratification tools, we do not know how widely such tools are used. Use of mobile technology, such as apps to enable risk calculation using complex equations at the bedside, might increase the use of accurate risk stratification tools in day-to-day practice. Importantly, in surgical outcomes research, there is an absence of impact studies, measuring the effect of using risk stratification tools on clinician behavior, patient outcome, and resource utilization. Randomized, controlled trials to evaluate impact, further validation of existing models across healthcare systems, and establishing the infrastructure required to facilitate such work, including the routine data collection of risk and outcome data, should be of the highest priority in health services research into surgical outcome.62
The authors thank Judith Hulf, F.R.C.A., Past President, Royal College of Anaesthetists, London, United Kingdom.
Appendix 2. Search Strategy
Risk adjustment.mp. or exp Health Care Reform/or exp Risk Adjustment/or exp “Outcome Assessment (Health Care)”/or exp Models, Statistical/or exp Risk/OR exp Risk Assessment/or risk prediction.mp. or exp Risk/or exp Risk Factors/OR predictive value of tests.mp. or exp “Predictive Value of Tests”/OR exp Prognosis/or risk stratification.mp. OR case mix adjustment.mp. or exp Risk Adjustment/OR severity of illness index.mp. or exp “Severity of Illness Index”/OR scoring system.mp.
Surgical Procedures, Operative/OR surgery.mp. or General Surgery/OR operation.mp. or exp Postoperative Complications/
mortality.mp. or exp Hospital Mortality/or exp Mortality/OR morbidity.mp. or exp Morbidity/OR outcome.mp. or exp Fatal Outcome/or exp “Outcome Assessment (Health Care)”/or exp “Outcome and Process Assessment (Health Care)”/or exp Treatment Outcome/OR postoperative complications.mp. or exp Postoperative Complications/OR intraoperative complications.mp. or exp Intraoperative Complications/OR exp Perioperative Care/or perioperative complications.mp. OR prognosis.mp. or exp Prognosis/.
Risk Factor/or risk adjust$.mp. OR cardiovascular risk/or high risk patient/or high risk population/or risk assessment/or risk factor OR risk stratification.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR *”Scoring System”/OR “Severity of Illness Index”/OR Multivariate Logistic Regression Analysis/or Logistic Regression Analysis OR logistic models/or risk assessment/or risk factors/OR exp Scoring System OR Prediction/or possum.mp. or Scoring System/OR exp Risk Assessment/or risk stratification.mp. OR predict$.mp. OR exp Quality Indicators, Health Care/OR Risk Adjustment/.
exp Surgery/OR exp Surgical Procedures, Operative/OR specialties, surgical/or surgery/OR surg$.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR peri-operative period.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR perioperative.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR postoperative.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR perioperative care/or intraoperative care/or postoperative care/or preoperative care.
complicat$.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR adverse outcome/or prediction/or prognosis/OR exp Postoperative Complication/co, di, ep, su, th [Complication, Diagnosis, Epidemiology, Surgery, Therapy] OR exp Perioperative Complication/or exp Perioperative Period/OR exp Mortality/or exp Surgical Mortality/OR exp Morbidity/OR outcome.mp. or “Outcome Assessment (Health Care)”/or “Outcome and Process Assessment (Health Care)” OR treatment outcome/.
1980 to August 31, 2011
(“all infant (birth to 23 months)” or “all child (0 to 18 years)” or “newborn infant (birth to 1 month)” or “infant (1 to 23 months)” or “preschool child (2 to 5 years)” or “child (6 to 12 years)” or “adolescent (13 to 18 years)”) or (cats or cattle or chick embryo or dogs or goats or guinea pigs or hamsters or horses or mice or rabbits or rats or sheep or swine) or (communication disorders journals or dentistry journals or “history of medicine journals” or “history of medicine journals non index medicus” or “national aeronautics and space administration (nasa) journals” or reproduction journals) or Angioplasty, Balloon/or Angioplasty, Laser/or Angioplasty/or Angioplasty, Balloon, Laser-Assisted/or Angioplasty, Transluminal, Percutaneous Coronary/or ANGIOPLASTY.mp. OR Eye/or Ophthalmology/or Eye Diseases/or OPTHALMOLOGY.mp. or Hearing Loss OR CARDIAC SURGERY.mp. or HEART SURGERY.mp. or Myocardial Revascularization/or Coronary Artery Bypass/or CORONARY SURGERY.mp. or Coronary Artery Bypass, Off-Pump/.
Hand Searching of Reference Lists
The following keywords were searched separately on MEDLINE, Embase, and ISI Web of Science:
POSSUM + surgery
In addition, the original development studies for all risk prediction models identified in the initial search were then snowballed by hand searching for citations on MEDLINE, Embase and ISI Web of Science.
Studies were eligible if they fulfilled the following criteria:
* Studies in adult humans undergoing noncardiac, nonneurological surgery
* Study cohorts that included at least two different surgical subspecialities
* Studies that described the predictive precision of risk models using analysis of receiver operator characteristic curves
Studies were excluded on the basis of these criteria:
* Cohorts including children (under the age of 14 yr)
* Cohorts including patients undergoing cardiac surgery
* Cohorts including patients who did not undergo surgery
* Single-speciality cohort studies (e.g., vascular, orthopedic)
* Studies of ambulatory (day case) surgery
* Studies describing the development of a risk prediction model without subsequent validation in a separate cohort (either in the original study or subsequent cohorts), with the exception of studies of data from the American College of Surgeons’ National Surgical Quality Improvement Programme
* Studies in which the items comprising the risk stratification tool were not disclosed in the study report or available from other sources (such as references)
* Studies using outcomes other than morbidity or mortality as their sole outcome measures (e.g., discharge destination, length of stay)
Studies using only a single pathological outcome measure (e.g., reoperation, cardiac morbidity, infectious complications, renal failure).
1. Nashef SA, Roques F, Michel P, Gauducheau E, Lemeshow S, Salamon R. European system for cardiac operative risk evaluation (EuroSCORE). Eur J Cardiothorac Surg. 1999;16:9–13
2. Adams ST, Leveson SH. Clinical prediction rules. BMJ. 2012;344:d8312
3. Grobman WA, Stamilio DM. Methods of clinical prediction. Am J Obstet Gynecol. 2006;194:888–94
4. Saklad M. Grading of patients for surgical procedures. Anesthesiology. 1941;2:281–4
5. Lee TH, Marcantonio ER, Mangione CM, Thomas EJ, Polanczyk CA, Cook EF, Sugarbaker DJ, Donaldson MC, Poss R, Ho KK, Ludwig LE, Pedan A, Goldman L. Derivation and prospective validation of a simple index for prediction of cardiac risk of major noncardiac surgery. Circulation. 1999;100:1043–9
6. Hennis PJ, Meale PM, Grocott MP. Cardiopulmonary exercise testing for the evaluation of perioperative risk in non-cardiopulmonary surgery. Postgrad Med J. 2011;87:550–7
7. Liao L, Mark DB. Clinical prediction models: Are we building better mousetraps? J Am Coll Cardiol. 2003;42:851–3
8. Noble D, Dent T, Greenhalgh T. Re: Comparisons of established risk prediction models for cardiovascular disease: Systematic review. (Rapid response). BMJ. 2012;345:e4357
9. Mallen C, Peat G, Croft P. Quality assessment of observational studies is not commonplace in systematic reviews. J Clin Epidemiol. 2006;59:765–9
10. Moher D, Liberati A, Tetzlaff J, Altman DGPRISMA Group. . Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. PLoS Med. 2009;6:e1000097
11. Altman DG. Systematic reviews of evaluations of prognostic variables. BMJ. 2001;323:224–8
12. Altman DGEgger M, Davey Smith G, Altman DG. Systematic reviews of evaluations of prognostic variables Systematic Reviews in Health Care. Meta-analysis in Context. 20012nd edition London BMJ Books:228–47
13. Altman DG, Vergouwe Y, Royston P, Moons KG. Prognosis and prognostic research: Validating a prognostic model. BMJ. 2009;338:b605
14. Moons KG, Altman DG, Vergouwe Y, Royston P. Prognosis and prognostic research: Application and impact of prognostic models in clinical practice. BMJ. 2009;338:b606
15. Swets JA. Measuring the accuracy of diagnostic systems. Science. 1988;240:1285–93
16. Arvidsson S, Ouchterlony J, Sjöstedt L, Svărdsudd K. Predicting postoperative adverse events. Clinical efficiency of four general classification systems. The project perioperative risk. Acta Anaesthesiol Scand. 1996;40:783–91
17. Copeland GP, Jones D, Walters M. POSSUM: A scoring system for surgical audit. Br J Surg. 1991;78:355–60
18. Ding LA, Sun LQ, Chen SX, Qu LL, Xie DF. Modified physiological and operative score for the enumeration of mortality and morbidity risk assessment model in general surgery. World J Gastroenterol. 2007;13:5090–5
19. Carneiro AV, Leitão MP, Lopes MG, De Pádua F. [Risk stratification and prognosis in critical surgical patients using the Acute Physiology, Age and Chronic Health III System (APACHE III)]. Acta Med Port. 1997;10:751–60
20. Zhang H, Zhu D-M, Xue Z-G, Luo J-F, Jiang H. Performance of APACHE II models in surgical intensive care unit. Fudan Univ J Med Sci. 2004;31:417–20
21. Saba V, Goffi L, Jassem W, Ghiselli R, Necozione S, Mattei A, Carle F. Prognostic value of the Apache II scoring system daily preoperative use in major general surgery. Chirurgia. 1997;10:187–94
22. Martin Graczyk AI, Molina Hernandez MJ, Vazquez PC, Mora FJ, Hierro VM, Gomez PJ, Ribera Casado JM. Preoperative geriatric assessment in major surgery in the aged. Anales de Medicina Interna. 1995;12:270–4
23. Kuo HS, Chuang JH, Tang GJ, Hou CC, Chou SS, Lui WY, P’eng FK. Development of a new prognostic system and validation of APACHE II for surgical ICU mortality: A multicenter study in Taiwan. Chung Hua i Hsueh Tsa Chih - Chin Med J. 1999;62:673–81
24. Krenzien J, Roding H, Mummelthey R. Surgical risk in old age: Prospective evaluation of a prognosis index. Zentralblatt fur Chirurgie. 1990;115:717–27
25. Jones DR, Copeland GP, de Cossart L. Comparison of POSSUM with APACHE II for prediction of outcome from a surgical high-dependency unit. Br J Surg. 1992;79:1293–6
26. Davenport DL, Bowe EA, Henderson WG, Khuri SF, Mentzer RM Jr. National Surgical Quality Improvement Program (NSQIP) risk factors can be used to validate American Society of Anesthesiologists Physical Status Classification (ASA PS) levels. Ann Surg. 2006;243:636–41 discussion 641–4
27. Makary MA, Segev DL, Pronovost PJ, Syin D, Bandeen-Roche K, Patel P, Takenaga R, Devgan L, Holzmueller CG, Tian J, Fried LP. Frailty as a predictor of surgical outcomes in older patients. J Am Coll Surg. 2010;210:901–8
28. Haga Y, Ikejiri K, Wada Y, Takahashi T, Ikenaga M, Akiyama N, Koike S, Koseki M, Saitoh T. A multicenter prospective study of surgical audit systems. Ann Surg. 2011;253:194–201
29. Donati A, Ruzzi M, Adrario E, Pelaia P, Coluzzi F, Gabbanelli V, Pietropaoli P. A new and feasible model for predicting operative risk. Br J Anaesth. 2004;93:393–9
30. Brooks MJ, Sutton R, Sarin S. Comparison of Surgical Risk Score, POSSUM and p-POSSUM in higher-risk surgical patients. Br J Surg. 2005;92:1288–92
31. Sutton R, Bann S, Brooks M, Sarin S. The Surgical Risk Scale as an improved tool for risk-adjusted analysis in comparative surgical audit. Br J Surg. 2002;89:763–8
32. Neary WD, Prytherch D, Foy C, Heather BP, Earnshaw JJ. Comparison of different methods of risk stratification in urgent and emergency surgery. Br J Surg. 2007;94:1300–5
33. Dasgupta M, Rolfson DB, Stolee P, Borrie MJ, Speechley M. Frailty is associated with postoperative complications in older adults with medical problems. Arch Gerontol Geriatr. 2009;48:78–83
34. Kuzu MA, Terzioğlu H, Genç V, Erkek AB, Ozban M, Sonyürek P, Elhan AH, Torun N. Preoperative nutritional risk assessment in predicting postoperative outcome in patients undergoing major surgery. World J Surg. 2006;30:378–90
35. Copeland GP, Sagar P, Brennan J, Roberts G, Ward J, Cornford P, Millar A, Harris C. Risk-adjusted analysis of surgeon performance: A 1-year study. Br J Surg. 1995;82:408–11
36. Whiteley MS, Prytherch DR, Higgins B, Weaver PC, Prout WG. An evaluation of the POSSUM surgical scoring system. Br J Surg. 1996;83:812–5
37. Organ N, Morgan T, Venkatesh B, Purdie D. Evaluation of the P-POSSUM mortality prediction algorithm in Australian surgical intensive care unit patients. ANZ J Surg. 2002;72:735–8
38. Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: A severity of disease classification system. Crit Care Med. 1985;13:818–29
39. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: Development and validation. J Chronic Dis. 1987;40:373–83
40. Stachon A, Becker A, Kempf R, Holland-Letz T, Friese J, Krieg M. Re-evaluation of established risk scores by measurement of nucleated red blood cells in blood of surgical intensive care patients. J Trauma. 2008;65:666–73
41. Charlson M, Szatrowski TP, Peterson J, Gold J. Validation of a combined comorbidity index. J Clin Epidemiol. 1994;47:1245–51
42. Atherly A, Fink AS, Campbell DC, Mentzer RM Jr, Henderson W, Khuri S, Culler SD. Evaluating alternative risk-adjustment strategies for surgery. Am J Surg. 2004;188:566–70
43. Sundararajan V, Henderson T, Perry C, Muggivan A, Quan H, Ghali WA. New ICD-10 version of the Charlson comorbidity index predicted in-hospital mortality. J Clin Epidemiol. 2004;57:1288–94
44. Haynes SR, Lawler PG. An assessment of the consistency of ASA physical status classification allocation. Anaesthesia. 1995;50:195–9
45. Grocott MP, Levett DZ, Matejowsky C, Emberton M, Mythen MG. ASA scores in the preoperative patient: Feedback to clinicians can improve data quality. J Eval Clin Pract. 2007;13:318–9
46. Aronson WL, McAuliffe MS, Miller K. Variability in the American Society of Anesthesiologists Physical Status Classification Scale. AANA J. 2003;71:265–74
47. Mak PHK, Campbell RCH, Irwin MG. The ASA physical status classification: Inter-observer consistency. Anaesth Intensive Care. 2002;30:633–40
48. Snowden CP, Prentis JM, Anderson HL, Roberts DR, Randles D, Renton M, Manas DM. Submaximal cardiopulmonary exercise testing predicts complications and hospital length of stay in patients undergoing major elective surgery. Ann Surg. 2010;251:535–41
49. Liebman B, Strating RP, van Wieringen W, Mulder W, Oomen JL, Engel AF. Risk modelling of outcome after general and trauma surgery (the IRIS score). Br J Surg. 2010;97:128–33
50. Robinson PJ, Wilson D, Coral A, Murphy A, Verow P. Variation between experienced observers in the interpretation of accident and emergency radiographs. Br J Radiol. 1999;72:323–30
51. Trzeciak S, Erickson T, Bunney EB, Sloan EP. Variation in patient management based on ECG interpretation by emergency medicine and internal medicine residents. Am J Emerg Med. 2002;20:188–95
52. Dindo D, Hahnloser D, Clavien PA. Quality assessment in surgery: Riding a lame horse. Ann Surg. 2010;251:766–71
53. Mohammed MA, Deeks JJ, Girling A, Rudge G, Carmalt M, Stevens AJ, Lilford RJ. Evidence of methodological bias in hospital standardised mortality ratios: Retrospective database study of English hospitals. BMJ. 2009;338:b780
54. Hall BL, Hirbe M, Waterman B, Boslaugh S, Dunagan WC. Comparison of mortality risk adjustment using a clinical data algorithm (American College of Surgeons National Surgical Quality Improvement Program) and an administrative data algorithm (Solucient) at the case level within a single institution. J Am Coll Surg. 2007;205:767–77
55. Copeland GP. The POSSUM system of surgical audit. Arch Surg. 2002;137:15–9
56. Tilford JM, Roberson PK, Lensing S, Fiser DH. Differences in pediatric ICU mortality risk over time. Crit Care Med. 1998;26:1737–43
57. Kramer AA, Zimmerman JE. Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited. Crit Care Med. 2007;35:2052–6
58. Parikh P, Shiloach M, Cohen ME, Bilimoria KY, Ko CY, Hall BL, Pitt HA. Pancreatectomy risk calculator: An ACS-NSQIP resource. HPB (Oxford). 2010;12:488–97
59. Gupta PK, Franck C, Miller WJ, Gupta H, Forse RA. Development and validation of a bariatric surgery morbidity risk calculator using the prospective, multicenter NSQIP dataset. J Am Coll Surg. 2011;212:301–9
60. Cohen ME, Bilimoria KY, Ko CY, Hall BL. Development of an American College of Surgeons National Surgery Quality Improvement Program: Morbidity and mortality risk calculator for colorectal surgery. J Am Coll Surg. 2009;208:1009–16
61. Gupta PK, Gupta H, Sundaram A, Kaushik M, Fang X, Miller WJ, Esterbrooks DJ, Hunter CB, Pipinos II, Johanning JM, Lynch TG, Forse RA, Mohiuddin SM, Mooss AN. Development and validation of a risk calculator for prediction of cardiac risk after surgery/clinical perspective. Circulation. 2011;124:381–7
62. Grocott MP. Improving outcomes after surgery. BMJ. 2009;339:b5173
63. Osler TM, Rogers FB, Glance LG, Cohen M, Rutledge R, Shackford SR. Predicting survival, length of stay, and cost in the surgical intensive care unit: APACHE II versus
ICISS. J Trauma. 1998;45:234–7 discussion 237–8
64. Prytherch DR, Whiteley MS, Higgins B, Weaver PC, Prout WG, Powell SJ. POSSUM and Portsmouth POSSUM for predicting mortality. Physiological and Operative Severity Score for the enUmeration of Mortality and morbidity. Br J Surg. 1998;85:1217–20
65. Gawande AA, Kwaan MR, Regenbogen SE, Lipsitz SA, Zinner MJ. An Apgar score for surgery. J Am Coll Surg. 2007;204:201–8
66. Regenbogen SE, Ehrenfeld JM, Lipsitz SR, Greenberg CC, Hutter MM, Gawande AA. Utility of the surgical apgar score: Validation in 4119 patients. Arch Surg. 2009;144:30–6 discussion 37
67. Haynes AB, Regenbogen SE, Weiser TG, Lipsitz SR, Dziekan G, Berry WR, Gawande AA. Surgical outcome measurement for a global patient population: Validation of the Surgical Apgar Score in 8 countries. Surgery. 2011;149:519–24
68. Goffi L, Saba V, Ghiselli R, Necozione S, Mattei A, Carle F. Preoperative APACHE II and ASA scores in patients having major general surgical operations: Prognostic value and potential clinical applications. Eur J Surg. 1999;165:730–5
69. Hightower CE, Riedel BJ, Feig BW, Morris GS, Ensor JE Jr, Woodruff VD, Daley-Norman MD, Sun XG. A pilot study evaluating predictors of postoperative outcomes after major abdominal surgery: Physiological capacity compared with the ASA physical status classification system. Br J Anaesth. 2010;104:465–71
70. Hadjianastassiou VG, Tekkis PP, Poloniecki JD, Gavalas MC, Goldhill DR. Surgical mortality score: Risk management tool for auditing surgical performance. World J Surg. 2004;28:193–200
71. Hobson SA, Sutton CD, Garcea G, Thomas WM. Prospective comparison of POSSUM and P-POSSUM with clinical assessment of mortality following emergency surgery. Acta Anaesthesiol Scand. 2007;51:94–100
72. Nathanson BH, Higgins TL, Kramer AA, Copes WS, Stark M, Teres D. Subgroup mortality probability models: Are they necessary for specialized intensive care units? Crit Care Med. 2009;37:2375–86
73. Pillai SB, van Rij AM, Williams S, Thomson IA, Putterill MJ, Greig S. Complexity- and risk-adjusted model for measuring surgical outcome. Br J Surg. 1999;86:1567–72
74. Stachon A, Becker A, Holland-Letz T, Friese J, Kempf R, Krieg M. Estimation of the mortality risk of surgical intensive care patients based on routine laboratory parameters. Eur Surg Res. 2008;40:263–72
75. Story DA, Fink M, Leslie K, Myles PS, Yap SJ, Beavis V, Kerridge RK, McNicol PL. Perioperative mortality risk score using pre- and postoperative risk factors in older patients. Anaesth Intensive Care. 2009;37:392–8
© 2013 American Society of Anesthesiologists, Inc.
Publication of an advertisement in Anesthesiology Online does not constitute endorsement by the American Society of Anesthesiologists, Inc. or Lippincott Williams & Wilkins, Inc. of the product or service being advertised.