Introduction
Anaesthesiologists have been pioneers in the development and use of risk scores and risk prediction. Apgar^{1} was an experienced anaesthesiologist and an astute clinician. On the basis of the author's careful observations of thousands of newborns, the author proposed in 1953 a ‘new method of evaluation’ that became the Apgar score (five signs summing up to a score of 10). Infants had mortality proportional to their Apgar score: 14% – Apgar score 0–2, 1.1% – Apgar score 3–7, and 0.1% – Apgar score 8–10. Fifty years later, the Apgar score is just as relevant in predicting neonatal mortality: 24% – Apgar score 0–3, 0.9% – Apgar score 4–6, and 0.02% – Apgar score 7–10.^{2} Even earlier, in 1941, the American Society of Anesthesiologists promulgated a physical status score (the ASA score) of patients prior to anaesthesia and surgery.^{3} Although specifically not claiming to predict operative risk , the ASA score has in fact been demonstrated to be a risk score with a probabilistic interpretation for mortality and morbidity.^{4}

Medical diagnosis, the identification of the nature and cause of a condition or event, reflects the current vitality of the patient – their present condition. A numerical Apgar score is a diagnosis of neonatal vigour; from low to high, it triggers a spectrum of resuscitation from aggressive efforts to simple observation. The Apgar score is also a risk score for prognosis. Prognosis is to know before or to give a forecast of the probable course and outcome of a disease, a procedure, a drug and so on. The baseline risk of an outcome for a population can come from several sources including: large randomised controlled trials; observational studies; and retrospective analysis of hospital or administrative databases. Multiple studies of each type may be assembled into a meta-analysis. The first very large study of perioperative mortality was a review of medical records at 10 US university hospitals. Records of 600 000 patients during 5 years (1948–1952) uncovered 8 000 deaths for an in-hospital rate of 1.3% or one death for each 75 surgeries; of these, expert opinion classified only 224 deaths as anaesthetic related.^{5} Fifty years later, the US Medicare Provider Analysis and Review database (MEDPAR ) with 35 million surgical patients from 2001 to 2006 demonstrated an in-hospital mortality of 3.1% or one death for each 32 surgeries.^{6} It is unlikely that this higher mortality reflects a deterioration of care, but is the result of an aging population, a different mix of surgical procedures, an increased prevalence of concomitant diseases, the redirection of healthy patients to ambulatory surgery and so on. Clearly, some risk probabilities only reflect earlier or different times.

A complete prognosis includes the expected time course and outcomes of the disease or surgery – the extreme outcomes being life or death. Risk is the potential that a disease or an action will lead to an undesirable outcome. Risk is always a probability issue – probability (P _{i} ) being between 0 and 1. Probability reflects the continuum between absolute certainty (P _{i} = 1) and certified impossibility (P _{i} = 0). The future of individual patients cannot be foretold, but the probable outcome for a group of similar patients might be foretold. Using the MEDPAR data, an older patient could be informed preoperatively that their P _{i} for death is 0.031 (by assigning each patient the baseline risk for US Medicare patients). This is an unsatisfying strategy. The desire to define categories of patients with similar risk has prompted the exploration of risk factors and probabilistic risk predictions. Reports on the quantification of risk have now become commonplace. The methods usually require statistical tools that are not familiar to most anaesthesiologists. Additionally, these statistical tools can be unintentionally misused, the results misinterpreted and biased descriptions of risk reported. We will briefly describe methods for identifying risk factors and risk scores , including the use of statistical regression techniques and a framework for assessing the performance of prediction models. This is not a systematic explanation or description of any particular use of risk factors , and to assist the reader illustrative examples will be cited.

Risk factors
Selection
Observational data has become a fertile source of material for epidemiological reports; these patient records have been collected on diseases, treatments, events, outcomes and so on by hospital and government information systems. There are methodological shortcomings in primary studies of prognosis based on observational sources because of problems with the data. These potential biases have been previously described and include failure to clearly define and describe the source population and failure to adequately measure the putative prognostic factors.^{7} Apgar used clinical judgment to derive the Apgar score. Some identification of risk factors is still achieved by consensus panels of expert physicians, such as the Berlin Questionnaire for the identification of obstructive sleep apnoea.^{8} Simple classification of outcomes into contingency tables was used in the National Halothane Study to relate physical status and anaesthetic technique to the standardised death rate.^{9}

Current practices for identifying risk factors are generally numerical methods (Table 1 );^{1,9–19} the most commonly used is logistic regression. About 50 years ago an algorithm was proposed for an automatic procedure to select a statistical model (i.e. to choose risk factors ) in which there are a large number of potential risk factors and no underlying theory on which to base the model selection, this is the ‘stepwise algorithm’.^{10} The statistical notation is given by the multivariable, linear logistic model equation:

. The x _{j,i} s represent the covariates that might be risk factors ; these observed covariates are the independent (explanatory) variables speculated to be risk related. They may be dichotomous, for example, gender, with coding x _{j,i} of 0 or 1. The Greek letter β _{j} s are the regression (β )-coefficients. This regression equation is estimated multiple times; at each step the calculated values of some or all of the β -coefficients may change. This model [i.e. the covariates chosen and the regression coefficients estimated (β _{0} , β _{1} , …, β _{k} )] is selected to maximise the goodness of fit of covariate values to the presence or absence of the event. An automatic variable selection process adds (forward selection) or removes (backward elimination) covariates from the model by a statistical measure of the model fit. When succeeding steps no longer include or exclude covariates, the stepwise process stops and reports the identity of the remaining covariates and the β -coefficients in the model. These remaining covariates are commonly called the ‘independent predictors’ or the ‘independent risk factors ’. The ‘independence’ of these risk factors is frequently misunderstood: this is a purely statistical concept – if the beta coefficient divided by its standard error is more than 1.96, it is significant at P value of 0.05; the statement of statistical significance for each risk factor is only valid within the context of the total set of final risk factors and within that particular statistical model; the association of risk factors to outcome does not imply causality; and covariates that fail inclusion in the stepwise model may nevertheless be causal factors.^{20}

Table 1: Methods for selection of risk factors

Multiple problems with the statistical methodology of the usual application of stepwise regression have been identified. Prognosis studies may include a large number of covariates and as each covariate is added to the stepwise regression, the number of possible models increases by a factor of two. With 50 covariates there are 2^{50} (approximately 10^{15} ) or so possible candidate models. Some risk factors may be synergistic, for example smoking and hypertension. To include the possibility of synergy and antagonism between risk factors , first order interactions (β _{jj} _{’} x _{j,i} x _{j’,i} ) must be added to the linear sum of covariates; if interactions were allowed for all 50 covariates, there could be 2^{1275} (10^{384} ) possible models. This explosion makes it impossible to estimate all the possible models. If all models could be estimated, what are called information-theoretic statistics offer methods to choose the best models.^{21} In an attempt to reduce the number of candidate covariates, univariable screening often precedes stepwise regression. Each covariate is tested for significance by a simple t -test (continuous variables) or χ ^{2} -test (categorical variables). Covariates having a P value less than 0.1–0.25 are included in the stepwise regression; covariates with higher P values are discarded. Unfortunately, pre-screening may wrongly reject potentially important covariates.^{22} More fundamentally, stepwise algorithms will create sets of risk factors composed of both the true and the false; they cannot be distinguished from each other.^{23} If large sets of variables generated by a random number process are subjected to stepwise regression, the appearance of association will usually result. From noise, independent predictors with publishable P values less than 0.05 can be created.^{11}

Another weakness is the inconsistent analysis and reporting of potential risk factors . In 2005, Brotman et al. ^{20} reported over 100 risk factors for cardiovascular outcomes in 117 published studies: ‘any claim that variable X is an independent risk factor for a given cardiovascular outcome (except within the context of a particular study) ignores the likelihood of residual confounding i.e. that valuable predictors also associated with X have been excluded, poorly measured, or incorrectly modelled.’ As another example, Ip et al. ^{24} in a systematic review found 48 studies (23 037 patients) reporting risk factors for postoperative pain and analgesic consumption. Pre-existing pain, anxiety, age and type of surgery were the four most important risk factors . Yet of the 32 (of a total of 48) studies reporting on postoperative pain, the majority did not report results for these risk factors ; only eight, 15, 12 and six studies reported on pre-existing pain, anxiety, age and surgery type as risk factors . Were these risk factors not analysed in most primary studies and, thus, could not be reported? Were these risk factors actually analysed in the other 24, 17, 20 and 26 studies, but not reported because no association with pain intensity was found? The answers to these questions are not known.

Advances in statistical theory and software have provided alternatives to stepwise regression for choosing risk factors . These include propensity analysis,^{13} Bayesian approaches to logistic regression^{17} and the various approaches of machine learning, such as decision tree learning and artificial neural networks;^{15} all these have been used in topics of anaesthetic interest (Table 1 ). The incorporation of prior knowledge about the outcome in choosing risk factors is a particularly interesting aspect of Bayesian methods and in simulations has better statistical properties.^{18} The poor generalisability of risk factors from stepwise regression when applied to new patients is in part due to insufficient numbers of patients in the original data; the result is called an over-optimistic model. Attempts to correct for this over fitting by applying penalised likelihood, shrinkage factors, bootstrapping and so on^{25} now appear in anaesthesia journals. Regardless of the method for derivation, there must be validation and replication of risk factors .

Biomarkers
A biomarker is a naturally occurring substance used as an indicator of a biological state. Included within biomarkers are plasma enzymes, antibodies, chromosomal re-arrangements, gene expression and so on. Biomarkers are chosen because their presence can be identified in a particular pathological or physiological process, disease and so on. Frequently a biomarker indicates a change in the expression, concentration or state of a protein that correlates with the risk or progression of disease. Enormous effort has been spent on searching for biomarkers of cancer. Many associations of a single or multiple proteins with different cancers have been reported. However, efforts to use these biomarkers as risk factors have often floundered because of the difference between statistical significance and prognostic discrimination.^{26,27} Brain natriuretic peptide (BNP) and troponins have been of special interest for anaesthesia and surgical risk .^{28} As an example, BNP is a risk factor for major adverse cardiovascular events (cardiac death or nonfatal myocardial infarction) after non-cardiac surgery. BNP is a circulating hormone synthesised by cardiomyocytes in response to increased ventricular wall stress or ischaemia and has natriuretic and vasodilator properties. In a systematic review of 15 studies (about 5 000 patients having non-cardiac surgery) using study level aggregate data, elevated BNP obtained preoperatively had a highly elevated odds ratio (OR) of 19.77 [95% confidence interval (CI) 13.18–29.65] for short-term major adverse cardiovascular events.^{29} Interpreting this very large OR is made difficult by the lack of a common definition for the threshold value to separate BNP into normal and abnormal ranges, a frequent problem in systematic reviews of prognosis. One generally accepted range of normal plasma BNP is up to 100 pg ml^{−1} ; in the 15 studies, the threshold separating normal BNP from abnormal BNP ranged from as low as 35 pg ml^{−1} to as high as 255 pg ml^{−1} .

When a large sample of individual patient data can be used a better definition of the relationship between biomarker and outcome becomes possible. In a prospective observational study (>4 000 patients) of cardiovascular disease, higher serum cardiac troponin T concentrations were a predictor of increased risk of heart failure and cardiovascular mortality in older adults.^{30} In a meta-analysis using individual patient data from seven studies (about 19 000 patients), the creatine kinase (CK-MB) ratio (the ratio between the peak CK-MB and the upper limit of normal of CK-MB) was calculated for each patient following coronary artery bypass grafting. Thirty-day mortality increased with increasing CK-MB; the relative risk was over four-fold higher for patients having a CK-MB ratio of 10–20 compared with those with a CK-MB ratio less than 1.^{31}

Multivariable probabilistic risk predictions
The relationship between outcome and a risk factor may be examined individually as described above. As risk factors may interact, the effects of multiple risk factors on outcome are analysed together in order to avoid and minimise spurious correlations. This is performed by multivariable regression, a statistical tool for determining simultaneously the unique contributions of various factors to a single event or outcome. The performance of these statistical models should be validated for their overall performance, discrimination and calibration.

Estimation
For estimation the strength of association between an outcome and a dichotomous risk factor is usually specified by the OR, not by the probability value of a statistical test. If the risk (P _{i} ) is 0.1, then for every 100 patients the event occurs in 10 patients. Odds is a concept from gambling, but is used also in medicine. If there are 10 patients with an event and 90 without an event, the odds are one to nine (written as the ratio 1 : 9 or as the decimal fraction 0.111…). Risk and odds are mathematically convertible:

. In an unadjusted OR only two variables, the binary outcome and the binary predictor, are used for the calculation (Table 2 ). The OR is the change in the odds with the presence of risk factor. An OR can also be calculated for a continuous risk factor such as age, and the OR represents the increase in odds of outcome for a one-unit change in the risk factor. The calculation of the 95% CI is also straightforward. At an OR of 1 (the identity value), there is no association of risk factor to outcome. If the OR value is greater or less than 1, there is still no association if the upper and lower bounds of the 95% CI lie on opposite sides of the line of identity. By contrast, if the bounds of a 95% CI are firmly above or below the identity value, the OR has achieved statistical significance. There is a distinction for continuous risk factors having an OR sufficient to demonstrate association versus an OR magnitude necessary for useful prognosis; OR must be much larger for clinical usefulness.^{32} An adjusted OR is a statistical method for isolating each risk factor's independent effect within the total set of risk factors from a multivariable model. The statistical methods for adjusted ORs are considerably more complicated and require iterative solutions to multivariable logistic regression equations.^{33}

Table 2: Calculation of the odds ratio

A probability prediction rule assigns a probability P _{i} (called a prediction, a forecast or a prognosis) for the occurrence of a specified event to a patient; these predictions are mainly descriptive, not mechanistic or explanatory, of the associations between risk factors and outcome. Using essentially the same mathematical methods (stepwise logistic regression) as for identifying risk factors , a statistical model of risk factors is chosen:

. The adjusted OR for each risk factor is obtained by exponentiating the β -coefficient: e ^{β} (Table 2 ). The ‘z ’ is the risk score for the model; it is also called a risk index score or a risk index. Inherent in the model is the calculation of P _{i} for a new patient with specific risk factor values (x _{j,i} s). This is calculated by the inverse logit function:

(Fig. 1 ). The literature of anaesthesia is now replete with risk scores that offer a P _{i} formula. For example, z = −2.28 + 1.27 × (female sex) + 0.65 × [history of postoperative nausea and vomiting (PONV) or motion sickness] + 0.72 × (non-smoker) + 0.78 × (postoperative opioid administration) is a risk score for PONV.^{34} If all risk factors are zero, then the baseline P _{i} is determined only by the intercept (β _{0} = −2.28) yielding a value of 0.09. A non-smoking female with a history of motion sickness receiving postoperative morphine would have a calculated risk score of −2.28 + 1.27 × 1 + 0.65 × 1 + 0.72 × 1 + 0.78 × 1, summing to 1.14 with a P _{i} of 0.76 – a 76% probability of having PONV.

Fig. 1: no caption available.

Validation
All the methodological problems of determining risk factors also apply to risk scores and probabilities. Clearly, the predictions must be valid for the patients of the dataset used for the development (internal validation).^{35} Numerous problems may arise when users desire to transfer predictive algorithms from one period or one practice setting to a different time or another place.^{36} These can include the following: the statistical model may have been excessively optimistic in the choice and weighting of predictor variables within the original data set; in a new time or a new place other variables not relevant to the original model may become important; predictor variables may no longer be ‘predictive’; and the functional relationship of predictor to outcome may have changed. Even when a risk factor has a true association with outcome, it has been empirically observed that the OR is usually inflated (too large).^{27} External validation is always necessary before any risk score with probabilistic predictions can be accepted. Examples of anaesthesia risk scores for which external validation has been reported include PONV and postoperative mortality.^{6,34,37}

In 2000, Eberhart et al. ^{37} reported an external validation study of three published risk scores for postoperative nausea and vomiting and postoperative vomiting in adults during the first 24 h following surgery. The assessment of predictors in an external validation study is illustrated with a re-analysis of these results in the Supplementary Digital Content 1 (http://links.lww.com/EJA/A23 ).

Overall performance measures
Frameworks for the analysis of external validation studies have been proposed.^{38–41} The mathematical distance between the forecast and the actual outcome is the essence of quantifying overall performance – the shorter the distance, the better the model. Overall performance is measured by the distance of the predicted outcome (P _{i} ) from the actual outcome (Y _{i} ) in which Y _{i} is set to 0 or 1 for the non-occurrence or occurrence of the outcome. A good model of risk will have a short average distance. The accepted measures for overall performance of the risk score in the validation datasets are the Brier score statistic and Nagelkerke's R ^{2} statistic. Nagelkerke's R ^{2} , ranging from 0 to 1, is the proportion of the variation of the response variable explained by the risk score. In the Eberhart data, the predictor scores of Apfel et al., ^{42} Koivuranta et al. ^{43} and Palazzo and Evans^{44} had Nagelkerke's R ^{2} values of 0.15, 0.19 and 0.09. Thus, only a small part of the variation of probability of PONV (15, 19 and 9%, respectively) among patients was explained by the P _{i} calculated from the risk score.

Discrimination
Discrimination is the ability of the predictions to rank order patients with different outcomes. Discrimination of risk scores uses the notation of diagnostic tests (Table 3 a and b)^{39} and is displayed by the receiver operating characteristic (ROC) curve. With the multiple values of P _{i} from a risk score, multiple 2 × 2 tables may be calculated (one less than the number of distinct P _{i} values), choosing all possible threshold values of P _{i} . An ROC curve is the line plot connecting the paired (sensitivity and specificity) values of these multiple 2 × 2 tables and extends the performance of a prediction across the prediction space. If all patients with outcome 1 have predictions P _{i} greater than the largest prediction of the patients with outcome 0, then there is perfect discrimination. Conversely, if the predictions are uniformly interspaced for the patients with outcomes 1 and 0, then there is no discrimination.

Table 3: The 2 × 2 tables diagnostic test definitions and an example

The area under the curve (AUC) is the measure of discrimination; it is also called the concordance (C)-statistic (range 0–1). This is the probability that given two patients, one with and one without the eventual occurrence of an event, a larger risk score will be assigned to the former.^{45} Perfect discrimination has a C-statistic of unity, whereas a C-statistic of 0.5 indicates random predictions (no better than flipping a coin); a C-statistic of 0.5 is the area under the diagonal line from lower left to upper right. The C-statistic compares the ranking of P _{i} s for cases and non-cases. For example, a validation study that found all cases with a P _{i} of 0.52 and all non-cases with a P _{i} of 0.51 would have perfect discrimination.^{46} There are a variety of methods, both parametric and nonparametric, for estimating the C-statistic.^{47} The simplest is the two-sample rank sum statistic (known as the Mann–Whitney–Wilcoxon or Wilcoxon rank-sum test), an unbiased estimator for the AUC. Although a good risk model will have a high discrimination, by itself the C-statistic is not optimal in assessing performance of risk scores .^{46} The three risk scores for postoperative nausea and vomiting assessed by Eberhart et al . had very similar AUCs of about 0.7 (Fig. 2 ).^{37,42–44} Another aspect of discrimination is the difference in the mean of predictions between outcomes; this is denoted the discrimination slope.^{39} For the three risk scores , the mean P _{i} s for those without and with an event were Apfel et al . (0.21, 0.32; slope = 0.11), Koivuranta et al . (0.34, 0.48; slope = 0.14) and Palazzo and Evans (0.14, 0.26; slope = 0.12). Although the absolute P _{i} s were different, the discrimination slopes were approximately the same, as were the AUCs.

Fig. 2: no caption available.

Calibration
Discrimination and calibration are distinct properties of risk scores . Calibration is the agreement between observed outcomes and predictions (i.e. the correctness of prediction probabilities on an absolute scale). For example, if the predicted probability for in-hospital mortality is 20%, then about 20% of the patients with that predicted probability should die in the hospital. Calibration is assessed by regression statistics and is visualised by a calibration graph. These include a goodness-of-fit χ ^{2} -statistic by Hosmer–Lemeshow and a logistic regression by Cox.^{48,49} The plot for one of the risk scores in the study by Eberhart et al . shows poor calibration with both over-prediction and under-prediction (Fig. 3 ).^{37,44}

Fig. 3: no caption available.

Discussion
With the advent of statistical software and data collection tools, hundreds of studies of prognosis are published annually, commonly involving the fate of patients with cancer, heart disease, pulmonary disease, trauma and so on. We have highlighted methods for the development and interpretation of risk scores . Apart from studies of mortality, the anaesthesia literature also includes reports of scoring systems for the prognosis of everyday problems such as airway management, postoperative shivering, postoperative nausea, postoperative cognition and so on. One of the goals is to obtain risk probabilities adjusted for specific patient characteristics. However, these studies sometimes do not complete some of the necessary steps for prognosis research.^{50} In their qualitative systematic review, Ip et al . identified 48 studies reporting predictors of postoperative pain and analgesic consumption.^{24} Only two studies reported any validation of their predictors; neither study had an external validation.

Wyatt and Altman^{51} posed this provocative question more than a decade ago: ‘prognostic models, clinically useful or quickly forgotten?’ Inevitably it seems the first publication of a forecast system is excessively optimistic, as later use is accompanied by a degradation of the predictions. Generalisability is the ability of the forecast to provide accurate predictions in a new sample of patients. The forecast should, of course, be reproducible in patients from the identical population obtained contemporaneously at the original institution; this is an internal validation. More importantly, the forecast should be transferable to different populations; external validation is necessary to show that temporal changes (a later year), geographic changes (a different continent), a different spectrum of illness and so on which do not defeat the prediction rule. What if external validation shows unfavourable characteristics of a scoring system? Several choices are evident and are as follows. First, among the three prediction scores for postoperative nausea and vomiting, that proposed by Koivuranta et al. was better when assessed by Eberhart et al .^{37} Yet, at another external site a different risk score may be the best. It may be decided that scoring system predictions are so inconsistent when applied elsewhere that they should be abandoned.^{52} Second, a prediction system for severe postoperative pain was derived and then tested by its originators at a different hospital several years after the initial publication.^{53} Statistical methods were used to recalibrate the prediction rule.^{54} It is unresolved if adjusting clinical prediction rules to fit a specific institution will prove practical and useful.^{55} Third, changes in medical care and patient risk may eventually degrade the calibration of even well established risk scores ; 10 years after being published, the EuroSCORE predictor for cardiac surgery now over-estimates mortality and is being updated from a new large dataset.^{56,57}

Wyatt and Altman^{51} also insisted that evidence of clinical effectiveness should be expected for risk scores and risk prediction. Risk estimations can improve a patient's informed consent, are useful in benchmarking between institutions and physicians and may help in resource allocation. However, can the clinician use the presence of risk factors or a risk score as decision aids to change care and change outcomes?^{58} Even with the publication by an international panel of guidelines for PONV management, including the use of risk factors to guide therapy, there is still controversy about the usefulness of PONV risk scores .^{52,59,60} Anaesthesiologists should demand empirical evidence from randomised controlled trials that a risk score is clinically effective.^{51,61} In fact, our risk predictions might be reconfigured to be conditioned on the chosen anaesthetic management – the treatment choices to be included in our risk score calculations along with biomarkers and clinical risk factors .^{61} Risk factors and risk scores will remain a subject of intense interest and research.

Acknowledgements
No external funding or conflicts of interest declared.

References
1. Apgar V. A proposal for a new method of evaluation of the newborn infant.

Cur Res Anesth Analg 1953; 32:260–267.

2. Casey BM, McIntire DD, Leveno KJ. The continuing value of the Apgar score for the assessment of newborn infants.

N Engl J Med 2001; 344:467–471.

3. Sakland M. Grading of patients for surgical procedures.

Anesthesiology 1941; 2:281–284.

4. Davenport DL, Bowe EA, Henderson WG, et al. National Surgical Quality Improvement Program (NSQIP)

risk factors can be used to validate American Society of Anesthesiologists Physical Status Classification (ASA PS) levels.

Ann Surg 2006; 243:636–641.

5. Beecher HK, Todd DP. A study of the deaths associated with anesthesia and surgery: based on a study of 599, 548 anesthesias in ten institutions 1948–1952, inclusive.

Ann Surg 1954; 140:2–35.

6. Sessler DI, Sigl JC, Manberg PJ, et al. Broadly applicable

risk stratification system for predicting duration of hospitalization and mortality.

Anesthesiology 2010; 113:1026–1037.

7. Hayden JA, Cote P, Bombardier C. Evaluation of the quality of prognosis studies in systematic reviews.

Ann Intern Med 2006; 144:427–437.

8. Netzer NC, Stoohs RA, Netzer CM, et al. Using the Berlin Questionnaire to identify patients at

risk for the sleep apnea syndrome.

Ann Intern Med 1999; 131:485–491.

9. Summary of the National Halothane Study. Possible association between halothane anesthesia and postoperative hepatic necrosis.

JAMA 1966;

197 :775–788.

10. Bewick V, Cheek L, Ball J. Statistics review 14: logistic regression.

Crit Care 2005; 9:112–118.

11. Pace NL. Independent predictors from stepwise logistic regression may be nothing more than publishable p values.

Anesth Analg 2008; 107:1775–1778.

12. Bishop MJ, Souders JE, Peterson CM, et al. Factors associated with unanticipated day of surgery deaths in Department of Veterans Affairs Hospitals.

Anesth Analg 2008; 107:1924–1935.

13. Gayat E, Pirracchio R, Resche-Rigon M, et al. Propensity scores in intensive care and anaesthesiology literature: a systematic review.

Intensive Care Med 2010; 36:1993–2003.

14. Ellenberger C, Tait G, Beattie WS. Chronic beta blockade is associated with a better outcome after elective noncardiac surgery than acute beta blockade: a single-center propensity-matched cohort study.

Anesthesiology 2011; 114:817–823.

15. Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes.

J Clin Epidemiol 1996; 49:1225–1231.

16. Traeger M, Eberhart A, Geldner G, et al. Prediction of postoperative nausea and vomiting using an artificial neural network.

Anaesthesist 2003; 52:1132–1138.

17. Greenland S. Bayesian perspectives for epidemiological research. II. Regression analysis.

Int J Epidemiol 2007; 36:195–202.

18. Swartz MD, Yu RK, Shete S. Finding factors influencing

risk : comparing Bayesian stochastic search and standard variable selection methods applied to logistic regression models of cases and controls.

Stat Med 2008; 27:6158–6174.

19. Biagioli B, Scolletta S, Cevenini G, et al. A multivariate Bayesian model for assessing morbidity after coronary artery surgery.

Crit Care 2006; 10:R94.

20. Brotman DJ, Walker E, Lauer MS, O’Brien RG. In search of fewer independent

risk factors.

Arch Intern Med 2005; 165:138–145.

21. Burnham KP, Anderson DR.

Model selection and multimodel inference: a practical information-theoretic approach . 2nd ed. New York, New York: Springer-Verlag; 2002.

22. Sun G-W, Shook TL, Kay GL. Inappropriate use of bivariable analysis to screen

risk factors for use in multivariable analysis.

J Clin Epidemiol 1996; 49:907–916.

23. Wiegand RE. Performance of using multiple stepwise algorithms for variable selection.

Stat Med 2010; 29:1647–1659.

24. Ip HY, Abrishami A, Peng PW, et al. Predictors of postoperative pain and analgesic consumption: a qualitative systematic review.

Anesthesiology 2009; 111:657–677.

25. Moons KG, Donders AR, Steyerberg EW, Harrell FE. Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example.

J Clin Epidemiol 2004; 57:1262–1270.

26. Levinson SS. Weak associations between prognostic biomarkers and disease in preliminary studies illustrates the breach between statistical significance and diagnostic discrimination.

Clin Chim Acta 2010; 411:467–473.

27. Ioannidis JP. Why most discovered true associations are inflated.

Epidemiology 2008; 19:640–648.

28. Ray P, Le Manach Y, Riou B, Houle TT. Statistical evaluation of a biomarker.

Anesthesiology 2010; 112:1023–1040.

29. Ryding AD, Kumar S, Worthington AM, Burgess D. Prognostic value of brain natriuretic peptide in noncardiac surgery: a meta-analysis.

Anesthesiology 2009; 111:311–319.

30. deFilippi CR, de Lemos JA, Christenson RH, et al. Association of serial measures of cardiac troponin T using a sensitive assay with incident heart failure and cardiovascular mortality in older adults.

JAMA 2010; 304:2494–2502.

31. Domanski MJ, Mahaffey K, Hasselblad V, et al. Association of myocardial enzyme elevation and survival following coronary artery bypass graft surgery.

JAMA 2011; 305:585–591.

32. Pepe MS, Janes H, Longton G, et al. Limitations of the

odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker.

Am J Epidemiol 2004; 159:882–890.

33. Katz MH. Multivariable analysis: a primer for readers of medical research.

Ann Intern Med 2003; 138:644–650.

34. van den Bosch JE, Kalkman CJ, Vergouwe Y, et al. Assessing the applicability of scoring systems for predicting postoperative nausea and vomiting.

Anaesthesia 2005; 60:323–331.

35. Altman DG, Royston P. What do we mean by validating a prognostic model?

Stat Med 2000; 19:453–473.

36. Altman DG, Vergouwe Y, Royston P, Moons KG. Prognosis and prognostic research: validating a prognostic model.

BMJ 2009; 338:b605.

37. Eberhart LHJ, Högel J, Seeling W, et al. Evaluation of three

risk scores to predict postoperative nausea and vomiting.

Acta Anaesthesiol Scand 2000; 44:480–488.

38. Steyerberg EW. Clinical prediction models: a practical approach to development, validation, and ppdating. New York:Springer; 2009.

39. Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures.

Epidemiology 2010; 21:128–138.

40. Gerds TA, Cai T, Schumacher M. The performance of

risk prediction models.

Biom J 2008; 50:457–479.

41. Cook NR, Ridker PM. Advances in measuring the effect of individual predictors of cardiovascular

risk : the role of reclassification measures.

Ann Intern Med 2009; 150:795–802.

42. Apfel CC, Greim CA, Haubitz I, et al. A

risk score to predict the

probability of postoperative vomiting in adults.

Acta Anaesthesiol Scand 1998; 42:495–501.

43. Koivuranta M, Läärä E, Snåre L, Alahuhta S. A survey of postoperative nausea and vomiting.

Anaesthesia 1997; 52:443–449.

44. Palazzo M, Evans R. Logistic regression analysis of fixed patient factors for postoperative sickness: a model for

risk assessment.

Br J Anaesth 1993; 70:135–140.

45. Hanley JA, McNeil BJ. The meaning and use of the area under the receiver operating characteristic curve (ROC).

Radiology 1982; 143:29–36.

46. Cook NR. Use and misuse of the receiver operating characteristic curve in

risk prediction.

Circulation 2007; 115:928–935.

47. Qin GS, Hotilovac L. Comparison of nonparametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test.

Stat Methods Med Res 2008; 17:207–221.

48. Hosmer D, Hosmer T, Le Cessie S, Lemeshow S. A comparison of goodness-of- fit tests for the logistic regression model.

Stat Med 1997; 16:965–980.

49. Cox DR. Two further applications of a model for binary regression.

Biometrika 1958; 45:562–565.

50. Hemingway H, Riley RD, Altman DG. Ten steps towards improving prognosis research.

BMJ 2009; 340:410–414.

51. Wyatt JC, Altman DG. Prognostic models: clinically useful or quickly forgotten?

BMJ 1995; 311:1539–1541.

52. Eberhart LHJ, Morin AM.

Risk scores for predicting postoperative nausea and vomiting are clinically useful tools and should be used in every patient: con – ‘life is really simple, but we insist on making it complicated’.

Eur J Anaesthesiol 2011; 28:155–159.

53. Janssen KJ, Vergouwe Y, Kalkman CJ, et al. A simple method to adjust clinical prediction models to local circumstances.

Can J Anaesth 2009; 56:194–201.

54. Janssen KJM, Moons KGM, Kalkman CJ, et al. Updating methods improved the performance of a clinical prediction model in new patients.

J Clin Epidemiol 2008; 61:76–86.

55. Brasher PM, Beattie WS. Adjusting clinical prediction rules: an academic exercise or the potential for real world clinical applications in perioperative medicine?

Can J Anaesth 2009; 56:190–193.

56. Berg KS, Senseth R, Pleym H, et al. Mortality

risk prediction in cardiac surgery: comparing a novel model with the EuroSCORE.

Acta Anaesthesiol Scand 2011; 55:313–321.

57. Jokinen JJ. Why do we have to predict mortality rates?

Acta Anaesthesiol Scand 2011; 55:255–256.

58. Janes H, Pepe MS, Bossuyt PM, Barlow WE. Measuring the performance of markers for guiding treatment decisions.

Ann Intern Med 2011; 154:253–259.

59. Pierre S.

Risk scores for predicting postoperative nausea and vomiting are clinically useful tools and should be used in every patient: ‘don’t throw the baby out with the bathwater’.

Eur J Anaesthesiol 2011; 28:160–163.

60. Kranke P. Effective management of postoperative nausea and vomiting: let us practise what we preach!.

Eur J Anaesthesiol 2011; 28:152–154.

61. Windeler J. Prognosis: what does the clinician associate with this notion?

Stat Med 2000; 19:425–430.