We are very familiar with risk models in cardiac surgery. The first well-known one was derived by Parsonnet and colleagues  for mixed adult cardiac surgery and from the outset it would be important to remember how it came about. By the 1980s, teams that were successful in coronary artery surgery were reporting case series with very low mortalities  and it was not infrequent to hear the proud boast ‘We have had no deaths in the last hundred cases'. Kirklin taught us in the University of Alabama that the risk for coronary surgery ‘approaches zero'a. However, the prevalent practice amongst these groups at the time was to be selective. Surgeons shied away from the elderly, those with poor ventricles already damaged by infarction, diabetics, hypertensives and those with significant co-morbidity. But these patients may have been very symptomatic with angina. They wanted a better answer than ‘No'. These high-risk patients were also in fear of their lives from myocardial ischaemia. Coronary grafts improved blood supply to the myocardium and hence there was an implicit assumption that there was also a survival advantage.
The problem, put simply, is that low-risk patients have a relatively good prognosis however they are managed. The factors that mark out a patient to be at higher risk around the time of surgery are much the same as the indicators of a bad natural history. An operation carries a higher risk but in return they receive a greater benefit. In the USA, in the 1980s, university surgeons such as Parsonnet in tertiary referral centres, or any other institution which was the ‘last port of call' for the sick, had to think beyond the short-term operative risk to the overall benefit for the patient.
The problem that faced Parsonnet and other ‘university' surgeons, when being compared with the private practitioners, was that they were being judged unfairly by a single measure: the perioperative mortality rate. Parsonnet devised and introduced to clinical practice his method of risk adjustment , which was soon validated in British practice . It has now been largely replaced by EuroSCORE. The central purpose is to provide ‘a method of uniform stratification of risk'  so that if comparisons are to be made, they are made more fairly.
Predicting the risk of a single outcome from a single variable
To understand the process of estimating risk it may be helpful to consider the effect of one variable on one outcome. Familiar to cardiothoracic anaesthetists is the time relationship between ischaemia and organ injury. For how long will the brain tolerate ischaemia if we use hypothermic circulatory arrest at 18°C? This has been studied in detail in humans and various species  down to and including the Mongolian gerbil [6,7].
The risk of neurological damage is time related. In the instance discussed here, the defining outcome was a spastic gait in dogs  following ischaemia. A series of 31 dogs were subjected to circulatory arrest for 30, 40, 50 or 60 min while the body temperature was held at 16–18°C. After they had recovered from the experience, some of the dogs were noted to have a permanent neurological deficit with a characteristic high stepping gait. This affected about half of the animals (17/31). There was a clear time relationship with 0%, 20%, 75% and 100% being affected as the time increase in 10-min intervals from 30 to 60 min (Fig. 1). These proportions are plotted with error bars representing the 70% CI.b The same data were later put into a logistic regression model,c which produces a characteristic s-shaped curve as the probability of the event (in this example spasticity of the limbs) asymptotically approaches 100%. There is recognizable change in risk, a threshold of a sort, at a little beyond 30 min, followed by an approximately linear rise in risk between 40 and 50 min, and a near certainty of damage beyond 60 min (Fig. 1). It is reassuring for those who rely on evidence from animal experiments that this relationship of time to clinically evident injury is similar in pigs , gerbils  and babies [9,10].
In the animals it was permissible (at least in the USA 20 yr ago) to ‘test to destruction' and find the full shape of the curve (Fig. 1). In clinical practice there is a point at which we call a halt and curtail risk so we will never have the data to fully characterize the risk of death associated with surgery in this manner. Also, the risk around the time of surgery is related to not one but many factors.
Building the model
The first task is to acquire data.
- bear a putative relationship to the risk, which is to be predicted,
- ideally be available on all patients,
- be routinely collected rather than derived from special or research investigations,
- be objective measurements when possible and
- be non-negotiable.
This last point is very important. It was noted that after New York CABG mortality data were published with risk adjustment, the number of patients with a clinical diagnosis of chronic obstructive pulmonary disease (COPD) jumped up remarkably with the effect that the mortality figures all looked more favourable. A tick in the COPD box can easily be justified on subjective assessment. It should be ticked to correct for additional risk, but ticking it because it makes the results look better may also be tempting.
Data on lung resections have been prospectively collected as part of the European Thoracic Database project (led by RB) described in more detail elsewhere . The data are from 27 units in 14 countries. The total dataset concerns 3488 patients undergoing 3517 procedures.
An important step before analysing the data is to check for completeness and for errors. The data should be ‘cleaned' by running checks to ensure that it all makes sense. For example, the operation date should be after the birth date. Odd outliers can be checked for transcription errors. An unexpected bimodal distribution can occur if data are entered in two different units, or some are in absolute measurements and others indexed to body weight or surface area. After exclusions on prescribed criteria given in Table 1, there were 3426 cases available for analysis.
Once the analyst starts working, messages begin to emerge. If for any reason the conclusions do not hang together or they do not fit preconceived ideas, there is a temptation for the researcher driving the project to make adjustments such as excluding ‘outliers'. This has a selective effect in changing data we do not like while accepting data that may be as flawed but which fit the story. We will later see such a concern in this analysis concerning the ASA (American Society of Anesthesiologists) gradingd.
The database was designed to permit free text entry of procedure type and diagnosis, resulting in an unwieldy array of different descriptions for these two factors. For the purpose of analysis, mutually exclusive categories of procedure type were defined on inspection of the data but prior to any analysis of the mortality data. For each patient, the reported type of procedure was matched to one of these categories. Some amalgamations were made so that, for example, the small number of sleeve resections were gathered under lobectomy and so on. Table 2 show the categories used in the model and the associated mortality rates. The scientific report carries more detail .
Reserving data for testing the model
The clinicians having spent considerable energy gathering data on as many cases as possible, the analyst will blithely remove a large proportion of these cases from the model building process. These data are reserved for later testing of the model's performance. Evaluating the performance of a model amongst the same data that were used in its development often leads to an unduly optimistic view of the accuracy of the model. In our study, of the 3426 procedures 60% (2056) were selected at random to be used to develop the model leaving the remaining 1370 for testing.
The first phase: univariate analysis
The first column of Table 3 shows the variables and the second column the number where the variable of interest was present. They were tested against the outcome, which was, for this analysis, death in hospital. There were 40 deaths in the development set consisting of 6/582 women and 34/1474 men (1% vs. 2.3%).
Age was associated with the risk of death as was the category of operation and a measure based on FEV1 (the forced expiratory volume in the first second), the most commonly used measurement from spirometry. The use of predicted postoperative (ppo) FEV1 is favoured in the British Thoracic Society guidelines . A calculation is made which depends on the number of functioning segments that will be lost as a proportion of the number of functioning segments before resection. It has the merits of being based on a measurement that is reproducible and it is usually measured three times to ensure reliability. If these factors had not emerged as significantly associated with death, it would have been surprising.
The other potential risk factors that were associated with mortality at the univariate level were subjective assessments of the patient made by the clinician: the ASA, ECOG performance statuse and the MRC dyspnoea score.
A measurement of DLCO% (the diffusion capacity for carbon monoxidef) was available only for 23% of cases which was too few to enter the multivariate model. All the other factors were considered as candidate variables for the multivariate model.
Multivariate analysis and its vagaries
The final column of Table 3 shows the P values for the factors that remained significant in a multivariate analysis. Note, in particular, that the ppoFEV1 has dropped out of consideration. It is important to stress that this is not necessarily a reflection on its independent value in assessing patients. Age, ASA, MRC dyspnoea score and the operation type are in the model.
The MADCAP  chart in Figure 2 illustrates the performance of the model amongst the remaining 40% of the cases (the test set). The model performs reasonably well but seems to underestimate risk amongst those patients in the middle quintile of predicted risk (cases 560–840) and to overestimate risk amongst the highest quintile.
A first-level model: clinical predictors
The full mathematical details of this first model are given in a scientific report . Table 4 gives the presentation and predicted risk for a number of patients selected to display the range of predicted risk.
It looks as though what has been created is a model dominated by clinical evaluation rather than objective laboratory measurements. This is not nonsensical because we know that several other factors enter our clinical evaluation of a patient than the single measure of their ability to blow out air, however important a measure this may be. However, this squeezing-out of the clinical measurement in the model sacrifices information which we are not inclined to reject. Furthermore, it deprives us of the objectivity provided by measurement as opposed to subjective clinical judgements. The pitfall has already been illustrated with reference to COPD in the New York CABG outcomes.
It is intuitively acceptable that increasing age, worsening MRC dyspnoea score and reducing amounts of remaining functioning lung will increase the risk of death but look again at ASA. The risk increases with increasing ASA but then reduces again at the most severe level (the bottom row of Table 4). Various possible explanations have been offered. It may be unwitting ‘loading' of the preoperative risk for added caution in some units or it has been suggested that, presented with this score, the anaesthetists made special provision. We cannot know for sure. We do know that ASA classifications are treated with caution by anaesthetists as being inconsistent predictors of risk.
In clinical practice, the decision to operate on a patient emanates from a risk–benefit analysis conducted over time. There are two very clear phases. When first clinically assessed, many patients can be judged on clinical grounds to be totally unfit for surgery and others appear to be very fit candidates. The former group may well drop out of consideration. This first risk model, which includes clinically available means of assessment, may serve at this stage. We are prepared to offer this as a first clinical assessment risk model, subject to the caveats given later in this paper.
A second model: refined by laboratory measurement
At a later stage of the selection process, a further assessment based on laboratory measurement is customary. We constructed a second-tier model based solely on clinical measurement to predict risk amongst patient with a diagnosis of neoplasia. This second model derived from age and ppoFEV1% was built using the ‘enter' method of logistic regression in which no statistical criteria are used to select which factors contribute to the model. This second round of model development went beyond the planned analysis.
Of the 1753 patients with a diagnosis of tumour growth in the development set, age was unavailable in five cases and ppofev1% was unavailable in a further 54 cases (one death). This gave a sample of 1694 patients (33 deaths) for the development of model 2. For a given patient, the predicted risk p of in-hospital death according to model 2 is given by the expression
An EXCEL spreadsheet which can be used to perform this calculation can be downloaded from http://www.ucl.ac.uk/operational-research. Figure 3 shows how this model performs against the actual tally of deaths in the data not used directly in the development of the modelg.
The role of risk models in clinical case selection
Besides using risk models to make fair (or at least fairer) comparisons of outcomes, it is clearly tempting to make use of the available predictions of risk of surgical mortality in clinician–patient discussions concerning whether to opt for surgery. Whilst risk models do have a role to play in this decision process, considerable care is required.
In general, we are making a judgement about whether what we have to offer with our surgery, in terms of relief of symptoms and/or improvement in the prospect of survival, is worth the risk that the patient will face in the attempt. The decision of whether or not to operate and the information used in making such a clinical decision go way beyond the information that can be obtained from any mathematical model that is feasible.
Clinical risk factors that are rare or that have a true but relatively small impact on risk will often not feature in risk models due to the criteria of statistical significance used (quite rightly) in the development of risk models. One solution to this problem is to amass more and more data and to build bigger and better models. However, there comes a point of diminishing return in terms the amounts of extra cases required to make improvements to the accuracy of a model. Also, an increase in data that needs to be collected for each case will raise the proportion of cases for whom some data, and hence a prediction of risk, is unavailable.
There is another problem when considering the role of risk models in case selection – the fundamental logical inconsistency in using risk models built only on data from cases that were accepted for surgery to determine which cases should be accepted for surgery. In simple terms, if a patient with, say, recent stroke or unstable angina is never considered for pneumonectomy, these factors will not enter the risk model and yet all rational teams would advise against it. Figure 4 illustrates this problem. This is a specific illustration of the more general problem of ‘work up bias' .
In lung resection surgery, we predominantly deal with lung cancer which has a higher than 95% probability of death within 5 yr if untreated. The issues are dominated by stark choices that may make the difference between life and death.
In order to inform the decision of whether or not to operate we want to know:
- Is the diagnosis of lung cancer proven beyond doubt?
- Depending on its pathological type and stage we can give an estimate of what the survival prospects are for the patient without surgery. For lung cancer, survival is easier to predict than for many other conditions. Take for instance abdominal aortic aneurysm. There is a single lethal event. It is sometimes referred to as a ‘time bomb' but erroneously. It is more like a bomb with a string fuse and we can only ask ‘how long is that piece of string'. In the case of lung cancer, provided we know the diagnosis is correct, although there is a range of uncertainty, there is high probability that death will follow within months or very few years.
- What benefit can we offer by lung resection?
- This estimation is also well developed. If patients are well staged by a combination of CT, PET and nodal biopsies, we can decide whether surgery has a high probability of cure (80% 5-yr survival in Stage Ia, T1N0M0) or more like 50% if there is N1 but not N2 disease .
- What are the perioperative risks that must be faced in order to achieve that benefit?
- This is highly patient specific and all the many factors that may be clinically evident or discovered on investigation of a patient cannot possibly be covered in a risk model. The risk model may give a reasonably robust number, that would be agreed on the basis of routine tests, and agreed by other specialists, but may be completely overridden by individual features that are simply not in the calculation.
The pipe dream of a computer-based algorithm that answers the question will not be realized and it is therefore unwise to judge risk models by this standard. Due to the statistical and logical considerations discussed above, the role of risk modelling will always be limited to informing this process.
The original purpose of this model building exercise was to allow fair and well-informed monitoring of outcomes, not to identify high-risk patients so that they can be refused cancer surgery. Risk modelling can have a secondary use by providing an impartial and objective standard to inform the decision between patient and surgeon.
But is peri-procedural death/survival what we really want to know in lung cancer? The chance of being cured of cancer is an important consideration for the patient and a well-informed patient may accept a risk to have that chance .
1. Parsonnet V, Dean D, Bernstein AD. A method of uniform stratification of risk for evaluating the results of surgery in acquired adult heart disease. Circulation
2. Miller DC, Stinson EB, Oyer PE et al.
Discriminant analysis of the changing risks of coronary artery operations: 1971–1979. J Thorac Cardiovasc Surg
3. Kirklin J, Barratt-Boyes B. Cardiac Surgery
. New York: John Wiley & Sons, 1986.
4. Nashef SA, Carey F, Silcock MM, Oommen PK, Levy RD, Jones MT. Risk stratification for open heart surgery: trial of the Parsonnet system in a British hospital. BMJ
5. Treasure T. The safe duration of total circulatory arrest with profound hypothermia. Ann R Coll Surg Engl
6. Treasure, T. The safe duration of total circulatory arrest with profound hypothermia. 1982. University of London.
7. Treasure T, Naftel DC, Conger KA, Garcia JH, Kirklin JW, Blackstone EH. The effect of hypothermic circulatory arrest time on cerebral function, morphology, and biochemistry. An experimental study. J Thorac Cardiovasc Surg
8. Mohri H, Barnes RW, Winterscheid LC, Dillard DH, Merendino KA. Challenge of prolonged suspended animation: a method of surface-induced deep hypothermia. Ann Surg
9. Brunberg JA, Reilly EL, Doty DB. Central nervous system consequences in infants of cardiac surgery using deep hypothermia and circulatory arrest. Circulation
10. Brunberg JA, Doty DB, Reilly EL. Choreoathetosis in infants following cardiac surgery with deep hypothermia and circulatory arrest. J Pediatr
11. Berrisford R, Brunelli A, Rocco G, Treasure T, Utley M. The European Thoracic Surgery
Database project: modelling the risk of in-hospital death following lung resection. Eur J Cardiothorac Surg
12. BTS guidelines: guidelines on the selection of patients with lung cancer for surgery. Thorax
13. Gallivan S, Utley M, Pagano D, Treasure T. MADCAP:
a graphical method for assessing risk scoring systems. Eur J Cardiothorac Surg
14. Blackstone EH, Lauer MS. Caveat emptor: the treachery of work-up bias. J Thorac Cardiovasc Surg
15. Naruke T, Tsuchiya R, Kondo H, Asamura H. Prognosis and survival after resection for bronchogenic carcinoma based on the 1997 TNM-staging classification: the Japanese experience. Ann Thorac Surg
16. Treasure T. Whose lung is it anyway? Thorax
aThis statement appears on p.230 of the first edition of Kirklin's major text book  and was an emphatic statement of his while TT was in his unit during 1981.
bThe use of 70% confidence intervals is dealt with on P.183 of Kirklin's text book . The idea was that whether the 70% confidence limits of two proportions held up for comparison did or did not overlap approximated to a statistical test for difference at the P = 0.05 level. It was either never really understood or seen as idiosyncratic and has not stood the test of time.
cGene Blackstone was a pioneer of mathematical applications in cardiac surgery and lifelong colleague of John Kirklin. TT worked with them at UAB in 1981. GB now works at the Cleveland Clinic.
dhttp://www.surgical-tutor.org.uk/default-home.htm?core/preop1/fitness.htm. This site gives the method and its application.
ehttp://www.ecog.org/general/perf_stat.html. ECOG standards for the Eastern Cooperative Oncology group and this very simple scale of impairment of performance (0–5 ranges from fully active to dead) is favoured by oncologists in the context of ability to take the chemotherapy.
fhttp://health.allrefer.com/health/lung-diffusion-testing-info.html. This site describes the test and its features.
gNote that even though Figure 3 looks encouraging, we built this second model after reviewing the performance of the first model amongst the test cases so we do not claim to have tested the model in a scientifically rigorous way.