Secondary Logo

Journal Logo

Paper

ICISS

An International Classification of Disease-9 Based Injury Severity Score

Osler, Turner MD; Rutledge, Robert MD; Deis, Joan RN; Bedrick, Edward PhD

Author Information
The Journal of Trauma: Injury, Infection, and Critical Care: September 1996 - Volume 41 - Issue 3 - p 380-388
  • Free

Abstract

Key Words: ICISS, Trauma, Outcome prediction, Injury Severity Score, ROC analysis.

Although it seems obvious that the scientific study of trauma must depend upon accurate description and measurement of injury, the modern era of injury quantification began only 40 years ago when De Haven's [1] research on light plane accidents led him to attempt the objective measurement of human injury. Interest in automobile crashes prompted a small group of physicians, engineers, and researchers to continue development of an injury description system, an effort that culminated in the publication of the first Abbreviated Injury Scale (AIS) [2] in 1971. This scale included only 73 general injuries, but did include a consensus derived severity measure for each injury, varying from 1 (minor) to 6 (fatal). Only blunt injuries were included in this first AIS, and no attempt was made to create a comprehensive listing of all possible injuries. Further, no mechanism for summarizing multiple injuries into a single score for an individual patient was proposed. This particular limitation immediately proved problematic, and three years later in 1974 was addressed by Baker et al. [3] with the creation of the Injury Severity Score (ISS).

The ISS has reigned as the standard for injury measurement for over 20 years with only occasional challenges. Champion et al. [4] noted in 1980 that the ISS remained "congenitally attached" to the subjective determinations of injury severity in the AIS, and proposed a data driven alternative, the Anatomic Index (AI), which was based upon the vocabulary of the HICDA-8 codes and empirically measured severities. Although Champion's AI score performed about as well as the ISS (both had misclassification rates of 4.2%), it failed to displace the ISS, probably because it failed to improve upon ISS. In retrospect, the AI probably could have performed substantially better if it had used the information in all a given patient's injuries, but, as defined by Champion, the AI used only the single worst injury to predict outcome.

Levy et al. [5] and Goldberg et al. [6] approached injury measurement in a manner similar to that of Champion, but used the information in all a given patient's injuries. They first assigned a probability of survival to each International Classification of Diseases Adapted for Use in the United States (ICDA) injury code and then defined a patient's probability of survival ("RESP index") as the product of all the single condition survival rates for that patient's various injuries. Levy and Goldberg's score also failed to displace the ISS, however, because the ISS was found to perform better. [6]

More recently still, Champion and coworkers have conceived the American College of Surgeons' Committee on Trauma (ASCOT), [7] a predictive measure of outcome that incorporates AIS injury descriptions, age, and physiologic data into a single score. Because ASCOT requires physiologic data for its calculation, it is not, strictly speaking, a summary measure of anatomic injury. We believe that summary measures of physical injury are more useful, because they can be applied not only when physiologic data is unavailable, but also can be used as a predictor variable in a logistic model when physiologic data is available. ASCOT has not been generally adopted, probably because of it provided only slightly improved predictive power over Trauma and Injury Severity Score (TRISS) at the price of substantially increased complexity of calculation. [8] Moreover, a recent comparison of ASCOT and TRISS scoring found no significant difference between the Receiver Operator Characteristic Curve (ROC) areas of these two approaches to outcome prediction, confirming that the predictive power of the two methods is equal. [9]

Despite the failure of early attempts to improve upon the ISS, it is likely that better summary measures of trauma exist. Not only is the ISS marred by its dependence upon the consensus derived approximations of individual injury severities attached to the AIS descriptor vocabulary, but it is further limited because it can consider at most only three of a given patient's injuries, and in practice often fewer. Professor Baker herself has called for testing a revised injury severity score that might be improved by allowing the three highest AIS scores to be used, regardless of body region (personal communication, 1995). This modification of the ISS will be the subject of a forthcoming report.

We present here a new injury severity score, the International Classification of Diseases 9th Edition Injury Severity Score (ICISS), which addresses both deficiencies of the ISS. We show that the ICISS more accurately predicts mortality than ISS at a level that is highly statistically significant and may prove to be clinically relevant. Moreover, this improvement is largely maintained when age, mechanism, and Revised Trauma Score (RTS) are added to the model to produce Probability of Survival (POS) scores.

MATERIALS AND METHODS

The University of New Mexico Hospital maintains two parallel but independent trauma registries. These registries include all patients presenting with trauma to the University Hospital who survived long enough to be admitted to the hospital or operating room. One registry uses the AIS90 lexicon and is maintained by a single data registrar trained by the American Association for Automotive Medicine (AAAM). The second registry uses the ICD-9 lexicon and is maintained by trauma nurse specialists, also AAAM trained. [10] Both registries are meticulously maintained, with the mandatory participation of admitting resident physicians and operating attending surgeons, frequent chart reviews, and weekly service conferences during which all injuries in every patient are reviewed. The two registries share 3, 142 patients admitted to the University of New Mexico Trauma Service over the 4 years from 1991 through 1994. Patients were predominantly young (92% under age 56) and sustained blunt injuries (75%). Ninety one percent survived to hospital discharge. These two competing registries provide a natural laboratory for the comparison of ICD-9 based and AIS90 based scoring of injuries.

The North Carolina Hospital Discharge Registry is composed of all patients surviving illness or injury long enough to be admitted to any hospital in North Carolina. ICD-0 values were assigned by individual hospital information system personnel, and no independent validation of ICD-9 completeness or accuracy was undertaken for this research. This registry has accumulated 314,402 patients who sustained 504,934 traumatic injuries (average 1.6 injuries per patient) since 1990. We used this data set to calculate a Survival Risk Ratio (SRR) for every trauma ICD-9 code (range 800-959.9, with 958.0 (air embolism), 958.4 (traumatic shock), and 958.5 (traumatic anuria) excluded as follows: The number times a given ICD-9 occurs in a surviving patient is divided by the total number of occurrences of that ICD-9 in the North Carolina registry. A given ICD-9s SRR thus represents the likelihood that any individual patient will survive that particular ICD-9 injury. The corpus of 2,034 different injury ICD-9s and their associated SRRs thus represents the knowledgebase that infuses the experience of North Carolina into ICISS outcome prediction.

Following the methodology of Levy [5] and Goldberg, [6] we define the ICISS as the product of all the SRRs for each of an individual patient's injuries. This may be a single SRR, corresponding to a patient with a single injury, or it may be as many as ten SRRs, as in the case of a patient with ten different injuries. Equation 1 where PSur sub vsub Injsub 1 = Probability of Survival of Injury 1, PSurvsub Injsub 2 = Probability of Survival of Injury 2, and PSurvsub Injsub last = Probability of Survival of last Injury.

This formula for injury severity summarization has several appealing characteristics. First, more severe injuries (i.e., lower SRRs) have lower ICISS scores. Second, patients with a greater number of injuries will also tend to have lower ICISS scores. In fact, the addition of any but a trivial injuries (ICISS = 1) will always lower the ICISS score. And finally, the actual calculation of ICISS involves only multiplication.

The ISSs were calculated for each patient following the protocol of Baker [3] using the computer program IS90SCOR. [11] IS90SCOR also produces TRISS Probability of Survival Scores. [12]

Comparisons between the ISS and the ICISS for the University of New Mexico data set included misclassification rates, receiver operating characteristic curve (ROC) analysis [13] and the Hosmer Lemeshow goodness of fit statistic, [14] all calculated by SAS statistical software (version 6.08) Proc Logistic. [15] The p values and 95% confidence intervals for the ROC statistics, misclassification rates, and the proportion survival curves were calculated using a resampling approach, [16] implemented in Paradox Application Language (Borland International) on a desktop IBM compatible personal computer.

RESULTS

We first sought to determine how many injuries should be included to optimize ICISS model performance. We created ten models: model 1 with only the most severe injury for each patient, model 2 with only the two most severe injuries for each patient, up to model 10. We find that the ROC values for the models increase monotonically for models 1 through 10, from 0.90 asymptotically approaching 0.92 (Figure 1). We conclude that outcome prediction is improved by using up to the five worst injuries, but is not harmed by using all available injuries. Because using all injuries avoids having to decide which injuries are the worst, we simply define the ICISS as the product of all available SRRs.

Figure 1
Figure 1:
Receiver Operator Characteristic Curve (ROC) areas as a function of the number of injuries considered in ten different ICISS models. Note that predictive power (ROC area) improves with the addition of more injuries up to a total of six injuries, but is not harmed by the addition of further injuries.

We next compare the ISS to the ICISS for the complete data set. A graphic comparison of ISS and ICISS shows better separation of the survivors and nonsurvivors when ICISS scoring is employed, and this impression is confirmed by the greater separation of the median values for survivors and nonsurvivors in ICISS (survivors 0.93, nonsurvivors 0.45 = 48% of possible range) over ISS (survivors 10, nonsurvivors 29 = 25% of possible range) (Figure 2). Optimal decision cut points (50% mortality) derived from a logistic regression analysis and their corresponding misclassification rates are also shown for ISS and ICISS. We next calculate survival rates by decile for ISS and ICISS and use a resampling methodology to determine 95% confidence intervals (CI) (Figure 3). The sigmoidal shape of the ICISS curve with its steep midrange is typical of tests that discriminate well between two conditions. The more linear graph for ISS implies weaker discrimination. Also of note, an ICISS score of 0.5 corresponds to a 50% survival rate, a convenient characteristic for an outcome prediction function.

Figure 2
Figure 2:
Separation of survivors from nonsurvivors by two injury summary measures: ISS and ICISS. The nonsurvivor populations have been magnified 10-fold to make trends easier to see. Misclassifications are shown as the cross-hatched areas.
Figure 3
Figure 3:
Proportion of patients surviving by decile for ISS and ICISS. The wide 95% confidence intervals in the more lethal ranges of both ISS and ICISS are owing to the relative sparsity of patients in the these regions.

Standard statistical measures confirm these observations: ISS has an ROC area of 0.872 (95% CI = 0.85-0.89), whereas ICISS has an ROC of 0.921 (95% CI = 0.90-0.94). This difference of 0.049 in ROC areas is highly statistically significant (p < 0.0001, paired resampling methodology) and may be clinically relevant (Table 1). Because a perfect scoring system will have an ROC value of 1.0, replacing ISS with ICISS represents a 38% increase in ROC value. Misclassification rates (total misclassifications/total sample size) were 7.67% (95% CI = 6.6-7.8%) for ISS and 5.95% (95% CI = 4.7-6.24%) for ICISS (p < 0.0001). This difference represents an overall improvement of 22% in misclassification rate by ICISS over ISS. When the complete data set was divided into blunt and penetrating data sets, ICISS maintained its advantage in both data subsets (Table 1).

Table 1
Table 1:
Resultsa of ISS vs. ICISS.

Finally, we ask if the improvement seen in outcome prediction by using the ICISS over the ISS will be maintained when Revised Trauma Score (RTS), age, and mechanism (blunt/penetrating) are added to the model to create a POS model. To ensure an accurate comparison, age was modeled for the ICISS probability of survival in the same way as it has been for the traditional TRISS score (age greater than or equal to 55 vs. age > 55). The University of New Mexico data set was divided into two subsets: a blunt data set (n = 2,337) and a penetrating data set (n = 805). Two logistic regression models were constructed for each data subset, one base upon ISS and one based upon ICISS. All patients were then scored using each model, and ROC areas were calculated and compared for both ICISS POS and the traditional ISS based POS. In both the blunt and penetrating data sets, the ICISS-POS outperformed ISS-POS: blunt ISS-POS ROC = 0.934, blunt ICISS-POS 0.950, penetrating ISS-POS 0.959, penetrating ICISS-POS = 0.973 (Table 1). Again, these results are highly statistically significant (p < 0.0001, resampling methodology).

DISCUSSION

The evidence that the ICISS better separates survivors from nonsurvivors is statistical, but even the statistically disinclined will appreciate the graphic evidence of better separation of the living from the dead by the ICISS (Figure 2). Misclassification rates are self-explanatory, but can be expected to vary from data set to data set depending upon the proportion of difficult cases in each data set. (Not surprisingly, any predictor system works less well for data sets that contain many cases in which the patients are severely, but not obviously fatally, injured. Such difficult cases will prove problematic for any prediction approach.) Nevertheless, comparison of misclassification rates for two different predictors applied to the same data set is a reliable way to compare the relative accuracy of two predictors. By this measure ICISS outperforms ISS by 22%.

The ROC area approach to evaluating the efficiency of a test was developed for the solution of certain problems in electrical engineering 40 years ago, but has been employed in medicine for only a decade. Formally defined as the area under a graph of sensitivity X (1 - specificity), an ROC area of 1 represents perfect prediction, whereas an ROC area of 0.5 represents prediction no better than chance and thus a test with no value whatsoever. How much improvement is represented by moving from the ISS ROC area of 0.870 to the ICISS ROC area of 0.921 is less intuitive that the improvement in misclassification rates of ICISS over ISS, but is also very significant (p < 0.0001).

It is important to note that ICISS was developed using only the North Carolina Hospital Discharge Data Set as a knowledgebase. No information about the University of New Mexico test data set was incorporated in the construction of ICISS. Thus, although the actual performance of ICISS in other data sets will vary depending upon the proportion of indeterminate cases (i.e., the proportion of cases in which predicted mortality is near 50%) present in the data set, ICISS should almost always outperform ISS. In 10,000 resamplings of the University of New Mexico data set ISS never outperformed ICISS. If the University of New Mexico experience can be considered typical for trauma services elsewhere, this implies that ICISS will outperform ISS in over 10,000 data sets for any single data set in which ISS proves superior.

Not only does ICISS predict outcome better than ISS, but as a measure of trauma severity ICISS is better behaved. The logit transformation of the ICISS more accurately models the data (Hosmer Lemeshow goodness of fit 10.6, p = 0.226) than does the logit transformation of ISS (Hosmer Lemeshow 51.3, p <.000.) (Table 1).

The sources of improvement of ICISS over ISS are twofold: all injuries are allowed to contribute to prediction, and individual injuries are more accurately modeled. Whereas most of the information about overall injury severity is contained in a patient's three worst injuries, ICISS derives some advantage from its use of all injuries. More importantly, because ICISS uses the information in all injuries it has access to a patient's three worst injuries. ISS, by contrast, is constrained to use only the single worst injury in the three most severely injured body regions. Frequently, this constraint results in the ISS ignoring more severe injuries in favor of less severe injuries that happen to be in a different body region. When the ISS was recalculated for the University of New Mexico data set based upon each patient's worst three injuries regardless of body region, 60% of ISSs were increased and outcome predictive power significantly increased. The ROC area for ISS improved from 0.87 to 0.90 (p < 0.001, resampling methodology). We infer that about 50% of the improvement in predictive power of ICISS over ISS is owing to this single effect. The balance of the ICISS improved predictive power is likely because of more accurate modeling of individual injury severities and the use of all injuries in calculating the final outcome measure.

Not only are injury severities measured rather than assigned in the ICISS model, but a virtually infinite number of possible ICISS values are allowed because any number between 0 and 1 is allowed as a SRR, and the ICISS is simply the product of any number of such SRRs. In the University of New Mexico data set, 2,703 different ICISS scores were assigned to 3,142 patients in the data set. Only occasionally were patients with nonidentical injuries assigned equal ICISSs. The ISS is, by contrast, somewhat more idiosyncratic in its behavior and is perhaps best thought of as a categorization scheme, which maps all possible injury combinations into the 44 allowed ISS scores between 0 and 75.

The use of measured rather than assigned values for injury severity by ICISS also contributes to the improved accuracy of prediction. A close study of the SRRs for the 2,034 injury ICD-9s in the North Carolina knowledgebase shows that only about 100 make much of a contribution to mortality, with SRRs of less than 0.80. The identities of these "99 most dangerous injuries" will be grimly familiar to those involved in clinical trauma care (Table 2). Sixty percent are head injuries, 20% are burn injuries, and 13% are major thoracic and abdominal vascular injuries. The remaining "most dangerous injuries" are major extremity amputations, pelvic fractures, and cervical spine fractures. Massive liver injury is also singled out as particularly dangerous (SRR = 0.73).

Table 2
Table 2:
Representative sample from the 99 most lethal traumatic ICD-9s in the North Carolina Hospital Admission Registry and their SRRs.

Although ICISS is a substantial improvement over ISS, ICISS too can be improved upon in at least four ways. First, the North Carolina data base from which the SRRs are derived may not be entirely representative of particular populations that it may be of interest to model. The injuries and outcomes of the entire state of North Carolina might be expected to be very different from those seen at a Level I Trauma Center in Albuquerque, New Mexico. The fact that North Carolina SRRs predict outcome so successfully in New Mexico attests to the robustness of the ICISS methodology. Nevertheless, ICISS would likely perform better if several different sets of SRRs were available: one for rural hospitals, one for urban trauma centers, etc.

Second, the calculation of SRRs uniformly overestimates the severity of trivial injuries. Recall that we define the SRR of each ICD-9 as the simple proportion of all patients sustaining ICD-9 injury and who survive. Unfortunately, when trivial injuries occur in patients with other more severe injuries who then go on to die, the SRRs of the trivial injuries are underestimated. In effect, the SRRs of trivial injuries are occasionally "contaminated" when they occur in patients who have also sustained more severe injuries.

A third problem is that the proposed multiplicative ICISS model assumes that each injury independently affects outcome, an assumption that is almost certainly wrong. It is well known to trauma surgeons that a gunshot wound to the pancreas and colon is in some way more dangerous than the simple sum of an injury to the colon added to an injury to the pancreas. The number of potential interactions is vast, and a better model may prove elusive.

A final, less thorny difficulty concerns the size of the knowledgebase used to calculate the SRRs in our implementation of ICISS. Although our dictionary of 2,034 SRR values is based upon over half a million injuries in over 300,000 patients, some ICD-9 values were poorly represented in our knowledgebase. One hundred and sixty trauma ICD-9s occurred only a single time in the entire North Carolina knowledgebase, and fully half of trauma ICD-9s occurred fewer than 30 times (Figure 4). It is impossible to derive reliable SRRs on such small patient cohorts. Although ICD-9s that rarely occur are unlikely to have much effect on the overall predictive power of ICISS, they may powerfully affect occasional individual case outcome prediction. The simple dependence of ICISS upon a single SRR dictionary allows an easy remedy, however. We need only expand the knowledgebase used in the construction of SRRs. Such a knowledgebase might require periodic updates should the treatment of traumatic injury evolve and outcomes improve, but this should also be easy.

Figure 4
Figure 4:
Number of occurrences of each ICD-9 in the North Carolina knowledgebase. Although the average number of occurrences for an ICD-9 was 250, the distribution of occurrences was extremely skewed; hip fractures occurred very commonly (19,000 times), but about one-half of ICD-9s occurred 30 or fewer times. One hundred and fifty ICD-9s occurred only once.

The most legitimate use for ICISS is as a global measure of the likelihood of death. Patients with lower ICISS scores can be said to be "more injured" in the sense that they are more likely to die. In passing, we should note that there are other measures by which a patient could be said to be "more injured:" length of hospital stay, for example, or cost of care, or even amount of pain suffered. ICISS is derived using an indisputable outcome (death), which simplifies its definition and calculation, but limits its interpretation. Thus, two patients with very different injuries will be scored by ICISS as "the same" if they have similar likelihoods of dying, even if their injuries and outcomes are otherwise very different.

Another arena where scoring might be employed is in clinical outcome prediction for individual patients. It is common for the uninitiated to assume that a model with a, say, 5% misclassification rate can be clinically useful by providing guidance to physicians faced with a clinical conundrum. They forget that the 5% error rate applies to the complete data set, not to individual patients. Clinically difficult problems usually represent only a fraction of the entire data set; when this much more difficult to predict subset is modeled, outcome prediction can be very much worse, perhaps in the 40% misclassification range. We have grown accustomed to the fact that machines play chess better, take cube roots more accurately, and recall long lists more completely than the human brain. But in the sphere of pattern recognition, for example, recognizing the clinical picture of futile care, the human brain is without peer. In fact, when it has been studied, clinical acumen is usually better than any mathematical model. [17-19] This idea is portrayed in Figure 3. The breadth of the error bars in the midregion of ICISS scores where clinicians might want guidance are disconcertingly wide. In this region of the survival curve, we scarcely know what proportion of patients will survive, much less which individual patients will survive. The midregion overlap of survivors and nonsurvivors for both ISS and ICISS in Figure 2 underscores this problem and is frankly discouraging. It is thus unlikely that any scoring system will be of assistance is directing clinical decision making.

Another hoped-for use of outcome prediction is in assessing resource utilization. Although such a model would be interesting and important, its outcome variable (cost, length of stay, etc.) is obviously not the same as survival. ICISS is thus likely to be a poor predictor of resource utilization. The ICISS methodology might be helpful in constructing such a model, however.

A better system of outcome prediction cannot add information; it can only optimally interpret the available information. Inaccurately or incompletely described patients will not have accurate outcome predictions regardless of the methodology employed. The success of ICISS in the New Mexico data set is in large measure because of the extreme attention given to the assignment of ICD-9 descriptors in our trauma registry. ICD-8 descriptors less carefully assigned have in the past proved far less useful, even useless, [6] but this failure may have in part been caused by the weakness in the ICD-8 as an injury descriptor system. The hope that expensive, carefully maintained trauma registries can be replaced by a bit of clever software using the discharge face sheet of the hospital record will need to be tested and may prove disappointing. A comparison of the accuracy of ICISS scores derived from the University of Vermont's trauma registry and from its hospital information system will be the subject of a forthcoming report.

It is a statistical truism that there is never a best model, only better models. Eventually, however, incremental improvements in our quest for the perfect predictor of outcome will slow and then stop. At that point we will be measuring the "noise level" in outcome prediction: how bad prediction must be rather than how good it can be. Our task is to minimize the number of patients assigned a probability of survival of 50%. For those individuals who inevitably remain in the 50% survival category, however, outcome will remain a coin toss; it is possible to predict the behavior of populations but individuals remain unpredictable.

CONCLUSIONS

ICISS significantly outperforms ISS when predicting outcome, both alone and when incorporated in a full probability of survival model, with age, mechanism, and RTS added to the model. This improvement is owing to the use of all injuries as well as to better modeling of each individual injury using data derived SRRs. One should not conclude that the ICD-9 lexicon offers any intrinsic improvement over the AIS lexicon, however. It is likely that a data driven modification of AIS severities would result in similar or greater improvements. Indeed, it may be that AIS based registries and ICD-9 based registries can "peacefully coexist" as long as results are compared through the intermediary of an SRR based scoring system.

It would be premature to recommend the overthrow of ISS as the lingua franca of the trauma world. Considerable resources are currently allocated to injury description and scoring in the AIS-ISS lexicon, and any change would be expensive and time consuming. Although it may be possible, and considerably less expensive, to use ICD-9 codes from the hospital discharge face sheets in place of independent AIS assignment, this approach has not succeeded in the past. [6] Such an approach would need to be carefully tested before its general adoption could be recommended.

It is difficult to assess how much more accurately or less expensively an injury description system must predict outcome in order to merit its general implementation. In some measure, this will depend upon the uses to which outcome predictions are put, and these are currently far from uniform. If a sufficiently improved system is discovered, we should not shrink from its implementation, but the decision will not be easy.

ADDENDUM

The authors are happy to provide a table of the Survival Risk Ratios derived from the North Carolina Trauma Registry to interested readers. From this table, ICISS scores can be calculated easily. Additionally, a computer program that automatically calculates ICISS scores, called ICISS, is available without charge. This program accepts either individual patient descriptions or batches of ICD-9 codes (in ASCII, sup *.DBF, sup *.WK1, or sup *.DB flat or normalized files) and produces a table of ICISS scores. The program is written in Paradox Application Language, but is compiled and runs as a stand-alone program under DOS. To obtain a copy of ICISS, please send three formatted high density 3.5-inch floppy diskettes and a return mailer to: Turner Osler MD, University Vermont, Department of Surgery, Given Building D-319, Burlington, VT 05405. Alternatively, the ICISS program can be downloaded from our World Wide Web site at: http:// www.vtmednet.org/surgery/iciss

Acknowledgments

The authors thank Professor Susan Baker for reading this manuscript in draft form. Her suggestions, comments, and criticisms have proven invaluable. The University of New Mexico Trauma Database was made possible by the contributions of over 100 resident and attending physicians and was held together by computer code written by Ron Martinez. Data base management and queries were overseen by Teresa Mosley-Rhodes. Our World Wide Web site is made possible through the generosity of the Division of Neurosurgery at the University of New Mexico.

PAPER DISCUSSION

Dr. Thomas J. Esposito (Maywood, Illinois): The premise of this study is straightforward. Information gleaned from analysis of ICD-9 codes for all injuries sustained is easier and less expensive to obtain while being a better predictor of outcome than is ISS. At first glance, the results are exciting, and their implications far-reaching, but closer scrutiny raises several questions and concerns to be explored before we can say the authors have truly built a better, rather than just different, mousetrap.

My first concern is not with the ISS comparison, but with the reference population upon which it is based. The concern relates to the validity of the mortality risk ratios and whether the North Carolina hospital discharge data base truly represents the universe of trauma patients. Specifically, does the fact that North Carolina has a well-organized and reportedly effective trauma center bias SRRs and their application to other data bases?

Were ICD-9s from the patients who died in the ED or DOA or admitted less than 24 hours included in the discharge data, and what is the impact of their inclusion or exclusion on the SRRs?

Why were certain ICD-9s, such as traumatic shock and traumatic anuria excluded, possibly affecting SRRs in the characterization of population outcomes? What of ICD-9s for complications such as pneumonia?

The average number of injuries per patient, 1.6, seems low in comparison to other data bases, for example, three in MTOS. Might this reflect incomplete cataloguing of injuries, thereby skewing risk ratios?

Was there some check on the completeness and accuracy of ICD-9 coding in the North Carolina data base?

As far as the actual comparison, does the software used to generate New Mexico standard ISS and its TRISS outcomes influence the performance relationships exhibited? Would the use of purely manual ISS calculation or commercially available software, such as Tri Code or ICD-9 mapping, which also address the issues of variability and labor intensity of ISS calculation, produce different results and conclusions?

While differences in ROC characteristics and misclassification rates are statistically significant, are they operationally significant? The authors acknowledge that misclassification rates and ISS performance in other data sets will vary depending on the number of "difficult, hard, or indeterminate cases." Perhaps they can define these terms and the impact of this on the universality of their results and conclusions.

Nationally, ICD-9 coding practices and personnel vary greatly. Some discharge data bases limit the number of codes captured. Some assign and rank codes based on reimbursement rather than severity of illness. Coding in the New Mexico data set was highly controlled. This again limits our ability to accept these results and conclusions in the context of the real world.

Other scoring systems exist which address the failure of ISS to take into account the synergistic and additive effects of all injuries on outcome. However, these are not acknowledged in the manuscript. Both the anatomical profile and ASCOT have shown substantial improvements in outcome prediction over ISS based scores. The authors should consider comparison of their method with these tools for ease of use, predictive performance, and cost.

Finally, can the authors provide any data that actually show a reduction in cost, time, or other tangible benefits for an individual hospital or trauma systems as a result of replacing ISS with ICISS methodology?

In summary, the study leaves me beguiled and bedazzled but unconvinced and yearning for more powerful evidence that this tool is better, rather than different. The manuscript states that it is difficult to assess how much more accurate, or less expensive, an injury description must be to merit its general implementation.

While I applaud these efforts, this study reinforces that statement in not adequately addressing either of these key issues. I encourage the authors to refine and continue this important and sentinel work, and I thank the Association for the privilege of being asked to comment upon it.

Dr. Howard R. Champion (Baltimore, Maryland): I would like to congratulate the authors on both the science and the marketing behind this new product. I believe that some of the statements and conclusions are warranted, even if not supported by data!

I have one question which I think is very serious, and I hope the answer is not what I suspect from reading the manuscript; and that is--it appears to me that this "risk ratio" was derived on the same data that it was tested on. If so, how would this influence the results? It would not be surprising that the results are fairly convincing. How are the authors planning to correct this methodological problem?

The other thing, just to reiterate what Dr. Esposito said, it would be worthwhile comparing this to the anatomical profile, particularly as they think that some of the problems with ISS are related to the number of injuries and location of injuries. This has been corrected with the anatomical profile methodology.

I think this is important work and should proceed.

Dr. Frederick A. Moore (Denver, Colorado): Traditional trauma scores have been designed to predict in-hospital mortality. Unfortunately, in-hospital deaths occur in two different peaks, early and late. While the early deaths are a test of our ability to deliver acute care (often in the middle of the night), it is the patients who die later in the ICU that consume a lot of resourses. They are different populations that most likely require different predictive models. Have you tested your system in a trauma intensive care unit? Additionally, for both end points (i.e., early deaths and late deaths) other models have required the inclusion of host factors and acute physiologic data to optimize prediction. Have you considered this? Thank you.

Dr. John R. Clarke (Philadelphia, Pennsylvania): I would remind the audience that the form described for the calculation is in fact the odds ratio form of Bayes theorem and is a very traditional and acceptable form of combining information. It does, however, require the assumption of independence. If you independently include two items that are in fact dependent, your discriminating ability will remain the same, but the probabilities will be forced towards one or zero. This might explain the separation that they see; that is, they correctly classify people as being alive or dead, but it is not as well calibrated as it could be, in that, if it predicts a 75 percent probability, it should be 3 of 4.

I wonder, if the authors have looked at any of the conditional dependencies that might be predicted?

In reference to their independent, isolated predictions, did they identify any whose confidence limits spanned one, that is, whose confidence limits were such that they could be eliminated from the equation because they could not predictably drive the mortality one way or the other? Thank you.

Dr. Michael D. McGonigal (St. Paul, Minnesota): I enjoyed the paper, although I admit I was slightly confused listening to the presentation and comparing that with the abstract I read in the book.

I share Dr. Esposito's concerns about the underlying data in this study. It appears that it depends completely on accurate death data for each ICD-9 code. Now, this mortality data will vary by region, by state, and even by hospital. It also will slowly change with time. What data should you use? Should you do as the authors say, and use a larger and larger data set, or should you use a smaller one that more accurately reflects the local flavor of care as well as the local population?

And how often do you have to update this data to maintain its accuracy?

I enjoyed the paper.

Dr. Turner M. Osler (closing): I want to thank Dr. Esposito for his thoughtful criticism, and the other discussants for their questions. If I may, I will take these questions out of order.

Dr. Champion, it is crucial to recognize that two separate data sets were used in this project. The survival risk ratios used to predict outcome in the New Mexico data set were derived from a completely independent data set collected in North Carolina. Thus, we were not "chasing our tails," that is, we were not using the New Mexico results to predict New Mexico results.

This said, Dr. Esposito, the North Carolina data set is a problem. It is certain that not all injuries were coded, since we only had 1.6 injuries per patient in the North Carolina, differing sharply from all other published data sets. The great surprise for me, however, is that this incomplete data set from North Carolina worked so well in prediction outcome in New Mexico. Had a better reference data set been available to calculate survival risk ratios it is likely that ICISS would have performed even better. Thus, we can view the shortcomings of the North Carolina data set as a vindication of the ICISS methodology.

Dr. McGonigal points out that the predictive power of ICISS will depend upon the reference data set used to calculate survival risk ratios, and asks which reference data set is appropriate, and how often such data sets should be updated. We have at present only a single reference data set which is representative of an average of the entire state of North Carolina. It is likely that survival risk ratios derived from a patient population more like the Level I Trauma Center in Albuquerque would have better predicted outcome in our test data set. In practice, I can imagine that several sets of survival risk ratios would be useful: one for urban trauma centers, one for non-trauma centers, one for rural hospitals, and so on. This arrangement would allow better outcome prediction and fairer comparisons of performance. Sets of survival risk ratios will obviously need to be updated as trauma care evolves. I would imagine that updates every few years would be adequate.

Dr. Clarke points out that the ICISS methodology of simply taking the product of all survival risk ratios to arrive at a final prediction of survival requires the assumption of independence. We have not yet been able to solve the problem of potential interactions between injuries, since there are over 4 million potential binary interactions between the 2,034 ICD-9 trauma codes, but we are working on this problem. Dr. Clarke's concern that our inability to deal with the independence problem may result in poor calibration of the ICISS model seems not to be a problem, however, since the Hosmer Lemeshow statistic for ICISS of 10.6 (p = 0.226) suggests our model is well calibrated.

Dr. Moore points out the need to add physiologic data to the anatomic ICISS score. We have done this with the New Mexico data set, and, not surprisingly, predictive power increases. Gratifyingly, ICISS augmented with age, RTS, and mechanism significantly outperforms the standard TRISS methodology (ROC TRISS = 0.938, ROC ICISS + age + RTS + mechanism = 0.954).

To address quickly the remainder of Dr. Exposito's questions: Three "trauma" ICD-9 (800-959.9) codes were excluded in the calculation of ICISS: 958.0 (air embolism), 958.4 (traumatic shock), and 958.5 (traumatic anuria). Since these diagnoses are not anatomic injuries, they are not available to ISS. We felt that they should be excluded to assure a fair comparison between ISS and ICISS. Had they been included, ICISS would have performed even better.

I agree with Dr. Esposito that the ICD-9s used in our data set were better than those available on most hospital discharge face sheets. It is not at all certain that outcome prediction based upon hospital information system generated ICD-9s would outperform ISS. This, however, is a testable hypothesis, and we plan to do it soon.

Finally, Dr. Esposito, so far the University of New Mexico has saved very little money on this project. In fact, in order to compare ISS with an ICD-9 based system, we have borne the expense of two trauma registries for the last 5 years. I calculate that the data I have shown you this morning cost us $100,000 to accumulate.

What we have shown is that carefully assigned ICD-9s, coupled with empirically derived survival risk ratios substantially outperform the venerable but dated ISS. Whether expensive trauma registries can be replaced by ICD-9s from hospital discharge summary face sheets coupled with a bit of software remains to be seen.

REFERENCES

1. De Haven H: The Site, Frequency and Dangerousness of Injury Sustained by 800 Survivors of Light Plane Accidents. Crash Injury Research, Department of Public Health and Prevention Medicine, Cornell University Medical College, New York, July, 1952
2. Committee on Medical Aspects of Automotive Safety: Rating the severity of tissue damage. JAMA 215:277, 1971
3. Baker SP, O'Neill B, Haddon W, et al: The injury severity score: A method for describing patients with multiple injuries and evaluating emergency care. J Trauma 14:187, 1974
4. Champion HR, Sacco WJ, Lepper RL, et al: An anatomic index of injury severity. J Trauma 20:197, 1980
5. Levy PS, Goldberg J, Rothrock J: The revised estimated survival probability index of trauma severity. Public Health Rep 97:452, 1982
6. Goldberg JL, Goldberg J, Levy PS, et al: Measuring the severity of injury; the validity of the revised estimated survival probability index. J Trauma 24:420, 1984
7. Champion HR, Copes WS, Sacco WJ, et al: A new characterization of injury severity. J Trauma 30:539, 1990
8. Markle J, Cayten CT, Burne DW, et al: Comparison between TRISS and ASCOT methods in controlling for injury severity. J Trauma 33:326, 1992
9. Champion HR, Copes WS, Sacco WJ, et al: Improved predictions from a severity characterization of trauma (ASCOT) over trauma and injury severity score (TRISS): Results of an independent evaluation. J Trauma 40:42, 1996
10. Cales RH: Hospital Trauma Register, version 4.1, 1986. One Pacific Marina, #708, Alameda, CA 94501
11. Osler T: Injury severity scoring: Perspectives in development and future directions. Am J Surg 165:43S, 1993
12. Boyd CR, Tolson MA, Copes WS: Evaluating trauma care: The TRISS method. J Trauma 27:370, 1987
13. Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29, 1982
14. Hosmer DW, Lemeshow S: Applied Logistic Regression. New York, John Wiley & Sons, 1989, pp 140-145
15. SAS Institute Inc., Logistic Regression Examples Using the SAS System, Version 6, First Edition. Cary, North Carolina, SAS Institute, 1995, pp 1-73
16. Efron B, Tibshirani RJ: An Introduction to the Bootstrap. New York, Chapman & Hall, 1993, pp 1-113
17. Meyer AA, Messick WJ, Young P, et al: Prospective comparison of clinical judgment and APACHE II score in predicting the outcome in critically ill surgical patients. J Trauma 32:747, 1992
18. Marks RJ, Simons RS, Blizzard RA, et al: Predicting outcome in intensive therapy units--a comparison of Apache II with with subjective assessments. J Intens Care Med 17:159, 1991
19. Kruse JA, Thill-Baharozian MC, Carlson RW: Comparison of clinical assessment with APACHE II for predicting mortality risk in patients admitted to a medical intensive care unit. JAMA 260:1739, 1988
© Williams & Wilkins 1996. All Rights Reserved.