In contemporary clinical practice, a large number of laboratory tests are performed to facilitate diagnosis, to assess disease progression or the effect of a therapy, and for prognostication. With the advent of genomic science, an even larger number of variables (mRNAs, proteins, and metabolites) can be measured in blood or tissues in a cost-effective way. For anesthesiologists, it is likely that many different metabolites, proteins, or neural variables will be measured in the perioperative setting that could provide data useful for patient assessment and treatment. Genetic risk factors can be measured to identify individuals at increased risk for disease or perioperative complications, or for predicting surgical outcome. Also, analyzing joined datasets from multiple institutions' anesthesia information management systems (AIMS) through data-mining techniques could identify threshold parameters associated with undesirable outcomes or to assess risk.

There is reason to believe that additional data and genetic risk factor measurements will improve our ability to diagnose, stratify, and optimize the perioperative care of our patients. For example, a quantitative simulation demonstrated how the use of pharmacogenetic information to individualize drug dosage has the potential to significantly improve treatment outcome.^{1}

Nevertheless, several publications have suggested that newly discovered laboratory tests would have limited diagnostic utility for individual patients. The argument advanced was that in order for a laboratory measurement to have clinically useful performance characteristics (sensitivity and specificity), the magnitude of the odds ratio (OR) for the test (Text Box) must be considerably higher than those seen in most etiologic or epidemiologic studies.^{2}^{–}^{4} For example, a hypothetical analysis by Ware^{2} indicated that an OR of 228 would be required for a test result to have sufficient diagnostic or predictive utility (80% specificity and 80% sensitivity).

## Text Box: OR Definitions and Data Distribution.

The OR compares the measured value for a risk factor within a population with a disease to that in a control population without the disease. Traditionally, the relationship between a binary test result and a binary outcome is represented in a two-way table. For example, the Bispectral Index (BIS) is used to titrate the administration of drugs during general anesthesia to minimize the risk of intraoperative awareness.^{5} This measure can be used to illustrate how a quantitative measure can be converted into a binary parameter (“BIS status”). Suppose analysis of an AIMS database that includes adverse outcome data leads to a putative test to predict the occurrence of intraoperative awareness if the BIS were more than 70 for more than 5 min during the interval from intubation to the end of surgery. Such a test might be useful to guide the rapidity with which BIS increases should be treated as well as to identify patients who need postoperative follow-up to assess and possibly treat the consequences of intraoperative awareness. Cases where the BIS was elevated as such are characterized as BIS+ and those where there were no such epochs as BIS-. Then, we can construct the following two-way table:

We have a total of n individuals (n = n_{11}+n_{12}+n_{21}+n_{22}), where n_{11} of them are BIS- and have awareness during the surgery and n_{12}, n_{21} and n_{22} are similarly defined according to the row and column labels for the corresponding cells in the table. Then, the odd of having awareness versus amnesia during the surgery in the BIS negative group is n_{11}/n_{12}. The odd of having awareness versus amnesia in the BIS positive group is n_{21}/n_{22.} The OR is then calculated as the ratio of these two odds: OR = (n_{11}/n_{12})/(n_{21}/n_{22}) = n_{11}n_{22}/n_{21}n_{12}. If the criterion of the BIS being more than 70 for >5 min were a good indicator of outcome, then n_{11} and n_{22} would be large, and n_{12} and n_{21} small, relative to n_{11} and n_{22}. As a consequence, a large OR is obtained. In contrast, a small OR (close to 1) indicates that the test result has a small effect on outcome probability, indicating that it has a poor performance.

Several other important measurements are frequently used to assess test performance. If we denote the individuals having amnesia during the surgery as cases, and those with awareness as controls, then n_{11}, n_{12}, n_{21} and n_{22} are known as true negative (TN), false negative (FN), false positive (FP) and true positive (TP), respectively. *Specificity* is the probability that a control individual is correctly classified; and *sensitivity* is the probability that a case individual is correctly classified. Additionally, the *positive predictive value* and *negative predictive value* are the chances that an individual predicted as case (or control) is actually a case or control, respectively. The following table illustrates how these values are calculated:

However, by focusing on different case and control populations, different methods can be used to calculate the OR, which can markedly affect its value. In the analysis by Ware,^{2} the OR is defined as the ratio of the frequency of an event occurring within the group with the disease (cases) whose risk factor value places them at the 90th percentile (Figure T1, left panel, arrow labeled Group 2) relative to the frequency of events within the control group whose risk factor value is at the 10th percentile (Figure T1, left panel, arrow labeled Group 1). The difference in the mean frequencies between the case and control groups is used to calculate the OR for a risk factor. However, more conventionally, the OR is defined as the ratio of the odds of an event occurring in one group relative to the odds of it occurring in another group where all values are at or above a given value (90th percentile) in one group (Figure T1, right panel, orange area), and the values that are at or below a given value (10th percentile) in the second group (Figure T1, right panel, blue area). The conventional method for OR calculation can be more easily and accurately determined.

The data distribution within a population also affects laboratory data performance. The data distribution for populations with a normal (Blue), a Double-Exponential (green) or a Cauchy (Red) distribution are shown in Figure T2. Note the differences at the extremes of the population distribution, where the Cauchy curve and the Double-Exponential curve do not tail-off as rapidly as does the normal curve. The maximum likelihood method is used to fit the actual data to different types of distributions. The fitted curve can be overlaid to the histogram. Visual inspection provides an estimate of the goodness-of-fit of the fitted distribution and statistical tests (Kolmogorov-Smirnov test^{6} for general distributions, and the Jarque-Bera^{7} or Shapiro-Wilk^{8} tests for normal distribution) can be used to rigorously assess the fit.

If this level of certainty is truly required, it would be virtually impossible to identify a genetic test with acceptable performance for the vast majority of clinical situations. This is because ORs for most of the genetic risk factors for quantitative traits or susceptibility to common diseases that have been identified are usually much lower than the proposed threshold. For example, analysis of the 1967 identified human single nucleotide polymorphisms with a reported OR for a studied trait^{a} indicated that 67% have ORs <5, and 95% have ORs <30.

In a similar negative vein, Pepe et al.^{4} investigated the relationship between the OR and classification accuracy. They analyzed hypothetical data with a normal distribution and concluded that an OR of 74 was required to obtain clinically useful performance characteristics (79% sensitivity and 79% specificity). Taken at face value, these 2 analyses provide a pessimistic outlook for the utility of laboratory tests, because they imply that the performance of most laboratory tests will not be sufficient to have much diagnostic or predictive utility.

However, these analyses^{2}^{–}^{4} are based on a fundamental assumption that laboratory data are normally distributed within case and control populations. They also assume that currently used disease definitions and classification schema, developed before the discovery of genetic risk factors or novel laboratory tests, will continue to be applied.

In this article, we demonstrate that the clinical utility of laboratory data has a much better prognosis than that suggested by Ware and by Pepe et al. The vast majority of laboratory results do not have a normal distribution within either control or disease populations. As a consequence, the performance characteristics (i.e., sensitivity and specificity for disease diagnosis) of laboratory tests are substantially improved over Gaussian assumptions. Thus, laboratory data whose performance characteristics are within the usual range observed in epidemiologic or etiologic studies can have substantial diagnostic utility. These considerations lead to a more optimistic assessment of the utility of laboratory tests in improving patient care.

## METHODS

### Analysis of Clinical Laboratory Test Data

All available data for 700 clinical laboratory tests performed between 2000 and 2006 were retrieved from the Stanford Translational Research Integrated Database (STRIDE) according to a protocol that was approved by the IRB. These tests were routinely performed at the Stanford and Lucile Packard Children's Hospitals, and included patients between 0 and 108 years of age. At least 1000 measurements were available for each test (minimum 1001, maximum 2,090,227, median 3466). The Jarque–Bera test^{7} was used to evaluate the normality of the data distribution in each test. The test was performed on the original data and on the logarithm-transformed data, which is a transformation commonly used to analyze data with a significant rightward skewing. The larger of the 2 *P* values obtained was reported. The Benjamini–Hochberg adjustment method was then used to adjust for multiple testing.^{9} Because the null hypothesis was that the distribution was indeed normal, the smaller the adjusted *P* value, the greater the deviation from normal. When the adjusted *P* value was smaller than 0.05, the null hypothesis was rejected at the 5% significance level (i.e., the data were not normally distributed).

### Biomarker Data Analysis

For the 3 types of laboratory data studied in detail, deidentified data were obtained from the STRIDE. The International Classification of Diseases, Clinical Modification (ICD-9), codes for each individual with an available laboratory value were evaluated to identify the control and disease populations by using the STRIDE Anonymous Patient Cohort Discovery Tool that was developed by the Stanford Center for Clinical Informatics.^{10} The hemoglobin A1C (immunoassay), CD19^{+} cell counts (flow immunocytometry), and protein S activity (automated latex immunoassay) were measured using standard protocols in the clinical laboratory at Stanford University.

The following ICD-9 codes were used to classify individuals with available laboratory data: diabetes (250), lymphoma (200,201), and coagulopathy (286). There were 33,958 A1C measurements from 20,590 nondiabetic individuals and 41,541 A1C measurements from 10,677 diabetic individuals from the database. CD19+ cell counts included 17,706 measurements from 4498 individuals not diagnosed as having a lymphoma, and 1861 measurements from 541 individuals with a lymphoma. Protein S activities included 3701 measurements from 3385 individuals without a diagnosed coagulopathy and 519 measurements from 420 individuals with a diagnosed coagulopathy. When a test was performed more than once on an individual, the value obtained on the initial visit was used. (However, results were insensitive to the substitution of lab values from subsequent visits.)

The A1C lab values were extensively right skewed; therefore, a log-transformation was applied to make the data distribution more symmetric. Because values of 0 were obtained for some CD19 and the protein S values, 1 was added to the original value before log transformation, because log(0) is undefined. All subsequent analyses were performed on the transformed data.

Additional information about the OR determinations for different data distributions and for A1C and other lab test data, and for the other simulation studies, is provided in the online Supplementary Methods (see Supplemental Digital Content 1, http://links.lww.com/AA/A172) for this journal.

## RESULTS

### Clinical Laboratory Data Usually Are not Normally Distributed

Even if most of the population data appear to follow the shape of a normal distribution, the values at the tails (within the population extremes) may diverge markedly from that predicted by the normal distribution. For instance, the measured serum concentration of a protein could have a nonnormal distribution within the extremes of a population because of a limitation in synthetic capacity, a biological feedback mechanism that limits its maximum concentration, or a threshold level of stimulation that may be required to initiate the synthesis of a disease-associated protein. Any of these effects would flatten the data distribution curve at the extremes of control or diseased populations. This is important because calculation of ORs involves assessments at the tails of the distribution, whether one follows the calculation method described by Ware or the more conventional approach (Text Box).

To assess the possibility that the prior estimates of ORs required for adequate test performance are overly conservative because of deviations from the normal distribution at the tails, we evaluated the data distribution of all 700 clinical laboratory tests. A detailed example follows.

The A1C test (HbA1c, glycated hemoglobin, or glycosylated hemoglobin) is a commonly used laboratory test that reflects the effectiveness of blood glucose regulation.^{11} Histograms of A1C lab values for 20,590 control (nondiabetic) individuals and 10,677 diabetic individuals reveal a deviation of these data from a normal distribution in either population (Fig. 1). Although a superficial visual inspection of the shape of the data distribution might seem to indicate a normal distribution, this is not a rigorous method for such determinations. Therefore, we used the Jarque–Bera test to assess the normality of this data. The resulting *P* values are nearly zero for the case and control populations, indicating that these data have a highly nonnormal distribution, which is consistent with the shape of the histograms. The same data are also graphed as quantile–quantile plots, where the A1C lab values are plotted in relation to the percentile for the theoretical normal distribution and departures from linearity indicate where the data do not have a normal distribution. These plots demonstrate that the distribution of A1C lab values at the extremes deviates significantly from the normal distribution, especially in the control population (Fig. 1).

We analyzed the raw data for all 700 laboratory tests with the same methodology, and found that 699 (99.9%) tests were not normally distributed. After log- transformation, the data in 694 (99.1%) tests remained nonnormally distributed. The serum transferrin level was the only test whose data had a normal distribution; only 5 other tests (B-type natriuretic peptide, 1-hour glucose tolerance test, glucostatin, hematocrit, and total protein) were normally distributed after log-transformation. Thus, the data for nearly all laboratory tests do not have a normal distribution.

### A Nonnormal Distribution Significantly Alters Laboratory Data Performance

Because the ORs and laboratory data performance are assessed with data obtained from the extremes of a population (**Text Box**), the distribution of the data within the population extremes has a large effect on its utility for disease diagnosis. For example, if laboratory data were more flatly distributed at the extremes, the data distribution would resemble a double-exponential^{12} or a Cauchy distribution^{13} (**Text Box**). These two types of data distribution may better fit the rate of decay of the probability density for the tails. Consequently, these distributions better fit the actual data distribution at the extremes than does the normal distribution.

To investigate the potential implications of this effect, we plotted the OR as a function of the sensitivity and specificity for lab test data with a normal or a Cauchy distribution (Supplemental Figs. 1 [see Supplemental Digital Content 2, http://links.lww.com/AA/A173] and 2 [see Supplemental Digital Content 2, http://links.lww.com/AA/A174]; see Supplementary Methods section for figure legends, http://links.lww.com/AA/A172). The dramatically different shapes of these curves demonstrate that laboratory test performance (sensitivity and specificity) at a specified OR is markedly altered if the data are not normally distributed.

Furthermore, the effect of a nonnormal data distribution is especially pronounced under conditions in which high sensitivity and specificity are required. To illustrate this, we prepared a table showing the values of ORs calculated for clinically useful levels of sensitivity and specificity for data with a normal, a double-exponential or a Cauchy distribution (Tables 1 and 2). The OR definition applied clearly impacts laboratory test performance. In general, use of the Ware definition increased the ORs that were required for a laboratory test to have clinically useful performance (80%–90% sensitivity and 80%–90% specificity) by >10-fold in relation to the conventional OR definition.

However, independent of which OR definition was used, the data distribution within the extremes of a population had a very significant effect on test utility and performance characteristics. A laboratory test with 80% sensitivity and 90% specificity require an OR (Ware definition) of 231 for normally distributed data. However, using the same assumptions as Ware (the data distribution in the case and control populations has the same shape), an OR of 15 or 25 can provide the same performance for data with a Cauchy distribution or a double-exponential distribution, respectively. The same trend was noted when the conventional OR definition was applied. More dramatically, an OR of 14,910 is required to achieve 90% sensitivity and 90% specificity for data with a normal distribution, and an OR of 225 or 6 is required for data that has a double-exponential or Cauchy distribution, respectively (Table 2).

At the sensitivity and specificity values required for clinical utility, the minimum required OR values for data with a Cauchy distribution are substantially smaller than those for data with a normal distribution. The same results were consistently obtained when the shape of the data distribution in the case and control groups were different (Supplementary Table 1; see Supplementary Methods, http://links.lww.com/AA/A172).

We further investigated whether the distribution of actual laboratory data affects the OR that is required for a laboratory test to achieve a certain level of diagnostic performance. Using a cutoff of 2.63 for the A1C lab values, this test would achieve 84% specificity and 71% sensitivity for the diagnosis of diabetes in the populations evaluated. Using the actual A1C data, the calculated OR was 24. If AIC data were normally distributed, the OR required to achieve this performance would have been >10-fold (271) higher.

To evaluate the impact that the data distribution had on the required OR, we artificially shifted the distribution curve for the A1C data in the case population to achieve the desired performance characteristics. This enabled us to directly compare the calculated ORs from the actual A1C data (after the hypothetical right shift) with those for data with a normal distribution. By varying the extent of the rightward shift in the diabetic population, any desired level of performance could be achieved, which enabled the impact of the shape of the distribution curve (especially at the 2 tails) on the OR (conventional definition) to be characterized (Table 3).

The required OR for the A1C test to achieve clinically useful performance characteristics is significantly smaller than if the data had a normal distribution. This analysis also indicated that a double-exponential distribution could reasonably approximate the properties of the A1C data distribution at the 2 tails. Thus, the nonnormal data distribution at the extremes had a very significant effect on the performance of a laboratory test; the ORs required to achieve clinically useful performance characteristics were significantly less than expected.

### Nonnormal Data Distribution Improves A1C Performance

We next evaluated the effect that the data distribution had on the performance of A1C laboratory data. To do this, we determined the number of individuals who would be correctly classified as control (nondiabetic) or diabetic on the basis of actual A1C laboratory data, and compared that with the expected results if the A1C data had a normal distribution (Table 3). Using the indicated cutoff of 2.63 (which produced an OR of 24.3) for the A1C data, we correctly identified 3439 more control (nondiabetic) individuals by this result than if the data had a normal distribution. In addition, 334 more diabetic individuals were correctly identified than if the A1C data were normally distributed (Table 3). In other words, the nonnormal distribution of the A1C laboratory data resulted in 13.7% fewer FP and 3.1% fewer FN classifications than if the data had followed a normal distribution.

Generation of a receiver operating characteristics (ROC) curve is a useful procedure for assessing the overall performance of a test. By varying the cutoff for A1C data, we obtain different pairs of specificity and sensitivity. Instead of focusing on a specific pair, we can plot all such pairs in a scatter plot. The resulting curve is an ROC curve, showing the change of sensitivity with respect to the change in specificity (Fig. 2A). An ROC curve close to the ideal point of 100% specificity and 100% sensitivity indicates nice performance, and a curve close to the diagonal line from 0% specificity and 100% sensitivity to 100% specificity and 0% sensitivity indicates a poor performance. The total area under the ROC curve is also an indicator of the performance, with a value of 1 indicating perfect classification power and a value of 0.5 demonstrating that the variable used for classification is irrelevant to the actual outcome. The improved performance of the actual A1C data in relation to that of the corresponding normal distribution is demonstrated by comparison of their ROC curves (Fig. 2A); the area under the ROC curve for the A1C data (0.84) is >10% more than that of the normal distribution (0.74).

The performances of 2 other laboratory tests were similarly evaluated. The number of CD19^{+} cells is a potential biomarker for the diagnosis of lymphoma, and protein S activity is a potential biomarker for coagulopathy. The histograms and the quantile–quantile plots indicate that the data for both of these biomarkers have a nonnormal distribution (Supplementary Figs. 3 [see Supplementary Digital Content 4, http://links.lww.com/AA/A175] and 4 [see Supplementary Digital Content 5, http://links.lww.com/AA/A176]; see Supplementary Methods for figure legends, http://links.lww.com/AA/A172). The ORs required for these 2 tests to achieve a useful level of diagnostic performance were also significantly smaller than if they had a normal distribution (Table 4). The required ORs were comparable in magnitude to those for data following a double-exponential distribution, and were much larger than those for data following a Cauchy distribution. Therefore, the 2 tails of the distribution curves for the data in the control and disease groups play an important role in determining the relationship between biomarker performance and OR, and this data distribution was better approximated by a double-exponential distribution.

The ROC curves of CD19 and protein S were also compared with the ROC curves of a normally distributed case and control population with the same OR. For the CD19 data, there was a >10% gain in the area under the curve (AUC) for the actual CD19 data in relation to that of the data with a normal distribution and the same OR (Fig. 2B). The performance of the actual data was superior to that of the normally distributed data over most of the range. The protein S data did not have good diagnostic utility because the AUC of the ROC curve was close to 0.5. However, the ROC curve of the actual protein S data had a gain in AUC in relation to that of the normally distributed data (Fig. 2C).

The distributions of the protein S data in the case and control populations were close to each other, which made the impact of data distribution hard to compare. Therefore, the protein S data in the case group was artificially shifted to the right to create a performance of 80% specificity and 80% sensitivity, to investigate the effect that the data distribution had on laboratory test performance when the case and control populations were better separated by the test. In this case, there was a 10% gain in the AUC in the shifted data in relation to that of the normally distributed data (Fig. 2D**).** These results further demonstrate that the data distribution has a large impact on the performance of laboratory test data.

### Additional Factors Affecting the Laboratory Data Performance

In addition to the data distribution, other factors can also significantly impact the performance of laboratory data. For example, lactic dehydrogenase (LDH) is a prognostic biomarker for survival in adult leukemia–lymphoma cases.^{14} However, the response variable (survival at 5 years) could have an erroneous value in a small percentage of patients if death were due to a nonleukemia-associated cause (e.g., an automobile accident) or because of a failure in mortality reporting. In the OR example presented in the introductory section of this article, such misclassification would represent errors due to patients having awareness under anesthesia that was not discovered by the investigators, or patients claiming to have had awareness but the events recalled did not take place during a general anesthetic.

In a simulation, we evaluated the impact of such misclassification of the response variable on laboratory data performance. At 80% specificity, a 1% (or 10%) misclassification incidence decreased the average sensitivity from 74.5% to 73.7% (or 66.3%) (Table 5). The same decrease occurred when the underlying data distribution was assumed to be normal. This simulation indicates that laboratory data performance decreases as the measurement error in the binary dependent variable increases, but the decrease is not very large if the measurement error is small.

In many perioperative medicine studies, data are often collected from multiple centers to achieve a desired sample size. However, laboratories at different centers could produce data with different statistical distribution parameters, which introduces a degree of heterogeneity into the pooled data. Therefore, we evaluated the impact of intercenter differences on a simulation study examining A1C data distributions obtained from 10 hypothetical centers. At 80% specificity, the sensitivity of the A1C data only decreased from 74.5% (if all data were from a single center) to 74.1% if the SD in the distribution parameter across centers was 10% of the control A1C data (Table 5); and the shape of the data distribution curve did not affect the size of the decrease in the sensitivity. Thus, combining data from multiple centers had a small effect on laboratory data performance if the tests achieved a reasonable level of standardization across the centers. If care is taken to minimize the variation among different data centers, the benefit from a larger sample size outweighs any decrease in laboratory data performance resulting from a multicenter study.

## DISCUSSION

The effect that a nonnormal data distribution has on laboratory test performance may at first appear to be a topic that is of interest to statisticians rather than physicians. However, this finding has significant implications for scientists who are discovering and developing new diagnostic markers, and subsequently for physicians who will use the results provided by the next generation of diagnostic tests to care for their patients. This information is also relevant to researchers using retrospective AIMS data to develop predictors of postoperative outcomes on the basis of quantitative data recorded during anesthesia.

First, this analysis demonstrates that a nonnormal distribution of laboratory data enables laboratory tests with moderate ORs (6–30) to provide useful diagnostic tools. Second, these ORs are within the range observed for laboratory data identified in etiologic and epidemiologic studies. This implies that contemporary genomic studies have the potential to produce clinically useful diagnostic tools. Third, when evaluated within the larger context of populations that are affected by common diseases (prevalence >1% of the population), the improvement in laboratory test performance due to the nonnormal data distribution assumes substantial significance. Finally, the evaluation of the usefulness of a marker (i.e., the independent variable) in predicting an outcome (i.e., the dependent variable) requires a determination of the probability distribution of that marker in the population. It is insufficient to make assumptions that the marker follows a normal distribution without evaluating this formally, because the failure to do this may result in a useful test being inappropriately discarded. This caveat applies equally to quantitative factors identified retrospectively from data-mining analyses of AIMS databases to be associated with an undesirable outcome.

For example, it is estimated that 18.2 million people (6.3% of the population) in the United States had diabetes in 2002, and 5.2 million of these were undiagnosed cases.^{15} Let us imagine that a scientist has just discovered that measurement of A1C could be a potential diagnostic marker that could be used for long-term monitoring of patients with diabetes. The scientist has just received a large number of serum samples that were obtained from control and diabetic individuals. After the A1C values were measured, the test developer must establish a threshold value that will enable physicians to determine whether an individual has an abnormal test result. As is shown here, the nonnormal distribution of A1C laboratory values resulted in a 13.7% increase in specificity and a 3.1% increase in sensitivity for the diagnosis of diabetes on the basis of A1C test results. Thus, for every 1 million diabetic individuals that were evaluated by this newly developed test, if the analyst used threshold values that were based upon a double-exponential distribution of A1C laboratory data in the population, 31,000 more test results would be correctly classified as abnormal. Similarly, when the test results from a population of 1 million control individuals were analyzed, the use of these cutoff values would avoid misclassifying 137,000 individuals as having an abnormal test result.

However, diagnoses are not based solely upon a laboratory test result; clinical context and judgment will always be essential for the proper use of any risk stratification tool. The prevalence of a particular condition and the level of pretest suspicion within the tested population (pretest probability) will impact laboratory test performance and the interpretation of test results. For many genomic tests that an anesthesiologist might use, different clinical scenarios might lead to requirements for different test performance characteristics (sensitivity and specificity).

For example, suppose that there was a genomic screening test that predicted susceptibility to malignant hyperthermia (MH). For this test, we would want a sensitivity of >99%, because this condition can be fatal and we do not want to miss anyone. However, we could accept a specificity of only 50% (for every 2 patients who are resistant to MH, only 1 will be predicted correctly), because there are effective alternative methods of providing anesthesia using non-MH triggering drugs. Alternatively, suppose another genetic test predicted whether a patient is likely to experience prolonged sedation when given midazolam. For this test, we do not need an extremely high sensitivity, but want a higher level of specificity than for MH. We do not want to deprive too many patients of the anxiolytic benefits of this drug, and we can effectively reverse the effect of midazolam with an antagonist (flumazenil), if necessary. However, knowing whether a patient is at risk for prolonged sedation would be helpful in assessing a patient who is slow to awaken after an anesthetic, and it could improve outcome. If the patient were at risk for excessive sedation, the practitioner might more quickly reach a decision to administer flumazenil than if the patient were not at risk. Avoiding the administration of flumazenil in situations in which slow emergence is not likely due to midazolam is desirable, because provoking a seizure is a potential complication of reversal. These 2 different scenarios indicate how very different test performance characteristics can be required to address different clinical situations.

The results from this study are especially important for organizations such as the Anesthesia Quality Institute (AQI)^{b} and the Multicenter Perioperative Outcomes group (MPOG), ^{c} both of which have recently initiated efforts to improve the quality of anesthesia care through the retrospective analysis of anesthesia data. The finding that up to a 10% error in survival classification has a small impact on test performance is fortunate. Typically, the cause of death cannot be determined from mortality databases that are publically available, and there is also incomplete reporting to these databases. If small reporting errors adversely affected test performance, misclassification of the cause of death would call into question the interpretation of such studies. Also encouraging for AIMS research involving collaborative databases is that statistical variation in the distributions of reported parameters among various contributing centers are not likely to have a major impact on the analysis. However, AQI and MPOG need to consider the implications of the data distribution of identified independent variables and report on these in their published reports. The data distributions of the independent variables will also affect power analysis calculations (which usually are based on assumptions that the data are normally distributed) for subsequent prospective studies, because group sizes necessary to demonstrate a specified change in outcome will be influenced.

The prognosis for laboratory data is also further improved when viewed through a future-oriented lens. First, although this analysis evaluated commonly available laboratory data, it has significant implications for newly emerging diagnostic tests identified using contemporary transcriptional, proteomic, or metabolomic technologies. It is likely that the data for these newly discovered tests would also have a similar (nonnormal) distribution, and we also have reason to be more optimistic about their diagnostic performance. Second, this analysis evaluated the performance of a single laboratory test. However, it is likely that many laboratory tests will be identified by contemporary genomic analyses.

As has been noted for disease-associated gene expression signatures,^{16} it is likely that a combination of genetic risk factors and laboratory test data will produce better clinical stratification tools. Fortunately, many statistical learning methods and modeling tools,^{17} including nonlinear mixed-effects modeling,^{18} have been developed to identify optimal combinations for clinical stratification.

Also, this analysis evaluated test performance within the context of existing systems for disease classification. Currently, very large and diverse groups of individuals are included within a disease grouping that is largely based upon a collection of symptoms, which often are disparate, and sometimes incorporating various laboratory variables. However, genetic risk factors and laboratory data have the potential to transform our diagnostic and stratification practices. In the not too distant future, patient groupings may be redefined using genetic risk factors and laboratory test measurements. When emerging laboratory test results and genetic data are incorporated into disease classification criteria, patients with common underlying predispositions and pathogenesis will be similarly classified. The sensitivity and specificity of the laboratory results that are used for diagnosis and prognosis will then improve.

## ACKNOWLEDGMENTS

We thank Dr. Gomathi Krishnan for retrieving the biomarker data, and Dr. Bob Lewis for helpful discussions.