Secondary Logo

Journal Logo

Let's Think Clinically Instead of Mathematically About Device Accuracy

Morey, Timothy E. MD; Gravenstein, Nikolaus MD; Rice, Mark J. MD

doi: 10.1213/ANE.0b013e318219a290
Technology, Computing, and Simulation: Commentary

Published ahead of print April 25, 2011 Supplemental Digital Content is available in the text.

From the Department of Anesthesiology, University of Florida College of Medicine, Gainesville, Florida.

Supported by institutional funds only.

The authors declare no conflicts of interest.

Reprints will not be available from the authors.

Address correspondence to Timothy E. Morey, MD, Department of Anesthesiology, University of Florida College of Medicine, PO Box 100254, Gainesville, FL 32610-0254. Address e-mail to

Accepted March 7, 2011

We read with interest the article by Macknet et al.1 in the December 2010 issue of Anesthesia & Analgesia describing the accuracy of the Masimo CO-Oximetry device. The statistical analysis that compared this new diagnostic measurement device to a known “gold standard” may lead to incorrect clinical decisions. Just as Clarke et al.2 redefined the thinking about the accuracy of glucose measurement devices, we believe devices that measure hemoglobin deserve another look.

Back to Top | Article Outline


There are several points describing the accuracy of the noninvasive CO-Oximetry device that merit comment. First, the average difference (bias) between the reference hemoglobin and the tested device (CO-Oximeter) is reported as −0.15 g/dL. This merely (and accurately) reports that the average error above and below the reference measurement essentially cancels one another. The error above and below could be very large, but if they are equally large in both directions, then the bias is small. Second, the y-axis (CO-Oximeter) in Figure 1 of Macknet et al.1 is compressed compared with the x-axis (reference measurement), giving the visual impression of a smaller difference between the CO-Oximeter and reference measurements. Third, although it is mathematically correct to state that this device is accurate within 1.0 g/dL, inspection of Figure 1 in the article by Macknet et al.1 makes that statement clinically misleading. As one example of many, in Figure 1 of Macknet et al.,1 data in a noninvasive CO-Oximeter measurement of approximately 10 g/dL leads to a reference (true) measurement of anywhere from approximately 9 g/dL to >13 g/dL. Fourth, the formula for the linear regression of the data in Figure 1 of Macknet et al.1 does not clearly note a slope or y-intercept. Fifth, their Figure 2 is not a Bland-Altman plot. The Bland-Altman method plots the difference in paired values on the ordinate against the average of the paired values on the abscissa. The average value on the abscissa is necessary as the gold standard will have some variation. Instead, Figure 2 of Macknet et al.1 is a plot of the difference versus the gold standard values. In their original article describing this statistical technique, Drs. Bland and Altman state, “It would be a mistake to plot the difference against either value separately because the difference will be related to each, a well-known statistical artefact [sic].”3 Sixth, precision in the Bland-Altman analysis refers to the possible variation in the limits of agreement, not the limits of agreement themselves and not a single standard deviation as it appears in the referenced publication.

Figure 1

Figure 1

The statistical evaluation of this device as presented might lead the clinician to believe that this device can be used in place of a traditional central laboratory device, but the data and analysis do not support such a conclusion. In addition, we profess that the Bland-Altman technique may be mathematically rigorous and useful for some tests, but does not offer the entire package when considering the operating room environment. Therefore, we propose 2 complementary methods for reviewing and presenting such data as Clarke et al. did for glucose measurement devices.2

Back to Top | Article Outline


The ideas of Clarke et al. for comparing devices that measure blood glucose are instructive when considering methods for comparing devices that measure hemoglobin. They developed an error grid that shows the absolute value of the new test, the absolute value of the reference test, the difference between these values, and the clinical significance of the difference (as determined by experienced clinicians). We propose that an adaptation of the glucose error grid for hemoglobin determination is informative and applied it to the data of Macknet et al. We did this without benefit of the raw data by taking Figure 1 from Macknet et al., resizing it to make x and y axes symmetric, and then overlaying the proposed error grid.

Back to Top | Article Outline


Figure 1 shows hemoglobin concentrations between 0 and 16 g/dL. All hemoglobin measurements from the reference device (tHb) and the new oximeter device (SpHb) are plotted as ordered pairs. Ideally, all points would be on a line of unity shown as the dashed diagonal in Figure 1. Coincident placement of all points on the unity line, however, is highly unlikely because of the imprecision of the reference and new method, and the bias. Therefore, the majority of points will not lie on the unity line, but rather deviate from it. The clinical interpretation of the magnitude and direction of deviation allows one to construct error zones based on practice guidelines2 and clinical decision making. Although the Clarke et al. glucose error grid includes 5 zones, we suggest that only 3 zones (A, B, and C) are needed for hemoglobin error grid analysis.

Back to Top | Article Outline

Zone A (Green)

For the purposes of hemoglobin, we believe that a deviation of ±10% of the new technique from the reference method seems reasonable and would form a useful area in which one could expect ≥95% of all points to exist. Although selection of these values of ±10% and 95% occupancy is indeed arbitrary, the numbers are selected to be analogous to the glucose monitoring error grid analysis (wherein glucose values may deviate ±20% from the reference method and Zone A contains ≥95% of points) that is suggested by the United States Food and Drug Administration, “… to estimate the clinical significance of bias results between the two methods.”2,4 Although many values may be considered for hemoglobin measurement, we selected ±10% because this deviation represents 1 g/dL hemoglobin at the upper range of transfusion consideration (10 g/dL).5 One unit of packed red blood cells is the smallest amount of potential risk for a transfused patient and will increase a patient's hemoglobin concentration by approximately 1 g/dL. We chose 95% of points to reside in Zone A to be consistent with standard accepted type I error in science (i.e., P < 0.05) and the work of Clarke et al.2 Regardless of our selected values, others may use values they think appropriate to analyze the clinical relevance of bias in hemoglobin measurement.

Zone A has a lowermost section for hemoglobin values <6.0 g/dL, an isthmus, and a larger uppermost region for hemoglobin concentrations >10.0 g/dL. The selection of these particular hemoglobin values was based on the 2006 practice guidelines for transfusion developed by the American Society of Anesthesiologists Task Force on Perioperative Blood Transfusion and Adjuvant Therapies. In that publication, red blood cell transfusion is typically recommended for hemoglobin values <6.0 g/dL and probably not needed for values >10.0 g/dL.5

In the uppermost Zone A, any bias in the 2 measurements will not fundamentally affect that patient because probably no transfusion will occur. In the lowermost Zone A, the patient will likely be transfused irrespective of which method was used to measure the hemoglobin. The more critical isthmus section of Zone A is the clinical decision-making region of this zone in which the hemoglobin may be a key determinant of whether a patient is transfused. Clearly, this Zone A isthmus is the most important section of the entire grid and where the reference and comparator device must most closely agree. We suggest that most of the experimental observations should be in this isthmus to not unduly affect the bias by points (i.e., hemoglobin >10 g/dL) that are of little interest to anesthesiologists. Many publications including that of Macknet et al., however, contain a great preponderance of points with hemoglobin values >10 g/dL, which may unduly affect the bias of the more critical range. This situation is exactly analogous to most glucose measurement publications where the great majority of reported values are in the normal glucose range and of little interest in the device evaluation.6 Recognizing this fact, the Food and Drug Administration sometimes requires glucose “clamp” experiments to evaluate new glucose measurement devices wherein subjects, during closely supervised clinical trials, have their glucose markedly decreased with insulin infusions.

Back to Top | Article Outline

Zone B (Yellow)

This area creates significant errors in hemoglobin measurement, although the magnitude of the mistakes is not as significant as for Zone C described subsequently. This zone is defined as the region between the upper line of Zone A and the upper Zone C and between the lower line of Zone A and the lower Zone C. Using the glucose error grid as an analog, <5% of all points should be in Zone B.

Back to Top | Article Outline

Zone C (Red)

For this zone, major therapeutic errors may occur with large risks to patients without any potential benefits. In this upper Zone C, a hemoglobin value measured by the new method would clearly overestimate the reference method. More importantly, an anesthesiologist would likely not transfuse blood for a measured hemoglobin >10 g/dL, although the true hemoglobin may be <6.0 g/dL. In this case, failure to diagnose and treat anemia may occur with potential injury to a patient's organs (e.g., brain). The lower Zone C region also promotes major missteps in patient care in that a patient may be transfused unnecessarily. More specifically, one may obtain a hemoglobin of <6.0 g/dL with the new method when the true hemoglobin is >10.0 g/dL. In this case, the patient experiences unnecessary transfusion without any benefit. Rather, the patient may experience a potential acute adverse reaction (e.g., ABO incompatibility) or longer term, chronic effects (e.g., immunosuppression, cancer progression, or infection). Because of the danger to patient safety without possible benefit, no points should be in either the upper or lower Zone C.

Back to Top | Article Outline


Bland-Altman analysis and the proposed error grid analysis are based on continuous data. Anesthesiologists working in a highly technical, rapidly changing environment make a binary decision when considering transfusion: Do I give packed red blood cells or not? Some measure of statistical testing that reflects the clinical situation would be useful when comparing 2 devices that purport to measure hemoglobin. Although not perfect, one accepted method of assessing the agreement between 2 methods for nominal data is the Cohen's κ statistic.7 This value is superior to percent agreement because it accounts for random agreement. In this case, we wish to know what the agreement is (all patient and surgical conditions being equal) for a hemoglobin value measured by both the experimental and a reference instrument that may lead an anesthesiologist to provide a transfusion. The value varies from 0 where there is poor agreement to 1.0 where almost perfect agreement is present. Other values can be interpreted for κ 0.00 to 0.20 (slight agreement), 0.21 to 0.40 (fair agreement), 0.41 to 0.60 (moderate agreement), and 0.61 to 0.80 (substantial agreement). In addition, we propose that the values be calculated in the critical isthmus of Zone A from 6 to 10 g/dL. Because we do not have the dataset detailed by Macknet et al., we cannot calculate the κ statistic for those observations. Alternatively, we provide an example using data that we collected on 2 devices to measure hematocrit to show how inclusion of many normal hematocrit values leads to a bias favorable for acceptance of the new device. Recently, we completed testing of 2 devices that measure hematocrit in human blood (data not shown; hemoglobin calculated as hematocrit/3.0) and determined the κ statistic noted in parentheses for all data (0.90), hemoglobin <10 g/dL (0.25), and hemoglobin 6 to 10 g/dL (0.21) to transfuse at a hemoglobin of 10 g/dL. That is, the agreement of the 2 methods was a function of the range of hemoglobin concentrations selected to include in the test. The many values of hemoglobin >10 g/dL where there was good agreement clearly biased our data compared with the clinically relevant zone of 6 to 10 g/dL. Would any anesthesiologist be comfortable providing packed red blood cells with their attendant risks with an agreement κ statistic of 0.21? For this reason, we propose that the κ statistic at hemoglobin values of 6.0 to 10.0 g/dL be used when evaluating new devices to remove the bias of the far more common hemoglobin values >10.0 g/dL found in most research subject study populations.

Back to Top | Article Outline


We believe that the provision of transfused erythrocytes remains a serious clinical decision with many potential adverse ramifications. As the anesthesiology community grows more dependent on point-of-care testing, more rigorous clinically oriented methods of device evaluation are needed to complement traditional techniques. To this end, we propose the following analysis: Bland-Altman analysis, error grid analysis, and Cohen's κ statistic over a 6.0 to 10.0 g/dL hemoglobin range to reflect the nature of clinical decision making to transfuse a patient (or not).

Back to Top | Article Outline


Name: Timothy E. Morey, MD.

Contribution: This author helped analyze the data and write the manuscript.

Attestation: Timothy E. Morey approved the final manuscript.

Name: Nikolaus Gravenstein, MD.

Contribution: This author helped analyze the data and write the manuscript.

Attestation: Nikolaus Gravenstein approved the final manuscript.

Name: Mark J. Rice, MD.

Contribution: This author helped analyze the data and write the manuscript.

Attestation: Mark J. Rice approved the final manuscript.

Back to Top | Article Outline


1. Macknet MR, Allard M, Applegate RL II, Rook J. The accuracy of noninvasive and continuous total hemoglobin measurement by pulse CO-Oximetry in human subjects undergoing hemodilution. Anesth Analg 2010;111:1424–6
2. Clarke WL, Cox D, Gonder-Frederick LA, Carter W, Pohl SL. Evaluating clinical accuracy of systems for self-monitoring of blood glucose. Diabetes Care 1987;10:622–8
3. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1;307–10
4. Food and Drug Administration Clinical Chemistry and Toxicology Devices Branch, Division of Clinical Laboratory Devices, Office of Device Evaluation. Review criteria assessment of portable blood glucose monitoring in vitro diagnostic devices using glucose oxidase, dehydrogenase or hexokinase methodology. Silver Spring, MD: Food and Drug Administration, 1997
5. American Society of Anesthesiologists Task Force on Perioperative Blood Transfusion and Adjuvant Therapies. Practice guidelines for perioperative blood transfusion and adjuvant therapies: an updated report by the American Society of Anesthesiologists Task Force on Perioperative Blood Transfusion and Adjuvant Therapies. Anesthesiology 2006;105:198–208
6. Rice MJ, Pitkin AD, Coursin DB. Review article: glucose measurement in the operating room—more complicated than it seems. Anesth Analg 2010;110:1056–65
7. Rigby AS. Statistical methods in epidemiology. v. Towards an understanding of the kappa coefficient. Disabil Rehabil 2000;22:339–44
© 2011 International Anesthesia Research Society