What are the current, preferred, statistical, and graphical methods used to evaluate and present these types of comparative data? Should bias and precision be the focus of an evaluation? What are the best metrics, both graphically and numerically, to help clinicians make good decisions regarding these devices? And is there a better way to look at these data?
CURRENT METHODS OF DATA PRESENTATION
There are several graphical methods for presenting accuracy data from a device measuring noninvasive Hb. Figure 3 is a scatterplot from Macknet et al.10 graphing the reference CLD Hb on the x-axis versus the noninvasive Hb value on the y-axis. They studied 20 volunteers undergoing 30 mL/kg of acute crystalloid hemodilution, from which 165 paired samples were generated comparing the noninvasive measurement to a CLD. Several metrics are reported with this plot, including bias and precision. Neither the plot nor the statistical metrics provide guidance on transfusion decisions.
The Bland–Altman difference plot,13 as shown in Figure 4 from Macknet et al.,10 is probably the most commonly used graphical presentation of device accuracy data. The mean of the reference and tested device is plotted on the x-axis versus the difference between the values on the y-axis. A horizontal line can be drawn representing the bias, which is the mean difference between values, as shown by the red line in Figure 1. In addition, the LOA lines may be added to the Bland–Altman plots, which show the limits within which 95% of the points should fall (calculated as the mathematical product of 1.96 and the standard deviation). When the sample size is not large (e.g., n < 30), prediction limits should be plotted.
In the original Bland and Altman article,13 the 2 measurement methods (neither one specifically designated as the “gold standard”) were averaged on the x-axis, and this plot is correctly called a Bland–Altman plot. If the reference measurement method is accepted as a truly accurate and precise gold standard, then this can be plotted alone on the x-axis and the plot can then be called a modified Bland–Altman plot.
Bland and Altman, however, developed this standard method comparison tool with some assumptions that may not be met. These assumptions include only 1 sample per subject to maintain independence of an observation from another observation, a constant parameter (i.e., Hb concentration that does not change over the course of repeated observations in each subject), relatively large number of observations, equal variance within subjects, and normal distributions of data. In fact, Bland and Altman have offered an alternative analysis to consider when multiple observations per subject are collected on parameters of changing values (as Hb concentration during surgery). This technique allows not only traditional bias and LOA, but also within-subject and between-subject variability. We suggest that this treatment of the data is superior when the previous assumptions are not met during data collection. In addition, use of subject number (instead of a constant symbol) on the graphical plot allows visual consideration of data distribution between and within subjects.
ARE BIAS, PRECISION, AND LIMITS OF AGREEMENT USEFUL METRICS FOR ASSESSING DEVICE ACCURACY?
Almost all articles that report the accuracy of the Masimo Radical-7 stress bias as a useful metric for accuracy evaluation. We believe this is, at best, not helpful and, at worst, grossly misleading. Figure 1 from Berkow et al.3 reported the bias for these data as −0.1 g/dL; as the first reported metric in the Abstract, it would appear to impart great significance to the very small bias. With an average Hb of 10 g/dL, a bias value of 0.1 g/dL is a mere 1% of the total Hb. To the casual observer, this might imply (incorrectly) great accuracy.
Bias is the mean of all the differences between each paired measurement (new method and old method) for all data points. As shown in Figure 1, even with large variation the bias may approximate 0 g/dL. Analogously, a shotgun blast, centered on a target with an equal number of points above and below the center, would have a bias of exactly zero. Even if the shotgun was fired from a distance and the shot pellets were spread widely onto a very large target, if centered exactly, the blast would still have a bias of zero. Bias merely implies the balance around the mean, but informs little about accuracy. Notwithstanding, graphing a Bland–Altman bias plot can illuminate cases where bias is a function of the magnitude of the measurement, the proportional effect, which violates statistical assumptions of Bland–Altman analysis. Presence of the proportional effect precludes calculation of bias because it varies across the range of values.
How about precision? In the Berkow et al.3 results, precision was the second reported metric after bias, and was recorded as ±1 g/dL. In this context, precision is the sample standard deviation of the differences between paired values measured by the 2 methods. It is a measurement of how closely the results are clustered, but those data can be very inaccurate. As long as the results are clustered, the precision will be a low value (high precision) but could be very far from the zero line on the Bland–Altman plot (true accuracy). Additionally, the inventors of the bias/LOA technique assume that only 1 measure per subject determines these values; otherwise, the investigators are actually measuring repeatability (i.e., within-subject variance) for a given patient. The mixture of multiple samples per patient and differing physiologic conditions influences accurate appraisals of device performance.
The LOA are also commonly graphed on Bland–Altman plots and provide insight into dispersion of the data around the bias line. LOA are values equal to the bias ±(1.96 times the precision) and are plotted above and below the bias line. Approximately 95% of all values should be between the positive and negative LOA if the aforementioned assumptions are met. As opposed to bias and precision, we submit that LOA are a useful metric of the accuracy of these devices. The LOA should always be viewed from a clinical perspective. If the LOA are minimal, then the test may well add value to patient care, although the raw data should nevertheless always be reviewed. If the LOA are high, the noninvasive test is not useful. Although the LOA represent the standard deviation of a population, we treat 1 patient at a time. Large LOA preclude using noninvasive Hb concentration data for an individual patient.
PRECISION AND BIAS ARE NOT CLINICAL DECISION METRICS
The important question to be answered by a noninvasive Hb measurement device is: does the patient need to be transfused? Precision and bias certainly do not provide the clinician with any clue as to how well the device is able to answer this. It is clear that a large bias implies inaccuracy. However, a bias of zero, as noted earlier, reveals little about accuracy because it may simply mean equally large positive and negative inaccuracy around the bias line. Similarly, high precision (a low value) may or may not imply high accuracy. Therefore, emphasizing these 2 metrics when assessing the accuracy of a device can be misleading in the absence of raw data and an educated reading.
Is accuracy over a range of values equivalent? The aviation industry offers some insight. The United States Federal Aviation Administration (FAA) requires a different accuracy from altimeters depending on the particular altitude14 (Fig. 5). For example, at an altitude of 12,000 ft, the required altimeter accuracy is ±90 ft because vertical separation of airplanes at this level is at least 1000 ft. However, at 500 ft, the required accuracy is 20 ft. When near the ground and landing in fog, the needed accuracy is certainly higher at these low altitudes.
Need we care if our Hb measurement device is highly accurate at 13 g/dL, the FAA equivalent of 30,000 ft? Perhaps not, but we do want it highly accurate at an Hb value of 6 or 7 g/dL, the FAA equivalent of 500 ft, at which point a critical decision may be made to transfuse. When reviewing the clinical studies referenced,2–12 the overwhelming majority of the Hb concentrations were found to be >10 g/dL, where it is significant that few, if any, patients would receive a packed red blood cell (RBC) transfusion. Because bias is calculated for all data points over the entire Hb concentration range studied (e.g., 6–15 g/dL) and the clinically important range of operation is 6 to 10 g/dL, then inclusion of data points from 10 to 15 g/dL essentially conceals the bias in the range when transfusion is being considered (6–10 g/dL). Moreover, the great preponderance of data points in the 10 to 15 g/dL range obscures the accuracy of the device in the critical transfusion range. The data points collected between 6 and 10 g/dL should reflect the actual clinically significant operational range of these devices for the clinician where they should be most accurate. Therefore, the clinical community remains largely ignorant of the performance characteristics of noninvasive Hb detectors in this 6 to 10 g/dL range because of the selection bias of the data reported to date. We recommend that, although not policy as of the present time, the Food and Drug Administration mandates companies to separately analyze and report data in the important 6 to 10 g/dL Hb concentration range when considering approval.
WHAT ACCURACY IS REQUIRED?
With the average adult (male) blood volume of approximately 5 L (the product of 65 mL/kg and 75 kg) and the average unit of diluted packed RBCs having a volume of approximately 500 mL (Hb approximately 13 g/dL), 1 unit of blood is approximately 10% of the blood volume. Thus, when a physician decides to transfuse a single unit of RBCs, the smallest amount of risk to the patient from infection and transfusion reaction, the choice is made to increase the blood volume by approximately 10%. By extension, the accuracy of an Hb measurement system needs to be ±10%, i.e., in the range within which transfusion decisions are made. For these reasons, we propose that any new device demonstrates accuracy to ±10% in a critical anemic range. What is that range?
The American Society of Anesthesiologists transfusion guidelines15 from 2006 state that transfusion of RBCs should typically not be done when the Hb is >10 g/dL, and almost always for an Hb <6 g/dL. Thus, between 6 and 10 g/dL is where an operating room noninvasive Hb device needs to be accurate within 1 g/dL. Below 6 g/dL, accuracy is probably not as critical because within this range, transfusion would almost universally be recommended. Similarly, with an Hb >10 g/dL, the patient would rarely be transfused, so accuracy is again not as crucial.
Additionally, the accuracy of the device should not deteriorate with patient conditions that might put them at greater risk of an inaccurate measurement. For example, blood from a patient in a low perfusion state, such as shock, should be as easily and accurately measured as a well-perfused patient. This is especially important, given that the cause of shock may well be the result of low Hb. In such a patient, it is critical to obtain an accurate Hb value because making the diagnosis and prescribing the treatment (which may include transfusion) may rely on this Hb result. Although Frasca et al.7 studied intensive care unit patients, they did not report any measurements done in patients with a low perfusion state.
THE ERROR GRID
In 1987, Clarke et al.16 introduced the glucose error grid to evaluate the clinical accuracy of point-of-care glucose meters compared with traditional laboratory devices. Figure 6 is an example showing the reference measurement on the x-axis and the sensor on the y-axis.10 There are 5 error zones on the plot, with the severity of the error increasing from A to E. “In the Clarke glucose error grid, zone A represents a 20% deviation from reference, zone B consists of results with a 20% deviation that would result in no or benign treatment, zone C represents overcorrection of acceptable blood glucose levels, zone D represents clinically dangerous failure to detect errors, and zone E results in erroneous treatment (contradictory decisions).”17
In 2011, our group introduced the Hb error grid (Fig. 7)18 as a method to clinically evaluate the accuracy of Hb measurement. (For a complete discussion of the Hb error grid, see Morey et al.18). The CLD reference and tested device measurements are placed on the x and y axes, respectively. Ideally, the resulting points generate a straight line. We created 3 zones: A, B, and C.
Zone A (in green) begins with an isthmus of a 10% error on either side of a perfectly accurate measurement between the Hb values of 6 and 10 g/dL because of the aforementioned American Society of Anesthesiologists guidelines and clinical relevance. Below 6 g/dL, transfusion will likely occur, so accuracy at this level is less important. Hb measurements >10 g/dL will likely result in no transfusion, so high accuracy is not needed. Zone C (red) signifies errors that may be critical. If the “true” Hb is <6 g/dL, but the device reports a value >10 g/dL, RBCs may be withheld, resulting in possible harm to the patient. Conversely, if the reference Hb is >10 g/dL, but the device reads <6 g/dL, an unnecessary transfusion would occur. Zone B (yellow) is between zone A and zone C, signifying an error that might result in harm, depending on the circumstances, but not as serious as a zone C error. From the broader perspective, this type of Hb error analysis is adaptable to other medical environments besides the operating room. If physicians practice in conditions wherein different Hb concentrations are important in clinical decision making, we encourage them to modify the zones to fit the needs of their particular patient population to understand whether the device meets their own needs. Any zone C errors are serious, and we believe that such a device should not be used in the clinical arena.
THE BOTTOM LINE
The essential purpose of noninvasive Hb technology in the operating room is to assist clinicians in deciding whether to transfuse. For that reason, we suggest that not only should noninvasive Hb devices and the gold standard method produce statistically similar results, they should also lead to comparable clinical decisions. To that end, a test of decision making around the relevant Hb concentration range of 6 to 10 g/dL should be used. It is our opinion that the published accuracy data from the Masimo Radical-7 device, especially in the aforementioned critical range, does not guide clinicians to make transfusion decisions.
Name: Mark J. Rice, MD.
Contribution: The author helped prepare the manuscript.
Attestation: Mark J. Rice, MD, attests to having approved the manuscript; Dr. Rice is designated as the archival author.
Name: Nikolaus Gravenstein, MD.
Contribution: The author helped prepare the manuscript.
Attestation: Nikolaus Gravenstein, MD, attests to having approved the manuscript.
Name: Timothy E. Morey, MD.
Contribution: The author helped prepare the manuscript.
Attestation: Timothy E. Morey, MD, attests to having approved the manuscript.
This manuscript was handled by: Dwayne R. Westenskow, PhD.
a Smith JL. The Pursuit of Noninvasive Glucose: “Hunting the Deceitful Turkey”. Available at: http://www.mendosa.com/noninvasive_glucose.pdf. Accessed November 12, 2012.
1. Severinghaus JW. Takuo Aoyagi: discovery of pulse oximetry. Anesth Analg. 2007;105:S1–4
2. Applegate RL 2nd, Barr SJ, Collier CE, Rook JL, Mangus DB, Allard MW. Evaluation of pulse cooximetry in patients undergoing abdominal or pelvic surgery. Anesthesiology. 2012;116:65–72
3. Berkow L, Rotolo S, Mirski E. Continuous noninvasive hemoglobin monitoring during complex spine surgery. Anesth Analg. 2011;113:1396–402
4. Butwick A, Hilton G, Carvalho B. Non-invasive haemoglobin measurement in patients undergoing elective Caesarean section. Br J Anaesth. 2012;108:271–7
5. Causey MW, Miller S, Foster A, Beekley A, Zenger D, Martin M. Validation of noninvasive hemoglobin measurements using the Masimo Radical-7 SpHb Station. Am J Surg. 2011;201:592–8
6. Colquhoun DA, Forkin KT, Durieux ME, Thiele RH. Ability of the Masimo pulse CO-Oximeter to detect changes in hemoglobin. J Clin Monit Comput. 2012;26:69–73
7. Frasca D, Dahyot-Fizelier C, Catherine K, Levrat Q, Debaene B, Mimoz O. Accuracy of a continuous noninvasive hemoglobin monitor in intensive care unit patients. Crit Care Med. 2011;39:2277–82
8. Gayat E, Bodin A, Sportiello C, Boisson M, Dreyfus JF, Mathieu E, Fischler M. Performance evaluation of a noninvasive hemoglobin monitoring device. Ann Emerg Med. 2011;57:330–3
9. Lamhaut L, Apriotesei R, Combes X, Lejay M, Carli P, Vivien B. Comparison of the accuracy of noninvasive hemoglobin monitoring by spectrophotometry (SpHb) and HemoCue® with automated laboratory hemoglobin measurement. Anesthesiology. 2011;115:548–54
10. Macknet MR, Allard M, Applegate RL 2nd, Rook J. The accuracy of noninvasive and continuous total hemoglobin measurement by pulse CO-Oximetry in human subjects undergoing hemodilution. Anesth Analg. 2010;111:1424–6
11. Miller RD, Ward TA, Shiboski SC, Cohen NH. A comparison of three methods of hemoglobin monitoring in patients undergoing spine surgery. Anesth Analg. 2011;112:858–63
12. Nguyen BV, Vincent JL, Nowak E, Coat M, Paleiron N, Gouny P, Ould-Ahmed M, Guillouet M, Arvieux CC, Gueret G. The accuracy of noninvasive hemoglobin measurement by multiwavelength pulse oximetry after cardiac surgery. Anesth Analg. 2011;113:1052–7
13. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1:307–10
14. National Archives and Records Administration. . Part 43-Maintenance, preventive maintenance, rebuilding and alteration. Title 14: Aeronautics and space. CFR. 2012
15. American Society of Anesthesiologists Task Force on Perioperative Blood Transfusion and Adjuvant Therapies.. Practice guidelines for perioperative blood transfusion and adjuvant therapies: an updated report by the American Society of Anesthesiologists Task Force on Perioperative Blood Transfusion and Adjuvant Therapies. Anesthesiology. 2006;105:198–208
16. Clarke WL, Cox D, Gonder-Frederick LA, Carter W, Pohl SL. Evaluating clinical accuracy of systems for self-monitoring of blood glucose. Diabetes Care. 1987;10:622–8
17. Rice MJ, Pitkin AD, Coursin DB. Review article: glucose measurement in the operating room: more complicated than it seems. Anesth Analg. 2010;110:1056–65
18. Morey TE, Gravenstein N, Rice MJ. Let’s think clinically instead of mathematically about device accuracy. Anesth Analg. 2011;113:89–91
19. Rice MJ, Coursin DB. Continuous measurement of glucose: facts and challenges. Anesthesiology. 2012;116:199–204
© 2013 International Anesthesia Research Society
20. Piper HG, Alexander JL, Shukla A, Pigula F, Costello JM, Laussen PC, Jaksic T, Agus MS. Real-time continuous glucose monitoring in pediatric patients during and after cardiac surgery. Pediatrics. 2006;118:1176–84