Secondary Logo

Journal Logo

Article: Research Report

Hemoximetry as the “Gold Standard”? Error Assessment Based on Differences Among Identical Blood Gas Analyzer Devices of Five Manufacturers

Gehring, Hartmut MD*; Duembgen, Lutz PhD; Peterlein, Mareike*; Hagelberg, Söhnke MD*; Dibbelt, Leif PhD

Author Information
doi: 10.1213/
  • Free

There is no test system for identifying the measurement error of the reference procedure for pulse oximetry, i.e., hemoximetry, also known as CO-oximetry. Hemoximetry discontinuously measures hemoglobin oxygen saturation and dyshemoglobines from blood samples. All medical devices deliver the measured values with a certain error.

CO-oximetry or hemoximetry is an essential component of blood gas analyzer systems. Hemoximetry presents hemoglobin oxygen saturation, dyshemoglobines, and total hemoglobin concentration data. The functional oxygen saturation measured by this procedure is the basis for calibrating pulse oximeters. Pulse oximeters cannot be calibrated using physical procedures, but only by directly comparing the reported measurements and the parallel arterial oxygen saturation measured by hemoximetry in a group of healthy subjects (1). Thus, the errors of hemoximetry are carried over into measurements made by the pulse oximeter.

There is no standard procedure for checking the measurement error of the hemoximeter. Usually the devices are calibrated by the manufacturer before delivery. For calibrating hemoximetry in everyday clinical routine, aqueous solutions supplied by the manufacturers are used. A gravimetric procedure that cannot be applied in normal clinical routine has been described (2).

Comparisons of hemoximeters from leading manufacturers, which are intended for use as reference procedures, have not been published. The hemoximeters of the companies participating in the present study were introduced in papers dealing with the reference analysis of pulse oximeters (3–9).

For pulse oximeter calibration and confirmation tests done on volunteers, the United States Food and Drug Administration (FDA) requires only one CO-oximeter device for measuring sO2 used according to the hemoximeter manufacturer’s recommended procedures and supplies (1).

Bland and Altman pointed out that not just test devices but also reference devices produce errors (10). The Bland–Altman procedure for error analysis and presentation is useful for the presentation of pulse oximeter accuracy data (11).

The objective of this study was to determine the measurement error associated with the determination of arterial oxygen-saturation by hemoximetry.

Because no reference procedure is available for measuring oxygen saturation with CO-oximeters, the pairwise differences between identical devices from five leading manufacturers, as well as the differences between manufacturers, were assessed in order to estimate the error between the hemoximeters.


Device Preparation

Table 1 lists the participating companies as well as the serial numbers and model descriptions of the devices. Table 2 describes the technical measurement procedure. All devices were assembled pairwise in a climatically controlled room and serviced by employees from the manufacturer before they were put into operation. The air pressure gauges integrated within the devices were checked and synchronized with the reference system ( During monitoring phases, the prevailing air pressure values were recorded hourly. The devices were adjusted so that they reported identical units and denotations of the variables.

Table 1:
Manufacturers Included in the Study with Serial Numbers of the Devices A and B
Table 2:
Measurement Principles of the Hemoximeter

The amount of blood required to provide sufficient material for the entire range of blood gas analyzers was tested under several conditions. At least three syringes specially designed for blood gas sampling (PICO™ 50 with dry electrolyte-balanced heparin (80 IU, Radiometer Copenhagen)), and filled with 2 mL of blood, appeared sufficient to provide all systems with an adequate amount of blood.

Study Protocol

A set up was chosen for the study which

  1. reflected the reality for the operation or the intensive care unit with blood gas analyses from patients, and
  2. also corresponded to the standards of the desaturation studies for the testing and calibration of pulse oximeters.

Twelve healthy male and female volunteers were investigated over 4 days regarding standardized calibration procedures for pulse oximeters corresponding to the requirements of the FDA over the range of 70%–100% sO2. The study was approved by the Ethical Committee of the University of Luebeck, Germany, and all participants gave written informed consent. All volunteers breathed an oxygen/nitrogen mixture with high flow (15 L/min) given by a Trajan 808™ (Draeger Medical, Luebeck) via a valve-less face mask. Three N-3000 with finger clip and one N-595 with a forehead sensor (all Nellcor, Pleasanton, CA) served as reference pulse oximeters for breathe down control. Mean values, as well as individual data points from the systems, were presented continuously on a display. The protocol followed the standard procedure of the FDA (1) where five levels (L) were established in the sO2 range between 100% and 70%, and five blood samples were withdrawn under steady state plateau conditions for testing, with the hemoximeter at the breathe-down laboratory (2 OSM 3 and 2 ABL 725, Radiometer Copenhagen, Denmark). At the end of three of the levels (L97 near to 97% sO2, L85 at 85% sO2 and L75 at 75% sO2) and also under the presence of plateau conditions, three syringes were rapidly filled representing the sample for one level. The breathe-down procedure was repeated, so that at least six sampling points materialized for each volunteer. The three syringes marked only with a colored spot were mixed and then transferred within 30 s to the adjacent study laboratory. Three technical assistants, blinded to the syringes, were randomly assigned to a manufacturer and the Devices A or B for testing. Data from each blood gas analyzer were stored on disk and in printed form before they were transferred to an Excel™ data file.

Statistical Analysis

It should be remembered that 1) each device received blood that was randomly selected from a sample, and that 2) there was no value for the true saturation.

The variables sO2 (hemoglobin oxygen saturation), cO2Hb (oxygen content of hemoglobin), cHHb (deoxyhemoglobin concentration), cCOHb (carboxyhemoglobin content), cMetHb (methemoglobin content), and ctHb (total hemoglobin content) were analyzed. The basis for evaluation were the raw data and the differences between the Devices A and B related to level, considering session, subject, manufacturer, and device.

The measurements are modeled as:

With i = 1, 2: session; j = 1, 2, 3: level; k = 1, …, 12: subject; l = 1, …, 5: manufacturer; m = 1, 2: device.

Here Mijk is a randomized effect depending on session, level, and subject, Δjlm is an effect depending on level, manufacturer and device, and εijklm is a measurement error with the expected value 0 and standard deviation σjl, depending on level and manufacturer.

The Wilcoxon signed rank sum tests with Bonferroni-Holm adjustment were applied to calculate P values for differences of devices and differences between manufacturers. See also Appendix A.

The question regarding which variable was responsible for the increasing errors in sO2 measurements with respect to levels 97, 85, and 75 was answered by analysis of variance proportion. The procedure for analyzing the proportion of the variance of cHHb and cO2Hb that contributed to the variance of the sO2 measurements is listed in Appendix B.


Each of the test systems was used to analyze n = 72 samples. The distribution of the absolute values for the individual manufacturers is given in Figure 1. The significances listed for level 97 between the mean values of the two Devices A and B from one manufacturer, compared with the mean value from all other companies were masked by the increasing variances at levels 85 and 75.

Figure 1.:
Mean and sd of the absolute values, summarized for Sessions 1 and 2. *P < 0.05; test between Group 1 (mean of Devices A and B) versus Group 2 (mean of the other companies).

The measured differences, pairwise, between Devices A and B, as a measure for the error of the hemoximeters within a series increased clearly and significantly with all manufacturers between levels 97 and 85. This effect was even stronger between level 97 and 75 (Fig. 2).

Figure 2.:
Mean and sd of the differences of the devices (A–B) of each company, ordered to level, summarized for Sessions 1 and 2. P < 0.05; test between Devices A and B.

Summarized for all samples, the differences of the Devices A and B were recorded as means and standard deviations for each manufacturer in Table 3, completed with the values for cHHb and cO2Hb as well as the sums of cO2Hb and cHHb representing the denominator of the formula sO2 = cO2Hb/(cO2Hb + cHHb).

Table 3:
Differences Between Devices (A–B) for Five Manufacturers

The measurement of cCOHb showed no dependence on sO2 level or session, but revealed significant differences between the manufacturers, as well as between the Devices A and B (Table 3). Overall, the absolute measured values for cCOHb were scattered over a broad range between 0% and 4%.

The measurement of cMetHb also showed no dependence with regard to sO2 level and session, but here, just as with cCOHb, significant differences were seen between the manufacturers as well as between the Devices A and B (Table 3). Overall, the scattering of the measured values was restricted to a narrow range of between 0% and 1% cMetHb.

The variances associated with the measurement of cHHb were disproportionately responsible for the increasing differences between Devices A and B (Table 4).

Table. 4:
Variance Proportion Test. See Appendix B for Abbreviations


For all manufacturers, the differences of the sO2 values, measured with identical devices of a series, increase as saturation falls. For the analysis of sO2, one can therefore assume that a measurement error also exists even for the “gold standard” of hemoximetry, and that this will influence pulse oximeter calibration.

For the absolute values, significant differences between the instrument manufacturers already occurred at level 97, an effect that was masked by the increasing variance at levels 85 and 75. The variance for the measurement of cHHb can be identified as an important cause underlying the error. The measurement errors for cMetHb were significant, but restricted to a range of 0%–1%. However, there were significant variances in the measurement error of cCOHb between 0% and 4%.

A standardized production of blood samples with a defined saturation level between 0% and 100% can only be achieved in isolated cases using a tonometrically based gravimetric procedure (2). A primary calibration can only occur in the factory before the devices are shipped to the customer. Any further calibration is then based merely on the application of aqueous sample materials; therefore, a fundamental determination of the measurement error is not possible. In particular, the error due to an inadequate hemolysis of the cellular substances in the blood is lost when calibrating with the aqueous samples.

As a possible means for approaching the unknown values for the true sO2 levels, laboratories involved in calibrating pulse oximeters use several hemoximeters of an identical design from one or several companies, and then compute mean values (12) from them. This assumption is based, however, on a randomly distributed error that becomes minimized upon averaging.

The study presented here was based on the fact that all test devices received randomized blood from a population sample. In order to test for reproducibility, the breathe down procedure was carried out twice for each test subject (Sessions 1 and 2). In the statistical analysis, an effect based on differences between Sessions 1 and 2 could be excluded.

Considering the results presented, we can also assume a marked error within the reference devices. This is an effect that was already clearly identified by Bland and Altman, a fact which led them to establish their own evaluation procedure (10). A modified presentation, related only to the values obtained from the reference devices (11), can therefore be recommended only with reservations.

The algorithms and corrective procedures established by the manufacturers represent a further gray zone as regards to the reporting of measurements within the reference systems. As a single example, the number of wavelengths used in the devices investigated clearly varies between the different manufacturers (Table 2).

The variable sO2 reported by the hemoximeters refers to the functional oxygen saturation. The calculation is based on hemoglobin that can bind oxygen (functional sO2 = cO2Hb/(cO2Hb + cHHb) and is directly comparable to the value reported by the pulse oximeters. A new pulse oximeter (12) is now also able to measure cMetHb and cCOHb so that, in principle, a measurement of the fractional oxygen saturation (fractional sO2 = cO2Hb/(cMetHb + cCOHb + cO2Hb + cHHb) is also possible by pulse oximetry. In a paper by Barker et al. (12), a Bland-Altman calculated difference (bias) of −1.12% and a precision of ±2.19% was reported for the cCOHb pulse oximetric value. This represents an order of magnitude that was also measured in the present test with the reference devices.

A fundamental problem is the lack of uniformity in the nomenclature used by the different manufacturers for the reported variables. Only the measurement of functional oxygen-saturation (sO2 in %) was applied uniformly both for the hemoximeter and the pulse oximeter. The term “fractional oxygen saturation” will find increasing use due to the application of the new pulse oximeters in clinical settings. Errors arising from different spellings and definitions are more or less inevitable. In Germany, a consensus-building conference involving the leading manufacturers of blood gas analysis devices has already taken place, the results of which were published in the “Qualitest Consensus” (13).

At the time when the testing was performed, the devices examined were state-of-the-art. New developments in oximeter testing and standardization are desirable. A standardized test procedure for hemoximeters is important for the variables sO2, cO2Hb, cHHb, cCOHb, and cMetHb, similar to those already established for the measurement of ctHb (14).


Errors in hemoximeter determination of sO2 depend both on the manufacturer of the hemoximeter as well as the sO2 range. The variable cHHb disproportionately contributes to the measurement error for sO2. Further, the measurement error for cCOHb is related to the absolute values and the variance is clearly greater than is the case for cMetHb, a fact relevant to calculating the fractional saturation.

Finally, we strongly recommended the Bland and Altman procedure as the preferred analysis method for reporting and presenting hemoximeter errors.


Estimators and Confidence Intervals for Single Standard Deviations σjl

We assume that the differences

are distributed according to Nijkl1 − Δijkl2, 2σjl2), apart from a small fraction of outliers, whereby Δijkl1 − Δijkl2 is the systematic difference between two identical devices. (Note that the effect Mijk disappears here.) Denoting the available differences Dijkl for given level j and manufacturer l with D1, D2,…,Dn, conventional parametric confidence intervals for σjl are then given by

Here χm;β2 denotes the β-quantile of the χ2 distribution with m degrees of freedom.

To safeguard against outliers and obtain a robust estimate for σjl, we deleted the smallest and largest differences Da. For this purpose, one has to replace the χ2 quantiles in Eq. [A1] with new quantiles taking into account this trimming. The latter have been determined using extensive Monte-Carlo simulations. Using the 50% quantile, we also calculated a robust point estimator for σjl, i.e., a “trimmed version’’ of

Confidence Intervals for Ratios σclbl

For 1 ≤ b < c ≤ 3, a standard (1−α) -confidence interval for the ratio σclbl is given by

Here Fm[1], m[2];β denotes the β-quantile of Fisher’s F distribution with m(1) and m(2) degrees of freedom, and

is the usual estimator of σjl based on all n(j) available differences Dajl = Dijkl.

Again we modified this classical procedure by means of trimmed samples and corresponding surrogates for the F quantiles.

Statistical Inference About the Device Differences Δijkl1 − Δijkl2

The null hypothesis that “Δijkl1 − Δijkl2 = 0”, i.e., there is no systematic difference between the two devices of manufacturer l at level j, was tested via Wilcoxon’s signed rank test. Using Wilcoxon’s signed rank test rather than a corresponding t-test has the advantage that the results are less sensitive to outliers, and one could even relax the model assumptions considerably. Since we are testing 15 null hypotheses simultaneously (5 manufacturers, 3 levels), we adjusted our P values via the Bonferroni–Holm procedure (15).

With D1, D2,…, Dn as in “Estimators and Confidence Intervals for Single Standard Deviations σjl” section, a (1−α)-confidence interval for Δijkl1 − Δijkl2 consists of all numbers Δ such that the Wilcoxon signed rank statistic W(D1 − Δ, D2 − Δ, …, Dn − Δ) lies within the thresholds for the two-sided Wilcoxon’s signed rank test at level α. In other words, we determine the set of all values Δ such that the latter test does not reject the null hypothesis “Δijkl1 − Δijkl2 = Δ”.

Comparing Manufacturers

To test whether the devices of manufacturer l are significantly different from the other manufacturer’s devices at a certain level j, we considered the means

over two identical devices with Δjl = (Δjl1 + Δjl2)/2 describing the average effect of manufacturer l at level j and εijkl = (εijkl1 + εijkl2)/2 having mean zero and standard deviation

Now we considered the differences

where the bar stands for averaging over the remaining four manufacturers. With these differences we tested the 15 null hypotheses “Δijkl = ijkl” via Wilcoxon’s signed rank test plus Bonferroni–Holm adjustment, analogously as in “Statistical Inference About the Device Differences Δijkl1ijkl2” section.


Our aim is to quantify the contribution of errors in cO2Hb and cHHb measurements to the total error in measuring sO2. To this end, we model the former measurements as cO2Hb = μ1 + ε1 and cHHb = μ2 + ε2 with true concentrations μ1, μ2 and independent random errors ε1, ε2 having mean zero and standard deviations σ1, σ2. Assuming that σ1 + σ2 < < μ1 + μ2, we may expand sO2 as follows:

The second summand on the right hand side has variance

Thus, the variance proportion (VP) of cO2Hb, i.e., the contribution to the overall variance of sO2 is essentially equal to

To obtain an estimate for this variance proportion, we plug in classical estimates for the unknown means μi and standard deviations σi based on all measurements from a certain level. That means, we average over all test subjects, sessions, manufacturers, and devices. This leads to JOURNAL/asag/04.03/00000539-200712001-00005/ENTITY_OV0473/v/2021-09-14T015512Z/r/image-pngPcO2Hb = 2212/2212 + 1222.


1. ISO 9919. Medical electrical equipment—particular requirements for the basic safety and essential performance of pulse oximeter equipment for medical use. 2005
2. Zander R, Schaffartzik W. Qualitest Häm-Oxymeter. Georg Thieme Publisher (ISSN 1434–0143) 1999;5:0–8
3. Dumas C, Wahr JA, Tremper KK. Clinical evaluation of a prototype motion artifact resistant pulse oximeter in the recovery room. Anesth Analg 1996;83:269–72
4. Senn O, Clarenbach CF, Kaplan V, Maggiorini M, Bloch KE. Monitoring carbon dioxide tension and arterial oxygen saturation by a single earlobe sensor in patients with critical illness or sleep apnea. Chest 2005;128:1291–6
5. Van de Louw A, Cracco C, Cerf C, Harf A, Duvaldestin P, Lemaire F, Brochard L. Accuracy of pulse oximetry in the intensive care unit. Intensive Care Med 2001;27:1606–13
6. Berkenbosch JW, Tobias JD. Comparison of a new forehead reflectance pulse oximeter sensor with a conventional digit sensor in pediatric patients. Respir Care 2006;51:726–31
7. Kugelman A, Wasserman Y, Mor F, Goldinov L, Geller Y, Bader D. Reflectance pulse oximetry from core body in neonates and infants: comparison to arterial blood gas oxygen saturation and to transmission pulse oximetry. J Perinatol 2004;24:366–71
8. MacLeod DB. The desaturation response time of finger pulse oximeters during mild hypothermia. Anaesthesia 2005;60:65–71
9. Hummler HD, Englelmann A, Pohlandt F, Högel J, Franz AR. Accuracy of pulse oximetry readings in an animal model of low perfusion caused by emerging pneumonia and sepsis. Intensive Care Med 2004;30:709–13
10. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307–10
11. Bickler PE, Feiner JR, Severinghaus JW. Effects of skin pigmentation on pulse oximeter accuracy at low saturation. Anesthesiology 2005;102:715–9
12. Barker SJ, Curry J, Redford D, Morgan S. Measurement of carboxyhemoglobin and methemoglobin by pulse oximetry: a human volunteer study. Anesthesiology 2006;105:892–7
13. Zander R. Qualitest Consensus. Georg Thieme Publisher (ISSN 1434–0143) 2005;8:0–7
14. Gehring H, Hornberger C, Dibbelt L, Roth-Isigkeit A, Gerlach K, Schumacher J, Schmucker P. Accuracy of point-of-care-testing (POCT) for determining hemoglobin concentrations. Acta Anaesthesiol Scand 2002;46:980–6
15. Holm S. A simple sequentially rejective multiple test procedure. Scand J Statistics 1979;6:65–70
International Anesthesia Research Society