We read with interest the recent evaluation of the 3 M SpotOn thermometer by Iden et al.1 Encouragingly, their results in noncardiac surgical patients were generally similar to ours in cardiac surgical patients.2 There are nonetheless several aspects of their report where clarification might be helpful.
Perhaps, the most striking aspect of Iden et al. is that 37 of 120 enrolled patients, very nearly a third, were excluded from analysis. Granted, two patients were switched to spinal anaesthesia and the SpotOn thermometer malfunctioned in four others. But 19 patients were eliminated because of ‘calibration failure of the sublingual probe’. That an electronic sublingual probe would fail in more than 15% of cases seems remarkable. After all, oral thermometers are standard hospital equipment and rarely fail in normal use. Twelve additional patients were excluded because surgery did not last long enough to obtain the final planned temperature measurement at 75 min.
The protocol specifies that ‘patients with unexpected blood loss, haemodynamic instability …, and the need for postoperative ventilation were excluded’. At least two of these could only be determined during or after surgery, thus well after enrolment. However, no patients appear to have been excluded for these reasons. Did none of 120 enrolled patients meet any of these exclusion criteria?
The general rule in clinical research is to analyse data from all enrolled patients to avoid data-selection bias. There is no compelling reason to exclude patients with missing sublingual temperatures from comparisons to nasopharyngeal temperature (which is anyway the more reliable temperature-monitoring site). Similarly, the initial measurements could perfectly well be evaluated in patients whose final temperatures were missing.
The nasopharyngeal sensor is described as being ‘inserted through a nostril just posterior to the soft palate’. How far was the probe actually inserted? Probe insertion distance may be important because sublingual temperatures are reported as being ‘greater’ than both nasopharyngeal and SpotOn temperatures by about a third of a degree Celsius. This is an unusual pattern since the nasopharynx, as a reliable core temperature-monitoring site, should be warmer than the mouth.
When assessing agreement between two methods, it is important to report the average difference or bias. However, the variability of the individual differences is even more important as even wildly variably thermometers might well look good on ‘average’. Clinicians need to be confident that any ‘single’ measurement well reflects actual core temperature. It would thus be helpful if the authors presented the mean bias ±2 SD range for each measurement for all enrolled patients, along with the fraction of SpotOn measurements that were within the ±0.5°C they describe as ‘adequate for temperature monitoring’.
The authors claim that SpotOn temperatures were ‘almost identical’ to nasopharyngeal temperatures, and report a mean bias of 0.07°C and P value for a paired t test of 0.14. However, paired t tests are not the right way to assess agreement as they evaluate mean bias, rather than agreement amongst individual pairs which is what we need to know. When used to assess agreement, as in Iden et al., a t test can be misleading. For example, a t test will fail to reject (and thus claim ‘good agreement’) when the average difference is zero, even when within-pair differences are large. It will also reject the null and claim ‘poor agreement’ in the presence of excellent agreement (i.e. tiny differences) coupled with low variability (e.g. all difference of similar magnitude). Paired t tests were simply the wrong analysis tool.
Bland and Altman3 developed their now-famous methods for assessing agreement in the 1980s specifically to combat the misuse of paired t tests and correlation coefficients for such studies. The authors did include a Bland-Altman method – which should have been their primary analysis. However, they used the wrong version. Specifically, they used the original method which is only applicable for studies with a single set of measurements from each patient, with all pairs being independent. Their analysis, though, included data from three distinct time points for each patient, thus introducing within-patient correlation across time. The proper tool would thus have been the repeated-measures version of the technique.4 Furthermore, graphical and numerical displays of the data should have highlighted the three time intervals, rather than lumping them together, since agreement and accuracy may have varied over time.
As paired t tests should never have been used, it is obvious that the authors’ sample-size estimate based on t tests is faulty. The estimate instead should have been based on having sufficient data to precisely estimate the standard deviation of the differences or the proportion of pairs within the a priori defined clinically acceptable limits of ±0.5°C. For example, assuming that 70% of differences were expected to fall within the acceptable limits, they would have needed n = 150 patients to have a 95% confidence interval width of 0.15 for the proportion and n = 350 for a width of 0.10.
The authors incorrectly refer to the Spearman correlation coefficient (r) as the ‘coefficient of determination’. Actually, the coefficient of determination is R2, the proportion of variance of one measure explained by the other, and would be obtained by squaring the Pearson correlation coefficient. As the authors only report the Spearman correlation, the coefficient of determination should not have been mentioned. Much more importantly, though, the Spearman rank-order correlation and Pearson correlation do not assess agreement, only the strength of the association. Either correlation can therefore be high even when agreement between two measures is terrible.
The proper test would have been Lin's concordance correlation coefficient.5 Lin's concordance is a standard measure of agreement for continuous data which indicates how closely the pairs are to the 45° line of agreement in a scatterplot of two methods. Intuitively, it is a function of both the perpendicular distance from the 45° line of agreement and the strength of association measured by Pearson correlation (r). So for data strongly correlated (‘tight’ on the scatterplot, with high ‘r’) but with poor agreement (i.e. off of the 45° line), the Lin's concordance will be low, appropriately indicating poor agreement.
As the authors defined two comparisons of interest (SpotOn to nasopharyngeal, and SpotOn to sublingual), a Bonferroni correction should have been applied rather than defining ‘P < 0.05’ as significant and using it for each comparison. Another mistake is that the authors specify that ‘power (β) = 95%’; however, power is 1-β. Finally, as Iden et al. were comparing to a gold standard temperature, they could have also reported the slope and intercept of the regression line of gold standard on 3 M SpotOn. To the extent that the slope differed from 1.0, there might be a predictable relationship between device and gold standard that could be used to improve device calibration. For example, for sublingual temperature, the slope in their Fig. 3 appears to be less than 1.0.
In summary, proper methods for assessing agreement between two continuous measures include a Bland-Altman comparison of differences versus the mean of two measures, Lin's concordance correlation coefficient, the proportion of differences falling within predefined limits and the mean difference (or bias) and its standard deviation. Iden et al. attempted a Bland-Altman analysis, but used the wrong version. And they failed to use the other methods at all. We encourage Iden et al. to use the proper statistical analysis and include all available data. Although unlikely to substantively change the authors’ conclusions, doing so would improve readers’ confidence in the authors’ conclusions.
And finally, we were surprised that the authors report no conflicts of interest as Berthold Bein apparently serves on a 3 M advisory board.
Acknowledgements related to this article
Assistance with the letter: none.
Financial support and sponsorship: none.
Conflicts of interest: the Department of Outcomes Research is funded by 3 M (manufacturer of the SpotOn thermometer) and various other temperature monitoring and management companies. DIS serves on advisory boards for 3 M and the 37Company, and consults for other temperature-related companies. He donates all temperature-related payments to charity.
1. Iden T, Horn EP, Bein B, et al. Intraoperative temperature monitoring with zero heat flux technology (3 M SpotOn sensor) in comparison with sublingual and nasopharyngeal temperature: an observational study. Eur J Anaesthesiol
2. Eshraghi Y, Nasr V, Parra-Sanchez I, et al. An evaluation of a zero-heat-flux cutaneous thermometer in cardiac surgical patients. Anesth Analg
3. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet
4. Bland JM, Altman DG. Agreement between methods of measurement with multiple observations per individual. J Biopharm Stat
5. Lin LI. A concordance correlation coefficient to evaluate reproducibility. Biometrics