Throughout diagnostic nuclear medicine, an appropriate balance is required between image quality and radiation dose. The pressure to reduce injected activities of radiopharmaceuticals has reached new levels because of the recent worldwide molybdenum crisis, whereas financial pressure on gamma camera time makes more rapid scanning desirable.
In practice, this means collecting the minimum number of counts necessary to generate an adequate image for making accurate clinical diagnoses. In recent years, several computational image enhancement methods have been proposed that aim to take low-count, noisy images and improve their quality, allowing the required count number to be decreased. Pixon image processing is one particularly promising method derived from astronomy, using a process of spatially adaptive, minimal complexity image reconstruction 1. It aims to strip away the noise inherent in low-count images by looking for the simplest underlying true image that can fit the data. This has been likened to Ockham’s Razor and is an intuitively plausible approach. To give a practical example, when presented with a patch of color, interspersed with random pixels of different intensities, Pixon will interpret this as one confluent patch with unwanted noise and strip away that noise. It has shown promising results both in computer modeled tests 1 and in a number of small clinical trials 2–5 and is now available commercially for postacquisition processing of bone scans 6.
Despite these successes and the possibility of significant reductions in cost and radiation exposure to the patient, Pixon processing combined with half-count scanning has not yet been widely accepted clinically. This may be because of insufficient high-quality evidence of its clinical efficacy – for example, the relatively small size of the clinical trials previously performed (18–29 patients) or the subjective measures of image quality previously used, in which clinicians were asked to score images for quality and diagnostic confidence but were unable to assess whether the diagnoses themselves were accurate.
This study aims to address some of these concerns. We focused on bone scanning for suspected metastatic cancer, as this represents the largest volume of work for many nuclear medicine departments. The study aimed to compare objective agreement in diagnoses made using current standard (full-count) scans, scans recorded in half the time (half-count), and half-count scans that were subsequently image enhanced with Pixon image processing (Pixon-processed). Ideally, a diagnostic gold standard (e.g. biopsy and histology of all suspicious lesions) is needed. However, this is clinically impractical; hence, this study considered agreement between clinicians and between scanning modalities as a proxy measure for diagnostic accuracy.
The aim was to ascertain whether a switch from full-count to half-count scanning would introduce significant errors in diagnoses and whether this situation could be improved using Pixon image processing.
Our secondary outcome looked at the confidence of clinicians in making diagnoses using each of the three scanning modalities.
Materials and methods
A total of 150 adult patients, referred for suspected metastatic bone disease and scanned between April and August 2010 at a large London teaching hospital, were selected for retrospective analysis. Patients were injected with 600 MBq of technetium-99m-labeled methylene diphosphonate (MDP) and were scanned 3–4 h later. The scans were performed on either a Siemens Symbia T16 (Siemens AG, Munich, Germany), Philips Irix (Philips Healthcare, Best, the Netherlands), or ADAC Vertex gamma camera (Philips Healthcare), producing 512×1024 pixel images (256×1024 each for anterior and posterior views) with a scan speed of 10 cm/min (Fig. 1a). The sensitivities of the cameras ranged between 70 and 87 cps/MBq, and their resolution at minimum scanning distance on posterior images ranged between 9 and 12.5 mm.
The patients were selected by a clinician who read their final clinical reports, initially selecting patients in chronological order until there were 75 negative results and then including patients with positive results to reach a total of 150. This clinician did not participate further in reporting scans for this study. The total number of patients was determined by an a-priori power calculation requiring an 80% probability of correctly identifying (with 95% confidence interval) an increase in diagnostic disagreement from 5 to 10%. The figure of 5% was picked as a rough estimate of what we believed was the current disagreement rate between clinicians and because it matched the 5% false-positive rate inherent in the normal range quoted by diagnostic laboratory tests frequently used in clinical practice. This assumption was retrospectively supported by a similar measured value of 8%. The 10% figure in our power calculation was selected because we felt it would be important to be able to detect a doubling in disagreement rate and that this would be an achievable goal in this study. Clearly there is a case to be made for more stringent goals to be set, but the number of patients required rapidly becomes unmanageable.
The images were anonymized, and they were degraded using Poisson-preserving binomial resampling 7 to obtain their half-count equivalents (Fig. 1b). This is a method whereby each photon detected in each pixel in the gamma camera is randomly included or discarded from the half-count scan with a 50% probability. It is important to use this approach rather than simply reducing the counts by half, which would also halve the noise levels and result in physically unrealistic images. The method was validated by assessing how well it could convert simulated pixels with expected counts of 1, 2, 10, and 25 to their half-count equivalents, comparing the results to values predicted by the Poisson distribution describing stochastic decays from a radioactive source. Figure 2 shows the resulting histogram plots, with the y-axis giving the probability of obtaining each number of counts measured over 10 000 iterations. It is worth noting that there is near-perfect overlap between the resampled and predicted lines, supporting the use of this resampling method.
An alternative to our scanning and resampling approach would have been to prospectively scan patients and then rescan them over half the time. This was rejected because it would introduce unwanted variables such as whether the patient was equally still or in the same position in the two scans and whether there was significant decay or excretion of tracer in the time between scans. It would also require more time on the part of staff and patients.
The resulting half-count images were subsequently processed using the Pixon-Enhanced Planar Processing module distributed by Siemens Medical Solutions (Munich, Germany) as part of their Syngo Molecular Imaging Applications suite. The default processing parameters were used (denoise parameter=1.7, maximum kernel radius=10, number of kernels=12, maximum iterations=20, α blending=30%). This α value was chosen after a blinded survey of five clinicians showed that this default setting corresponded to their preferred value. An example of the resulting image is shown in Fig. 1c.
Two experienced clinicians were presented with 450 images (full-count, half-count, and Pixon-processed) from 150 patients (Fig. 1) in a blinded, randomized order, using a custom software application written in Matlab (MathWorks, Natick, Massachusetts, USA) (Fig. 3). All images were viewed on the same Microsoft Windows workstation attached to a 19 Fujitsu E19-9 1280×1024 color LCD monitor (Fujitsu Limited, Tokyo, Japan) with brightness and contrast unaltered between viewings, under ambient lighting conditions. The primary outcome was whether the clinician diagnosed metastatic bone disease (yes/no). The secondary outcome was how confident they were in their diagnosis on a four-point ordinal scale (‘confident’, ‘fairly confident’, ‘not very confident’, ‘not at all confident’).
The full-count images were used as the reference standard, and both half-count and Pixon-processed images were assessed against this, using intraobserver agreement as the measure of quality (Figs 4 and 5). These full-count versus half-count and full-count versus Pixon-processed intraobserver agreement levels were calculated as fractions and κ values 8. As it is unrealistic to expect perfect agreement between scans (indeed other studies have shown this to not be the case in practice 9), these calculated levels were compared with the interobserver agreement for full-count images, representing the currently acceptable disagreement rate between clinicians in the department.
The interobserver agreement levels for half-count and Pixon-enhanced scans were also calculated, as a measure of the consistency in diagnosis between clinicians within these new modalities.
95% confidence intervals were calculated assuming a binomial distribution of disagreements and applying the normal approximation. All p-values were calculated using the χ2 goodness-of-fit test.
The study received prior approval from the North West London Regional Ethics Committee.
The first clinician (Dr A) diagnosed 37 full-count scans as representing metastatic bone disease compared with 33 half-count scans and 40 Pixon-processed half-count scans (Table 1). The corresponding figures for Dr B were 47, 45, and 49, respectively. Dr A and Dr B made the same diagnosis in 138 of the 150 full-count scans, corresponding to a κ value of 0.80 (the interobserver agreement rate for full-count scans). They arrived at the same diagnosis for 138 of the 150 half-count scans (κ=0.79) and for 131 of the 150 Pixon-processed scans (κ=0.70).
Dr A came to the same diagnosis when viewing a patient’s full-count scan as when viewing his or her half-count scan in 144 of 150 cases (96% or κ=0.89), as shown in Table 2. The corresponding figure for Dr B was 142 of 150 cases (95% or κ=0.87). Summing these results across both clinicians yielded an agreement in diagnosis for 286 of 300 cases (95% or κ=0.88). The agreement between Dr A’s diagnosis on a full-count scan compared with the diagnosis for the same patient made using Pixon-processed half-count scans was 143 of 150 cases (95% or κ=0.88). The corresponding figure for Dr B was 142 of 150 cases (95% or κ=0.88). Summing the results of the two clinicians gave a figure of 285 of 300 cases (95% or κ=0.88).
The secondary outcome of the study was to measure the confidence each clinician felt in making diagnoses, according to the scanning modality they were viewing. Of the 150 full-count scans viewed by Dr A, three were rated as him/her being ‘not at all confident’, 13 as ‘not very confident’, 74 as ‘fairly confident’, and 60 as ‘confident’. This was compared with 7, 14, 88, and 41 cases, respectively, for the Pixon-processed scans viewed by Dr A. The rest of the figures, including the combined scores of both clinicians, are presented in Table 3.
Some of the key interobserver and intraobserver agreement rates are plotted in Fig. 5, whereas Fig. 6 shows the combined confidence scores of the two clinicians.
The levels of agreement between the diagnoses of metastatic bone disease shown in Fig. 5 enable us to address the main question of this paper: Would switching from full-count scanning to half-count scanning introduce significant errors when diagnosing metastatic bone disease and would Pixon processing reduce these errors?
The first column in Fig. 5 (interobserver agreement when viewing full-count scans) gives a benchmark against which to measure the other levels of agreement. It represents how often two clinicians in the department, independently reviewing current standard scans, arrive at the same diagnosis (i.e. they agree in 92% of scans).
This compares to an intraobserver agreement of 95% for full-count versus half-count scans (Fig. 5, column 4), suggesting that there is more disagreement introduced by changing the reporting clinician than by changing to half-count scans; however, this effect is not statistically significant (p=0.15). This level of disagreement remains the same when looking at levels of disagreement introduced when switching from full-count to Pixon-processed half-count scans (Fig. 5, column 5), suggesting neither an improvement nor a loss of diagnostic accuracy when including the Pixon-processing step.
It is also possible to assess consistency between clinicians when both are viewing the same modality. Hence, two clinicians viewing half-count scans agree in 92% of cases (Fig. 5, column 2), similar to the full-count scans used at present, whereas they agree only 87% of the time when viewing Pixon-processed half-count scans (column 3). Although this is again not statistically significant (p=0.18), it does suggest that the Pixon processing step may introduce further disagreement between clinicians.
With respect to the secondary outcome measure – that is, the subjective confidence of clinicians in the diagnoses they are making (Table 3 and Fig. 6) – we see a difference between the modalities (p=0.04): in the highest ‘confident’ category, the largest number of images were full-count scans, followed by half-count scans and then Pixon-processed images. At the other extreme, in the ‘not at all confident’ category the largest number of images were Pixon-processed, followed by half-count and then full-count scans. In other words, clinicians felt most confident making diagnoses using full-count scans, less confident viewing half-count scans, and least confident viewing Pixon-processed scans.
We can now bring together the conclusions from the primary and secondary outcome measures.
We observed no statistically significant difference in diagnostic agreement levels between the current full-count scans and half-count scans, supporting a move to cheaper, lower-radiation half-count scanning.
We observed no improvement in agreement when applying Pixon processing to these half-count scans and even showed a nonsignificant trend toward worse agreement.
However, the clinicians felt less confident making diagnoses using half-count scans.
There are three possible interpretations of the inconsistency between our objective agreement and the subjective confidence results:
The inconsistency may represent a genuinely lower ability to make difficult diagnoses, which could be demonstrated objectively in a larger study, or in one with more difficult clinical cases.
The full count scans may be esthetically more pleasing, resulting in greater clinician confidence, but in fact may not represent any true ability to make more accurate diagnoses.
The confidence of clinicians may be an important outcome measure in itself, irrespective of correlation with objective diagnostic accuracy. It will determine whether they order further expensive imaging and the clinical management they recommend.
There are a number of limitations to this study that are worth noting:
It does not contain a gold standard for diagnosis of metastatic disease (e.g. biopsy of suspicious lesions), which would have allowed the accuracy of different scans to be measured directly. Instead, we relied on agreement with the currently accepted method. It is theoretically possible that the disagreement between Pixon-processed half-count scans and full-count scans indicates Pixon images being more accurate and achieving the correct diagnosis more often. However, this is extremely unlikely as the Pixon scans had been degraded and hence started off with less information, showed a greater level of disagreement between clinicians and subjectively less confidence among clinicians in making diagnoses.
Interobserver agreements between full-count and half-count scans were not assessed. We chose instead to look at intraobserver agreement between full-count and half-count images, which isolates the disagreement due to changing modality while keeping the clinician the same. It was not possible to consider both values because the study was not powered for multiple comparisons.
Only two clinicians participated in the study, giving maximum statistical power for a given number of images viewed, but there is a risk of results being biased by the idiosyncrasies of the two clinicians selected.
There are five parameters that can be set for Pixon processing, including α blending, the denoise parameter, the maximum kernel radius, the number of kernels, and maximum iterations. We performed a blind survey of five clinicians that showed that they preferred an α value of 30%, which corresponded to the default setting, and we similarly used the default values for the other parameters as distributed by Siemens Medical Solutions, because we believed that this would be how a typical user uses Pixon. It is possible that these values are suboptimal and that different results would have been obtained with different settings.
It is also possible that different Pixon settings would be appropriate depending on the gamma camera or scanning technique used. The sensitivities and resolutions of the cameras used in this study were very similar, and there were no discernible differences in image quality between the different models. Although Pixon is marketed by Siemens Medical Solutions, it was not developed for a specific camera or manufacturer and, unlike resolution recovery software, does not require knowledge of camera-specific parameters. It therefore seems unlikely that Pixon parameters would need to be altered for different cameras or that the effect of the software would be changed with standard scanning conditions, but we did not test for this effect in this study.
In principle, it would be advantageous to reduce the number of counts collected in bone scanning, both from a financial and a radiation protection point of view, provided this can be achieved without loss of diagnostic accuracy.
This study allows us to draw the following conclusions:
Halving the number of counts does not introduce significantly more disagreement when diagnosing metastatic bone disease.
Pixon image processing does not significantly improve half-count scans and may reduce diagnostic agreement.
Clinicians feel less confident in making diagnoses with fewer-count scans.
The critical question facing decision makers is whether they can safely switch all their bone scanning to half-count scans, and this paper goes part of the way towards supporting such a move. Further work, either with a larger sample size or focusing on difficult clinical cases, would reassure readers that such a move would not lead to diagnostic errors in the small number of complex cases.
The Matlab program was based on a code written in the Medical Physics Department of Leeds Teaching Hospitals NHS Trust to study the effect of scanning time on the quality of bone studies, which was undertaken with the aid of a Regional Research and Development (Northern and Yorkshire) response funding small project grant.
The study was approved by the North West London Regional Ethics Committee.
Conflicts of interest
MH has received consultancy fees from GE Healthcare relating to other tracers, and book royalties from Springer Verlag. For the remaining authors there are no conflicts of interest.