Secondary Logo

Computer versus human diagnosis of melanoma: evaluation of the feasibility of an automated diagnostic system in a prospective clinical trial

Dreiseitl, Stephana b; Binder, Michaelc; Hable, Krispina; Kittler, Haraldc

doi: 10.1097/CMR.0b013e32832a1e41
ORIGINAL ARTICLES: Clinical research
Free

The aim of this study was to evaluate the accuracy of a computer-based system for the automated diagnosis of melanoma in the hands of nonexpert physicians. We performed a prospective comparison between nonexperts using computer assistance and experts without assistance in the setting of a tertiary referral center at a University hospital. Between February and November 2004 we enrolled 511 consecutive patients. Each patient was examined by two nonexpert physicians with low to moderate diagnostic skills who were allowed to use a neural network-based diagnostic system at their own discretion. Every patient was also examined by an expert dermatologist using standard dermatoscopy equipment. The nonexpert physicians used the automatic diagnostic system in 3827 pigmented skin lesions. In their hands, the system achieved a sensitivity of 72% and a specificity of 82%. The sensitivity was significantly lower than that of the expert physician (72 vs. 96%, P = 0.001), whereas the specificity was significantly higher (82 vs. 72%, P<0.01). Three melanomas were missed because the physicians who operated the system did not choose them for examination. The system as a stand-alone device had an average discriminatory power of 0.87, as measured by the area under the receiver operating characteristic curve, with optimal sensitivities and specificities of 75 and 84%, respectively. The diagnostic accuracy achieved in this clinical trial was lower than that achieved in a previous experimental trial of the same system. In total, the performance of a decision-support system for melanoma diagnosis under real-life conditions is lower than that expected from experimental data and depends upon the physicians who are using the system.

aDepartment of Software Engineering, Upper Austria University of Applied Sciences, Hagenberg

bDepartment of Biomedical Engineering, University of Health Sciences, Medical Informatics and Technology, Hall

cDepartment of Dermatology, University of Vienna Medical School, Vienna, Austria

Correspondence to Dr Harald Kittler, MD, Medical University of Vienna, Waehringer Guertel 18-20, Vienna, Austria

Tel: +43 140400 7711; fax: +43 1408 1287 2099;

e-mail: harald.kittler@meduniwien.ac.at

Preliminary results of this study were presented at the AMIA Meeting 2007 in Chicago and were published in AMIA Annu Symp Proc 2007:191–195 under the title ‘Applying a decision-support system in clinical practice: results from melanoma diagnosis'.

Received 3 November 2008 Accepted 5 February 2009

Back to Top | Article Outline

Introduction

The recent developments in computer technology have raised expectations that automated diagnostic systems will become available to assist physicians in the diagnosis of human diseases including different types of cancer. Systems have been developed to streamline the diagnostic decision-making process in general and to analyze medical images in particular. With regard to the latter, the computer diagnosis of melanoma has been a highly active field of research. The current data suggest that the computer diagnosis of melanoma is as reliable as the diagnosis by human experts [1,2]. The drawback of all these systems is that they have been tested only under experimental conditions that are prone to different types of bias, including selection bias and verification bias. Moreover, little attention has been paid to how such a system would actually be used under real-world conditions and how physicians will influence the performance of such systems. Reviews and discussions of these problems in different fields of medicine can be found in the literature [3–9].

In contrast to the above, this paper sets out to investigate the more clinically relevant question of using a decision-support system for melanoma diagnosis in routine patient encounters. Building on our previous expertise in this area [1,10], we developed, implemented, and used a neural-network-based decision-support tool to analyze dermatoscopy images of pigmented skin lesions (the complete software system is freeware and available free of charge from the authors). Dermatoscopy, also known as dermoscopy, is a noninvasive technique for taking high-resolution images of skin lesions by making the superficial layers of the skin translucent. This technique has been shown to be beneficial in the noninvasive assessment of pigmented skin lesions because it improves the diagnosis of melanoma when compared with examination with the unaided eye [11].

In our study, the output of a previously tested automatic diagnostic system was made available to physicians of varying levels of experience who could use it at their own discretion. This was accomplished by connecting the automatic diagnostic system to a digital dermatoscopy equipment in such a way that the diagnostic system is automatically invoked every time the dermatoscopy equipment is used. The main motivating questions for our research were to evaluate the diagnostic performance of a system that has been previously tested in an experimental setting in a real-life environment and to investigate the role of the physician who operates the system. Related questions, such as the performance and replicability of system ratings on individual lesions, are reported elsewhere [12].

Back to Top | Article Outline

Materials and methods

System development

The development of the system has been described earlier [13]. In short, we trained an artificial neural network on a data set of 1311 pictures of lesions that were taken at the Department of Dermatology of the Medical University of Vienna using a MoleMax II dermatoscopy instrument (Derma Medical Systems, Vienna, Austria). The sample for the development phase was taken from a larger consecutive sample of images collected between January 1998 and December 2003. Of the 1311 lesions, 125 were confirmed to be melanomas by histopathology and 1186 were diagnosed as benign by histopathology (if excised) or expert opinion after a 1-year follow-up (if not excised). The images of lesions at 30-fold magnification were stored as bitmaps at a resolution of 752–582 pixels and a color depth of 24 bits. Image segmentation was performed by a combination of local and global thresholding algorithms. A feature extraction step represented each segmented image as a combination of 38 shape, form, and color descriptors. Image segmentation and feature extraction were implemented in ImageJ (NIH, Bethesda, USA). We used a stepwise feature selection method to identify 29 features relevant for the classification process. A Matlab neural network model (The Mathworks, Natick, Massachusetts, USA) trained on these features achieved, on an independent test set, a discrimination value of 0.94 as measured by the area under the receiver operating characteristic (ROC) curve [13]. We used the Netlab package (Neural Computing Research Group, Aston University, Birmingham, UK) to train the neural network, with a conjugate gradient optimization algorithm and early stopping to avoid overtraining.

A trained classifier was then combined with the segmentation and feature extraction module to form a classifier that could process digital dermatoscopy images. The classifier module was automatically invoked every time a lesion image was taken. As a result of the classifier module, the segmented image was displayed on the screen, along with a visual rendering of the classifier's malignancy rating (a number between 0 and 1). A sample screen shot of the system output is shown in Fig. 1. The continuous malignancy scale ranging from 0 to 1 was visualized as a colored rectangular area, with the left part (green, from 0 to 0.1) corresponding to a low rating, the middle part (yellow, from 0.1 to 0.4) to a medium rating, and the right part (red, from 0.4 to 1) to a high rating. The malignancy rating for a particular lesion was visualized by the position of a slider on this rectangular scale. On account of a nonlinearity in the neural network output, the sizes of the regions on the output scale are not linearly related to the lengths of the intervals that they represent. Referring to the experimental data, the color intervals were chosen in a way that the upper threshold had a specificity of 95% and the lower threshold a sensitivity of 95%. The green zone of the scale was considered benign, the yellow zone suspicious, and the red zone malignant, respectively. The instructions given by the system were ‘no excision’ if the lower threshold was not reached and ‘excision’ if otherwise.

Fig. 1

Fig. 1

Back to Top | Article Outline

Patients and data collection

Five hundred and eleven patients presenting at the Department of Dermatology of the Medical University of Vienna between February and November 2004 were enrolled into the study. The pigmented skin lesion unit of the Department of Dermatology at the Medical University of Vienna serves as a secondary and tertiary referral center. It can thus be expected that the prevalence of melanoma in the study population is higher than in the general population. The sample of patients was a consecutive sample. The study was approved by the local ethics committee and all patients gave written informed consent to participate in the study.

As part of the routine consultation process, all study participants were examined by an expert dermatologist using standard dermatoscopy instrumentation without the decision-support system. In addition, they were also examined, in turn, by two out of a pool of six other physicians participating in the study (one more experienced, one less experienced). The physicians were participating based on availability, and consented to participate in the study. The educational training of the participating physicians ranged from no training in dermatology to 4 years training in dermatology. No physician was specifically trained in dermatoscopy. The physicians were instructed to perform an independent routine examination on the study participants. They could thus choose which, and how many, lesions to examine with the system on each patient. The six study physicians used a MoleMax II instrument (Derma Medical Systems, Vienna, Austria) with the added decision-support system. The system malignancy rating was automatically generated for each examined lesion and was displayed immediately on a monitor. The physicians' decisions were then entered into a spreadsheet by a medical student who served as study coordinator. Independent from this, and unknown to the physicians operating the system, the expert dermatologist determined in advance which lesions, if any, have to be excised to rule out melanoma histopathologically. Histopathology served as the gold standard of diagnosis. Lesions that were not excised were monitored for 6 months. If the lesion did not change during this interval it was regarded as benign. If the lesion changed according to previously defined criteria [14] it was excised and sent to routine histopathologic examination.

Overall, the physicians performed a total of 3827 lesion examinations with the system. After the removal of cases with missing data on account of missing follow-up information, 3021 lesion examinations remained in the study, corresponding to 458 patients with at least one lesion for which all required information was available. Of these, 27 patients had at least one melanoma (with a total of 31 melanomas), and 431 were unaffected.

Back to Top | Article Outline

Outcome variables

The outcomes considered in this study were the diagnosis of single lesions and health status of entire patients (healthy/diseased). Note that all benign and all malignant lesions have to be diagnosed correctly for a true-negative or true-positive diagnosis for the entire patient. We compared three diagnostic entities for these outcomes: the expert without the system, the system as a stand-alone device, and the system in the hands of the study physicians.

Back to Top | Article Outline

Statistical analysis

Chi-square tests were used for the comparison of proportions. All given P values are two tailed, and a P value of less than 0.05 indicates statistical significance.

Back to Top | Article Outline

Results

Performance of the system as a stand-alone device

Using ROC analysis on the 3021 lesion examinations in the study, we obtained an area under the curve of 0.87 [95% confidence interval (0.82–0.92)] as a measure of the discriminatory power of the decision-support system. This value is significantly lower than the result of 0.94 obtained on a test set during network training (P = 0.01). The optimal sensitivity and specificity values for diagnosing individual lesions, as measured by the point on the ROC curve closest to the upper-left corner of the unit square, were 75 and 84%, respectively. Using these thresholds, the sensitivity was 88% and the specificity 48% for the diagnosis of entire patients.

Back to Top | Article Outline

Performance of the system in the hands of physicians

The outcome measures (sensitivity and specificity) for the performance of the system in the hands of physicians were calculated by patient and not by lesion because this is clinically more relevant. A summary of these outcome measures, stratified by the level of experience of the physicians, is given in Table 1. The data indicate that overall, the study physicians had lower sensitivities, but higher specificities than the control expert dermatologist. These differences are significant, both for sensitivity (96 vs. 72%, P = 0.0012) and specificity (82 vs. 72%, P = 0.00007). The data in Table 2 show the difference between the number of lesions examined by the more-experienced physicians, compared with the less-experienced ones. Less-experienced physicians examined more lesions than more-experienced ones. The difference in the number of lesions examined per patient, although not large, was statistically significant (3.44 vs. 3.15, P = 0.0059).

Table 1

Table 1

Table 2

Table 2

Back to Top | Article Outline

Melanomas missed by the study physicians

Three melanomas that were missed by the study physicians were not chosen to be examined by the system. The benefits of a decision-support system, even if it is correct 100% of the time, is thus limited by the fact that, for practical reasons, it is used at the discretion of the physician and cannot be applied to all lesions of a patient.

Back to Top | Article Outline

Disagreement between the physicians and the system

In the 2781 cases that the system rated as benign (green area of the rating scale, threshold below 0.1), the physicians agreed 2651 times (95%) and took no further actions. In the 240 cases in which the output of the system was between the thresholds 0.1 and 0.4 (where the output bar of the decision-support system moves from the green to the yellow region), the physicians decided to perform an excision in 76 cases (32%). Of the 140 lesions that were labeled as malignant by the system (threshold higher than to 0.4, the output bar moves from the yellow to the red region), 59 cases (42%) would have been excised by the study physicians. A contingency table summarizing all combinations of benign and malignant diagnoses, for the three different system output ranges shown as green, yellow, and red is given in Table 3.

Table 3

Table 3

Back to Top | Article Outline

Comparison between the stand-alone system and the experienced physician who did not use the system

The expert, who did not use the system, correctly diagnosed 26 of 27 melanoma patients (96% sensitivity) and 311 of 431 healthy individuals (72% specificity). To compare these numbers with the stand-alone system, we first had to transform the continuous-scale output of the system to a dichotomous diagnosis. As mentioned above, the system achieved sensitivity and specificity values of 68 and 54%, respectively, when using the optimal threshold. If we use a threshold for which the system achieves the same specificity as the expert, which is 93%, then the system specificity for diagnosing entire patients increases to 72%, whereas the sensitivity drops to 48%.

Back to Top | Article Outline

Discussion

A major finding of our study is that the performance of an automated diagnostic system for melanoma is lower under real-life conditions than expected from experimental data in a laboratory setting. Not unexpectedly, we also found that the performance of a decision-support system, in part, depends on the physicians who are operating such a system. In the hands of physicians the system had higher specificities, but lower sensitivities than a control expert dermatologist who did not use the system. The lower sensitivity can be partly explained by the fact that 10% of melanomas were not analyzed by the system because the physicians operating the system overlooked them. In other words, unless a technique is provided that guarantees the analysis of all lesions of a patient, an automated diagnostic system for melanoma is always dependent on the skills of the operators who are selecting the lesions.

Our study was not designed to compare the performance of unaided physicians with the performance of physicians using a decision-support system, both at the same level of expertise. Previous research, however, has shown that inexperienced physicians benefit the most from decision-support technology [15–17]. A smaller study on the use of decision-support systems in melanoma diagnosis reports a significant improvement in sensitivity for physicians in combination with a melanoma diagnosis system [18]. We also did not attempt to measure to which degree the physicians were influenced by the decision-support system. A study we had conducted earlier found that in a similar setting, 24% of physicians changed their diagnosis when contradicted by a decision-support system [19].

The rather large difference in specificities on the lesion level versus the patient level, as shown in Table 1, can be explained by the fact that a correct true-negative patient diagnosis requires all of the lesion diagnoses of this patient to be true negatives. For sensitivity, this difference is not very large, because there were only few patients with more than one melanoma. It has to be pointed out that the numbers in Table 1 present a summary of performance only for those lesions that were examined by the study physicians. There were three melanomas that were missed by both study physicians. This drawback, however, is inherent in the examination process, as it cannot be expected that the physicians will examine every single lesion on a patient. Further developments in imaging modalities, such as total body photography, may be able to remedy this situation by automatically providing computer-generated assessments for all lesions on a patient.

The performance of our system, as measured during this study, is less than the performance on an independent test set during neural network training [1,20]. This was not expected, as the system was applied to a patient sample drawn from the same population as during system training, and the hardware for taking the pictures had also been the same. One possible explanation for the decrease in performance could be the fact that the system is applied to lesions chosen by the study physicians. These physicians are likely to examine lesions that are hard to diagnose. In contrast, the lesions used for neural network training were chosen to be representative of typical lesions on a patient, and thus included both easy and hard cases to diagnose. Other possible explanations for the decrease in performance are the sensitivity of the dermatoscopy equipment to the pressure and tilt with which it is applied to the patient's skin, and the poor color-calibration of the camera. Recent research suggests that this latter point might have a larger effect on diagnostic performance than previously thought [21,22]. Further research on the standardization of dermatoscopy equipment, and on their standardized usage, will be required to resolve this issue. Future research will be targeted at making the system more robust, by investigating the causes of the lower performance in real-life settings, and at improving the user interface of the system in order to provide explanations for system outputs.

Back to Top | Article Outline

Acknowledgements

Statement on financial disclosure/conflicts of interest: Michael Binder has been involved in the development of the Molemax II device and holds a patent for parts of the device. This work was partially funded by the Austrian National Bank Funds (No. 9952).

Back to Top | Article Outline

References

1. Dreiseitl S, Ohno-Machado L, Kittler H, Vinterbo S, Billhardt H, Binder M, et al. A comparison of machine learning methods for the diagnosis of pigmented skin lesions. J Biomed Inform 2001; 34:28–36.
2. Sboner A, Eccher C, Blanzieri E, Bauer P, Cristofolini M, Zumiani G, et al. A multiple classifier system for early melanoma diagnosis. Artif Intell Med 2003; 27:29–44.
3. Rajan P, Tolley DA. Artificial neural networks in urolithiasis. Curr Opin Urol 2005; 15:133–137.
4. Bicciato S. Artificial neural network technologies to identify biomarkers for therapeutic intervention. Curr Opin Mol Ther 2004; 6:616–623.
5. Yang ZR. Biological applications of support vector machines. Brief Bioinform 2004; 5:328–338.
6. Dreiseitl S, Ohno-Machado L. Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform 2002; 35:352–359.
7. Lisboa PJG. A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Netw 2002; 15:11–39.
8. Sherriff A, Ott J. Applications of neural networks for gene finding. Adv Genet 2001; 42:287–297.
9. Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J Clin Epidemiol 1996; 49:1225–1231.
10. Binder M, Kittler H, Dreiseitl S, Ganster H, Wolff K, Pehamberger H, et al. Computer-aided epiluminescence microscopy of pigmented skin lesions: the value of clinical data for the classification process. Melanoma Res 2000; 10:556–561.
11. Kittler H, Pehamberger H, Wolff K, Binder M. Diagnostic accuracy of dermoscopy. Lancet Oncol 2002; 3:159–165.
12. Dreiseitl S, Binder M, Vinterbo S, Kittler H. Applying a decision support system in clinical practice: results from melanoma diagnosis. AMIA Annu Symp Proc 2007. pp. 191–195.
13. Hable K. Robust parameters in automated diagnosis of melanoma. Master's thesis, Upper Austria University of Applied Sciences, 2004.
14. Kittler H, Pehamberger H, Wolff K, Binder M. Follow-up of melanocytic skin lesions with digital epiluminescence microscopy: patterns of modifications observed in early melanoma, atypical nevi, and common nevi. J Am Acad Dermatol 2000; 43:467–476.
15. Chang PL, Li YC, Wang TM, Huang ST, Hsieh ML, Tsui KH, et al. Evaluation of a decision-support system for preoperative staging of prostate cancer. Med Decis Making 1999; 19:419–427.
16. Berner ES, Maisiak RS. Influence of case and physician characteristics on perceptions of decision support systems. J Am Med Inform Assoc 1999; 6:428–434.
17. Friedman CP, Elstein AS, Wolf FM, Murphy GC, Franz TM, Heckerling PS, et al. Enhancement of clinicians' diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA 1999; 282:1851–1856.
18. Sboner A, Bauer P, Zumiani G, Eccher C, Blanzieri E, Forti S, et al. Clinical validation of an automated system for supporting the early diagnosis of melanoma. Skin Res Technol 2004; 10:184–192.
19. Dreiseitl S, Binder M. Do physicians value decision support? A look at the effect of decision support systems on physician opinion. Artif Intell Med 2005; 33:25–30.
20. Rubegni P, Cevenini G, Burroni M, Perotti R, Dell'Eva G, Sbano P, et al. Automated diagnosis of pigmented skin lesions. Int J Cancer 2002; 101:576–580.
21. Grana C, Pellacani G, Seidenari S. Practical color calibration for dermoscopy, applied to a digital epiluminescence microscope. Skin Res Technol 2005; 11:242–247.
22. Seidenari S, Pellacani G, Grana C. Computer description of colours in dermoscopic melanocytic lesion images reproducing clinical assessment. Br J Dermatol 2003; 149:523–529.
Keywords:

computer diagnosis; dermatoscopy; melanoma; prospective clinical trial

© 2009 Lippincott Williams & Wilkins, Inc.