Secondary Logo

Journal Logo

Original Clinical Study

Validation of Soft Labels in Developing Deep Learning Algorithms for Detecting Lesions of Myopic Maculopathy From Optical Coherence Tomographic Images

Du, Ran MD; Xie, Shiqi MD; Fang, Yuxin; PhD, ∗,†; Hagino, Shinichi MS; Yamamoto, Shinji BS; Moriyama, Muka MD; Yoshida, Takeshi PhD; Igarashi-Yokoi, Tae MD; Takahashi, Hiroyuki PhD; Nagaoka, Natsuko PhD; Uramoto, Kengo MD; Onishi, Yuka MD; Watanabe, Takashi MD; Nakao, Noriko MD; Takahashi, Tomonari MD; Kaneko, Yuichiro MD; Azuma, Takeshi MD; Hatake, Ryoma MD; Nomura, Takuhei MD; Sakura, Tatsuro MD; Yana, Mariko MD; Xiong, Jianping MD; Chen, Changyu MD; Ohno-Matsui, Kyoko PhD

Author Information
Asia-Pacific Journal of Ophthalmology: May-June 2022 - Volume 11 - Issue 3 - p 227-236
doi: 10.1097/APO.0000000000000466


Myopia is a global health issue, especially in East Asian countries.1,2 In urban areas of these countries, 80% to 90% of adolescents completing high school are myopic and 10 to 20% of these have high myopia.1,3 An increase in the prevalence of myopia is not restricted to East Asia because it is also increasing in North America, Germany, Spain, and Russia.4–7 An increase in the prevalence of high myopia results in an increase of pathologic myopia, the most serious form of myopia. Myopic maculopathy, which can be induced by the progression of pathologic myopia,8 is the major cause of irreversible vision reduction worldwide, especially in East Asian countries.9–12 It is expected that the myopia-related blindness will increase in many parts of the world, and a lack or shortage of myopia specialists is a huge concern to governmental leaders. Optical coherence tomography (OCT) is a widely used, noninvasive method for detecting retinal disorders, and it has become the general examination method to assess the retina of myopic patients.

With the rapid advancement in the past decade in artificial intelligence, deep learning, and big data, which give a strong foundation for developing mankind's fourth industrial revolution, the vast volume of ocular images can now be analyzed with good accuracy to assist in making the diagnosis.13–22 Nevertheless, we still have to develop the necessary image-processing software and hardware to handle image-based data from different sources so as to derive meaningful output.23,24 The more recent algorithms tend to be lighter with low floating-point operations per second (FLOPS). FLOPS is a measure of computer performance that is used in different fields of scientific computations that require floating-point calculations. The properties of FLOPS make deep learning models easier to use in machines or even in mobile devices. Prior studies have proposed detecting macular lesions in the OCT images in an automated manner in patients with age-related macular degeneration, diabetic maculopathy, and other retinal disorders.25,26 However, the effects of the extreme axial lengths and the characteristics of myopic maculopathy in the OCT images vary considerably, and few OCT-related studies have been performed on myopic patients. Thus, it is difficult for physicians to make a definitive diagnosis based on the OCT images alone. Moreover, artifacts and errors are commonly observed in the OCT images of eyes with pathologic myopia. These noise signals could confuse the grading results. In addition, mirror artifacts caused by extreme axial lengths or outward protrusions of the eye, or a breakdown of retinal segmentation in the OCT images due to the altered coherence signals of the retina layers which are frequently seen in eyes with pathologic myopia all contribute to the uncertainties. Because of these uncertainties, training algorithms from OCT images of eyes with pathologic myopia is difficult.

Peterson et al27 recently suggested that the use of explicit training models through soft labels could decrease the gap caused by human uncertainties, improve the generalization ability, and confer robustness to adversarial attacks. However, whether this method can be used on the OCT images of eyes with myopic mculopathy has not been determined.

Therefore, the primary aim of this study was to validate and compare the deep learning models trained by hard labels and soft labels in detecting myopic retinopathy through examinations of OCT images. The secondary aim was to evaluate the differences between human graders and deep learning models in forecasting the possibilities of automanners in the grading of OCT-based myopic maculopathy.


Data Acquisition

A schematic diagram of the workflow is provided in Figure 1. All patients and corresponding image data were collected from Advanced Clinical Center for Myopia. The Advanced Clinical Center for Myopia was established in 1974 with a name of Tokyo High Myopia Clinic. Over 6000 patients with pathologic myopia were registered, and many were followed with periodic examinations in the outpatient clinic of the Advanced Clinical Center. To develop the algorithms, OCT images from high myopia patients, whom axial length over 26.5 mm or spherical equivalence less than -6.0, were collected between October 2015 and March 2019. In total, 9176 swept-source OCT (SSOCT) images were obtained from Topcon DRI OCT-1 Atlantis (DRI-OCT Atlantis, Topcon Co, Tokyo, Japan) for further processing. All images were obtained from horizontal and vertical scans of 4588 follow-up examinations of highly myopic eyes.

Figure 1:
Schematic of workflow. Our project consisted of 5 steps: data collecting, image grading, data processing, model construction, and performance comparison. We firstly collected images according to criteria, and then graded myopic neovascularization, myopic traction maculopathy, and dome-shaped macula by graders. Models were constructed based on soft labels and hard labels respectively. Finally, comparisons among models were performed.

Ethics and Data Protection

All procedures adhered to the principles of the Declaration of Helsinki, and they were approved by the Ethics Committee of Tokyo Medical and Dental University (Application Number, M2018-261). The Institutional Review Board and Ethics Committee approved the medical data collection methods (Applied Number: M2018-261-3). All images were anonymized irreversibly. Only retrospective data were used without active participation by any patients. No commercial pursuing was involved in the design and conduct of this research.

Data Handling and Grading

The SSOCT images were examined to determine the presence of myopic neovascularization (MNV), myopic traction maculopathy (MTM), and a dome-shaped macula (DSM). These lesions were selected because they are the most common lesions inmyopic maculopathy and are the main causes of vision decrease in eyes with pathologic myopia. Typical MNVs are observed as signs of exudation, which appear as retinal cysts and subretinal fluid as shown in the grading samples (Supplementary Digital Content, Fig. 1, MNVs in the active stage appeared as a dome-shaped elevation above the retina pigment epithelium (RPE) with a hyper-reflective component. The scar stage of MNVs is characterized by hyper-reflectivity of the inner surface of the MNV with an attenuation of the reflectivity of the tissue below it. Typical MTM had the following features: retinal detachment with the neurosensory retina detached from the RPE, a macular hole, a breakage above the RPE layer, an anvil-shaped deformity of the crack edges of the retina, and retinoschisis, a splitting of the inner retina from the outer retinal layers with perpendicular multiple columnar structures connecting the split retinal layers. Typical DSM is seen as an inward bulging of the RPE above a line connecting the RPE lines on both sides away from the DSM. In addition, some of the images were graded as “delete” if they failed to be graded and were masked into “negative” for all features when they were voted as “delete” by all graders.

The assessments of the SSOCT images were performed by a team of trained professional graders. There were 21 ophthalmologists who were divided into 7 groups, with 3 graders in each group. Each group contained 1 experienced specialist on pathologic myopia with more than 5 years of experience and 2 ophthalmologists who had received at least 6 months of training before doing the grading. Before the formal grading, the 100 images were graded twice with a 2-week interval by all graders, and the graded results were used to determine the intra-reliability of each grader and inter-reliability among the graders. The SSOCT images were divided into 7 sets and were analyzed by the 7 groups. All graded images and the results from each grader were anonymized and compacted into binary dataset for model construction. The process of anonymizing was mainly for the desensitization of the patients’ information, and the processing was irreversible.

Grouping and Pretreating Data

The data of the images were randomly divided into 2 groups (rate of 6:1): the training and validation dataset and the test dataset. Eyes in the test dataset never appeared in training, nor the validation dataset. Random cropping and augmentation were used for the pretreatment of the OCT images. Random cropping is a data augmentation technique wherein a random subset of an original image is created. This helps the model generalize because the object of interest is not always wholly visible in the image or in the same scale in our training data. After resizing from 1024 × 992 pixels to 512 × 512 pixels, all images were graded and then randomly cropped to 480 × 480 pixels for training. In addition, the training and validation of the dataset were further processed by augmentation. The data of the images were augmented twice by horizontal flipping, which increased the image heterogeneity without losing any of the characteristics of the lesions.

Hard Labels and Soft Labels

Each SSOCT image was graded by at least 3 graders. Although the overall grading was consistent among the graders, there remained some inconsistent grading results. To determine the model with the best performance, the images were manipulated in 2 ways called labels. The first way was called hard labels, which meant the labels were voted by the principle of the minority obeying the majority. All the images were marked with the exact binary labels, ie, positive or negative, during the construction of the models. The other way was that all labels were transformed into probabilities or soft labels. For example, if 2 of the 3 graders labeled the image as MNV positive, this image would be fed into the MNV model with labels as a 2/3 positive probability.

Model Construction

Models were trained based on the Caffe framework (Berkeley AI Research, University of California, Berkeley, CA, US).28 The training platforms were the NVIDIA GeForce GTX 1080 Graphics Card (NVIDIA Corporation, CA, US) and the Ubuntu 18.04 LTS (Canonical, London, UK). All programs were performed by C++ 20 program language (ISO/IEC 14882:2020).

Darknet-19 was used as the training neural network. When dealing with hard labels, the categorizing loss was calculated by the “SoftmaxWithLoss” layer. In dealing with soft labels, the categorizing loss was calculated by the “SigmoidCrossEntropyLoss” layer. The hyperparameters are shown in Supplementary Digital Content, Table 1, During the grading, we found that our data met the condition of data imbalance, and the number of positive cases was much fewer than the negative cases. To deal with this imbalance and to reduce the false-negative rate, an oversampling method was used for the sampling data before the training. Images in the minority class were augmented by oversampling until their quantities were close to that of the opposite class. In addition to oversampling, models were also trained with high iteration (over 12,000) to be certain that we had reached the decision order of the models.

The training and validation datasets were used for training separate models to recognize MNVs, MTMs, and DSMs. Considering that the labels of the images and the oversampling method would affect the model to “learn” and “extract” features previously, each model was optimized with hard label or soft label, and with either oversampling or high iteration.

Evaluations of Models

The test dataset was used for testing the performance of the models after they were resized to 512 × 512 pixels. The area under the curve (AUC) of the receiver operating characteristics (ROC) curve, the precision-recall (PR) curve, F-score, least-square errors (L2 loss), and sensitivity and specificity were used to evaluate the performance of the models. The least absolute deviation, the L1 loss, was made visible by a calibration plot.29 The SSOCT images covered with heat maps were used for viewing the areas of interest using the Class Activation Mapping (CAM) algorithm, which is well-known for visual explanations for deep networks by gradient-based localization.

Statistical Analyses

Due to the skewed distribution of the dataset, the basic information such as the age, sex, axial length, and other factors between training and validation dataset and test dataset were compared by Mann-Whitney U tests. The observed differences were taken to be significant when P was <0.05 (two-tailed). Cohen kappa values were used to evaluate the intra-rater reliability and for the comparisons between the human graders and models. The results from the images were subdivided into 7 groups and graded by the 3 graders in each group. The mean kappa values were used to grade the consistency of all the human graders (21 graders) because different images were graded among the different grading groups. Moderate agreement was defined as a kappa value >0.4, and substantial to the high agreement was defined as kappa value >0.6. To confirm the grading consistency between models and experienced myopic specialists, the grading results for images in the test dataset were used to evaluate the inter-rater reliability. According to the test dataset, the corresponding grading results from experienced myopic specialists were filtered out, and kappa values were recalculated and used.

Data Availability

The clinical datasets used were collected from the Tokyo High Myopia Clinic, Tokyo Medical and Dental University. This study and the collection of data were approved by the IRB of the hospital. They are not publicly available, and restrictions apply to their use. All the other data supporting the findings of this study are available within the article and its Supplementary Digital file: The code being used in this study for developing the algorithm and pretrained parameters of models are shared at


A total of 9176 OCT images, acquired from 2400 highly myopic eyes of 1327 patients, were graded and the details of the images are shown in Table 1. The mean axial length of these eyes was 29.31 ± 7.02 mm. The mean age was 59.16 ± 17.73 years, and the mean number of images (interquartile range) from each eye was 2 (2, 4).

Table 1 - Summary of Patients’ Information and Optical Coherence Tomography Image Details
No. patients (25% quartile, 75% quartile) 1327
 No. eyes 2400
 Female / male 977 / 350
 Median age (y) 62.58 (50.59, 71.29)
Median axial length (mm) Image counts (positive/negative) 29.31 (27.87, 30.77)
 No. images 9176
 Myopic neovascularization 1227 / 7949
 Myopic traction maculopathy 4079 / 5097
 Dome-shaped macular 525 / 8651

We initially compared the consistency among the graders. All the images were collected without image quality control and were graded separately. For the grading of the presence of MNV, MTM, and DSM, the results showed that there was a high intra-grader consistency with mean kappa values as 0.91, 0.83, and 0.78 respectively (all P< 0.01) within each grader. There was also a moderate to high inter-grader consistency with mean kappa values of 0.704, 0.624, and 0.519 respectively among the graders. Even though we had a high consistency in the grading of myopic maculopathy during the twice grading test of each grader, some uncertainties between graders regarding the grading of myopic maculopathy remained based on OCT images in our database.

We then compared the performance of the MNVand MTM models under different training conditions. The results of the 6 models that were trained and compared: hard label and soft label with either oversampling or with high iteration, are shown in Figure 2. The results showed that the MNV model trained by soft labels with oversampling and iteration had the best performance, compared with that of the others in the total test dataset (AUC = 0.985), and a high value of area under the precision-recall curve (AUPR) of 0.908. Even though the MTM models trained by soft labels with oversampling and it eration had a higher specificity and a larger AUPR of 0.876 than that of the others, the sensitivity value was closer to other models with an AUC of 0.946.

Figure 2:
Performances of different models in categorizing myopic neovascularization, myopic traction maculopathy, and dome-shaped macula. The receiver operator characteristic curve (the first row) and the precision-recall curve (second row) show that models trained by soft label with oversampling and iteration had better performances than that of the other models.

Regarding the total DSM positive cases, even after taking uncertain cases into account, there was only 6.1%. DSM models without oversampling and high iteration were considered to have a limited practical significance. As the differences among MNV and MTM models with or without oversampling and high iteration were mostly less than 5% or even less than 1%, and such differences would become even smaller when the models were trained by low number of positive cases. Thus, DSM models were directly trained with oversampling and high iteration by soft label and hard label. The results showed that even though there were fewer positive cases, soft labels with oversampling and iteration had a better performance with an AUC value of 0.978, which was higher than that of hard labels. The activation maps from each model trained by soft labels are shown in Figure 3.

Figure 3:
Heatmaps of myopic neovascularization (A and B), myopic traction maculopathy (C and D), and dome-shaped macular (E and F).

It appeared that the models trained either by hard labels or by soft labels had moderate to high consistency compared with experienced myopia specialists, and there were only slight differences among the kappa values (Fig. 4). However, when examining the results more carefully, the predictions varied among the images: models trained by hard labels had high certainties in cases, whereas models trained by soft labels had uncertainties, which were the same as that of the physicians. The performance of the models on the MNV test dataset, in which the predicated probability was categorized by every 10% of intervals and paired with the labels graded by human graders, is shown in Table 2. In this table, there were 6 (0.56%) cases that were graded as “negative” by all physicians, but the hard label model answered as “positive” with a high certainty. In contrast, none of the cases were graded with these extreme errors by the soft label model. In addition, there were 10 (8.55%) cases that were graded as “negative” by all physicians, but the hard label model answered these as “positive” with a high certainty, and the number decreased to 3 (2.56%) in the soft label model. The same results were found in the MTM and DSM models (Supplementary Digital Content, Tables 2 and 3, We also made visible the L1 loss of models among the different myopic maculopathy features in Figure 5. The results revealed that the predicted probabilities supplied by the soft label model were more likely to agree with the observed probabilities than that of the hard label model.

Figure 4:
Comparisons between models and human myopic experts. Kappa values showed that models trained by soft label or hard label had a moderate to high consistency with myopia specialists in recognizing myopic neovascularization, myopic traction maculopathy, and dome-shaped macula.
Table 2 - Myopic Macular Neovascularization (MNV) Model Outputs as Levels of Predicted Probabilities With Oversampling and Iterations
Hard label with OS and Iter (N = 1315 images) Soft label with OS and Iter (N = 1315 images)
Predicted Probabilities (x) 0/3 (Negative) N = 1068 1/3 (Negative) N = 60 2/3 (Positive) N = 70 3/3 (Positive) N = 117 L2 Loss 0/3 N = 1068 1/3 N = 60 2/3 N = 70 3/3 N = 117 L2 Loss
x < 0.1 1053 (98.6%) 40 (66.7%) 17 (24.3%) 10 (8.55%) 0.001 1020 (95.5%) 14 (23.3%) 5 (7.14%) 3 (2.56%) 0.000
0.1 ≤ x < 0.2 4 (0.37%) 1 (1.67%) 4 (5.71%) 1 (0.85%) 0.066 20 (1.87%) 12 (20.0%) 5 (7.14%) 2 (1.71%) 0.007
0.2 ≤ x < 0.3 0 (0.00%) 3 (5.00%) 1 (1.43%) 1 (0.85%) 0.086 10 (0.94%) 11 (18.3%) 7 (10.0%) 0 (0.00%) 0.002
0.3 ≤ x < 0.4 1 (0.09%) 1 (1.67%) 1 (1.43%) 0 (0.00%) 0.000 6 (0.56%) 5 (8.33%) 9 (12.9%) 2 (1.71%) 0.009
0.4 ≤ x < 0.5 1 (0.09%) 0 (0.00%) 0 (0.00%) 1 (0.85%) 0.006 3 (0.28%) 4 (6.67%) 2 (2.86%) 5 (4.27%) 0.010
0.5 ≤ x < 0.6 0 (0.00%) 2 (3.33%) 1 (1.43%) 1 (0.85%) 0.001 5 (0.47%) 6 (10.0%) 5 (7.14%) 6 (5.13%) 0.002
0.6 ≤ x < 0.7 1 (0.09%) 0 (0.00%) 3 (4.29%) 0 (0.00%) 0.025 3 (0.28%) 2(3.33%) 8(11.4%) 7 (5.98%) 0.000
0.7 ≤ x < 0.8 1 (0.09%) 0 (0.00%) 1 (1.43%) 2 (1.71%) 0.007 0 (0.00%) 2 (3.33%) 3 (4.29%) 8 (6.84%) 0.005
0.8 ≤ x < 0.9 1 (0.09%) 1 (1.67%) 3 (4.29%) 1 (0.85%) 0.105 1 (0.09%) 3 (5.00%) 10 (14.3%) 14 (12.0%) 0.006
0.9 ≤ x 6 (0.56%) 12 (20.0%) 39 (55.7%) 100 (85.5%) 0.027 0 (0.00%) 1 (1.67%) 16 (22.9%) 70 (59.8%) 0.002
Avg. L2: 0.032 Avg. L2: 0.004
OS indicates oversampling; Avg. L2, average L2 loss; Iter, iteration.

Figure 5:
Calibration plot of models. The subplots were visualizations of the L1 loss of models among different myopic maculopathy features. Among each plot, the x axis was predicting label expects whereas the y axis was true to label expects. If the x value was close to y value, the dots would lie on the referred line (oblique red line). In other words, the closer to the referred line, the better performance the model owned. Results from these plots showed that predicted probabilities supplied by soft label model were more likely to agree with the observed probabilities than that of supplied by hard label model among the eyes with myopic neovascularization, myopic traction maculopathy, and dome-shaped macular.

We also evaluated the predictions by the models by grouping the images into 2 groups according to the grading results of the physicians (Table 3). The images were divided into the certain images group and the uncertain images group. In the certain images group, each image had the same diagnosis by all of the graders, and in the uncertain images group, the diagnosis by the physicians was different for each of the images within a group which were classified as “positive” or “negative”. The results showed there were no obvious differences in each model when it examined the certain images, but the results varied when diagnosing the uncertain images. When the cut-off value was set at 0.5 (positive = predict probability ≥0.5), the models trained by soft labels had a higher recall rate and precision rate than that trained by hard labels in the uncertain images group, especially the models trained on the minority bias dataset such as DSM. The results indicated that the soft label models had a higher recall rate in most situations either for the certain images or uncertain images.

Table 3 - Models’ Performance in Certain Images Group and Uncertain Images Group
Certain images group (0/3 and 3/3) Uncertain images group (1/3 and 2/3)
Precision Recall F-score MSE Specificity Precision Recall F-score MSE Specificity
MNV models
 Hard label 0.925 0.846 0.884 0.019 0.993 0.684 0.557 0.614 0.111 0.700
Soft label 0.892 0.846 0.868 0.019 0.989 0.667 0.600 0.632 0.083 0.650
 Hard label with OS 0.838 0.932 0.883 0.019 0.980 0.655 0.786 0.714 0.125 0.517
Soft label with OS 0.871 0.923 0.896 0.020 0.985 0.714 0.714 0.714 0.053 0.667
 Hard label with OS and iteration 0.920 0.889 0.904 0.096 0.992 0.758 0.671 0.712 0.472 0.750
Soft label with OS and iteration 0.921 0.897 0.909 0.099 0.992 0.750 0.600 0.667 0.455 0.767
MTM models
 Hard label 0.896 0.845 0.870 0.035 0.977 0.667 0.351 0.460 0.106 0.877
Soft label 0.917 0.851 0.882 0.040 0.982 0.632 0.315 0.421 0.083 0.871
 Hard label with OS 0.827 0.938 0.879 0.038 0.955 0.625 0.482 0.545 0.094 0.796
Soft label with OS 0.903 0.861 0.881 0.041 0.976 0.667 0.351 0.460 0.077 0.877
 Hard label with OS and iteration 0.898 0.912 0.905 0.191 0.976 0.699 0.447 0.545 0.279 0.865
Soft label with OS and iteration 0.910 0.938 0.924 0.220 0.979 0.691 0.491 0.574 0.332 0.847
DSM models
 Hard label with OS and iteration 0.825 0.868 0.846 0.033 0.994 0.455 0.250 0.323 0.217 0.821
Soft label with OS and iteration 0.791 0.895 0.840 0.049 0.992 0.640 0.400 0.492 0.331 0.866
MNV indicates myopic neovascularization; DSM, dome-shaped macula; MSE, mean square error; MTM, myopic traction maculopathy; OS, oversampling.


We compared the findings of the hard label model trained by definite diagnosing, and the soft label model trained by human confusion diagnosing. The results indicated that incorporating information about the physicians’ category of uncertainty at the image level can improve the performance in the OCT image recognition. In addition to a high accuracy in certain OCT images, the models trained with human uncertainty had a promising generalization ability as well as robustness when physicians encounter uncertain OCT images of myopic maculopathy.

The main strength of this study lies in the methods used during the training and comparing models. It is well-known that deep learning was born with overfitting, and a better deep learning model needs to be more robust, instead of only fitting all the training data. All models in this study were trained by images without quality control and aimed to mimic the real-life clinical situation in pathologic myopia patients. It is known that the actual situation is much more complex than the experimental environment, and it is common for physicians to encounter uncertainties when they examine some clinical images. Unlike previous studies of OCT images analyses, we took these uncertainties into account by employing soft labels. The soft label models seemed to be somewhat uncertain when examining those images, but these uncertainties did not affect the results (Supplementary Digital Content, Tables 2 and 3, On the contrary, the results of the soft label models had more robustness. Another strength of this study lies in the dataset employed. The present study was a hospital-based study, all potential cases were collected without calculating or comparing the basic information among minority features, even those cases with low quality were collected. Although sample estimation process was dropped, abundant cases, as well as results, were persuasive to show the positive effects of soft labels when interoperated with minority cases and uncertain cases during mimicking real situations.

Deep learning is widely used for analyzing images in medical science. Generally, deep leaning models are trained and validated based on the hypothesis that everything in nature can be clearly categorized into specific classes. However, the actual situation in the medical field is that not all lesions can be clearly identified in only 1 image by physicians. During the grading of myopic maculopathy in the real world, for example, ophthalmologists are more likely to grade an image as “this lesion is probably an MNV” rather than “this lesion is definitely an MNV” in most situations. This is because they also need to differentiate an MNV from other similar lesions, and such differentiation cannot be completely made by only 1 OCT image. As shown in the examples (Supplementary Digital Content, Fig. 2,, it is difficult to conclude whether an eye has MNV, MTM, or DSM. To the best of our knowledge, this study is the first to take the effects of uncertainty into account and validate the function of soft labels in predicting myopic macul-opathy. Our results indicated that models trained by soft labels can mimic the actual diagnosis conducted by humans.

In an earlier study, Li et al30 examined the vision-threatening factors in high myopia by deep learning models. They trained models with hard labels and improved the recall rate by the final diagnosed results of merged outputs from 2 images (horizontal and vertical scan lines). In our study, we supplied another choice that could provide more information to physicians and assist in the diagnosis. Basically, hard label models that are trained by a sparse matrix tended to be overfitting with the training data and the output was an extreme prediction (“positive” or “negative” prediction). The soft label method takes uncertainty into account, and the possibility is more meaningful not only as a referral model but also for physicians. For example, when the model output is “positive with a possibility of 60%”, which is saying that 6 out of 10 (60%) experts will judge it as positive. From these results, we believe that this information would effectively help physicians manage the follow-ups of pathologic myopia patients, in which first-line screening with the models is followed by that of the physicians.

Myopic maculopathy due to pathologic myopia has been the major cause of vision impairment worldwide. Specifically, earlier studies revealed that myopic maculopathy continued to progress with time and thus the prompt fundus screening is necessary.31 However, various appearances of myopic maculopathy in OCT images make the situation harder, especially in remote areas where there is a lack of myopia specialists. There is an urgent need for automatic manners to assist the patients’ management. Our models are even useful for assisting physicians in better detecting these cases of myopic maculopathies in places where there are OCT devices available but without experts. The present study only validated the usefulness of soft labels in MNV, MTM, and DSM, and further validation in datasets of OCT images in other settings is expected.

We examined 3 main lesions of pathologic myopia, viz. MNV, MTM, and DSM. An earlier study reported that nearly 5.2 to 11.3% of eyes with pathologic myopia were affected by MNV, and it is the most frequent cause of the visual reduction.32 In addition, previous studies reported that 92.7% active or scar phase of MNV would progress into MNV-related macular atrophy during the follow-up period, which would lead to an irreversible vision loss.31 According to Panozzo et al,33 the MTM staging system focused on all vision-threatening changes of the retina, especially in the foveal area. The recognition of MTM and timely surgery for complications could protect patients from vision reduction. As indicated by previous research, DSM is significantly associated with Bruch membrane (BM) defects, in which the presence of an intact BM is a protective factor against the myopia-associated changes in the overlying retina. Thus, an automated grading of myopic maculopathy has potential benefits not only in increasing the efficiency and reproducibility, but also in reducing barriers to access. Correctly diagnosing pathologic myopia changes and treating myopic maculopathy timely could further improve the patient outcomes and prevent myopia-related vision reduction by providing early detection and treatment.

There are limitations to this study. First, all the patients used for our training and testing data were Japanese, and transfer learning may be required when applied to Caucasians and individuals of other ethnicities. Secondly, we collected images regardless of patients’ prior surgeries such as vitreoretinal surgeries. Compared with the total database, the percentage of these eyes was <1%. These images may be different from those without prior surgeries, but we considered the existence of these images would help models enhance the generalization ability and be closer to real-life conditions. Thirdly, some lesions such as macular atrophy were not examined. Since the main purpose of this study was to validate the influence of soft label training models among myopia OCT images, future studies need to be done on training and validating the performance of models among all the myopic fundus lesions. In addition, there were potential possibilities that controversial cases were underdiagnosed, for example, MNV with subretinal hemorrhage due to rupture of BM often lacks retinal cysts or subretinal fluid. The main task of this study was to validate the function of soft labels using OCT images without checking the detailed medical history. We did not differentiate these controversial cases on purpose. Instead, we left this uncertainty to the model as it happened in real settings. Finally, because the core of this study was detecting the usefulness of soft labels and the process of optimizing threshold was not the core of the present study, we set the threshold as 0.5 when categorizing the positive cases. Further detection of optimum threshold value based on different features should be done in balancing the sensitivity and specificity.

In conclusion, we validated the positive influences of soft labels in developing deep learning models and predicting the possibilities of automanners for detecting myopic retinopathy by examining OCT images, and revealed that soft labels may work similar to hard labels in general circumstances but potentially could be better than hard labels in some particular conditions with uncertainty or difficulty in diagnoses. In addition, the predicted possibilities by models that were trained by soft labels should assist physicians in managing pathologic myopia patients efficiently. We conclude that these findings should inspire the novel training of deep learning models and further lead the generalized use of soft labels in the medical area.


The authors thank Professor Emeritus Duco Hamasaki of the Bascom Palmer Eye Institute, University of Miami, FL, for discussions and thorough editing of the manuscript without managing data.


1. Morgan IG, Ohno-Matsui K, Saw SM. Myopia. Lancet 2012; 379:1739–1748. doi:10.1016/S0140-6736(12)60272-4.
2. Taylor HR. Global Blindness: The progress we are making and still need to make. Asia Pac J Ophthalmol (Phila) 2019; 8:424–428. doi:10.1097/APO.0000000000000264.
3. Morgan IG, French AN, Ashby RS, et al. The epidemics of myopia: Aetiology and prevention. Prog Retin Eye Res 2018; 62:134–149. doi:10.1016/j.preteyeres.2017.09.004.
4. Reed DS, Ferris LM, Santamaria J, et al. Prevalence of myopia in newly enlisted airmen at Joint Base San Antonio. Clin Ophthalmol 2020; 14:133–137. doi:10.2147/OPTH.S233048.
5. Hopf S, Korb C, Nickels S, et al. Prevalence of myopic maculopathy in the German population: results from the Gutenberg health study. Br J Ophthalmol 2020; 104:1254–1259. doi:10.1136/bjophthalmol-2019-315255.
6. Alvarez-Peregrina CC, Sanchez-Tena M, Martinez-Perez CC, Villa-Collar CC. Prevalence and risk factors of myopia in Spain. J Ophthalmol 2019; 2019:3419576doi:10.1155/2019/3419576.
7. Bikbov MM, Gilmanshin TR, Kazakbaeva GM, et al. Prevalence of myopic maculopathy among adults in a Russian population. JAMA Netw Open 2020; 3:e200567doi:10.1001/jamanetworkopen.2020.0567.
8. Ohno-Matsui K. What is the fundamental nature of pathologic myopia? Retina 2017; 37:1043–1048. doi:10.1097/IAE.0000000000001348.
9. Iwase A, Araie M, Tomidokoro A, et al. Prevalence and causes of low vision and blindness in a Japanese adult population: the Tajimi Study. Ophthalmology 2006; 113:1354–1362. doi:10.1016/j.ophtha.2006.04.022.
10. Dong L, Kang YK, Li Y, et al. Prevalence and time trends of myopia in children and adolescents in China: a systemic review and meta-analysis. Retina 2020; 40:399–411. doi:10.1097/IAE.0000000000002590.
11. Xu L, Wang Y, Li Y, et al. Causes of blindness and visual impairment in urban and rural areas in Beijing: the Beijing Eye Study. Ophthalmology 2006; 113:1134doi:10.1016/j.ophtha.2006.01.035.
12. Gao LQ, Liu W, Liang YB, et al. Prevalence and characteristics of myopic retinopathy in a rural Chinese adult population: the Handan Eye Study. Arch Ophthalmol 2011; 129:1199–1204. doi:10.1001/archophthalmol.2011.230.
13. Balyen L, Peto T. Promising artificial intelligence-machine learning-deep learning algorithms in ophthalmology. Asia Pac J Ophthalmol (Phila) 2019; 8:264–272. doi:10.22608/APO.2018479.
14. Bakshi SK, Lin SR, Ting DSW, et al. The era of artificial intelligence and virtual reality: transforming surgical education in ophthalmology. Br J Ophthalmol 2021; 105:1325–1328. doi:10.1136/bjophthalmol-2020-316845.
15. Nuzzi R, Boscia G, Marolo P, Ricardi F. The impact of artificial intelligence and deep learning in eye diseases: a review. Front Med (Lausanne) 2021; 8:710329doi:10.3389/fined.2021.710329.
16. Ruamviboonsuk P, Cheung CY, Zhang X, et al. Artificial intelligence in ophthalmology: evolutions in Asia. Asia Pac J Ophthalmol (Phila) 2020; 9:78–84. doi:10.1097/
17. Gunasekeran DV, Wong TY. Artificial intelligence in ophthalmology in 2020: a technology on the cusp for translation and implementation. Asia Pac J Ophthalmol (Phila) 2020; 9:61–66. doi:10.1097/01.APO.0000656984.56467.2c.
18. O’Byrne C, Abbas A, Korot E, Keane PA. Automated deep learning in ophthalmology: AI that can build AI. Curr Opin Ophthalmol 2021; 32:406–412. doi:10.1097/ICU.0000000000000779.
19. Hanif AM, Beqiri S, Keane PA, Campbell JP. Applications of interpretability in deep learning models for ophthalmology. Curr Opin Ophthalmol 2021; 32:452–458. doi:10.1097/ICU.0000000000000780.
20. Ng WY, Zhang S, Wang Z, et al. Updates in deep learning research in ophthalmology. Clin Sci (Lond) 2021; 135:2357–2376. doi:10.1042/CS20210207.
21. He M, Li Z, Liu C, et al. Deployment of artificial intelligence in real-world practice: opportunity and challenge. Asia Pac J Ophthalmol (Phila) 2020; 9:299–307. doi:10.1097/AP0.0000000000000301.
22. Benet D, Pellicer-Valero OJ. Artificial intelligence: the unstoppable revolution in ophthalmology. Surv Ophthalmol 2021; doi:10.1016/j.survophthal.2021.03.003.
23. Cheng CY, Soh ZD, Majithia S, et al. Big data in ophthalmology. Asia Pac J Ophthalmol (Phila) 2020; 9:291–298. doi:10.1097/APO.0000000000000304.
24. Xu LL, Yang Z, Tian B. Artificial intelligence based on images in ophthalmology. Zhonghua Yan Ke Za Zhi 2021; 57:465–469. doi:10.3760/cma.j.cn112142-20201224-00842.
25. Schlegl T, Waldstein SM, Bogunovic H, et al. Fully automated detection and quantification of macular fluid in OCT using deep learning. Ophthalmology 2018; 125:549–558. doi:10.1016/j.ophtha.2017.10.031.
26. Lee CS, Baughman DM, Lee AY. Deep learning is effective for the classification of OCT images of normal versus age-related macular degeneration. Ophthalmol Retina 2017; 1:322–327. doi:10.1016/j.oret.2016.12.009.
27. Peterson JC, Battleday RM, Griffiths TL, Russakovsky O. Human uncertainty makes classification more robust. 2019. arXiv:1908.07086.
28. Jia Y, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding. 2014. arXiv:1408.5093.
29. Nixon J, Dusenberry M, Jerfel G, et al. Measuring calibration in deep learning. 2019. arXiv:1904.01685.
30. Li Y, Feng W, Zhao X, et al. Development and validation of a deep learning system to screen vision-threatening conditions in high myopia using optical coherence tomography images. Br J Ophthalmol 2020; doi:10.1136/bjophthalmol-2020-317825.
31. Fang Y, Yokoi T, Nagaoka N, et al. Progression of myopic maculopathy during 18-year follow-up. Ophthalmology 2018; 125:863–877. doi:10.1016/j.ophtha.2017.12.005.
32. Wong TY, Ferreira A, Hughes R, et al. Epidemiology and disease burden of pathologic myopia and myopic choroidal neovascularization: an evidence-based systematic review. Am JOphthalmol 2014; 157:9–25. doi:10.1016/j.ajo.2013.08.010.
33. Panozzo G, Mercanti A. Optical coherence tomography findings in myopic traction maculopathy. Arch Ophthalmol 2004; 122:1455–1460. doi:10.1001/archopht.122.10.1455.

deep learning; myopia; myopic maculopathy; optical coherence tomographic; soft label

Supplemental Digital Content

Copyright © 2021 Asia-Pacific Academy of Ophthalmology. Published by Wolters Kluwer Health, Inc. on behalf of the Asia-Pacific Academy of Ophthalmology.