Comparing the predictive performance of a decision tree with logistic regression for oral cavity cancer mortality: A retrospective study : Cancer Research, Statistics, and Treatment

Secondary Logo

Journal Logo

Original Article – Statistical Resource

Comparing the predictive performance of a decision tree with logistic regression for oral cavity cancer mortality: A retrospective study

Sevvanthi, K; Ganapathy, Sachit; Penumadu, Prasanth1; Harichandrakumar, KT

Author Information
Cancer Research, Statistics, and Treatment 6(1):p 103-110, Jan–Mar 2023. | DOI: 10.4103/crst.crst_234_22
  • Open
  • Infographic



Cancer is one of the most lethal diseases with 19.3 million new cases and almost 10 million deaths globally in 2020.[1] In India, it was projected that there would be 13,92,179 patients with cancer by the year 2020, with the most common sites of cancer in the head-and-neck, breast, and cervix.[2] In India, the cancer incidence burden was 98.7 per 100,000 population with head-and-neck cancer (HNC) being the most common cancer in several states. Furthermore, lung and oral cancers were found to be the leading cancer types among men, while breast and cervix uteri were the most common cancer types among women of India.[3] HNC accounts for 30% of all cancers in men and 11% to 16% of cancers in women.[3] Among men, buccal mucosa cancers are the major contributor, and between 60% and 90% of patients with HNCs present at an advanced stage, which is substantially higher than the proportion in developed countries.[4]

In India, oral cavity cancer is a major public health issue when compared to other cancer types. Oral cavity cancer accounts for approximately 77,000 new cases and 52,000 deaths annually in the world.[5] Among Indian men and women, oral cavity cancer ranks as the second and fifth most common cancer, respectively.[5] The lack of awareness and late diagnosis may be the major reasons for the reduced overall survival of oral cavity cancer in India.[6–8] It has been observed that the most common risk factors for oral cavity cancers are tobacco usage, smoking, alcohol consumption, and, specifically for India, betel leaf chewing.[5] In India, the usage of tobacco and smoking are common; oral cavity cancers are very common in regions with a high prevalence of tobacco and alcohol habits.[9]

Predictive modeling techniques are complex and efficient ways of predicting outcomes.[9] Predictive modeling and machine learning techniques have gained interest in biomedical research in recent years.[9–12] They are used to forecast future oncological outcomes that can help improve patient care. Some of the most widely used machine learning techniques are artificial neural network, random forest, decision tree, support vector machines, and Bayesian networks.[10,13]

In this study, we compared the predictive accuracy of logistic regression and decision tree models. Logistic regression and decision tree models were used as classification models to predict the mortality caused by disease among patients with HNC.


General study details

This was a retrospective analysis of a secondary dataset from a cohort of patients with oral cavity cancer, conducted in the Department of Surgical Oncology, Jawaharlal Institute of Post Graduate Medical Education and Research (JIPMER), a tertiary healthcare hospital in Puducherry, South India. The de-identified records of the patients with oral cavity cancer, who received treatment at our center during the period from 2011 to 2017, were included in the study. Approval for procuring the data and conducting the study was obtained from the Institute Ethics Committee (Project No. JIP/IEC/2020/341) on February 22, 2021; the project was considered to pose less than minimal risk and was therefore exempted from review [Supplementary Appendix 1]. The Ethics Committee waived the requirement to obtain written informed consent as we would be using a secondary dataset with no primary contact with the patients. The study was conducted according to the ethical guidelines established by the Declaration of Helsinki and other guidelines such as Good Clinical Practice Guidelines and those established by the Indian Council of Medical Research. No funding was availed for the study. As this was a cohort study and not a clinical trial, it was not registered in a publicly accessible clinical trials registry.

Supplementary Appendix 1


Patients with oral cavity cancer who had received treatment between 2011 and 2017 were considered for enrollment in the study. Since histopathological data were available only for the 275 patients who underwent surgery, only those cases were included for developing the models.


Our primary objective was to assess the performance of logistic regression and decision tree models for predicting mortality due to disease among patients diagnosed with oral cavity cancer.

Study methodology

The data on variables such as stage of the cancer, tumor (T) category, margin status, and node (N) category were extracted from the de-identified patient records. Staging of the cancer was according to the American Joint Committee on Cancer (AJCC), 8th edition. The details were extracted from the data collected as part of a study conducted in the Department of Surgical Oncology, JIPMER (data maintained at the institute). The outcome considered for this study was mortality due to disease at the end of 2 years of follow-up from the date of diagnosis. We considered a binary outcome instead of a time-to-event analysis, as we were interested in the outcome at the end of 2 years of follow-up for assessing the performance of the models. In this study, two classification models, multiple binary logistic regression and decision tree, were employed to predict the mortality caused by disease among patients with oral cavity cancer.


The planned sample size was calculated as 275 based on the prevalence of oral cancer of 23.15%[3] and with a 5% level of significance.

The data were analyzed using the Statistical Package for the Social Sciences (IBM Corp. Released 2010. IBM SPSS Statistics for Windows, Version 19.0. Armonk, NY: IBM Corp.).[14] The distribution of the categorical variables such as demographic and clinical characteristics was summarized as frequencies with percentages. The Chi-square test was performed to assess the association between categorical variables and outcomes. Independent Student’s t-test/Mann-Whitney U-test was performed to compare the continuous variables such as age and neutrophil count, based on the mortality status, depending upon the normality assumption of the variables. The variables found to be significantly associated with outcome were included for developing the logistic regression and decision tree models. These models were compared in terms of receiver operating characteristic (ROC) curve analysis and predictive accuracy. Although we performed multiple statistical tests, these were performed for different outcomes of interest, and hence, no adjustment for multiplicity was deemed necessary. All the statistical analyses were carried out at a 5% level of significance, and a P value less than 0.05 was considered statistically significant.

Logistic regression analysis

Logistic regression is a generalized linear model, developed to explore the relationship between a categorical outcome variable and a set of predictor variables.[15] The independent variable may be measured by either a continuous or categorical scale.

A simple binary logistic regression is used to study the relationship between a binary-dependent variable and a single predictor variable. The model can be written as follows:


However, multiple binary logistic regression allows us to study the relationship between a binary outcome variable and several predictor variables. The model can be written as follows:


Where β’s are regression coefficients, x’s are independent variables such as clinical stage, pathological stage, and margin status, and p is the probability of the occurrence of the outcome.

Let x1, x2…, xn be the independent variables (clinical stage, pathological stage, and margin status) and y be the outcome variable, that is, mortality due to HNC. The expected probability of mortality is given as follows:


The Hosmer-Lemeshow test was used to assess the goodness of fit of the multiple logistic regression model.[15]

Decision tree analysis

Decision trees are widely used for binary classification. They have a flowchart structure, functioning on the tree-based classification model.[16] The decision tree uses information from independent variables to classify subjects into the most-likely category of the outcome variable.

Classification and regression tree (CART) is a method of decision tree modeling that splits the data into homogeneous subgroups with respect to the outcome variables.[16] It is a binary split method where a node (sub-branches of possible outcomes) can be split into two child nodes. The terminal node is known as the pure node, wherein all cases have the same value for the dependent variable, and it does not require further splitting. The tree starts from the root node that contains the entire population or sample, and following this, each node is divided into two or more homogeneous subsets. The nodes are split based on Gini impurity indices. The best predictor is chosen based on the lowest Gini impurity index; that is, we choose an attribute that is more homogeneous after dividing the nodes into sub-nodes.[16] The Gini index is calculated by subtracting the sum of the squared probabilities from one.


Where pi is the probability of a subject being classified into a particular class. The Gini impurity index takes values between 0 and 1. Higher impurity refers to higher heterogeneity, and lower impurity refers to higher homogeneity. The split of nodes starts from the lowest impurity value. The first step in the splitting of nodes is based on the attribute with the lowest Gini index. The second split is based on the attribute with the second smallest Gini index and so on. The splitting of nodes is stopped based on some predefined stopping rules specifying the minimum number of cases in each node.

In this study, three different decision tree models were developed with different predefined node sizes. Decision tree models 1, 2, and 3 were developed with the minimum number of cases in the parent and child nodes specified as 30 and 15; 20 and 10; and 10 and 5, respectively. These three models were compared to assess whether the change in the node size had any impact on the predictive accuracy of the models.

Comparison of logistic regression and decision tree

The decision tree and logistic regression models were compared to assess their robustness in terms of accuracy in predicting the mortality from disease among patients diagnosed with oral cavity cancer. The models were compared in terms of their sensitivity, specificity, predictive accuracy, and area under the curve (AUC) by ROC curves. Accuracy was defined as the proportion of patients who were correctly classified as dead or alive by each model. Sensitivity was the proportion of patients whose death status was correctly predicted by the model. Specificity was the proportion of patients whose alive status was correctly predicted by the model. The AUC was the overall discriminative ability of the model.


We included 275 patients who had undergone surgery for oral cavity cancer between 2011 and 2017, and for whom, histopathology details were available. As we used a secondary dataset, details regarding patient enrollment and follow-up were not available, and thus, a patient flow diagram could not be generated. The data considered in our study were for the purpose of demonstration.

Among the enrolled patients, 180 (65.5%) were male. The mean age of the patients was 55 years (range, 26-87). Table 1 provides the results of the univariate analysis of the sociodemographic variables such as age, sex, place of residence, smoking status, alcohol use, betel quid, and tobacco status. All details were available; that is, there were no missing data. None of these variables were significantly associated with the occurrence of death due to disease. Various clinical characteristics such as clinical T category (cT), clinical node status (cN), clinical stage, pathological T category (pT), pathological node status (pN), and margin status were significantly associated with the occurrence of death due to disease. No quantitative variables such as white blood cell, platelet, neutrophil, or eosinophil counts were found to be significantly associated with the mortality status.

Table 1:
Bivariate association of demographic and clinical characteristics with mortality

Following the univariate analysis, the multiple binary logistic regression model was built based on the variables that had been found to be significantly associated with mortality in the univariate analysis. The results of the multiple binary logistic regression model are provided in Table 2. Among the variables used for developing the multiple logistic regression model, only variables such as pT and margin status were noted to be significant (P < 0.001). We found that in 71% of the patients, mortality was correctly predicted by the multiple binary logistic regression model.

Table 2:
Effect estimates of prognostic factors of mortality estimated by multiple binary logistic regression for patients with squamous cell carcinoma of the oral cavity

The decision tree model was developed based on the variables, which were associated with the 2-year mortality. The nodes in the decision tree were divided based on the Gini impurity index. The details on the impurity measures are shown in Table 3. The variable pT category had the lowest impurity among all the variables. Therefore, the first split of the decision tree was started with the variable “pT category.” The second split was based on the variable “margin status,” and the third split was based on the variable “clinical stage.”

Table 3:
Gini impurity index for potential prognostic risk factor selection

The decision tree models 1 and 2 were built with the minimum number of cases in the parent and child nodes as 30 and 15, and 20 and 10, respectively, as shown in Supplementary Appendix 2 [Figure S1] and 3 [Figure S2]. The tree started from the root node that contained the death status of the patients. Of 275 patients, 191 (69.5%) survived or died due to reasons other than disease, and 84 (30.5%) died due to disease. Subsequently, the patients were further classified based on the pT category of the disease. The first stage of pT had an impurity of 0.014, and the second, third, and fourth stages had impurities of 0.130, 0.131, and 0.121, respectively. In the second step, splitting was carried out using the variable of margin status. In the third step of splitting, the clinical stage was considered. The details on the gains for nodes for the decision tree model 1 are shown in Supplementary Appendix 5 [Table S1] and 6 [Table S2]. In both decision tree models 1 and 2, the gain was high in node 5 at 75%. The predictive accuracy was 72% for both models.

Supplementary Appendix 2: Figure S1:
Decision tree model 1
Decision tree model 2
Gains for nodes for decision tree model 1
Gains for nodes for decision tree model 2

The decision tree model 3 was built with the minimum number of cases in the parent and child node as 10 and 5, respectively, as shown in Supplementary Appendix 4 [Figure S3]. For decision tree models 1 and 2, the tree started from the root node that contained the death status of all 275 patients. The details on the gain for each node in decision tree model 3 are shown in Supplementary Appendix 7 [Table S3]. The gain was high in node 14 at 59.5%, and the predictive accuracy of the model was 72.7%.

Decision tree model 3
Gains for nodes for decision tree Model 3

The predictive accuracies of the three decision tree models with different nodes were almost the same (magnitudes were compared, no statistical tests were employed for comparison). Therefore, the change in node size did not affect the predictive accuracy of the model. A comparison of the predictive accuracy, sensitivity, specificity, and AUC of both the decision tree and logistic regression is shown in Table 4. The predictive accuracies of all the models were noted to be similar in magnitude. Based on this comparison, it was found that the accuracies of logistic regression and decision tree models were similar for predicting disease-related mortality among patients with oral cavity cancers. The ROC curves of the various classification techniques are shown in Figure 1 .

Table 4:
Comparison of predictive accuracies of different classification techniques
Figure 1:
Receiver operating characteristic curve for comparing decision tree and logistic regression. DT—decision tree, LR—logistic regression model


We conducted this study to compare the accuracy of decision tree with that of logistic regression for predicting disease-related mortality among patients with oral cavity cancer. We observed that the accuracy of the decision tree model was slightly better than that of the logistic regression model. Previously, studies have been conducted to predict or classify the outcomes among patients with breast cancer, HNC, diabetes, cardiovascular illness, and coronavirus disease (COVID-19).[17–20] Some studies have shown that a decision tree performed better than other models, while some other studies showed that the logistic regression model performed better in predicting outcomes.[20–22] We found that the predictive accuracy was not significantly different between the decision trees with different node sizes. This suggests that the node size is independent of predictive accuracy.

In the present study, we performed univariate analysis to identify the factors associated with disease-related mortality in oral cavity cancer. We found that tumor- and therapy-related clinical variables such as clinical tumor stage, overall tumor and node metastasis (TNM) cancer stage, node status, and margin status were associated with mortality due to disease. In multiple logistic regression, the margin status (P = 0.001) and pT (T2: P =0.010; T3: P=0.024; T4: P=0.027) were found to be statistically significant predictors. It is well known that the pathological stage of cancer is the most important predictor of mortality in patients with squamous cell cancer of the oral cavity.[23,24] Studies have also suggested that patients with a positive surgical margin status have an 11% to 15% lower overall survival rate, and even radiation therapy may be ineffective in these patients.[25] Other significant prognostic factors include the patient’s age, treatment modality, comorbidities, vascular invasion, and perineural invasion.[26–28]

The limitations of our study include the fact that we did not consider all the variables which have the potential to influence the mortality among patients with oral cavity cancer, due to the lack of availability of data. This is because we extracted the data from the secondary dataset with available variable details among patients who had undergone treatment for oral cavity cancer. Therefore, we require further prospective studies with a larger sample size that include more variables that have the potential to influence the outcomes.


Both decision tree and logistic regression models show similar accuracy in predicting a binary outcome in the prognosis of oral cavity cancer. Hence, decision tree models can be used as an alternative to logistic regression models in binary classifications.

Author’s contribution

Study concept and design: SK, HCK, PP; data collection: SK, HCK, PP; data analysis and interpretation: SK, HCK, SG, PP; manuscript writing: SK, HCK, SG, PP; approval of final article, and accountability for all aspects of the work: all authors.

Data sharing statement

Individual de-identified data that were used in the study will be considered to be shared on a case-to-case basis, on request. To avail the data, researchers can write to the corresponding author. Based on the merit of the request and with the permission and agreement of all the stakeholders of the data, the de-identified data can be shared.

Financial support and sponsorship


Conflicts of interest

There are no conflicts of interest.


1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2021;71:209–49
2. Mathur P, Sathishkumar K, Chaturvedi M, Das P, Sudarshan KL, Santhappan S, et al. Cancer Statistics. 2020: Report From National Cancer Registry Programme, India. JCO Glob Oncol 2020;6:1063–75
3. Shetty R, Mathew RT, Vijayakumar M. Incidence and pattern of distribution cancer in India: A secondary data analysis from population-based cancer registries. Cancer Res Stat Treat 2020;3:678–82
4. Singla A, Goel AK, Oberoi S, Jain S, Singh D, Kapoor R. Impact of demographic factors on delayed presentation of oral cancers: A questionnaire-based cross-sectional study from a rural cancer center. Cancer Res Stat Treat 2022;5:45–51
5. Laprise C, Shahul HP, Madathil SA, Thekkepurakkal AS, Castonguay G, Varghese I, et al. Periodontal diseases and risk of oral cancer in Southern India: Results from the HeNCe Life study. Int J Cancer 2016;139:1512–9
6. Dhumal S, Patil V, Noronha V, Prabhash K. Immune checkpoint inhibitors: Promising but still inaccessible treatment option for patients with head-and-neck cancer in India. Cancer Res Stat Treat 2022;5:587–8
7. Roy S, Mandal TK, Das S, Srinivas S, Agarwal A, Gupta A, et al. Demography and pattern of care of patients with head-and-neck carcinoma: Experience from a tertiary care center in North India. Cancer Res Stat Treat 2020;3:730–5
8. Madankumar PD, Iyer K, Soni S, Nagarajan L, Kumar K, Solomon S, et al. A simple screening program for oral cancer in a defined geographic area in southern India: A community-based cross-sectional study. Cancer Res Stat Treat 2022;5:226–31
9. Hawkes N. Cancer survival data emphasise importance of early diagnosis. BMJ 2019;364:l408
10. Han SS, Azad TD, Suarez PA, Ratliff JK. A machine learning approach for predictive models of adverse events following spine surgery. Spine J 2019;19:1772–81
11. Lezcano-Valverde JM, Salazar F, León L, Toledano E, Jover JA, Fernandez-Gutierrez B, et al. Development and validation of a multivariate predictive model for rheumatoid arthritis mortality using a machine learning approach. Sci Rep 2017;7:10189
12. Sato M, Morimoto K, Kajihara S, Tateishi R, Shiina S, Koike K, et al. Machine-learning approach for the development of a novel predictive model for the diagnosis of hepatocellular carcinoma. Sci Rep 2019;9:7704
13. Bhattacharjee A, Vishwakarma GK, Banerjee S. A Bayesian approach for dynamic treatment regimes in the presence of competing risk analysis. Cancer Res Stat Treat 2018;1:51–7
14. IBM Corp. IBM SPSS Statistics for Windows. Armonk, NY. 2010
15. Pal A. Logistic regression: A simple primer. Cancer Res Stat Treat 2021;4:551–4
16. Larose DT, Larose CD. Discovering knowledge in data: An introduction to data mining. 2nd ed. Hoboken, New Jersey: John Wiley and Sons, Inc; 2014
17. Grabauskytė I, Tamošiūnas A, Kavaliauskas M, Radišauskas R, Bernotienė G, Janilionis V. A comparison of decision tree induction with binary logistic regression for the prediction of the risk of cardiovascular diseases in adult men. Informatica 2018;29:675–92
18. Cheng Z, Nakatsugawa M, Hu C, Robertson SP, Hui X, Moore JA, et al. Evaluation of classification and regression tree (CART) model in weight loss prediction following head and neck cancer radiation therapy. Adv Radiat Oncol 2018;3:346–55
19. Ahmad A, Safi O, Malebary S, Alesawi S, Alkayal E. Decision tree ensembles to predict coronavirus disease 2019 infection: A comparative study. Complexity 2021 2021;e5550344
20. Omurlu IK, Ture M, Unubol M, Katranci M, Guney E. Comparing performances of logistic regression, classification and regression trees and artificial neural networks for predicting albuminuria in type 2 diabetes mellitus. Int J Sci 2014;16:173–87
21. Momenyan S, Baghestani AR, Momenyan N, Naseri P, Akbari ME. Survival prediction of patients with breast cancer: Comparisons of decision tree and logistic regression analysis. Int J Cancer Manag 2018;11:e9176
22. Ye Y, Xiong Y, Zhou Q, Wu J, Li X, Xiao X. Comparison of machine learning methods and conventional logistic regressions for predicting gestational diabetes using routine clinical data: A retrospective cohort study. J Diabetes Res 2020 2020;4168340
23. Zanoni DK, Montero PH, Migliacci JC, Shah JP, Wong RJ, Ganly I, et al. Survival outcomes after treatment of cancer of the oral cavity (1985–2015). Oral Oncol 2019;90:115–21
24. Oliveira LL, Bergmann A, Melo AC, Thuler LC. Prognostic factors associated with overall survival in patients with oral cavity squamous cell carcinoma. Med Oral Patol Oral Cir Bucal 2020;25:e523–31
25. Patel V, Galloway TJ, Liu JC. The impact of positive margin on survival in oral cavity squamous cell carcinoma. Oral Oncol 2021;122:105499
26. Katna R, Kalyani N, Agarwal S, Singh S, Deshpande A, Bhosale B. Impact of comorbidities on perioperative outcomes for carcinoma of oral cavity. Ann R Coll Surg Engl 2020;102:232–5
27. Nair D, Mair M, Singhvi H, Mishra A, Nair S, Agrawal J, et al. Perineural invasion: Independent prognostic factor in oral cancer that warrants adjuvant treatment. Head Neck 2018;40:1780–7
28. Spoerl S, Gerken M, Fischer R, Mamilos A, Spoerl S, Wolf S, et al. Lymphatic and vascular invasion in oral squamous cell carcinoma: Implications for recurrence and survival in a population-based cohort study. Oral Oncol 2020;111:105009

Decision tree; logistic regression; mortality; predictive accuracy

Copyright: © 2023 Cancer Research, Statistics, and Treatment