A narrative review of survival analysis in oncology using R : Cancer Research, Statistics, and Treatment

Secondary Logo

Journal Logo

Statistical Resource

A narrative review of survival analysis in oncology using R

Pawar, Akash; Chowdhury, Oindrila R.1; Salvi, Omkar2

Author Information
Cancer Research, Statistics, and Treatment: Jul–Sep 2022 - Volume 5 - Issue 3 - p 554-561
doi: 10.4103/crst.crst_230_22
  • Open



Survival analysis involves the analysis of data obtained for a time-to-event outcome. Clinical trials in oncology largely include survival analysis, that is, the time to an event of interest as the main outcome under assessment. The time here is reported as survival time, disease-free survival time, or event-free survival time. Each of these times are calculated on the basis of different assumptions. Survival time is usually defined as the duration from the diagnosis of disease (cancer) to the time of death (here, death is considered the event of interest). For the censored patients, the time until the last follow-up is considered the survival time. Progression-free survival (PFS) is defined as the duration from the initiation of treatment to the time of disease progression or worsening. Disease-free survival (DFS) is defined as the duration from when the primary treatment of a tumor ends to the time the patient survives without any symptoms or signs of the cancer.

Survival analysis is performed using non-parametric methods to analyze the data as survival data are non-normally distributed. The data are skewed as most of the events may happen earlier and only a few occur at the end. Survival analysis includes time as an independent factor, along which the events are distributed. Hence, there is a need to use various other methods that together lead to the survival analysis.

Censoring is broadly categorized as right censoring and left censoring. Left censoring occurs when the event of interest has already occurred before enrolling the patient in the study. This is a very rare event. Right censoring occurs when a subject or patient leaves the study before an event occurs, or the study ends before the event has occurred or if the subject is lost to follow-up.

In this article, we aim to present the basic concepts of survival analysis, the formation of a survival model, Cox proportional hazards model, and introduce the formulae for estimating the confidence intervals (CIs) for the survival estimate. We will discuss the ways to obtain the survival estimates, and to generate the survival curves using statistical software. We will also attempt to provide guidance regarding the interpretation of the survival results and the use of statistical tests to compare the survivals between subgroups of patients based on various factors under study.


We used R software for performing the survival analysis and fitting the model on survival data in oncology. There are multiple packages in R that support survival analysis which are freely available on the Comprehensive R Archive Network (CRAN). We searched the internet using Google for packages available on R CRAN supporting Survival Modelling, Survival analysis, obtaining Kaplan–Meier curves, and fitting the Cox proportional models as depicted in Figure 1. We obtained a number of packages named survival, survminer, rms, prodlim, km.ci, eha, NADA, survey, spatstat, rhosp, survPresmooth, coxed, coxphw, and simPH. Among the listed packages, we used the survival package developed by Terry M. Therneau. This was the very first package which was available for survival analysis from the time when R was popularly known as S software to demonstrate survival analysis. We used the latest version 3.4-0 of the Survival package. The survminer package version 0.4.9 was used to visualize the survival curves. These packages are commonly used by medical statisticians. The Survival package provides user-friendly functions to fit the Kaplan–Meier method and the Cox proportional hazards model, analyze survival data, and obtain survival and hazard estimates for clear interpretation. The survminer package provides attractive and informative visualization of the survival curves using the survival results obtained from the built-in functions of the Survival package. We used the rotterdam data set on primary breast cancer patients available in the Survival package. Other packages were excluded as they provided add-on methods and functions for obtaining survival estimates which are beyond the scope of this article.

Figure 1:
Flow diagram depicting the search and selection process of the packages for inclusion in the review article on survival analysis demonstration (CRAN = Comprehensive R Archive Network)


Survival analysis can be defined as a set of statistical methods or approaches for investigating the time taken for an event to occur. The event can be death, cardiac arrest, or any other.[1] There are two widely used methods of survival analysis: one is for the estimation of probability for survival introduced by Kaplan and Meier[23] and the second is the Cox proportional hazards model,[45] also known as the regression model.[6]

The Kaplan–Meier approach for analysis is used to calculate the probability of survival at time ti, as follows:


S(ti)= the probability of surviving till time ti

S(ti-1)= the probability of surviving till time ti-1

ni= the frequency of patients censored or alive just before ti-1

di= the frequency of events at ti-1

Assumptions of survival analysis d0, =0, number of events at the start of the study is 0, and S (0) =1, survival probability at the start of the study is 1.


The estimated probability changes its value only at the time when events occur. The confidence interval[7] can also be estimated with the survival probability. The Cox proportional hazards model is used to evaluate the association of a risk factor with survival time. It was introduced based on the criteria of the hazard rate, which is described as follows:


X = vector of risk predictors,

= the function of hazards with covariate X,

h0(t)= function of baseline hazards with X = 0

We can rewrite the Cox model after dividing both sides of the equation by the baseline hazard function as the linear regression model:

Now, between two individuals, β, the regression coefficient is interpreted as the relative hazard rate as one unit of X changes.

When there are multiple coefficients and covariates, then the hazard equals to

the values of exp(β) are called hazard ratios, and here a value of βi greater than 0 or a hazard ratio greater than 1 indicates that the value of the ith covariate will increase and overall, the hazards of occurrence of the event will increase and thus the survival length of the individuals within a particular category will decrease.


Qualitative research can offer valuable insights into the probability of patient survival using survival analysis. We will take the example of a prospective, randomized controlled cancer clinical trial.

We will be using an in-built data set from Survival 3.3-1 package in R named “rotterdam”. The rotterdam data set comprises the data of 2982 patients with primary breast cancers available in the Rotterdam tumor bank. The rotterdam data set was used in the paper published by Royston and Altman in 2013.[8]

The survival model was fitted using the rotterdam data and then the model was validated using the gbsg data set, both of which are available in the Survival package in R.

The meaning of the names of the variables in the rotterdam data and their description are as follows:

“pid” = Variable indicating patient’s identification number

“year” = Variable for the year of surgery

“age” = Patient’s age at surgery

“meno” = Patient’s menopausal status (0 = Premenopausal, 1 = Postmenopausal)

“size” = Variable indicating the range of tumor sizes (≤20, 20–50, >50)

“grade” = Grade of tumor differentiation

“nodes” = Variable indicating the number of lymph nodes noted to be positive

“pgr” = Biomarker of progesterone receptor (fmol/l)

“er” = Biomarker of estrogen receptor (fmol/l)

“hormone” = Variable indicating whether hormonal treatment was received or not (0 = No, 1 = Yes)

“chemo” = Variable indicating whether chemotherapy was received or not

“rtime” – Time to tumor relapse, or time until the patient was lost-to-follow-up

“recur” = Variable indicating whether the patient has had a recurrence or not (1 = Relapse, 0 = No relapse)

“dtime” = Duration of survival or the time until death in days

“death” = Variable indicating whether the patient is alive or not (0 = Alive, 1 = Died)

We will be using the survfit function from the Survival package and the coxph function for performing the survival analysis. We have two objectives: to estimate the recurrence-free survival by setting the surv object using “rtime” and “recur” variables, and secondly, to estimate the recurrence-free survival by setting the surv object using “dtime” and “death” variables.


#In order to use survfit and coxph function, we need to first install the Survival package,[9] survminer package[10] is used to obtain attractive survival curves that we will see later.

Install.packages(“survival”, “survminer”)

#calling for the package using library function



#We will fit two survival models: model1 for recurrence free survival and model2 overall survival

model1<- survfit(Surv(rtime,recur==1) ~ 1, data = rotterdam)


summary (model1, times = 1825)

Here, out of the total of 2982 patients, recurrence occurred in 1518 patients. The median recurrence-free survival is 2983 days, which when expressed in months with 95% CI (LCL = Lower Confidence Limit, UCL = Upper Confidence Limit) is 83.17 (7.45–8.75) and the recurrence-free survival probability at five years is 59.3 (57.6–61.1) in percentage, as shown in Table 1.


model2<- survfit(Surv(dtime,death==1) ~ 1, data = rotterdam)


summary (model2, times = 1825)

Here, out of the total of 2982 patients, 1272 had died. The median survival in days is 4033, which when expressed in months with 95% CI is 11.04 (10.65–11.81) and the survival probability at five years is 74.4 (72.8–76) in percentage; the detailed output is shown in Table 1.

Thus, we have calculated the median recurrence-free survival time and the median overall survival time.

We can now perform the log-rank test to compare the survivals between the two groups.

Here, we will test the hypothesis as follows:

H0 = There is no survival difference between the patients receiving hormonal therapy and those not receiving hormonal therapy.

H1 = There is a survival difference between the patients receiving hormonal therapy and those not receiving hormonal therapy.


model1_diff <- survdiff(Surv(rtime,recur) ~ hormon, data = rotterdam)



model2_diff <- survdiff(Surv(dtime,death) ~ hormon, data = rotterdam)


Thus, we get a Chi-squared test statistic and the corresponding P value[11] for drawing a conclusion regarding our hypothesis, as shown in Table 1.

Table 1:
The outputs from survival models (model1 indicates estimates for recurrence-free survival, and model2 indicates estimates for overall survival)

The log-rank test is used to assess for any differences in survivals between various subgroups; for example, gender.


#fitting cox proportional hazards model to obtain hazard ratios

cox_model1_diff <- coxph(Surv(rtime,recur) ~ hormon, data = rotterdam)



cox_model2_diff <- coxph(Surv(dtime,death) ~ hormon, data = rotterdam)


Given the survival up to a certain point in time, the hazard function gives the instantaneous potential of having an event at that time. It is primarily used to diagnose the risk of having an event in the survival analysis.

The hazard ratio of 1.276 is interpreted as 27.6% higher risk of recurrence over time in patients who received hormonal therapy compared to those who did not receive hormonal therapy.

The hazard ratio of 1.511 is interpreted as 51.1% higher risk of death over time in the patients who received hormonal therapy compared to those who did not receive hormonal therapy.

Survival curves for recurrence free survival and overall survival can also be obtained for the subgroups based on the hormonal status. The plot is formatted in such a way that the events are represented at every down step of the curve. The median survival is marked with dotted lines along the x-axis and y-axis, and the P value is mentioned for the comparison of the survival curves using the log-rank test.

The codes which we use to generate the graphs have been provided in Table 2.

Table 2:
Codes for plotting survival curves using ggsurvplot function from the survminer package in R


As tumor heterogeneity is evolving, personalized medicine has taken center stage in the treatment of cancers. Hence, it is crucial that research trials in clinical oncology use pertinent analytical methodology for recognizing and validating various treatment methods.

In this article, we have presented the framework for the statistical approach to analyzing survival data. We have presented how survival models can be fitted and the estimates can be obtained. The common issues in survival data include negative survival durations due to error in the data entry, or an unequal number of cases in the various subgroups. The limitation of the Kaplan–Meier estimate is that it cannot be used for multivariate analysis as it only studies the effect of one factor at a time. Additionally, only categorical variables can be evaluated in the Kaplan–Meier method.

On the other hand, the Cox proportional hazards model plays an important role as it provides a way to understand the effect of continuous variables as well as categorical variables. This method is further used to fit a multivariate model to study the overall effects of more than one variable on survival. The fundamental assumption in the Cox model is that the hazards are proportional (PH), such that the relative hazard remains constant over time with different predictor or covariate levels. One of the methods of validating the assumption is the diagnosis of the Kaplan–Meier curves or testing of the Schoenfeld residuals. The curves should not cross each other. In case there is a violation of the assumption, one can use the stratified Cox method by stratifying the data using assumption-violating variable or factor. Also, Schemper’s weighted model and restricted mean survival time methods can be used in cases where the proportion hazards assumption is violated. After fitting the data in the multivariate model, it needs to be validated using various statistical tests and estimates.

We have discussed the approach used frequently. We have used the advanced method for generating and presenting the survival curves which are demonstrated in Figure 2ad.

Figure 2:
(a) Overall survival curve with numbers at risk. (b) Recurrence-free survival curve with numbers at risk. (c) Overall survival curves for patients receiving hormone therapy and those not receiving hormone therapy. (d) Recurrence-free survival curves for patients receiving hormone therapy and those not receiving hormone therapy

Given the various challenges, it is necessary to ensure the benefits of using this type of data and methodology for analysis. There is a continuous need to develop flexible, robust, and easy methods of survival data analysis and to improve the methods for fitting multivariate models, Bayesian approaches,[12] and risk assessments.

Additional detailed easy-to-understand information on the methodology for survival analysis is provided in the textbooks by Collette[13] and Klein and Moeschberger.[14] Readers are also directed to refer to various articles that have been published in order to obtain a deeper understanding of survival analysis.[1516]


We discussed the process of using statistical methods and inferences to estimate survival and to establish the prognostic factors affecting survival. We have also introduced the mathematical survival model and the Cox proportional hazards model. There are multiple functions and packages available in R and other statistical programs like the Statistical Package for the Social Sciences (SPSS). The Survival package used most commonly includes various functions to easily analyze the survival data to obtain the required results. However, these functions need to be thoroughly studied to correctly fit the survival data. Assumptions and limitations of each method should be considered and accounted for before obtaining the required estimates.

Financial support and sponsorship


Conflicts of interest

There are no conflicts of interest.


1. Emmert-Streib F, Dehmer M Introduction to survival analysis in practice Mach Learn Knowl Extr 2019 1 1013 38
2. Kaplan EL, Meier P Nonparametric estimation from incomplete observations J Am Stat Assoc 1958 53 457 81
3. Chakraborty S A step-wise guide to performing survival analysis Cancer Res Stat Treat 2018 1 41 5
4. Dessai S, Simha V, Patil V Stepwise Cox regression analysis in SPSS Cancer Res Stat Treat 2018 1 167 70
5. Dessai S, Patil V Testing and interpreting assumptions of COX regression analysis Cancer Res Stat Treat 2019 2 108 11
6. Cox DR Regression models and life-tables J R Stat Soc Series B Stat Methodol 1972 34 187 202
7. Darling HS Are you confident about your confidence in confidence intervals? Cancer Res Stat Treat 2022 5 139 44
8. Royston P, Altman DG External validation of a Cox prognostic model:Principles and methods BMC Med Res Methodol 2013 13 33
9. Therneau T Package for Survival Analysis in R. R package version 3.4-0 Available from: https://cran.r-project.org/web/packages/survival/index.html. 2022 Last accessed on 2022 Sep 11
10. Kassambara A, Kosinski M, Biecek P Survminer: Drawing Survival Curves using ‘ggplot2’ R package version 0.4.9 2021 Available from: https://CRAN.R-project.org/package=survminer Last accessed on Sep 11, 2022
11. Darling HS To “P”or not to “P”, that is the question:A narrative review onPvalue Cancer Res Stat Treat 2021 4 756 62
12. Bhattacharjee A, Vishwakarma GK, Banerjee S A Bayesian approach for dynamic treatment regimes in the presence of competing risk analysis Cancer Res Stat Treat 2018 1 51 7
13. Collett D Modelling Survival Data in Medical Research New York Chapman and Hall/CRC Press 2015
14. Klein JP, Moeschberger ML Survival Analysis:Techniques for Censored and Truncated Data Vol. 1230 New York Springer 2003
15. Altman DG Practical Statistics for Medical Research 1st ed London Chapman and Hall/CRC Press 1990
16. Piantadosi S Clinical Trials:A Methodologic Perspective John Wiley and Sons New York 2017

Breast cancer; clinical data; Kaplan–Meier; survival data

Copyright: © 2022 Cancer Research, Statistics, and Treatment