INTRODUCTION
Logistic regression is a regression with a categorical outcome variable and predictor variables that can be either continuous or categorical. In this short example tutorial, the basic workflow of logistic regression using R statistical software is shown. As introduced earlier,[1] R is a statistical programming language widely used in the field of data science and statistics. R can be downloaded to Windows, MacOS and Linux platforms from https://www.r-project.org/webpage. RStudio is an integrated development environment for R. Free RStudio can be downloaded from https://rstudio.com/. R can be used without RStudio, but using it makes many things easier, such as downloading data to R, etc.
METHODS
In this example, a dataset by Unda et al.[2] is used to show a basic workflow of logistic regression. The purpose of this short tutorial is not to explain the mathematical background behind the regression models.
First, data should be stored to the desired location (desktop in this case) and read into R as described in the earlier Statistical corner.[1]
library (readxl)
Dataset <- read_excel('~/Desktop/Dataset.xls')
Then, all the variable names are cleaned. Everything else but letters and numbers are replaced with _ symbol to avoid possible subsequent problems. '[^[: alnum:]]' is a regular expression meaning 'everything else but alphanumerals'. If variable names contain mathematical symbols, there might be problems during the analyses.
names (Dataset) <- gsub('[^[: alnum:]]', '_', names (Dataset))
Then, a simple boxplot, that shows a rough relationship between the age and patients condition at discharge, is created [Figure 1].
Figure 1: Example analysis. Boxplot of the age distribution of different condition-groups of the patients
boxplot (Age ~ mRS_at_dIscharge, data = Dataset)
To create a dichotomic variable, transform() function is used. This piece of code tells R to create a new variable categorical_mRS and give it a value '0' if mRS_at_discharge is either 1, 2 or 3. All the other values of the new variable are set to '1'. So, the value of categorical_mRS is '0' in patients with good recovery and '1' when the recovery is poorer. A basic form of ifelse() function is explained at the webpage https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/ifelse.
Dataset <- transform (Dataset, categorical_mRs = ifelse (mRS_at_dIscharge==1 | mRS_at_dIscharge==2 | mRS_at_dIscharge==3, 0, 1))
Then, only the variables of interest are selected and all the rows with missing values are dropped off using tidyverse functionalities:
library (tidyverse)
Dataset <- Dataset %>%
select (Age, BMI, mRS_at_dIscharge, categorical_mRs)
Dataset %>% drop_na()
It is not necessary to get rid of the extra variables but in the case of bigger datasets, the view might be confusing, if the whole dataset is printed on the screen. Its only authors personal preference to handle as small dataset as possible. Then, the logistic regression model is built using generalized linear model command glm() from base R. Both age and body mass index (BMI) are chosen as predictor variables to see if they explain the condition of the patient at discharge:
model <- glm (categorical_mRs ~ Age + BMI, family = 'binomial', data = Dataset)
RESULTS
When summary (model) is typed, R shows the overall information about the model. In this case, BMI seems not to explain the condition of the patient (P > 0.05) but age seems to be a significant predictor (P < 0.001). Deviance of a model is a variable that describes the overall fit of the model. The bigger the deviance is, the poorer is the fit of the model. Null deviance is a deviance of a 'null model', which is a model that contains only constant predictor. Residual deviance is a deviance of the model with the given predictors. In this case, null deviance is 205.27 and residual deviance 192.49, which means that the true model is fitting better than the 'null model', which is of course a good thing in this case. To analyse further, one could build other models on the same dataset and test which one of them explains the chosen outcomes better, using anova() analysis of the models.
CONCLUSION
To conclude, performing basic statistical modeling in R is a fairly simple and straightforward procedure. Analysing model fit is essential to be able to report the results.
REFERENCES
1. Pyysalo M, Vesterinen T. Statistical corner: Using R to build, analyse and plot clinical neurological datasets J Cerebrovasc Sci. 2020;8:107–12
2. Unda SR, Labagnara K, Birnbaum J, Wong M, de Silva N, Terala H, et al Impact of hospital-acquired complications in long-term clinical outcomes after subarachnoid hemorrhage Clin Neurol Neurosurg. 2020;194:105945