Correlation and regression are two commonly used statistical techniques in modelling medical data. It s essential for clinicians to understand the methodology of these techniques in general and their implications in clinical area of expertise. This article provides an overview on correlation and regression methods from an ophthalmologic perspective.

# An overview of correlation and regression methods in medical research – An ophthalmologic perspective

- Open

## Abstract

## INTRODUCTION

Statistics in health research primarily deals with methods for designing studies and making statistical inferences about the research question based on data in the presence of uncertainty. The selection of a particular method for statistical inference depends on various aspects of the study such as research question, study design, type of variables, and so on. Carefully framed research question helps us to select the appropriate research design. Further, the researcher has to have a clarity on which are the variables to be measured and what type these variables are. The statistical analysis plan primarily depends on the research question, study design, and type of variables that are being measured in the study. There are primarily two types of variables measured in a study:

- Categorical variables: Variables that can take a pre-fixed number of possible categories. For example, Gender-Male and Female, Diabetes status-No, Mild, Moderate, and Severe, Presence of glaucoma-Yes and No, eye color, success from surgery, etc., are all categorical variables. Categorical variables can be binary, which are essentially two categories or can have more than two categories. These are variables measured either in nominal or ordinal scale.
- Continuous variables: Variables that can take up any real value in the range of its interval. For example, age measured in years or days, intraocular pressure (IOP), visual acuity, contrast sensitivity, etc., are all continuous variables. These are variables measured in ratio or interval scale.

In medical research, very often we attempt to understand the relationship between different variables measured. Depending upon the relationship between these variables, they can be broadly classified as independent and dependent variables. It is of our interest to see how technically these variables are correlated. And how do the changes in the independent variable affect the dependent variable on an average? The statistical methods covering these types of analysis are known as correlation and regression analysis. The type of variables, type of dependent variable, and the number of independent variables affecting the dependent variable all have a bearing on deciding what type of correlation and regression analysis to be performed. The present article essentially explores these methods with some relevant examples.

## CORRELATION

Correlation deals with the study of possible relationships between two continuous or ordinal variables and is assessed using a measure called the correlation coefficient. The correlation coefficient measures the degree of the linear relationship between two continuous or ordinal variables. The population correlation coefficient is represented by “ρ” and the estimate for the correlation coefficient which is computed from sample data is represented by “*r*”. Primarily, the correlation coefficient is of the following two types:

i. Pearson’s correlation coefficient – This measure is to estimate the degree of the linear relationship between two continuous variables, which are normally distributed. If “X” and “Y” are two variables measured, then Pearson’s correlation coefficient is computed by the formula,

where Cov(X,Y) is the covariance between X and Y, SD(X) is the standard deviation of X, and SD(Y), the standard deviation of Y.

ii. Spearman rank correlation coefficient – This is to estimate the degree of the linear relationship between two ordinal or continuous variables, which are not normally distributed. Spearman’s rank correlation coefficient is computed by,

where D is the difference in the ranks. “D” is obtained by ranking the observations.

The correlation coefficient ranges between –1≤*r*≤1. The sign of the correlation coefficient explains the direction and magnitude shows the strength of the linear relationship. We can test the significance of the correlation coefficient using a statistical test. The null hypothesis for this test of significance is H_{0} :ρ =0. The test statistic for testing the significance of the correlation coefficient is given by,

The confidence interval for “r” is given by,

Let us now consider an example, where the linear relationship between IOP and intracranial pressure (ICP) is assessed to use IOP as a non-invasive surrogate for ICP.^{[}^{1}^{]} In a study with 130 subjects, it was estimated that the correlation coefficient between these two variables was 0.33 in the left eye and 0.29 in the right eye. There seems to be a positive correlation between IOP and ICP in both eyes. Now, the question would be whether this correlation is statistically significant or clinically meaningful. To test the significance of “*r*,” the “t” statistic was estimated to be *t* = 3.95 and *t* = 3.43 for left and right eyes, respectively, which are larger than the critical value at a 5% level of significance, *t* = 1.98. The p-value was estimated to be <0.001 for both these correlations which infers that there is a significant positive correlation between these two variables. The 95% confidence intervals for both these correlations are estimated to be (0.17, 0.49) for the left eye and (0.12, 0.46) for the right eye. This further confirms the earlier interpretation. These results clearly show that IOP and ICP are positively and significantly correlated.

In summary, the correlation coefficient is a measure of the linear relationship between two continuous or ordinal variables and the correlation coefficient varies between -1 and +1. The significance of the correlation coefficient can be tested with a statistical test. We must be wary that the magnitude of the correlation coefficient is sensitive to sample size and extreme observations. Therefore, while making inferences, we must consider both the magnitude of the correlation coefficient, confidence interval, and significance of the test of the hypothesis.

## SIMPLE LINEAR REGRESSION

Once we established the correlation between two variables, the next step is to move to the concept of regression. Here, we try to define one variable, usually the dependent variable as a function of the other variable(s), the independent variable(s). Linear regression derives a linear equation for estimating the average values of the dependent variable Y, for given values of the independent variable X, while both these variables are continuous. The linear regression equation model is given by,

where “*β*_{1}” is the mean change in *Y* for every unit change in *X* value. “*β*_{0}” is the value of *Y* corresponding to *X =* 0 and *ε* is the error term which is assumed to follow a normal distribution with mean 0 and a constant variance for every given value of *X* . The regression coefficients *β*^{’}*s* are estimated from the sample data using the method of regression analysis. The regression coefficient estimates the relationship between dependent and independent variables. The regression coefficient is given by,

Similar to correlation analysis, the regression coefficients also can be tested for their statistical significance and their confidence intervals can be estimated. The confidence interval of the regression coefficient is given by,

The test statistic for the regression coefficient is given by,

Consider the example taken earlier, with a linear relationship between IOP and ICP. Let ICP be the dependent variable, which we are regressing through IOP as the independent variable. Through simulation, the regression equation estimated was,

The standard error for the regression coefficient be estimated to be 0.203 and therefore the test statistic for the regression coefficient *β*_{1} corresponding to IOP is estimated to be, *t* = 1.643 which is less than the critical value of *t* = 1.98, and the p-value = 0.103. The 95% confidence interval for the regression coefficient was estimated to be (–0.73, 0.06) which includes 0 which is the point of nullity. Therefore, the regression coefficient did not come out to be significant. While interpreting the result of the simple linear regression, we must specify as, with every one-unit increase in IOP, there is a decrease of 0.334 units of ICP on an average.

To determine the goodness of fit of the regression equation, the coefficient of determination (R^{2}) is estimated. The coefficient of determination is defined as the proportion of variability in the dependent variable explained by the model. It is represented as *R*^{2} and is given by,

The confidence interval for *R*^{2} is given by,

In the context of assessing whether the regression equation that is modeled satisfied the assumptions based on which it was estimated, there are statistical tools available to assess it. Residual is the difference between the observed value of the dependent variable and the predicted value with the regression model. It is given by,

Residuals are independently and normally distributed. The residual plot is the graph that is plotted between residuals on the Y-axis and independent variables on the X-axis.

### Regression diagnostics

*Testing non-linearity*: The assumption that the relationship between the dependent variable and the independent variable is linear. We expect the plot between the residuals and the independent values to be roughly scattered around zero.*Testing homogeneity of error variance*: The assumption of homoscedasticity is tested by plotting residuals and predicted values. If the plot is having no clear pattern, then the error pattern is said to be homoscedastic.*Testing normality of residuals*: The assumption of normality of the residuals is tested using a normal probability plot or with the Shapiro–Wilk test to assess the normality of the residuals.

## MULTIPLE LINEAR REGRESSION

Multiple linear regression is an extension of simple linear regression. It derives the prediction equation for estimating the mean value of the dependent variable for the given values of more than one independent variable. The equation is given by,

In the case of multiple linear regression, *β*_{0} is the average value of *Y* when all the corresponding *X*'_{i}*s*=0, *β*_{i} is the average change in *Y* for a unit change in *X*_{i} for all values of *i* ranging from 1 to *p* after adjusting for other variables in the model, and ϵ is the error term which is said to follow a normal distribution with mean 0 and a constant variance. The method of estimation for the regression coefficient vector β'=(β_{1},β_{2},...,β_{p}) for all *X*'_{i}*s* in multiple linear regression model is given by,

Here, *X* is the matrix of *n* observations for all the *p* variables, and *y* is the vector of dependent variables for *n* subjects. Similar to simple linear regression, the test statistic for testing the significance of individual regression coefficient is given by,

The confidence interval for each regression coefficient is given by,

Here, σ̂ ^{2} is the estimate of the variance of the error term and *C** _{ii}* is the i

*diagonal element of (*

^{th}*X*’

*X*)

^{–1}matrix.

Consider the previous example of simple linear regression. Suppose, we need to assess the relationship between IOP and ICP after accounting for the effect of age, as it has been observed that age has a negative relationship with ICP.^{[}^{2}^{]} In this case, a multiple linear regression model can be modeled which is given by,

The estimates for the regression coefficient associated with IOP and age can then be estimated, along with the confidence interval and a test of the significance of the estimate can be carried out. While interpreting, we must specify as, with one unit change in the IOP, there is an average of *β*_{1} units of change in ICP after adjusting for the effect of age.

To determine the goodness of fit of the multiple linear regression model with more than one independent variable, the coefficient of determination, adjusted for the number of variables in the model is estimated. The adjusted R^{2} value is given by,

Linear regression provides a structure of the linear relationship between one or more independent variable/s and a normally distributed continuous outcome variable. Multiple linear regression model estimates the average of the response variable in the presence of multiple independent variables. The effect of covariates gets adjusted. The global test for the model adequacy or significance of the regression model can be carried out using the analysis of variance (ANOVA) by calculating the sum of squares explained by the model and the error sum of squares.

## SIMPLE LOGISTIC REGRESSION

Linear regression works when both dependent and independent variables as continuous. However, in health research, very often we encounter binary or categorical dependent variables and a set of one or more independent variables with all types. The linear regression model fails to model such data. Logistic regression is a very powerful regression technique that enables us to model such data and study the association between all types of the independent variable(s) and a categorical outcome or dependent variable.^{[}^{3}^{]} In the case of binary outcome or dependent variable, we refer to it as binary logistic regression. The fundamental change is we work with a probability model.

Suppose Y is the binary dependent variable with possibilities “Yes” denoted by 1 and “No” denoted by 0. Then the binary logistic regression model is given by,

This is the probability of occurrence of the variable Y=1 or event under a given level of the independent variable. There is no distributional assumption for the independent variable in logistic regression analysis, which means, they can be either continuous or categorical. We generally consider a logit link function to linearize the relationship between the independent and dependent variables, the equation of which is given by,

The error term *ε* in the logistic regression model assumes one of the two possible values. If the outcome of interest occurs, then *ε* = 1–*p*, with a probability of p whereas if the outcome does not occur, then *ε* = –*p* with a probability of 1-p. Therefore, *ε* follows a binomial distribution with mean 0 and variance equal to p(1-p).

Consider a study, where we are looking at the possible factors associated with retinal break/tear.^{[}^{4}^{]} The outcome here is dichotomous or binary. Let posterior hyaloid face (PHF) separation during 23-gauge pars plana vitrectomy (PPV) be a potential risk factor associated with a retinal tear. Here, the outcome variable is binary given as retinal tear, Yes or No. The independent variable PHF separation is also categorized as suction-only induced or adherent. Let us consider that PHF separation by adherent is the exposure of interest, which we are looking at as a risk factor associated with retinal tears. Therefore,

P (Retinal tear = Yes| Adherent) = Probability of retinal tear among those who had adherent PHF separation. It is given by,

P (Retinal tear = No| Adherent) = Probability of no retinal tear among those who had adherent PHF separation. It is given by,

P (Retinal tear = Yes| Suction-only) = Probability of retinal tear among those who had suction-only PHF separation. It is given by,

P (Retinal tear = No| Suction-only) = Probability of no retinal tear among those who had suction-only PHF separation. It is given by,

Therefore, the odds of retinal tear among subjects who had adherent PHF separation are given by,

Similarly, the odds of retinal tear among subjects who had suction-only PHF separation is given by,

Hence, the odds ratio of retinal tear for adherent PHF separation as compared to suction-only PHF separation is given by,

The significance of the regression parameter is tested using the Wald test ratio given by,

The confidence interval for the odds ratio is given by,

In the above-mentioned study on retinal tears, the estimate for *β =* 1.325. And hence the odds ratio is given by *OR* = 3.763. The standard error for *β* is estimated to be 0.861. Therefore, the test statistic is *t* = 1.539 which is less than the critical value at a 5% level of significance and the P value was estimated to be 0.124. The confidence interval for the odds ratio is given by (0.696,20.34).

The log-likelihood ratio test is used to assess the goodness-of-fit of the model by comparing the likelihood function of the current and the null model. The test statistic for likelihood ratio is given by,

The test statistic asymptotically follows a chi-square distribution with *N*_{1} – *N*_{2} degrees of freedom, where *N*_{1} is the number of parameters in the current model and *N*_{2} is the number of parameters in the null model.

## MULTIPLE LOGISTIC REGRESSION

In health research, generally, we have a multifactorial environment and we study the effect of these independent variables on the dependent variable or outcome. Under such a circumstance, we need to assess the effect of all these factors in a single model on the binary outcome variable. Multiple logistic regression which is an extension of simple logistic regression is the modeling method applied in such a scenario. The multiple logistic regression model is given by,

This is the probability of occurrence of the dependent variable Y =1 or event under fixed levels of more than one independent variable. The logit function of the multiple logistic regression model is given by,

The method of estimation for estimating the regression coefficient *β’s* is by the maximum likelihood method. The log-likelihood function is differentiated with respect to p+1 coefficients. The likelihood equations that are expressed are given as,

Here, *j* = 1,2,…,*p* are the number of independent variables. By differentiating the second partial derivative of these log-likelihood functions, we get a square matrix of the order p+1 which is called the observed information matrix. The diagonal elements represent the variance and off-diagonal elements represent the covariance in this matrix. The odds ratio estimated in multiple logistic regression is termed as the adjusted odds ratio. Similar to the case in univariate logistic regression, the test statistic used to test the significance of the regression coefficient is given by Wald statistic,

*exp*(*β*) provide the adjusted odds ratios. The confidence interval calculation of regression coefficients is a bit more complicated for multiple logistic regression model as it can be expressed in matrix notation and the estimate is given by,

However, with the advent of software, the calculation of the confidence intervals for the odds ratio or the regression coefficient is no longer tedious. Most statistical software packages routinely calculate the confidence intervals and provide the required estimates. Based on the confidence interval of regression coefficients, the confidence interval of odds ratios also can be calculated by taking antilog or directly using packages. Similar to the log-likelihood test, there are other tests to assess the goodness of fit of the logistic regression model. The Hosmer–Lemeshow test is used for testing the goodness-of-fit of the logistic regression model by assessing whether the observed event rate matches the expected event rate. Hosmer–Lemeshow test statistic is given by,

where *n*’* _{k}* is the total number of subjects in the

*k*

^{th}group,

*O*

*is the observed response, and π̅*

_{k}_{k}is the average estimated probability.

In summary, binary logistic regression studies the association between all types of independent variables with a binary outcome or dependent variable. It can adjust for the covariates and derive the odds ratio along with the confidence interval. It is a very powerful technique applicable for analyzing data coming from case-control, cohort, and randomized control trial designs with binary outcomes and multiple covariates or confounders. It is also applicable for prognostic modeling with binary outcomes.

Further, in the case where we have an outcome variable, which has more than two categories, we need to apply a multinomial logistic regression model and if the outcome categories have an order, we need to apply an ordinal logistic regression model. For matched studies, we need to go for conditional logistic regression modeling. If the outcome is count data or data expressed as proportions, we have the option of a Poisson logistic regression model.

## CONCLUSION

The analysis plan for any study is determined by the research question, study design, type, and the number of variables. Apart from this, we also need to be cautious about the applied field of medicine from where the data is being generated from. In ophthalmology, very often the observations are assessed from both eyes. Therefore, a pair of observations is being generated from a single subject, and hence we have to consider them as paired observations in our analysis plan. Another way to deal with it would be to combine the pair of observations coming from a set of eyes. For example, for IOP, the mean IOP from both eyes is considered for the analysis. The origin of data is very important in deciding the analysis plan and selection of the model.

It is also important during the analysis to understand the assumptions and setting for which the analysis is pertinent for. For example, we have the assumption of normality for the dependent variable in linear regression. Under the circumstance where this assumption is violated, we can look at alternatives such as the transformation of the variable. There are numerous transformation techniques such as log, square root, inverse, etc., which can be applied depending on the skewness of the distribution of the variable. We have to be cautious in this also before going ahead with the analysis.

### Financial support and sponsorship

Nil.

### Conflicts of interest

There are no conflicts of interest.

## REFERENCES

**Keywords:**

Correlation; linear regression; logistic regression