# Statistics in Medicine

Summary: The scope of biomedical research has expanded rapidly during the past several decades, and statistical analysis has become increasingly necessary to understand the meaning of large and diverse quantities of raw data. As such, a familiarity with this lexicon is essential for critical appraisal of medical literature. This article attempts to provide a practical overview of medical statistics, with an emphasis on the selection, application, and interpretation of specific tests. This includes a brief review of statistical theory and its nomenclature, particularly with regard to the classification of variables. A discussion of descriptive methods for data presentation is then provided, followed by an overview of statistical inference and significance analysis, and detailed treatment of specific statistical tests and guidelines for their interpretation.

Stanford, Calif.

From the Department of Surgery, Stanford University School of Medicine.

Received for publication November 16, 2009; accepted March 29, 2010.

**Disclosure:** *The authors have no financial interest to declare in relation to the content of this article.*

Geoffrey C. Gurtner, M.D.; Stanford University School of Medicine; 257 Campus Drive, GK-201; Stanford, Calif. 94305; ggurtner@stanford.edu

Plastic surgery is unique in the extent to which its outcomes possess an inherent subjectivity. Rather than binary measures (e.g., dead and alive) or linear metrics (e.g., blood pressure), the plastic surgeon must frequently operate within the nebulous space enveloping “better” and “worse.” As such, defining appropriate endpoints is critical. The purpose of this overview is to give the reader some familiarity with the underlying theory governing statistical methods and provide a tool for investigative design and the interpretation of results.

The scope of biomedical research has expanded rapidly during the past several decades, and statistical analysis has become increasingly essential to translate large and diverse quantities of raw data into readily interpretable results. A broad array of rigorous statistical methods has evolved to validate and standardize these various approaches. Currently, nearly every published study incorporates some elements of statistical analysis, and many journals have even begun to use dedicated statistical reviewers to ensure a standard of quality.^{1} As such, a degree of statistical literacy is critical not only for article preparation but also for accurate interpretation of published data.

This article attempts to provide a practical overview of medical statistics, with an emphasis on the selection, application, and interpretation of specific tests. This includes a brief review of statistical theory and its nomenclature, particularly with regard to the classification of variables. A discussion of descriptive methods for data presentation is then provided, followed by an overview of statistical inference and significance analysis, and detailed treatment of specific statistical tests and guidelines for their interpretation.

## BASIC PRINCIPLES

In statistics, *variables* refer to measurable or observable attributes that vary among individuals or over time (e.g., body mass, postoperative complications). *Data* consist of the corresponding measured or observed *values* assumed by these variables under specific conditions (e.g., 70.5 kg, abdominal hematoma). The goal of data analysis is to make statements about the distributional properties of variables, individually or collectively. However, the diverse nature of statistical variables is such that no single analytical method may be applied to all forms of data. As such, understanding the properties of the variables under study is critical to ensure selection of appropriate statistical tests.

At their most primitive level, all statistical variables may be classified as either *categorical* or *numerical*. Categorical variables represent *qualitative* observations (e.g., postoperative complications, preoperative diagnosis), whereas numerical variables refer to *quantitative* observations (e.g., body mass, operative time). Generally speaking, numerical data are those whose values are numbers (e.g., 70.5 kg), whereas categorical data delineate one or more groups (e.g., hematoma, herniation). Numerical variables may be further subdivided into *discrete* and *continuous* types. Discrete variables are those whose values are restricted to a predefined set, typically the integers (e.g., number of revisions). By contrast, continuous variables may assume any intermediate value within a given range (e.g., body mass). These include most clinical measurements, such as weight and length.

Statistical data are also frequently categorized according to their *level of measurement* or *data scale*. These two terms are synonymous and refer to the relationships among observable values as restricted by the nature of the measurement system. Measurement of categorical variables may be either *ordinal*, in which case values have an obvious ordering (e.g., severe pain is greater than moderate pain), or *nominal*, in which no explicit ranking exists (e.g., hematoma versus herniation). All numerical variables prescribe to an innate ordering system, and most are measured along an *interval* scale [the exception being pseudonumerical variables such as cancer staging, where the quantitative differences between values do not reflect magnitudes of effect (e.g., stage 1 versus stage 2 versus stage 3 colon cancer)]. These categories are further summarized in Table 1. In general, the specific properties of a variable or group of variables (e.g., continuous numerical) will dictate the appropriate statistical measures for their analysis.

## DESCRIPTIVE STATISTICS

In broad terms, analysis of data may be regarded as either descriptive or inferential. *Descriptive statistics* are those that *describe* basic features of a data set, including most forms of graphic analysis. These facilitate identification of patterns and form the basis of nearly every quantitative analysis. *Inferential statistics*, by contrast, evaluate relationships among variables (e.g., increased body mass is associated with an increased complication rate). Treatment of clinical data will typically begin with broad descriptive statistics covering all study data and then proceed to the application of inferential statistics for specific subsets of interest.

In their most basic form, descriptive statistics are simply an efficient method of data presentation. This is particularly important for large data sets, where reporting raw values quickly becomes unwieldy. For categorical data, this is typically achieved through a frequency table, in which the absolute frequencies (i.e., raw counts) or relative frequencies (i.e., fractions of the total) for each group are listed. Common clinical applications include the tabular reporting of complications, risk factors, and patient demographics. Graphic analysis of categorical data may facilitate comparison of relative frequencies, often in the form of bar graphs or pie charts. Presentation of numerical data requires a more complex descriptive approach using the various metrics described below.

### Central Tendency

Measures of central tendency provide estimates of the “middle” of a data set, and are particularly useful for describing numerical data. These include the familiar metrics of *mean*, *median*, and *mode*, which are formally defined in Table 2. These refer to the arithmetic average, midpoint, and most common values of a set, respectively. Here, “midpoint” signifies the middle value of the ordered set (or the average of the middle two numbers if the sample size is even). Typically, only one of these indicators will be used to describe any given variable, often dictated by the data scale (Table 1). The mean is by far the most frequently used measure of central tendency, although specific instances where the median is more appropriate (particularly when dealing with outliers) are discussed in the paragraphs that follow. The mode is rarely mentioned by name, but it is nearly ubiquitous in any discussion of categorical data (e.g., “the most common…,” “the majority…”).

### Measures of Dispersion

Dispersion refers to the spread of the values around the central tendency (typically, the mean). The *range* is the most simple measure of dispersion, defined as the difference between the highest and lowest values in a set. For data that are interval in nature, a more accurate and detailed estimate is provided by the *variance*, which measures how closely individual values cluster around the mean. The *SD*, defined as the square root of the variance, provides an estimate for the typical distance of a value from the mean. Another related measure commonly used to describe the behavior of a variable is the *skewness*, which evaluates the symmetry of data relative to their mean. Together, these metrics provide a concise numerical description of the distribution of values throughout a data set (Tables 1 and 2).

### Statistical Distributions

The relationships observed among measures of central tendency and dispersion frequently follow specific well-characterized patterns, which may be described mathematically in the form of statistical distributions. The most common of these is the *normal distribution*, which corresponds to a symmetric, bell-shaped curve (Fig. 1, *above*). This distribution has several important and unique properties, including identical values of the mean, median, and mode. Moreover, a normal distribution implies that approximately 68, 95, and 99.7 percent of all values will fall within 1, 2, and 3 SD of the mean, respectively.^{2}

In general, the mean is the preferred measure of central tendency for symmetrical (e.g., normal) distributions; however, it is strongly influenced by outliers and can thus be misleading in skewed distributions (Fig. 1, *below*). The median typically provides a more reliable metric for heavily or predictably skewed distributions. Although most clinical data can usually be characterized by a normal distribution, several variables such as length of stay and body mass may be predictably skewed (e.g., both frequently exhibit positive skewness because of unbalanced outliers on the upper end). Graphic analyses such as Q-Q plots and histograms are often helpful for evaluating deviations from normality, and more sophisticated computational methods have also been developed.^{3} When identified, nonnormal data can occasionally be mathematically transformed to adopt a normal distribution (e.g., using the variable's logarithm, square root, or multiplicative inverse). This is strongly preferable where possible, as normality permits the use of highly specific *parametric tests*, which use knowledge of the underlying data distribution to more precisely evaluate significance. Analysis of nonnormally distributed data requires a separate, more general class of statistical methods; incorrectly applying normality-based tests to these data will yield inaccurate results.

## INFERENTIAL STATISTICS

Where descriptive statistics describe basic features of a data set (e.g., mean body mass), statistical inference makes claims about the nature and relationships of the underlying variables that give rise to the observed data (e.g., “increased body mass leads to an increased incidence of hematomas”). This process is critically dependent on the ability to distinguish true relationships from random variation, and rigorous numerical analysis is often required to achieve this end.

### Standard Error

Generally speaking, any collection of observed data represents only a small subset of all the data that *could* be observed for that variable under similar conditions. As such, a variable's distribution *within* a given data set is merely a sampling of the true behavior of that variable. By assuming that the degree of variance exhibited by the observed data is approximately that of the true variable, it is possible to estimate the sampling error associated with each variable under study. For example, consider intracanthal measurements of 10 adults with Apert syndrome. This is a small sample relative to all affected adults and, as such, repeating these measurements for another 10 patients would be expected to yield a different mean distance. The degree to which this sample mean will vary on repeated measurements can be estimated based on the variance within the (initial) observed data. The resulting quantity, termed the *standard error of the mean (SEM)*, is frequently used as a measure of the reliability of a data set (Table 2).

### Confidence Intervals

Confidence intervals provide another index of the reliability of a sample mean, denoting numerical limits within which a given variable is expected to occur. These limits may be adjusted to encompass any fraction of expected outcomes but are typically defined using the 95 percent threshold (i.e., with specific error tolerance α = 0.05). When the data can be assumed to follow a normal distribution, confidence intervals may be quickly computed using the SEM as a measure of the true SD (within 2 of which 95 percent of normally distributed data fall). Here, the 95 percent confidence interval is simply the mean ± 2 · SEM. However, as described above, if the true distribution is unknown, the SEM is often an unreliable estimator. For such cases, a more robust interval may be obtained using other methods, as described by Efron and Tibshirani.^{4}

In certain situations, confidence intervals may also be used to evaluate statistical significance. This is most commonly achieved by computing the confidence intervals for two variables and demonstrating that they do or do not overlap. For example, consider the 95 percent confidence intervals for mean artery diameter (6.2 ± 1.1 mm) and vein diameter (8.9 ± 1.3 mm). Because these two intervals (5.1 to 7.3 mm and 7.6 to 10.2 mm) do not overlap, this indicates a statistically significant difference with *p* < 0.05.

### Hypothesis Testing

Hypothesis testing is the most common and most general form of statistical inference. This process is centered around the generation of a *null hypothesis*, which serves as the predicate for subsequent evaluation. The essence of hypothesis testing is to specify a proposition (before data collection) and then use the sample data to disprove it. This process typically proceeds through several general stages, as described in Table 3, including computation of a *p* value, which represents a decreasing index of reliability for the null hypothesis. Although an explicit statement of the null hypothesis is often omitted, it is implicit in declaring that a given difference is “statistically significant with *p* < 0.05” that the null hypothesis of “no difference exists” is rejected at an α threshold of 0.05.

There is always a level of uncertainty associated with accepting (or rejecting) the null hypothesis. An incorrect rejection (i.e., concluding that a difference exists, when in fact one does not) is referred to as a *type I error*, the probability of which is equal to the *p* value. A *type II error* describes the failure to reject a faulty null hypothesis (i.e., not finding a difference, when one actually exists). The *power* of a study is defined as the probability of *not* making a type II error; that is, the likelihood of detecting a statistically significant difference when one actually exists. This is dependent on the sample size and significance threshold (α), and the extent of true difference. When designing a clinical study, a *power analysis* is frequently used to determine the minimum sample size required to achieve an acceptable rate of type II error. These computations are heavily distribution dependent, and their specific formulae are beyond the scope of this review; however, a thorough treatment of this subject may be found in Cohen's text.^{5}

## STATISTICAL TESTS

Selecting an appropriate statistical test is critical for accurate data analysis. Determining the optimal method for a given data set must take into account several factors, chief among which are the limitations (e.g., continuous versus discrete) and distributional properties (e.g., normal versus skewed) of the variables under study. Table 4 provides a general blueprint for statistical test selection based on the various combinations of numerical and categorical data.

### Statistical Tests for Categorical Data

For comparisons among categorical variables, the null hypothesis typically states that the distribution of one variable is independent from those of the other variables (i.e., no relationship exists). Statistical tests, then, evaluate whether the observed frequencies of joint occurrences differ from the products of the individual observed frequencies for each variable (i.e., the expected overlap for independent distributions). *Pearson's chi-square test* (χ^{2}) is the most commonly used method and may be applied to the majority of categorical data. For example, to evaluate the association between diabetes and postoperative infection, a chi-square test would compare the observed incidence of postoperative infections in diabetic patients with that otherwise expected based on the observed frequencies of the remaining category pairs (e.g., postoperative infections in nondiabetics, diabetics without postoperative infections). When the expected number of co-occurrences for any pair is very small, *Fisher's exact test* offers a nonparametric alternative^{6} for statistical analysis.

### Statistical Tests for Numerical Data

Analysis of numerical variables typically takes the form of either correlation or regression analysis. Where the former simply evaluates the degree of relationship among variables, regression analysis seeks to formalize this relationship as a mathematical model. In deciding which form of analysis is more appropriate, it is helpful to distinguish between *independent* (predictive) variables and *dependent* (predicted) variables. Typically, dependent variables represent measured endpoints (e.g., complications), whereas independent variables are those hypothesized to influence these endpoints (e.g., body mass).

Correlation analysis is typically used to evaluate the degree of associations among large numbers of independent variables without establishing causal relationships. The most common method is *Pearson's correlation* (*r*), which measures the linear dependence of normally distributed numerical variables. This test assigns an *r* value to the degree of correlation, ranging from –1 to 1, with 0 representing no correlation and +1/–1 corresponding to perfect correlation/anticorrelation. When the assumption of normality cannot be made, *Spearman's rank correlation coefficient (ρ)* provides a nonparametric alternative to Pearson's correlation. Both measures are typically reported in conjunction with a *p* value, representing the likelihood of observing a relationship of such strength by random chance; however, it is important to note that this *p* value itself does not represent the degree of the association and thus should function only as a means of validating large correlation coefficients. For example, Pearson's correlation could be used to evaluate the fidelity of automated blood pressure readings compared with physician-administered measurements, where a correlation coefficient of 1 may correspond either to perfect agreement or perfectly reproducible error (e.g., the automated reading is always exactly 3 higher than that of the physician).

Regression analysis is more appropriate to evaluate the specific relationship between a dependent variable and one or more independent variables. The most familiar application involves the comparison of two variables through *linear regression analysis*; that is, for independent variable x and dependent variable y, determining the optimal linear equation (i.e., y = a + b · x) to fit the set of observed data pairs {x, y}. Such analyses are frequently accompanied by a scatterplot, which illustrates the observed distribution of data pairs relative to the best linear fit. Although less intuitive, linear regression analysis may be similarly extended to interrogate any number of independent variables (i.e., y = m_{1} · x_{1} + m_{2} · x_{2} + …). When more complex, nonlinear relationships exist between variables, the data may occasionally be transformed (e.g., substituting the variable's natural logarithm) to establish a linear relationship; otherwise, more elaborate techniques such as *logistic regression analysis* may be required.

### Statistical Tests for Mixed Categorical and Numerical Data

When comparing categorical data with numerical data, it is again important to distinguish between *independent* and *dependent* variables. Statistical analyses of independent numerical and dependent categorical data focus on developing mathematical classifiers based on one or more numerical variables to predict the outcome of the (dependent) categorical variable. These methods are frequently used to evaluate diagnostic and therapeutic implications (e.g., “prostate-specific antigen >6.0 is associated with a 35 percent risk of prostate cancer”); however, the computational complexity of these *discriminant analyses* grows quickly with the number of variables and generally requires more sophisticated mathematical software, for which consultation with a biostatistician with the necessary resources is advised.^{7}

The analysis of dependent discrete data with independent continuous data is more broad in scope, with distinct approaches based on the specific properties of the dependent data, as summarized in Table 5. The most simple comparisons consist of one independent categorical variable and one dependent numerical variable. These will typically involve two groups of data (i.e., one discrete variable with two possible values, such as treatment versus control). This requires a two-sample *t* test, which may be either *paired* or *unpaired*. The unpaired *t* test is the more common of the two and may be applied for the majority of comparisons. The paired *t* test is only necessary when the independent variable distinguishes between two interdependent measurements. For example, when comparing the healing properties of two wounds on opposite sides of the same mouse, one treated with a dressing and the other a “control,” the independent variable (treatment status) does not fully disjoin the dependent variables (rate of healing) on individual mice. A paired *t* test is required to make the necessary statistical adjustments in such cases. In addition, each *t* test may also be *one-tailed* or *two-tailed*, depending on whether experimental deviations are restricted to one direction. Nearly all clinical analyses should use a two-tailed test, with a one-tailed test used only for situations where differences in one direction or the other are not merely unexpected but impossible.^{8}

*Analysis of variance* is a generalized version of the two-sample *t* test that permits comparisons of more than two groups (i.e., a discrete variable with more than two possible values). Analysis of variance may be either *one-way* or *two-way*, similar to the unpaired and paired forms of the *t* test. *Factorial analysis of variance* may be used to evaluate the effects of more than one independent variable, and *multivariate analysis of variance* is used to study more than one dependent variable. All *t* tests, and even the more general analysis of variance, are based on the assumption that the underlying variables obey a normal distribution. In situations where the assumption of normality is not valid, nonparametric alternative tests must be used. Although more robust, these nonparametric tests have less statistical power than their parametric counterparts and are thus avoided where possible. Table 5 lists the most common nonparametric alternatives to each of the statistical tests described above.

## INTERPRETATION OF STATISTICS

One of the primary purposes of statistical analysis is to provide a standardized method of presenting diverse experimental findings. This serves to facilitate the interpretation of results and to ensure a level of quality control, and deviations from these standardized procedures should be appraised critically. A failure to correct for multiple comparisons (e.g., post hoc hypothesis generation), for example, will lead to artificially low *p* values, as would the application of normality-dependent statistical tests to data that are not normally distributed. Nonstandard tests should always be accompanied by a brief justification and reference to avoid the perception of “test shopping” for rare or cryptic tests that may generate lower *p* values. In contrast, statistical errors can also serve to undersell data, such as the failure to use a paired or one-tailed version of a *t* test when appropriate, thereby reducing the statistical power.

As always, it is critical to note the difference between statistical significance and clinical significance. There is often a temptation to distill statistics down to a single *p* value and apply a binary label of “significant” or “not” based on the ubiquitous 0.05 threshold. However, it is important to note that *p* values themselves do not indicate the strengths of relationships, but rather the likelihoods that such relationships (as defined by the statistical test of choice) may have occurred by chance. Furthermore, true controls rarely exist in a clinical setting, and confounding variables are more often the rule than the exception. Clinically trivial results may exhibit statistical significance and, likewise, results that fail to achieve statistical significance may nonetheless be clinically relevant. Nevertheless, statistics are an essential component of data analysis, and a ready familiarity with this lexicon is critical to the interpretation of medical literature.

Detailed treatment of specific statistical tests and concepts is available through the free Web-based resource, Wolfram MathWorld (http://mathworld.wolfram.com).^{2} For an exhaustive survey of statistical analysis, Walter T. Ambrosius' text, *Topics in Biostatistics*, is recommended.^{6}

## REFERENCES

*BMJ*. 2002;324:1271–1273.

*J Am Stat Assoc*. 1967;62:399–402.

*Stat Sci*. 1986;1:54–75.

*Statistical Power Analysis for the Behavioral Sciences*. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum; 1988.

*Topics in Biostatistics*. Totowa, NJ: Humana Press; 2007.

*Biometrics*1979;35:69–85.

*BMJ*. 1997;315:364–366.

## Instructions for Authors: *Update*

**Ethical Approval of Studies/Informed Consent**

Authors of manuscripts that describe experimental studies on either humans or animals must supply to the Editor a statement that the study was approved by an institutional review committee or ethics committee and that all human subjects gave informed consent. Such approval should be described in the Methods section of the manuscript. For studies conducted with human subjects, the method by which informed consent was obtained from the participants (i.e., verbal or written) also needs to be stated in the Methods section.

In those situations where a formal institutional review board process is not available, the authors must indicate that the principles outlined in the Declaration of Helsinki have been followed. More information regarding the Declaration of Helsinki can be found at http://www.wma.net/e/policy/b3.htm.