Statistics From A (Agreement) to Z (z Score): A Guide to Interpreting Common Measures of Association, Agreement, Diagnostic Accuracy, Effect Size, Heterogeneity, and Reliability in Medical Research : Anesthesia & Analgesia

Secondary Logo

Journal Logo

Special Articles: Special Article

Statistics From A (Agreement) to Z (z Score): A Guide to Interpreting Common Measures of Association, Agreement, Diagnostic Accuracy, Effect Size, Heterogeneity, and Reliability in Medical Research

Schober, Patrick MD, PhD, MMedStat*; Mascha, Edward J. PhD; Vetter, Thomas R. MD, MPH

Author Information
Anesthesia & Analgesia 133(6):p 1633-1641, December 2021. | DOI: 10.1213/ANE.0000000000005773
  • Free


Researchers reporting results of statistical analyses, as well as readers of manuscripts reporting original research, often seek guidance on how numeric results can be practically and meaningfully interpreted. With this article, we aim to provide benchmarks for cutoff or cut-point values and to suggest plain-language interpretations for a number of commonly used statistical measures of association, agreement, diagnostic accuracy, effect size, heterogeneity, and reliability in medical research. Specifically, we discuss correlation coefficients, Cronbach’s alpha, I2, intraclass correlation (ICC), Cohen’s and Fleiss’ kappa statistics, the area under the receiver operating characteristic curve (AUROC, concordance statistic), standardized mean differences (Cohen’s d, Hedge’s g, Glass’ delta), and z scores. We base these cutoff values on what has been previously proposed by experts in the field in peer-reviewed literature and textbooks, as well as online statistical resources. We integrate, adapt, and/or expand previous suggestions in attempts to (a) achieve a compromise between divergent recommendations, and (b) propose cutoffs that we perceive sensible for the field of anesthesia and related specialties. While our suggestions provide guidance on how the results of statistical tests are typically interpreted, this does not mean that the results can universally be interpreted as suggested here. We discuss the well-known inherent limitations of using cutoff values to categorize continuous measures. We further emphasize that cutoff values may depend on the specific clinical or scientific context. Rule-of-the thumb approaches to the interpretation of statistical measures should therefore be used judiciously.

Science is the interpretation of nature, and man is the interpreter.

George Gore (1826–1908), English electrochemist

The Art of Scientific Discovery (1878)

Statistical methods are the cornerstone of the analysis of research data, and their results form the basis for practice change and decision-making in clinical medicine. To clearly communicate results of their analysis to a target audience, researchers often seek to translate the numeric results into plain language that is intuitively understandable and easily interpretable. For example, stating that a correlation between 2 variables is “very strong,” or reporting that the agreement between 2 raters is “poor,” is generally more comprehensible for most readers than simply numerical results. Conversely, when such plain-language interpretations are lacking, readers of research papers may appreciate benchmarks to guide them in the interpretation of the reported numerical values. Nevertheless, our strong recommendation is that such plain-language interpretations be used to complement, not substitute for, clearly reported numerical results.

In this article, we therefore aim to guide authors and readers of Anesthesia & Analgesia in the interpretation of a variety of commonly reported statistics. Our selection of statistical measures is by no means comprehensive. We focus on those that lend themselves to be categorized into a few distinct descriptors, based in most cases on a finite range of possible values, and those most likely to be relevant to the fields of anesthesia and related specialties. We also focus on the interpretation of these measures, electing only to briefly review the underlying statistical concepts. This article builds on a series of statistical tutorials as well as statistical grand round articles previously published in Anesthesia & Analgesia, to which we refer readers interested in a more in-depth coverage of the underlying principles.1–7


While we aim to provide guidance on how the results of statistical methods are typically interpreted, this does not mean that the results should universally be interpreted as suggested here. We instead stress that rule-of-thumb interpretations should be used judiciously. Using cutoff values or thresholds to categorize continuous measures is inherently problematic.8–10 For instance, values just above a threshold are interpreted differently from values just below the threshold, while they should—everything else being equal—essentially have a very similar interpretation. Much information is therefore lost by converting continuous measures or values into discrete categories, which explains why statisticians are reluctant do so in an analysis. All cutoff values are innately arbitrary, are inconsistently applied in the literature, and often depend on the specific clinical or scientific context. Moreover, the absolute value of a measure, and thus its interpretation, often depends on several factors that must be considered, for example, the range of the assessed values (in correlation analyses),1 the number of items in a rating scale (for Cronbach’s alpha),11and the number of categories12 as well as the prevalence of the attribute that is being scored (for kappa statistics).13–15

When interpreting results of statistical methods, researchers should also assess whether the choice of the respective measure is appropriate, confirm that key assumptions underlying the method have been met, and consider sources of bias.16 Any plain-language interpretation is likely misleading when the underlying process of collecting and/or analyzing the data is inappropriate. And even when data collection and analyses are appropriate, it is important to realize that a statistic is an estimate of a population parameter based on the given sample. Samples are invariably affected by sampling error, and the so-called point estimate observed in a sample (eg, the observed correlation coefficient) may not be very close to the actual population parameter of interest (eg, the true correlation in the underlying population). To account for this uncertainty, the interpretation of statistical measures should not focus only on the point estimate. Authors should also consider the entire range of the reported confidence interval (CI), which provides the best estimate of the plausible values of the population parameter.2

Table 1. - Common Measures and Suggested Interpretations
Measure Magnitude of the measure Suggested interpretation
Correlation coefficients
<0.10 Negligible correlation
0.10–0.39 Weak correlation
0.40–0.69 Moderate correlation
0.70–0.89 Strong correlation
≥0.90 Very strong correlation
Coefficient of determination (R 2)
<1% Negligible % variance explained
1%–15% Small % variance explained
16%–48% Moderate % variance explained
49%–80% Substantial % variance explained
≥81% Very high % variance explained
Cronbach’s alpha (α)
<0.60 Poor reliability
0.60–0.69 Questionable reliability
0.70–0.79 Acceptable reliability
0.80–0.89 Good reliability
≥0.90 Very good reliability
I 2
<10% Negligible heterogeneity
10%–39% Low heterogeneity
40%–59% Moderate heterogeneity
60%–89% High heterogeneity
≥90% Very high heterogeneity
Intraclass correlation and kappa statisticsa
<0.40 Poor agreement
0.40–0.54 Weak agreement
0.55–0.69 Moderate agreement
0.70–0.84 Good agreement
0.85–1.00 Excellent agreement
Area under the receiver operating characteristic curve
0.50 Chance discrimination
0.51–0.59 Very poor discrimination
0.60–0.69 Poor discrimination
0.70–0.79 Moderate discrimination
0.80–0.89 Good discrimination
≥0.90 Excellent discrimination
Standardized mean difference (Cohen’s d, Hedge’s g, Glass’ Δ)
<0.10 Trivial effect
0.10–0.34 Small effect
0.35–0.64 Medium effect
0.65–1.19 Large effect
≥1.20 Very large effect
Z scores (or z statistics)
<2.0 May be normal
2.0–2.9 Unusual observation
≥3.0 Highly unusual observation
For correlation coefficients, standardized mean differences, and z scores, absolute values are presented.
aKappa statistics measure agreement beyond that expected by chance, not “agreement” per se.

Table 2. - Summary, Indication, Example Reporting, and Interpretation of Common Measures
Measure What it measures Indication Example of proper reporting Example of further interpretation
Correlation coefficients Measures the strength and direction of association between 2 continuous/normally distributed variables (Pearson correlation) or nonnormal/ranked variables (Spearman correlation). Observational (usually nonexperimental) studies assessing the relationship between 2 continuous or ordinal variables without adjusting for other variables (such as confounding factors). “Pearson correlation (95% CI) between patient age and BMI was 0.47 (0.40–0.54).”“Spearman rank-order correlation (95% CI) between VRS pain score and Likert-scaled (0–5) patient satisfaction was −0.73 (−0.65 to −0.81).” “Patient age and BMI showed a moderate positive correlation.”“As pain score decreased, patient satisfaction tended to increase, implying a strong negative correlation.”
Coefficient of determination (R 2) Measures the proportion or percentage of the variance of one variable that is explained by another. In conjunction with Pearson correlation analysis, and relatedly in simple linear regression. “Pearson correlation between age and BMI was 0.47, corresponding to an R 2 of 0.22. Age was thus found to explain about 22% of the variation in BMI.” “Age was found to explain only a moderate percentage of the variation in BMI.”
Cronbach’s alpha (α) Measures the reliability (or internal consistency) of the items in a multi-item rating scale or instrument. Measures the degree of correlation among the items in the instrument, and is sensitive to the number of items. Cronbach’s alpha close to 1.0 may mean some items are redundant and can be removed. When constructing a new questionnaire or instrument to measure some underlying construct(s). In studies developing, refining, or evaluating rating scales. “The questionnaire comprised 10 Likert-scaled items measuring patient satisfaction with care, and had a Cronbach’s alpha of 0.85.” “The questionnaire comprised 10 Likert-scaled items measuring patient satisfaction with care. Cronbach’s alpha was 0.85, indicating good internal consistency among the items.”
I 2 Measures degree of heterogeneity in treatment effects in meta-analysis. Represents percentage of between-study variation in effect sizes attributable to true heterogeneity rather than sampling error. In meta-analysis to quantify between-study variability in effect size. “13 studies included in the meta-analysis had an I 2 of 70%.” “13 studies included in the meta-analysis had a I 2 of 70%, suggesting high heterogeneity in the true treatment effect across studies.”
Intraclass correlation and kappa statistics Intraclass correlation assesses the agreement among 2 or more raters (interrater reliability), or among repeated ratings from the same individual (eg, test-retest reliability), as well as correlation within clusters. Assessing interrater or intrarater reliability on continuous or ordinal variables. Establishing the reliability of an end point for a clinical trial, for example. “Interrater reliability on NIH stroke scale was assessed by 5 raters independently scoring each of 20 stroke patient videos, then reporting the intraclass correlation.” “Interrater reliability on the NIH stroke scale was good, with estimated intraclass correlation (95% CI) of 0.80 (0.85–0.95).”
Kappa statistics are used to assess intrarater or interrater agreement beyond that expected by chance for either nominal or ordinal outcomes. Assessing interrater or intrarater reliability on nominal or ordinal variables. The 2 raters agreed (both “yes” or both “no”) on 80% of patients, with kappa (95% CI) of 0.60 (0.50–0.70). “With kappa (95% CI) of 0.60 (0.50–0.70) there was moderate agreement beyond that expected by chance alone.”
Area under ROC curve Measures the ability of a predictor variable to discriminate between those who have and do not have a condition/disease. As such, measures the “accuracy” of the predictor. A function of sensitivity and specificity. In diagnostic testing to assess accuracy of a variable or a statistical model. In prediction to help identify cutpoints maximizing sensitivity and specificity. “Intraoperative time-weighted average mean arterial pressure had an area under the ROC curve (95% CI) of 0.75 (0.70–0.80) for predicting 30-d mortality.” “… had an area under the ROC curve (95% CI) of 0.75 (0.70–0.80) for predicting 30-d mortality, indicating moderate discriminative ability.”
Standardized mean difference (Cohen’s d, Hedge’s g, Glass’ Δ) Measures a treatment effect or difference between groups in standard deviation units instead of the raw units of the variable, thus allowing comparison to other effect sizes which may not have the same units. In randomized trials to assess balance between groups at baseline (eg, absolute standardized difference); meta-analyses when studies measure outcomes on different scales; sample size calculations). Handles continuous, ordinal, binary data. “Randomized groups were well balanced on baseline variables, evidenced by all absolute standardized differences below 0.10 (Table X).” “Regional anesthesia reduced pain score versus GA across studies, with weighted standardized mean difference (95% CI) of −1.2 (−0.81 to −1.6).” “Randomized groups were well balanced on baseline variables, with all absolute standardized differences below 0.10, indicating only trivial differences.” “… with weighted standardized mean difference (95% CI) of −1.2 (−0.81 to −1.6), suggesting a very large effect.”
Z scores and z statistics Measures the distance of a data point from the sample mean in standard deviation units. Useful especially for normally distributed variables since 95% of observations are expected to fall between −2 and +2. Also key in statistical inference as a “z statistic” that quantifies the observed effect size and allows calculation of the P value. In a study sample, to help identify individual outlier or extreme data points with respect to the sample mean. “After correcting several data points with z scores >3 or <−3, the data were more normally distributed” “After correcting several highly unusual data points with z scores >3 or <−3, …”
Abbreviations: BMI, body mass index; CI, confidence interval; GA, general anesthesia; NIH, National Institutes of Health; ROC, receiver operating characteristic; VRS, verbal rating scale.

In the following sections, we provide guidance on how a variety of statistical measures can typically be interpreted, but we again recommend careful use. In the absence of irrefutable evidence to define threshold values for rule-of-thumb interpretations, we base cutoff values on what has been previously proposed by experts in the field in peer-reviewed literature and textbooks, as well as online statistical resources. We integrate, adapt, and/or expand previous suggestions in an attempt to achieve a compromise between divergent recommendations, using our experience as researchers, statisticians, and clinicians to propose cutoff values that we believe to be sensible for the field of anesthesia and related specialties. Importantly, authors should never only provide a plain-language interpretation—they must always report their numerical results. It is the authors’ responsibility to be thoughtful in their interpretation. It is the duty of the editors, reviewers, and readers to gauge whether a particular interpretation is reasonable in its specific context. Table 1 lists the common measures and our suggested interpretation. Table 2 explains what is being measured as well as when and how to report results, including the interpretation.


Correlation coefficients quantify the strength of the association between 2 variables.1,17 They are ubiquitously reported in the anesthesia literature. Pearson’s correlation coefficient describes the strength of a linear relationship, assuming that both variables are continuous and approximately normally distributed. Spearman’s rank correlation makes no assumption on the data distribution other that the data can be ranked in a meaningful way. It describes the strength of a monotonic relationship—the value of one variable consistently increases or decreases as the value of the other variable increases, but not necessarily in a linear way. Other types of correlation, like Kendall correlation, point-biserial correlation, or polychoric correlation, are less commonly found in the anesthesia literature, but they can be interpreted in a manner similar to Pearson’s correlation coefficients and Spearman’s rank coefficients. Intraclass correlation (ICC) is covered as a separate topic later.

Correlation coefficients discussed here generally range from −1 to +1.17 A positive value indicates that the value of one variable tends to increase as the value of the other increases, whereas a negative value indicates an inverse relationship. Absolute values increasingly closer to 1 indicate an increasingly stronger relationship. Suggested cutoff values shown in Table 1 are based on literature reporting on the appropriate use and interpretation of correlation coefficients.1

The squared Pearson correlation (or R2, coefficient of determination) is also commonly reported. It is interpreted as the proportion (or percentage) of the variance of one variable explained by the other. For example, the 5 categories of correlation coefficients (referring to Pearson correlation) in Table 1 correspond to variance explained of <1%, 1% to 15%, 16% to 48%, 49% to 80%, and ≥81%, respectively, and can be interpreted as “negligible,” “small,” “moderate,” substantial,” and “very high” amount of variance explained.


Cronbach’s alpha (α) is a measure of the reliability, and more specifically, the internal consistency, of a multi-item rating scale.7 Rating scales are widely used in psychology and social sciences to address so-called latent constructs that are not directly measurable—like self-esteem, anxiety, depression, somatization, catastrophizing, or resilience. A common example is the Likert-type rating scale, in which raters or respondents classify their observations, perceptions, attitudes, knowledge, performance, etc, regarding a number of test or survey items using descriptors for each item such as “strongly agree,” “agree,” “neutral,” “disagree,” or “strongly disagree.” In anesthesia and the related specialties of perioperative medicine, pain medicine, and palliative care, psychometric rating scales are increasingly germane due to an increasing emphasis on aspects of patient well-being that are not directly measurable, including patient satisfaction, quality of recovery, or health-related quality of life.

In studies developing, refining, or evaluating rating scales, despite some important limitations,18 Cronbach’s alpha is the most commonly reported measure of internal consistency. Main assumptions of Cronbach’s alpha are that all scale items are continuous and normally distributed, all scale items refer to the same underlying latent construct, and each item equally contributes to the total scale score (called tau equivalence).18 When all scale items reflect the same construct, different subsets of the entire scale items should produce consistent results. Cronbach’s alpha measures the degree of correlation or interrelatedness of scale items.11 It is a function of the number of scale items, the average covariance between pairs of items, and the variance of the total score. Its value typically ranges from 0 to 1, with a value closer to 1 indicating greater internal consistency. However, the value of Cronbach’s alpha is sensitive to the number of scale items (more items increases Cronbach’s alpha, all else being the same), and very high values can simply indicate high redundancy of the test items.

While the plain-language interpretation of Cronbach’s alpha varies considerably in literature, a value ≥0.7 is usually considered “acceptable.”19,20 Of note, this is mainly true when the scale is used for research purposes (eg, to compare patient satisfaction between 2 groups of patients). When the scale is used for clinical assessments, a higher value (≥0.9) is typically needed.21,22 Thus, while we consider a value of ≥0.7 “acceptable,” because rating scales are mostly developed for research purposes, we stress that “acceptable” or even “good” reliability may not be good enough when the scale is used for clinical decision-making in patient care.


In a meta-analysis, the variation in observed effect sizes between the included studies is due to random sampling error as well as true variation in effect sizes.23 The true variation in effect sizes across studies is referred to as heterogeneity. The I2 statistic is commonly reported to quantify this heterogeneity. I2 represents the percentage of the total between-study variation in effect sizes that is attributable to heterogeneity rather than sampling error.4,24 We note that the I2 is a function of the classical measure of heterogeneity, Cochran’s Q. I2 is considered a better summary measure because it does not depend on the number of included studies.

Being a percentage, I2 is thus a relative measure of heterogeneity; it does not quantify the magnitude of effect size differences across studies in absolute terms. I2 values range from 0% to 100%. In their original description of the I2 statistic, Higgins et al24 considered values of 25%, 50%, and 75% as low, moderate, and high heterogeneity, respectively. The cutoff values proposed in Table 1 are based on this original proposal, as well as the Cochrane Handbook for Systematic Reviews of Interventions.25


The agreement of quantitative data that share the same metric or measurement instrument, like the interrater or intrarater reliability of rating scales, is frequently quantified by an ICC coefficient.5,26 There are at least 10 versions of the ICC,27 and the choice of the most appropriate coefficient depends on several factors. These factors include whether all assessments are performed by the same raters or by different raters; whether or not the raters are considered a random sample; whether the primary interest is in individual ratings or mean ratings; and whether absolute agreement or consistency is being assessed.28

ICC coefficients typically range from 0 to 1 and can generally be thought of as the ratio of the between subject (or rater) variance to the total variance. Interpretations of ICC values are often based on the cutoff points proposed by Landis and Koch29 or the slight adaptation suggested by Altman.30 However, these cutoff values may be too lenient for health care research.31 Of note, these thresholds were originally intended to classify kappa statistics (described in the next section) from poor to almost perfect agreement.

As the weighted kappa is a special case of the ICC,32 it seems reasonable to interpret ICC and kappa statistics in a similar fashion. The guidelines found in the psychology literature also suggest a common interpretation for both types of statistics.20 We adopted this approach in Table 1, and we propose the same threshold values and interpretations for ICC and kappa statistics. Our proposed cutoff values represent a pragmatic compromise between the more lenient interpretation by Landis and Koch29 or Altman,29,30 and the stringent interpretation proposed by McHugh.31

Discussed earlier, Cronbach’s alpha is also a special case of the ICC,27 and it thus should arguably have the same interpretation. However, as the context in which ICC versus Cronbach’s alpha is typically used differs considerably, their interpretations cannot be meaningfully lumped together.


The Cohen’s kappa (κ) statistic quantifies the degree of agreement beyond that expected by chance when 2 raters (observers) classify items into mutually exclusive categories.5 For example, when 2 examiners rate whether anesthesiology residents pass or fail an examination, the interrater agreement can be described using Cohen’s kappa. For more than 2 raters, Fleiss kappa is typically used. A weighted version of Cohen’s kappa can be used for ordinal items like the American Society of Anesthesiologists (ASA) physical status classification system score. Whereas Cohen’s kappa treats all disagreement equally, the weighted kappa statistic weighs disagreements differently depending on how far apart the disagreeing values are on the ordinal scale.5

Like the ICC, kappa has an upper limit of +1, indicating perfect agreement beyond chance, but unlike the ICC, kappa has a lower limit of −1, indicating agreement far less than expected by chance. A kappa of 0 occurs when the observed agreement is the same as expected by chance. As described in the previous section, kappa values are often interpreted similarly as ICC values, and we have adopted this approach in Table 1. However, it is important to understand that the kappa statistic is not a measure of absolute agreement but quantifies agreement beyond chance, and it is therefore sensitive to the prevalence of the attribute being scored.13 With a high prevalence (eg, the majority of candidates pass the anesthesiology examination in the above example), the expected agreement is high, and kappa values can accordingly be rather low despite good or even excellent observed agreement.5 Because of this characteristic of the kappa statistic and because kappa is directly a function of the observed and expected agreement, we strongly encourage authors to report kappa values along with both the number of categories, and the observed and expected agreement.


Receiver operating characteristic (ROC) analysis is regularly used to estimate the accuracy of a diagnostic test in which subjects are dichotomized as having (“diseased”) or not having (“healthy”) a condition of interest, based on the observed value of some continuous biomarker.6 More broadly, ROC analysis is used to evaluate the predictive performance of statistical models to predict a binary outcome, such as with a logistic regression model.33,34

An ROC curve is a plot of the true positive rate (sensitivity) on the y-axis against the false positive rate (1 − specificity) on the x-axis across different observed cut-point values for the continuous measured variable.6 The area under the curve (AUC) of the ROC curve, also known as concordance statistic or c statistic, quantifies the diagnostic accuracy, or the accuracy of binary regression model predictions. An AUC of +1 indicates perfect accuracy, 0.5 corresponds to classification by random chance (tossing a coin to classify patients as healthy or diseased), and values <0.5 indicate an accuracy worse than chance. Gorunescu35 proposed cut-point values to guide interpretation, and these or very similar thresholds and interpretations values are regularly found in the literature and online resources. In Table 1, we present a version that we slightly modified to obtain mutually exclusive categories.


In the anesthesia literature, mean differences between groups are usually reported as unstandardized differences in the original unit of measurement (eg, a mean difference in systolic blood pressure of 15 mm Hg) because these units have an intrinsic meaning.2,36 In other fields of study like psychology or social sciences, the scales often do not have an intrinsic unit of measure. Therefore, the effect size is typically reported in terms of the standardized mean difference (SMD), using Cohen’s d, Hedge’s g, or Glass’ Δ. An SMD is the difference in means (or proportions or ranks) between 2 groups, divided by the standard deviation.2,37 In plain language, an SMD of 1 indicates that the means of both groups differ by 1 standard deviation. The SMD statistics differ by the type of standard deviation that is used, with the pooled standard deviation across groups being most common. Still, they are essentially very similarly interpreted.

Even if unstandardized differences (eg, treatment effect expressed as differences in means or proportions) are much more common in anesthesia literature, an SMD is still often encountered. For example, a meta-analysis often uses an SMD to pool or aggregate data when the outcome had been measured with different scales in the included studies.38 Probably the most frequent use of the SMD in anesthesia literature, however, is to assess baseline balance between groups in a randomized controlled trial,39 or before and after propensity score matching or weighting of the study groups.3,40

Cohen41 originally proposed that an SMD of 0.2, 0.5, and 0.8 corresponds to a small, medium, or large effect size, respectively. This interpretation is widely accepted in literature. It has been further expanded by other authors, for example, to include very small, very large, or huge effect sizes or differences.42,43 In the context of assessing the balance between study groups, an SMD of <0.1 conventionally indicates appropriate balance,3,39,44,45 and it can thus, for all practical purposes in anesthesia research, be considered a trivial difference between the study groups. We therefore adapted Cohen’s proposed interpretation by adding additional categories indicating trivial or very large differences, and by adding cutoff values to define distinct categories, in line with the approach chosen for other statistics reported in Table 1.


A z score describes the distance of a data point from the mean in units of the standard deviation. In other words, a z score of 0 is at the mean, a z score of 2 is 2 standard deviations above the mean, and a z score −1 is 1 standard deviation below the mean. While z scores are usually not explicitly reported in the anesthesia literature, they are implicitly used in most anesthesia research.

When making inference from a sample to a population, we rely heavily on the analogous Z statistics and related t statistics, as these form the basis for statistical inference in a variety of hypothesis testing. These statistics measure the distance between a sample mean (or difference in mean) and the null hypothesis value, which is usually 0. Results of regression analyses are occasionally reported as standardized regression coefficients, which are coefficients that would be obtained if all variables in the model were converted to z scores prior the analysis. A related concept, describing mean differences in units of standard deviation, was described in the previous section.

While z scores theoretically have an infinite range, the probability of observing absolute values of >5 is virtually 0, and even values >3 are quite unlikely, because when data are normally distributed, 99.7% of all data points fall within ±3 standard deviations of the mean. Generally, the higher the absolute value, the lower the probability of observing a value of at least that magnitude. A z score is thus a convenient way of assessing how usual or unusual a particular data point is either compared to other data points from the same sample (to screen for potential outliers under certain conditions46) or compared to a reference population (to identify abnormal lab values in clinical care).

For example, z scores are commonly used in growth and weight charts to identify abnormal child development, and age-adjusted body mass index (BMI) values, with a z score > ±2 being interpreted as overweight or underweight, respectively.47 Similarly, reference ranges of many laboratory tests are defined such that they include the middle 95% of a healthy reference population,48 corresponding approximately to the range of z scores from −2 to +2 for normally distributed variables. Based on the common interpretation that absolute z scores <2 are within some “normal” range, and values ≥3 having a very low probability of occurring, we interpret absolute z scores ≥2 as unusual values, and ≥3 as highly unusual values in Tables 1 and 2.

We note, however, that a z score between −2 and +2 does not necessarily mean that the observation is “normal.” For example, a BMI that is 1 standard deviation above the mean (corresponding to a z score of +1) indicates risk of overweight in children.47 Moreover, a marked increase or decrease in laboratory test values from one measurement to the next—for example, a decrease in hemoglobin value over time—may also be clinically concerning, even when the laboratory test values are still within the “normal” range.

Conversely, values > ±2 do not automatically imply that the values represent some abnormality. Actually, it is completely normal to observe z scores > ±2 in a normal distribution as approximately 5% of all the data points lie outside the range of −2 to +2 z scores. Moreover, when many data points are examined for unusual values—for example, when multiple laboratory tests are simultaneously performed in 1 patient—the probability of observing at least 1 high value by chance markedly increases.49 Furthermore, typical normal ranges based on healthy patients need to be used with caution by clinicians. It is important to note that when a reference or “normal” range is determined for a population based on z scores for normal or healthy patients, it should not be assumed that the values of 2 and −2 can be used as cutpoints to distinguish “normal” from “abnormal” patients just because 95% of “normal” patients will be in that range. Instead, a cut-point to best distinguish normal from abnormal patients must incorporate the distributions of both normal and abnormal patients, and their overlap. This is done through diagnostic accuracy studies in which a cut-point might be chosen that maximizes sensitivity and specificity.6,50


In this article, we discuss various commonly reported statistical measures of association, agreement, diagnostic accuracy, effect size, heterogeneity, and reliability in medical research. Our intention is to inform and to guide researchers and clinicians on how the numeric results of these statistics can typically be interpreted. We stress the limitations of categorizing continuous measures into distinct categories, emphasize that the interpretation must always consider the specific context, and caution readers to use rule-of-the thumb interpretations judiciously. Readers interested in an in-depth coverage of the underlying statistical principles, and those interested in developing a deeper understanding on how the measures are appropriately interpreted under various conditions, are encouraged to delve into the literature referenced throughout this article.


Name: Patrick Schober, MD, PhD, MMedStat.

Contribution: This author helped write and revise the manuscript.

Name: Edward J. Mascha, PhD.

Contribution: This author helped write and revise the manuscript.

Name: Thomas R. Vetter, MD, MPH.

Contribution: This author helped write and revise the manuscript.

This manuscript was handled by: Jean-Francois Pittet, MD.


American Society of Anesthesiologists
area under the curve
area under the receiver operating characteristic curve
body mass index
confidence interval
intraclass correlation
general anesthesia
National Institutes of Health
receiver operating characteristic
standardized mean difference
verbal rating scale


1. Schober P, Boer C, Schwarte LA. Correlation coefficients: appropriate use and interpretation. Anesth Analg. 2018;126:1763–1768.
2. Schober P, Bossers SM, Schwarte LA. Statistical significance versus clinical importance of observed effect sizes: what do P values and confidence intervals really represent? Anesth Analg. 2018;126:1068–1072.
3. Schulte PJ, Mascha EJ. Propensity score methods: theory and practice for anesthesia research. Anesth Analg. 2018;127:1074–1084.
4. Vetter TR. Systematic review and meta-analysis: sometimes bigger is indeed better. Anesth Analg. 2019;128:575–583.
5. Vetter TR, Schober P. Agreement analysis: what he said, she said versus you said. Anesth Analg. 2018;126:2123–2128.
6. Vetter TR, Schober P, Mascha EJ. Diagnostic testing and decision-making: beauty is not just in the eye of the beholder. Anesth Analg. 2018;127:1085–1091.
7. Vetter TR, Cubbin C. Psychometrics: trust, but verify. Anesth Analg. 2019;128:176–181.
8. Naggara O, Raymond J, Guilbert F, Roy D, Weill A, Altman DG. Analysis by categorizing or dichotomizing continuous variables is inadvisable: an example from the natural history of unruptured aneurysms. AJNR Am J Neuroradiol. 2011;32:437–440.
9. Ragland DR. Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint. Epidemiology. 1992;3:434–440.
10. Subramanian V, Mascha EJ, Kattan MW. Developing a clinical prediction score: comparing prediction accuracy of integer scores to statistical regression models. Anesth Analg. 2021;132:1603–1613.
11. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297–334.
12. Brenner H, Kliebsch U. Dependence of weighted kappa coefficients on the number of categories. Epidemiology. 1996;7:199–202.
13. Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther. 2005;85:257–268.
14. Cook RJ. Kappa and its dependence on marginal rates. Armitage P, Colton T, eds. In: The Encyclopedia of Biostatistics, John Wiley & Sons,1998:2166–2168.
15. Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. Am J Epidemiol. 1987;126:161–169.
16. Vetter TR, Mascha EJ. Bias, confounding, and interaction: lions and tigers, and bears, oh my! Anesth Analg. 2017;125:1042–1048.
17. Schober P, Vetter TR. Correlation analysis in medical research. Anesth Analg. 2020;130:332.
18. McNeish D. Thanks coefficient alpha, we’ll take it from here. Psychol Methods. 2018;23:412–433.
19. Taber KS. The use of Cronbach’s alpha when developing and reporting research instruments in science education. Res Sci Educ. 2018;48:1273–1296.
20. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instrument in psychology. Psychol Assess. 1994;6:284–290.
21. Bland JM, Altman DG. Cronbach’s alpha. BMJ. 1997;314:572.
22. Nunnally JC, Bernstein IH. The assessment of reliability. In: Psychometric Theory. 3rd ed. McGraw-Hill, 1994:248–292.
23. Schober P, Vetter TR. Meta-analysis in clinical research. Anesth Analg. 2020;131:1090–1091.
24. Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327:557–560.
25. Higgins JP, Thomas J. Analysing data and undertaking meta-analyses. Higgins JP, Thomas J, eds. In: Cochrane Handbook for Systematic Reviews of Interventions. 2nd ed. Wiley-Blackwell, 2019:241–284.
26. Rousson V, Gasser T, Seifert B. Assessing intrarater, interrater and test-retest reliability of continuous measurements. Stat Med. 2002;21:3431–3446.
27. McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1:30–46.
28. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15:155–163.
29. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174.
30. Altman DG. Some common problems in medical research. In: Practical Statistics for Medical Research. Chapman & Hall/CRC, 1991:396–439.
31. McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb). 2012;22:276–282.
32. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educat Psychol Measurement. 1973;33:613–619.
33. Zou KH, O’Malley AJ, Mauri L. Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation. 2007;115:654–657.
34. Schober P, Vetter TR. Logistic regression in medical research. Anesth Analg. 2021;132:365–366.
35. Gorunescu F. Classification performance evaluation. In: Data Mining: Concepts, Models and Techniques. Springer-Verlag, 2011:319–330.
36. Schober P, Vetter TR. Effect size measures in clinical research. Anesth Analg. 2020;130:869.
37. Yang D, Dalton J. A unified approach to measuring the effect size between two groups using SAS. 2012. SAS Global Forum 2012, Paper 335. Accessed September 16, 2021.
38. Andrade C. Mean difference, standardized mean difference (SMD), and their use in meta-analysis: as simple as it gets. J Clin Psychiatry. 2020;81:20f13681.
39. Schober P, Vetter TR. Correct baseline comparisons in a randomized trial. Anesth Analg. 2019;129:639.
40. Schober P, Vetter TR. Propensity score matching in observational research. Anesth Analg. 2020;130:1616–1617.
41. Cohen J. The t test for means. In: Statistical Power Analysis for the Behavioral Sciences. Psychology Press, Taylor & Francis Group, 1988:19–74.
42. Matthay EC, Hagan E, Gottlieb LM, et al. Powering population health research: considerations for plausible and actionable effect sizes. SSM Popul Health. 2021;14:100789.
43. Sawilowsky SS. New effect size rules of thumb. J Modern Appl Stat Met. 2009;8:598–599.
44. Normand ST, Landrum MB, Guadagnoli E, et al. Validating recommendations for coronary angiography following acute myocardial infarction in the elderly: a matched analysis using propensity scores. J Clin Epidemiol. 2001;54:387–398.
45. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res. 2011;46:399–424.
46. Cousineau D, Chartier S. Outliers detection and treatment: a review. Int J Psychol Res. 2010;3:59–68.
47. Khadilkar V, Khadilkar A. Growth charts: a diagnostic tool. Indian J Endocrinol Metab. 2011;15(suppl 3):S166–S171.
48. Jones G, Barker A. Reference intervals. Clin Biochem Rev. 2008;29(suppl 1):S93–S97.
49. Schober P, Vetter TR. Adjustments for multiple testing in medical research. Anesth Analg. 2020;130:99.
50. Mascha EJ. Identifying the best cut-point for a biomarker, or not. Anesth Analg. 2018;127:820–822.
Copyright © 2021 International Anesthesia Research Society