A 2-sample t test is commonly used to analyze continuous data, but valid statistical inferences rely on its test assumptions being met.
In this issue of Anesthesia & Analgesia, Wong et al1 report a randomized trial of the effects of high-flow nasal oxygenation on safe apnea time (oxygen saturation measured by pulse oximetry [Spo2], >95%) during anesthetic induction of morbidly obese patients. The authors used a 2-sample, unpaired t test and observed a significantly prolonged safe apnea time in the intervention group (Figure).
The 2-sample t test is commonly used to compare 2 independent groups and tests the null hypothesis that the means are equal.2 Its test statistic can be thought of as a “signal-to-noise” ratio3: the ratio of the mean difference between the groups (the “signal”) to a function of the within-group variability (the “noise”). A large mean difference relative to the variability corresponds to a small P value, suggesting that there actually is a difference between the groups.
Valid inferences with a 2-sample t test rely on several assumptions being met,3 including:
- The dependent variable is continuous.
- Observations are independent of each other.
- The data are approximately normally distributed in each group.
- The variances are approximately equal in both groups.
Assumptions #3 and #4 apply to the population from which the data were sampled, not the sample itself. Hypothesis tests are available to test these assumptions (eg, Shapiro-Wilk test as used by Wong et al1 for #3 and Levene test for #4) but have low power to detect violations at small sample sizes. These tests should be used in combination with graphical methods (eg, Q-Q plots) and a comparison of the group standard deviations to determine whether the assumptions are met.4
The t test is relatively robust against nonnormality with larger sample sizes. Conventionally, the sample size needs to be ≥30; however, a much larger sample may be necessary if the data distribution is severely skewed. Importantly, smaller samples definitely do rely on a normal distribution to avoid erroneous conclusions. Nonparametric methods (eg, Mann-Whitney U tests) can be used if this normality assumption is violated.
The t test is also relatively robust against unequal variances if the sample sizes per group are equal and if the sample is large enough (>15 per group). Alternatives like the Welch t test are available if variances are unequal.
Beside inappropriately using t tests when assumptions are (grossly) violated, a common mistake is to use multiple pairwise t tests to compare >2 groups. Other techniques are instead needed, like analysis of variance (ANOVA) with appropriate post hoc tests.2
While the 2-sample t test is basically a hypothesis test that gives a P value, statistical software packages routinely report the mean difference between the groups including its confidence interval. As appropriately done by Wong et al,1 we strongly encourage authors to report this estimate because it provides important information about the magnitude of the treatment effect.5
1. Wong DT, Dallaire A, Singh KP, et al. High-flow nasal oxygen improves safe apnea time in morbidly obese patients undergoing general anesthesia: a randomized controlled trial. Anesth Analg. 2019;129:1130–1136.
2. Vetter TR, Mascha EJ.. Unadjusted bivariate two-group comparisons: when simpler is better. Anesth Analg. 2018;126:338–342.
3. Vetter TR, Mascha EJ.. In the beginning-there is the introduction-and your study hypothesis. Anesth Analg. 2017;124:1709–1711.
4. Vetter TR.. Fundamentals of research data and variables: the devil is in the details. Anesth Analg. 2017;125:1375–1380.
5. Schober P, Bossers SM, Schwarte LA.. Statistical significance versus clinical importance of observed effect sizes: what do P values and confidence intervals really represent? Anesth Analg. 2018;126:1068–1072.