My cousin, a middle school mathematics teacher, would motivate her students by reminding them that “math is everywhere.” This is clearly true in hematology. We use blood counts, SUVs from PET imaging, and quantitative assessments of minimal residual disease to assess tumor burden and define the need for further therapy. Laboratory studies report percentage of viable cells after exposure to various agents. Mice are subjected to serial bioluminescent imaging to assess whether a novel agent can impact the trajectory of tumor growth. In such a quantitative environment, we rely on statistics to help us process and interpret our abundance of data.
The need for statistical assistance in the design and analysis of clinical, laboratory, and animal experiments in hematology seems to elicit a high level of anxiety in many investigators. Statisticians, bioinformaticians, and computational biologists are available to assist researchers in most academic institutions—universities as well as hospitals—but many of these individuals are focused on theoretical rather than collaborative research. Relatively few of these individuals are actually embedded in hematology, and therefore familiar with leukemia, lymphoma, multiple myeloma, or the many varieties of nonmalignant hematology. Moreover, these individuals are often judged academically on their theoretical research, with little or no acknowledgment of their value in promoting statistical thinking among clinical and basic scientists. We will not focus here on the generation of P values. We will focus instead on the importance of statistical thinking—a quantitative aspect of critical thinking—in the design of scientific investigations, and the role of statistical analysis in interpreting the data collected through such investigations.
The statistical design of experiments is exceedingly important to assure that a study is of an adequate size to identify as statistically significant a result of scientific importance. It is crucial to distinguish between the hypothesized difference of scientific importance and the observed data generated by the experiment. The sample size calculations are the less problematic statistical aspect here, as multiple statistical programs are available, both through statistical analysis packages and a variety of interactive websites. An additional contribution of statistical thinking is an understanding that the results of any experiment are stochastic. Using such packages, one can identify the numbers of patients, samples, or mice required to detect specified differences with controlled probabilities of erroneously finding in favor of the hypothesized difference based on the experimental results (type I error or significance level, computed under the assumption that the experimental and control do not differ) or erroneously failing to find in favor of hypothesized difference, again based on the experimental results (type II error, with statistical power defined as 1 minus the type II error, computed under the assumption that the experimental and control reflect the hypothesized difference). The investigator who seeks self-sufficiency in experimental design would do well to establish a comfortable acquaintance with a statistical package or online program to explore both sample size for a given type I error, power, and detectable difference, or, given feasible sample size, to identify the magnitude of difference that can be detected with adequate power and appropriate type I error. This will give the investigator a sense of how variations in power or significance level alter sample size, and give concrete understanding to how sample size impacts standard error of the mean and through that measure drives the ability to detect larger or smaller differences. Through direct interaction with an algorithm—just playing around—the investigator can develop a sense of how altering each parameter of design alters the required sample size. Through such iterative experience the investigator will develop his or her own statistical intuition, an excellent first step on the way to statistical thinking.
There are indeed guidelines for these parameters. Statisticians typically suggest that exploratory studies have approximately 90% power to detect a “meaningful” difference, and that a type I error of 0.05 to 0.10 that addresses only superiority (1-sided test of hypothesis) is adequate. More definitive studies require at least 90% power, and, with the goal of presenting results to regulatory authorities such as the European Medicines Agency, a maximum type I error of 0.05 that considers that the experimental intervention may turn out to be either better or worse than the control (2-sided test of hypothesis).
Clinical researchers are accustomed to readily available statistical support for study design because institutional ethics approval is required for research involving human subjects. Basic science researchers may find that identifying such support is quite challenging. Specification of sample size for animal and laboratory studies permits the research to be planned more efficiently. However in this latter setting, the responsibility for clearly explaining the detail of the experimental plan will fall more heavily to the investigator, as many statisticians are more familiar, and more comfortable, with the statistical needs of clinical studies.
Specification of a statistical plan for an experiment provides an opportunity for colleagues (as well as funding agencies) to reflect on the robustness of the proposal. As an example, an investigator may propose a study that contains both a discovery and a validation phase. A statistical plan should clearly state the rules for identifying promising discoveries to be taken forward for validation. If a high-dimensional discovery phase generates a large quantity of data, identification of promising “hits” to pursue would require specificity, including the adjustment of P values to address the enhanced potential for false discovery associated with multiplicity of testing. Resolution of such issues requires thoughtful discussion within the study team, and both a statistician and a computational biologist would be important members of such a team.
Statistics are also used in the interpretation of data. The statistical programs available through commercial statistical packages and the freely downloadable R Project for Statistical Computing (https://www.r-project.org/) cover a broad range of analytical approaches that have been widely vetted within the statistical community. It is common to develop R packages to implement novel statistical and computational algorithms appearing in the literature. Learning R requires a commitment from an investigator; other approaches only require more awareness of how to use the packages. One can perform analyses in a popular spreadsheet program or in a package for laboratory analyses, but this shifts to the investigator the responsibility for statistical thinking to clarify the experimental setting and select the correct analytic approach. The proprietary spreadsheet Excel can, in addition to performing the Student t test to compare 2 groups with equal variances, perform the Welch t test to address inequality of variance in the setting of small sample sizes. The proprietary laboratory package GraphPad Prism can perform repeated measures analysis of variance (ANOVA) on serial measurements of tumor burden over time within mice; options within ANOVA identify the data structures and specifications required to implement the correct analysis.
Statistical thinking emphasizes the ability to identify the requirements of the statistical test. In the first setting I have cited, the Student t test uses a pooled estimate of the common variance of the 2 groups under study to generate the standard deviation of the difference for the test. If the 2 variances are quite different, they cannot reasonably be pooled. The Welch t test can accommodate such differences. In the murine example, most investigators clearly wish to test whether the trajectory of tumor growth over time differs between animals receiving a novel agent and those receiving vehicle control. If tumor burden is assessed 9 times in 3 weeks, the issue is not whether there is a difference at the 0.05 level on 1 of the 9 time points, but of whether the trajectory of growth—a single assessment over the 3 weeks—differs between groups. The statistical analysis needs to explicitly consider the tumor measurements within each of the animals, to perform the testing with the appropriate number of degrees of freedom. In each of these examples, what is required is an understanding of the assumptions of the analytic approach and how to communicate this to the analysis package.
Statistical thinking is not a focus on P values; P values only quantify for you the probability of observing the data actually generated by your experiment when the null hypothesis is indeed true. Statistical thinking is about developing a logical and critical approach to medical research, an understanding of the assumptions of testing and the challenges of analysis. It enhances the sophistication with which you read the literature. Statistical thinking raises the level of your science.