The first article^{1} in this three-part series discussed the question “what cutoff should we use with diagnostic questionnaires?” Here we address a related question: what does it mean to say “The difference in treatments was significant (P =0.013 by two tailed t -test)”?

Let us consider some (fictional) data. Over the course of a month, all insomniacs coming to a sleep clinic are offered a chance to participate in a clinical trial of the new anti-insomnia medication Fraudulin. If they agree, they are given either a placebo or a dose of Fraudulin, and they stay overnight at the clinic. That night, doctors with lots of caffeine and reading material measure how long each insomniac sleeps. For 13 patients who present themselves over the course of a month, they find

Table: No title available.

The patients might be sleeping better with Fraudulin than placebo, but I would not be comfortable making such an assertion without some mathematics to back it up.

This situation is similar to that of a diagnostic questionnaire. We have two hypotheses, the null hypothesis that Fraudulin has no effect on insomnia, and the other that Fraudulin changes the average length of time a patient sleeps. We want to choose some cutoff below which we decide the null hypothesis, then find the probabilities of error corresponding to that particular cutoff. Such statistical methods are called “hypothesis tests.”

Diagnostic questionnaires specify how to produce a single score for a patient. We don't have any such score predefined for these data, so we must calculate a score, usually called a “summary statistic,” from our data. The original, and still most famous, hypothesis test was derived in 1908 by William Gosset while doing quality control at the Guiness brewery. Guiness regarded statistical methods developed by its staff as trade secrets, so Gosset published under the pseudonym Student.^{2} Student's test was based on a summary statistic that he denoted as t.

In our example, shown in the formula above, m is the average time slept, _ is the standard deviation of the time slept, n is the number of patients, and the subscripts f and p denote values for the groups given Fraudulin and placebo, respectively. Many other tests based on other summary statistics have been derived over the past century, each with its own strengths and weaknesses.

Check yourself. Imagine you are given a cutoff of 0.8 on t. What decision do you make on the data set above? What if the standard deviation for each group were twice as large? What if the mean and standard deviation remained the same, but there were 3 patients in each group? What if the mean number of hours slept with Fraudulin was 3 instead of 5?

Answer. For the data above, t has the value

Using a cutoff of 0.8, we would decide that Fraudulin had an effect on insomnia. If the standard deviation were twice as large for each group, it would halve the resulting value of t to give t=0.54. More widely dispersed data would lead us to decide the null hypothesis. If n=3 for both groups, t would be

Fewer patients constitute weaker evidence, and the value of t goes down. In this case it drops far enough that we decide the null hypothesis. If the mean number of hours slept with Fraudulin was 3 instead of 5, t would have the value

If we proceed as for a diagnostic questionnaire, we would say Fraudulin has no effect. Obviously, something is wrong, since Fraudulin appears to be causing patients to lose sleep.

There is a problem with how we set our cutoff. The null hypothesis is “Fraudulin has no effect,” but using the cutoff as we described, we decide the null hypothesis even if Fraudulin keeps patients awake. “No effect” requires that we set our cutoff in both directions, that is, if the means are no farther apart than some cutoff, we decide the null hypothesis. Such a two-sided cutoff is called “two tailed.” If you wish to be ignorant of harmful effects to your patients, use a one tailed test. Otherwise, prefer the two tailed.

Now that we have something to apply a cutoff to, we can recapitulate the program we established for diagnostic questionnaires: consider the population we are dealing with, the various probabilities of error, and losses associated with each kind of error.

For diagnostic questionnaires, the population consisted of patients. In our fictional trial. the population consists of groups of patients. The rule to remember is that an individual member of the population is the grouping on which you calculate a summary statistic. One group of patients produces one value of t . The population as a whole could be all clinical trials of Fraudulin, all clinical trials of insomnia medications, all clinical trials of drugs in the state of Ohio, or as many other variations as we saw with diagnostic questionnaires.

Once we have established our population, we repeat the mental gymnastics we performed for diagnostic questionnaires: divide the population into two groups, for one of which the null hypothesis is true (Fraudulin has no effect) and for one of which the other hypothesis is true (Fraudulin affects how much patients sleep). We divide each of those groups into those for which we decide correctly and those for which we make a mistake

Table: No title available.

For diagnostic questionnaires, we estimated the relative numbers of individuals in each of the blocks by referring to a more reliable test (e.g., a structured interview). Yet that interview must have been established based on a yet more reliable test, which must have been established by reference to some even more reliable test, and so on. This way lies madness.

Gosset's contribution was breaking this cycle. Instead of establishing a population's properties based on some more reliable test, a mathematical model is used to approximate the population. We try to find mathematical models whose form depends only on plausible assumptions about the real world. Many such models are known and bear the names of the mathematicians who discovered them. If your measurements arise from the sum of lots of small, independent effects, they will be approximated by the famous bell curve, more properly known as the Gaussian or normal distribution. Sums of a few large, independent effects produce a Cauchy distribution. Poisson gave his name to the distribution of random events of some kind you would expect to happen in a fixed period of time if there were no connections among them. The classic example is the number of soldiers who died from being kicked by a horse each year in the Prussian army. If no horse, no matter how ornery, kills more than one soldier, especially since it is likely to be put down, there will be no clumps of deaths, and this model is extremely accurate. Gumbel, Frechet, and Weibull figured out the generic distributions of maximum values (e.g., maximum height of a river or the maximum magnitude of an earthquake that will hit an area in some period). Such mathematical models are ideals based on an explicit set of assumptions. It takes common sense, experience, and experimental design to approach these assumptions in real life, and a failure of these assumptions constitutes a new kind of error, one for which we cannot give a probability.

Student's t -test makes three assumptions: each of the measurements we make is independent, each is drawn from the same population, and the population is well approximated by a Gaussian distribution.

“Independence” is a subtle concept. When you make a measurement on some member of a population, you hope it will tell you something about the population as a whole. If all the members of the population are independent, it will only tell you about the population as a whole. As soon as it tells you about parts of the population beyond that one individual, you have lost independence. For example, if your population is a rural town where three quarters of the citizens have the last name Cox, and you are studying congenital diseases, finding your disease in a Cox tells you something about Coxes as opposed to the whole population. Finding the same disease in a random patient in Chicago tells you very little except about whatever population the patient came from. Of course, finding the disease in a person in Chicago also tells you about his or her family. This is where idealization comes in, and why independence is subtle. If you arrange your experiment to exclude multiple family members, then the idealization holds. Of course, some subjects may have arrived in Chicago in the same mass population movements. We can always make an error and have dependence among samples, but without plausible independence, you may as well throw your data away. There are no mathematical tricks that will save you.

Once we have a mathematical model of the population, we use it to calculate error probabilities. Recall the standard names α and β for the probability of type I and type II error

Table: No title available.

In order to calculate the error probabilities, Student assumed that the measurements for each treatment come from a Gaussian distribution. When the null hypothesis is true, the Gaussian distributions for each group are centered at the same value, although they may have different widths, as in

From these two distributions, Student calculated the distribution of his t ,

The probability α of type I error (of wrongly concluding there is an effect) is the probability of measuring a value of t beyond the cutoff (the shaded fraction of the area in the figure) when the two Gaussian distributions are centered at the same value.

Now we can define the P -value. To do so, you set the value of t you calculated as your cutoff and calculate α for that cutoff (i.e., probability of a type I error). When the cutoff is set at t , the value of α is the P -value. To calculate the P -value for our Fraudulin data, we take 1.09 (the t value) as our cutoff and have a computer calculate the fraction of the area under the t distribution beyond 1.09 and –1.09 (shaded areas in the graphic) to produce α, which is our p value. In this case P =0.30. If you set the level of significance at the traditional 0.05, you decide the null hypothesis (Fraudulin has no effect). Good luck getting these data published! If the P value is lower (e.g., P =0.04), researchers would decide the other hypothesis (results are “statistically significant”).

The P -value is the smallest relevant value of α given your data (i.e., the smallest probability of making a Type I error and deciding there is an effect when there isn't one). Let's look at a few other cutoffs, both smaller and larger than 1.09.

Table: No title available.

Given t =1.09, you would decide the null hypothesis for the cutoffs of 1.5 or 2.0 (α, the probability of deciding the other hypothesis when the null hypothesis is true, is not relevant). For the cutoffs 0.9, 1.0, and 1.09, you decide the other hypothesis. As the cutoffs get smaller, the values of alpha (shaded areas under the tails in the figure) increase. With a cutoff of 0.9, you would incorrectly decide the other hypothesis in 39 of 100 cases compared with 30 of 100 cases with a cutoff of 1.09. Thus, 1.09 is the largest cutoff for which we would decide the other hypothesis from our data, and it corresponds to the smallest relevant value of alpha. The p- value represents the smallest value of alpha you can get from your data while deciding the other hypothesis. It is the lower limit of your probability of making a type I error.

The most common misconception about the P- value is that it is the probability of deciding the wrong hypothesis. It is not. It is the lower limit of the probability of deciding the wrong hypothesis when the null hypothesis is true (of deciding there is an effect when there is not). It says nothing about what happens when the other hypothesis is true, nor does it account for relative probability of each hypothesis.

The P -value seems like a strange thing to report, does it not? Wouldn't it be better to give the cutoff, or perhaps the α and β you used to make your decision? All of the conventions around hypothesis testing in practice are sufficiently odd that a vocal minority of statisticians call for them to be abandoned on a regular basis, and the silent majority shifts uncomfortably in their seats, since they don't really disagree. The peculiarity of hypothesis testing comes from its history. Its conventions were cobbled together from the wreckage of a decades long dispute over how the theory should work between Ronald Fisher on one side, and Jerzy Neyman and Egon Pearson on the other. Fisher took Gosset's work on the t test, extended his ideas in many directions, and used them as the foundation for his book Statistical Methods for Research Workers , first published in 1925.^{3} He proposed the P -value as a useful measure: if a researcher looked at his data and claimed he saw an effect, the P -value was the probability of any effect he saw arising purely by chance. Fisher wasn't particularly concerned about the extremely rare problem of researchers looking at their data and saying, “No, I really don't think there's anything there.”

It wasn't until 1933 that Neyman and Pearson^{4} took the first (rather opaque) steps towards the formal idea of decisions and probabilities of error that we use here, and the full structure of decision theory didn't appear until 1939 in the work of Abraham Wald.^{5} By that time, Fisher's book had gone through six more editions, and would go through another seven more before his death in 1962.

To make matters worse, the arbitrary P -value of 0.05 has become enshrined in the scientific community. If your results yield a P -value of 0.049, many sins of method and technique will be forgiven for publication, but let it crawl above that magic 0.05, and suddenly your paper is universally rejected.^{6} This may seem absurd, but it took so much work to achieve consistent reporting of P -values that statisticians hesitate to undertake yet more changes.

Given this, you must exercise good judgment with hypothesis tests. They are ubiquitous and useful for many things. Hammers are also useful for many things, including driving nails and smashing your thumb. Both hammers and hypothesis tests require common sense. Fisher encouraged researchers not to consider p- values in isolation but also to take into account other relevant evidence (e.g., results of previous studies).^{7} He used hypothesis tests in support of his intellect, not in place of it, and so should you.