This is the first in a series of short articles on statistics for working clinicians, whose last math class long ago faded from their minds, but who retain the two vital life skills of balancing checkbooks and calculating the odds of poker hands. Most attempts to teach statistics to doctors focus on the calculation of t-tests, analysis of variance, and other horrifying terms. These calculations are better left to statisticians and computer programs.
On the other hand, every scientist and clinician must be able to interpret the results of statistical calculations. This is not a question of mathematical skill, but of mindset, of knowing the proper questions to ask. We will begin exploring these questions with the simplest possible clinically relevant example: when administering the Patient Health Question naire-9 (PHQ-9),1 what cutoff should you use to diagnose major depression?
The PHQ-9 consists of nine questions used to assess depression and anxiety. Each question is answered numerically from 0 (no symptoms) to 3 (severe symptoms). The sum of all of the answers is the patient's score, ranging from 0 to 27. If the score is above some cutoff, then the test diagnoses the patient as depressed.
Before we go further, anyone dealing with statistics must internalize the following three quotations:
- “In any practical problem, any statistical procedure we use can possibly lead to an unfortunate decision.” (p. 12)2
- “Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.” (p. 74)3
- “The object [is] to determine a ‘good’ way of guessing, on the basis of the outcome of the experiment, which of the possible underlying [states of nature] is the one which actually governs the experiment whose outcome we are to observe.” (p. 1)2
Specialized fields develop their own types of jargon, and the subfields of statistics are no exception. Rather than force this jungle upon the reader, I have chosen to use the notation and vocabulary that are easiest to remember and most broadly useful across statistics as a whole. A glossary of jargon for translation purposes is provided at the end of the column.
Based on quotation number 3 above, our object is to choose a “good” cutoff to use with the PHQ-9. Good, in this case, means minimizing the repercussions of misdiagnosis. The first step is to analyze the probability of misdiagnosis for various cutoffs.
Probabilities of Error
Imagine two groups of people, one of which consists only of people who suffer from depression, the other only of people who do not. Say you administer the questionnaire to everyone in both groups. For each possible cutoff, you calculate the fraction of each group that would be misdiagnosed. For example, if 2 out of 175 people in the group that was not depressed had scores of 8 or more, then using 8 as the cutoff would give a probability of misdiagnosing a healthy person as depressed of 2/175=0.011. If 5 out of 250 people in the depressed group had scores of 7 or less, then the probability of misdiagnosing a depressed person as not depressed would be 5/250=0.02.
Adewuya et al did exactly this kind of experiment in a population of Nigerian college students.4 In this population, the probabilities of misdiagnosing a healthy individual with major depression (a) and misdiagnosing an individual suffering from major depression as healthy ({) for different cutoffs were: Cutoff a {(Table 1)
There is a simple mnemonic to keep a and { straight. When faced with two possible states of nature in statistics, the state of no disease, no effect, or no difference is called the “null hypothesis.” It is always listed first, and everything is named in order starting from the null hypothesis. For the PHQ-9, the null hypothesis is “the patient is not depressed.” The probability of error, that is, of misdiagnosing a healthy patient or a false positive, is a, and the probability of a false negative is {. Misdiagnosis when the null hypothesis is true is called Type I error; misdiagnosis in the other case is called Type II error. If you remember “null hypothesis first,” you will never confuse these terms.
Check yourself : If you administered the PHQ-9 to a Nigerian college student who does not suffer from depression using a cutoff of 11, what is the probability that you will incorrectly diagnose him with depression?
Answer : Looking at the row for cutoff 11 in the table of error probabilities above, there are two error probabilities. The question asks for the probability of error when the subject is not depressed, that is, when the null hypothesis is true. From the rule “null hypothesis first”, we want a. In that row of the table, a is 0.385, so, using this cutoff, you would incorrectly diagnose a healthy student almost four times in ten.
The values for a in the table were defined based on data from healthy subjects, while those for { were defined based on data from depressed subjects. In reality, the PHQ-9 is administered because it isn’t known whether a particular patient is depressed or not. How are a and { connected in this situation?
The key is to imagine the population from which the patient is drawn. The best way to think about a population is to think about how you select someone from it. “Someone who comes to my office asking for treatment” defines a very different population than “a random person I stopped on the street”. “Women under 40 who came to my office for treatment” is also a different population than “men over 65 who came to my office for treatment”.
Check yourself : Would you expect someone from the population defined by “a patient who presented himself at my office asking for treatment for depression” to be more or less likely to be depressed than someone from the population defined by “someone who happens to live next door to me”?
Answer : Unless you are truly an appalling neighbor, someone who presented himself for treatment is more likely to be depressed.
Once you have imagined the population, divide it into those who are depressed and those who aren’t. This is a thought experiment, so assume you can know perfectly. Then administer the PHQ-9 to everyone in each group and divide them by those who score below or above some cutoff. Now we can connect a and { to the population. Let us imaging doing this for 1000 people, and arrange the four groups in a two-by-two table: a is the fraction of healthy people who score above the cutoff, that is, 150 out of the 827+150=977 people in the first row, so that a=150/977=0.15.
Check yourself : Calculate { for this population.
Answer : There are 23 people in all who are actually depressed, and only one of them tested negative, so {=1/23=0.04 (Table 2).
The following population of 1000 people has the same values of a and {:
Check yourself : Calculate a and { for this population (Table 3).
Answer : There are 600 people who are not depressed, 90 of whom incorrectly test positive, so a=90/600=0.15. There are 400 people who are depressed, 16 of whom incorrectly test negative, so {=16/400=0.04.
Although a and { are the same for the different populations in the two tables above, the first population has 150 false positives and 1 false negative, while the second has 90 false positives and 16 false negatives. Missing one case of major depression in a thousand might be acceptable whereas missing 16 might not. Obviously, a and { are not enough to choose a cutoff. The real probability of error for the PHQ-9 also depends on the prevalence of depression in the population.
What is the probability of error of a patient being misdiagnosed as depressed in the first population? All people who score above the cutoff are in the right column, but only those in the first row of the right column are misdiagnosing. There are 150+22=172 people in the right column, and 150 are misdiagnosed, so the probability of error is 150/ (22+150)=0.87. Eighty-seven percent of the time a diagnosis of depression is wrong for this population! On the other hand, there are 828 people in the left hand column who are diagnosed as healthy, one of whom is actually depressed. This error rate is 1/828=0.001, so that only 0.1% of the time is a diagnosis of depression missed in this population.
Check yourself : Calculate the probability of each diagnosis being wrong in the second population.
Answer : There are 90+384=474 people in the right hand column who test above cutoff. Of those, 90 are not actually depressed, so the probability of error of a diagnosis of depression is 90/474=0.19. So 19% of the time a diagnosis of depression is wrong for this population. There are 526 people in the left column, 16 of whom are misdiagnosed as healthy, so the probability of error of a diagnosis of no depression is 16/526=0.03. This means that 3% of the time a diagnosis of depression is missed in this population.
When reading the literature on diagnostic tests, you will not find a or {. This particular corner of statistics uses the probabilities of correctness, called “specificity” (=1—a) and “sensitivity” (=1—{ ), instead of the probabilities of error. These terms were unfortunately named, since the “null hypothesis first” rule can’t tell you which one corresponds to the null hypothesis.
Specificity indicates the proportion of negatives that are correctly identified (eg, the percentage of healthy people who are correctly identified as not being depressed). Sensitivity indicates the proportion of actual positives that are correctly identified as such (eg, the percentage of depressed individuals who are correctly identified as being depressed). In the rest of statistics, however, error probabilities dominate.
Using the probabilities of correctness works for diagnostic questionnaires because type I and type II errors in this case are mutually exclusive. If the possible errors overlapped, the situation would become murky indeed. Tolstoy's description of families in the opening line of Anna Karenina (“Happy families are all alike; every unhappy family is unhappy in its own way”) applies well to statistics. If we start with the unique correct decision, we can investigate each error separately without worrying about their overlap, no matter how complicated the possible errors become.
Calculating Loss
It is important to know the prevalence of depression in the population you are dealing with when choosing a cutoff, but even that is not enough. You must also consider the possible losses from each possible state of nature and decision. Loss is a measure of the overall effect of an outcome. It can be clinical outcome, monetary cost, loss of reputation, or even benefits (negative losses). It may be helpful to think of loss as the negative side of an economist's idea of utility—statisticians are pessimists and prefer to think of loss.
The loss from incorrectly diagnosing someone as depressed with the PHQ-9 is the time and money to have follow-up sessions and realize that the test was in error. The loss from incorrectly diagnosing someone as not being depressed may be the person's suicide in about one in a hundred cases.5 We may happily accept that 87% of the people who score above cutoff are not depressed, if it means that we miss only one in a thousand who really is. In this situation, we might say that the loss from misdiagnosing someone as healthy is a thousand times worse than the loss from misdiagnosing someone as well. The loss from a correct diagnosis is zero. In this case our losses would be as long as all the losses are correctly scaled relative to each other, their exact value doesn’t matter. We could have assigned 2 and 2000 without changing the final result. The losses in another context, such as an epidemiological survey, may be very different (Table 4).
Once you have assigned the losses associated with each type of error, you can calculate the expected loss in a specific population. This is done by multiplying the probability of each type of error by the loss associated with that error and adding the results. For example, for the second population above, the zeros have no effect, and we calculated that the probability of error of a result above cutoff was 0.19, and of error below cutoff was 0.03. Then the expected loss is 0.19×1+0.03×1000=30.19.
This expected loss was calculated with a certain cutoff on the PHQ-9. If we change the cutoff, we will change the probabilities of each error, and thus the expected loss. For example, we might find the following expected losses for each cutoff: Cutoff Expected loss (Table 5).
The minimum loss occurs with a cutoff of 13, so this is a rational choice. Information about the patient such as age and sex may change the optimal cutoff. For instance, women have a higher incidence of depression than men, so when the expected loss is calculated in a population of women, the optimal cutoff will be lower than for men. The rates of depression also vary with age, and an optimal cutoff for a 40 year-old may not be optimal for a 19-year-old.
Now you have the tools you need to think about cutoffs in diagnostic questionnaires. You must consider what population you are dealing with, the various probabilities of error, and the losses associated with each kind of error.
Surprisingly, these three things are all you need to interpret any statistic. The details of how you think about populations grow more complicated, the probabilities of error become more sophisticated, and the kinds of error and associated losses multiply, but it is a difference of degree rather than kind. Hypothesis tests and P-values require only a modest increase in sophistication beyond diagnostic questionnaires, as you will see in the next column.
Glossary
The null hypothesisis the state of no disease, no difference, no effect, or that what you are looking for does not exist. Alpha (a) is the probability of error when the null hypothesis is true. Beta ( { ) is the probability of error when the other hypothesis is true.
In assessing clinical measures, the terms sensitivity(=1—a) and specificity(=1—{) are often used, although they don’t generally appear elsewhere in statistics. These terms refer to probabilities of correctness, rather than probabilities of error. Specificity indicates the proportion of negatives that are correctly identified (eg, the percentage of healthy people who are correctly identified as not being depressed). Sensitivity indicates the proportion of actual positives that are correctly identified as such (eg, the percentage of depressed individuals who are correctly identified as being depressed). These terms were unfortunately named, since the “null hypothesis first” rule can’t tell you which one corresponds to the null hypothesis.
In the older statistical literature, you will often see the terms size(=a) and power(=1—{). These terms are still more confusing than specificity and sensitivity because not only must you remember which one corresponds to the null hypothesis, you must also remember which one is a probability of error (size) and which one is a probability of correctness (power).