Special Report

# Special Report

## P-Values Have No Value, but There is a Better (Bayesian) Way

Young, Paul MBChB, PhD

Author Information
doi: 10.1097/01.EEM.0000550364.36707.fe

Conducting research and being science-based is what distinguishes medicine from crystal therapy and is fundamental to modern medical care. Despite many major advances in medicine, we are missing an opportunity to recognize the true potential of research to advance patient care. Some of our research concepts are problematic, and you need to understand the concepts to understand the problems.

Power is one such concept. Power is the probability of demonstrating an effect of a magnitude that you specify in your sample-size calculation if such an effect exists. Then there's the p-value. A p-value of less than 0.05 does not mean there is a one-in-20 chance that the result is a fluke. Instead, assuming that the null hypothesis is correct, the p-value is the probability of obtaining a result equal to or more extreme than what was actually observed.

These concepts underpin the frequentist statistical approach. You start with an experimental hypothesis, reframe it as a null hypothesis, gather some data, and then calculate the probability of the observed result under the assumption that the null hypothesis is correct. All this is abstract, and it is difficult to understand why this is such a problem, but you are basically accepting the null hypothesis when you apply a threshold p-value of greater than or equal to 0.05. If it's less, you reject the null hypothesis. The fundamental problem with this paradigm is illustrated by the picture of a young woman shown here. What are the chances this woman worships Satan? Look at the details and try to find the clues.

Whatever probability you came up with, did you consider that there are only 10,000 Satan worshippers in the world? What if I asked you what the chances are that she is Christian? How heavily do you weigh the fact that there are two billion Christians? This is the problem of base rate neglect. We tend to put too much weight on the specific and not enough on the general.

Another example: Let's say I have a test that detects 100 percent of drunk drivers, but it has a five percent false-positive rate. We have a driver with a positive test; what are the chances the driver is drunk? The answer that likely comes into your head is 95 percent. The problem with that answer is the same problem as the young woman: It neglects the base rate.

For the sake of argument, let's say one in every 1,000 drivers is drunk, so there can only possibly be one true-positive: the drunk driver. From the remaining 999 sober drivers, we can expect 49 false-positives with a five percent false-positive rate. Under this scenario, there is only a one-in-50 chance that the driver is drunk if the test is positive. We are all familiar with this kind of thinking in relation to the sensitivity and specificity of diagnostic tests. Think of a clinical trial as a diagnostic test of a trial hypothesis, and imagine that you're going to test 100 intensive care interventions and clinical trials, interventions that might reduce mortality. You have 100 things you want to test, and you want to demonstrate that you can save lives.

## Testing Hypotheses

It obviously depends on what you're testing and the effect size you specify, but what percentage do you expect to be positive? The ICU literature is not exactly overrun with interventions that reduce mortality, but for the sake of argument, let's say 10 percent of the hypotheses are correct. With 90 percent power, we can expect to detect nine of 10 of the true hypotheses, with a p-value threshold of 0.05 where the null hypothesis is correct. We can expect the p-value to be less than 0.05 in five percent of the remaining 90 trials. These are the false-positives. If the prior probability that the hypothesis was correct is 10 percent and the p-value is less than 0.05, then it's a true-positive around two-thirds of the time, and it's a false-positive around one-third of the time. (See graph.)

Another scenario: If one percent of the hypotheses are correct, then a trial that's statistically significant is five times more likely to be false than it is to be true. From now on, the first thing you should do when you read a clinical trial is go to the sample size calculation part of the paper, where they say what their hypothesized fixed size was, and think about what you know about medicine, biology, and the world. What is the percentage chance, based on what you already know, that this hypothesis is correct?

Some effect sizes are manifestly implausible. If the prior probability that a treatment effect size is zero, then the posterior probability must also equal zero, no matter what the p-value is. If you believe, as I do, that most ICU trial hypotheses are long shots, then you must also accept that the two most likely results from a clinical trial are no different from a false-positive. This is a problem for all of medicine.

We are facing an epidemic of marginal p-values. There is a rapid upsurge in exciting findings in medical journals with marginal p-values. If you remember only one thing from this article, it should be this: P-values have no value. What we really should be thinking about is probability, particularly when it comes to treatments currently in clinical practice where there is practice variation. We want to have a higher probability of giving patients the right treatment for such comparative effectiveness situations. The p-value is not important. What is important is that we increase the probability of giving the patient the right treatment.

We should see opportunity wherever there is uncertainty. If we randomize patients into a clinical trial, there is an opportunity to shift our understanding of probability. Therapy should be randomized where there is idiosyncratic practice variation. Randomized treatment is the best treatment when it's not plausible for a clinician to tell whether treatment A or B is better based on his clinical experience. Physicians are operating under uncertainty and have cognitive biases, and that means they are likely to make bad choices.

Consider attribution bias: The physician remembers a time he gave one drug to one person who did well, and then he uses that drug for the rest of his career. We are also subject to novelty bias: Doctors like the newest and most expensive treatment; it must be better because it is newer and more expensive. Perhaps most pervasive of all, doctors are biased toward doing something when doing nothing might actually be the best choice. When we don't know what to do, we should be randomizing to treatment A or treatment B and to treatment or control.

## Randomized Treatment

There are likely collateral benefits from doing this. Clinical trials can put the best available standard care into a protocol and provide follow-up for patients who might not otherwise receive it. Randomized treatment is a good way of hedging one's bets and guarding against the cognitive biases that the doctor brings to the table. Not only that, but if we can also come up with a system where we learn rapidly from randomizing patients, then randomizing treatment is the best way to get the right treatment if you are ever sick again. Figure: Graphical representation of the method of estimation of the chances that a statistically significant result represents a “true” positive based on 100 hypothetical trials where there is a 10 percent chance the hypothesis is correct; experiments are conducted with 90 percent power at an alpha of 0.05. In this example, where there is a 10 percent prior probability that the hypothesis is correct, each box represents a hypothetical trial. The top row of boxes (surrounded by a green line) represent the 10 occasions where the hypothesis is correct; the remaining 90 boxes represent the occasions when the null hypothesis is correct. One would expect in an experiment with 90 percent power to correctly identify nine of 10 correct hypotheses (the area shaded red). Because the alpha is defined as the probability of rejecting the null hypothesis when the null hypothesis is correct, one would also expect to incorrectly reject the null hypothesis on 4.5 of 90 occasions (the area shaded blue). As a result, with a 10 percent prior probability in an experiment with 90 percent power, a true positive result is expected 67 percent of the time when the P value is 0.05.

We need to change the paradigm completely. It should not be research but clinical care and a system to optimize treatment that reliably improves outcomes over time. We have to use an alternative method, the Bayesian approach. Here we have an experimental hypothesis and a null hypothesis, we collect some data, and we calculate two probabilities. The probability the hypothesis is correct divided by the probability the null hypothesis is correct is the Bayes factor. Take the prior probability that the hypothesis is true and multiply it by the Bayes factor to get the posterior probability that the hypothesis is true. We need big data sources that collect important outcomes for our patients, and then we need to randomize.

Using this Bayesian approach, we can then look at what has happened to patients and continually update the probability that particular treatments are better. Randomization can be unequivocally good for patients. We can start by randomizing patients in a 1:1 ratio and randomizing the aspects of treatment for ICU patients who are subject to idiosyncratic practice variation, like blood pressure, oxygen, and nutrition therapy targets. The information can go into a database, and as it looks like particular treatments are better over time, we can continue to randomize, but also weight the coin.

As soon as we have information that makes it look like one treatment might be better than another, even if we don't know for sure, we can increase the probability that patients get treatments that work. Doing this creates a new paradigm where there is a fusion of quality improvement and science so that every patient contributes information that improves the care of every subsequent patient. It no longer matters what the p-value is, but we can stop randomizing patients once we have decided a treatment works.

We can also declare that two treatments are equal, and we can systematically use the cheapest treatment from that time on, move on, and study something else. Instead of having indeterminate clinical trials where we wonder if the trial was too small and underpowered, we will have statistically robust results that are never indeterminate.

Optimizing care is a priority for global public health. Those with an acutely-reversible, life-threatening illness will be cared for in an intensive care unit, and there is about a one-in-two chance of needing intensive care over your lifetime if you live in a developed country, so you want the best intensive care therapy you can get. Using this system could save money by not using ineffective treatments, could save lives by immediately incorporating effective treatments into standard care, and could increase the probability that patients will get the treatments that work even before we know for sure that they work. This is the way that research should be.