Lately, social media has been flooded with people sharing studies about various aspects of COVID. This is potentially great. I’m all for people being more engaged with science. Unfortunately, many people lack a good foundation for understanding science, and a common point of confusion is the meaning of “statistically significant.” I’ve written about this at length several times before (e.g., here and here), so for this post, I’m just trying to give a brief overview to hopefully clear up some confusion. In short, “statistically significant” means that there is a low probability that a result as great or greater than the observed result could arise if there is actually no effect of the thing being tested. Statistical tests are designed to show you how likely it is that the sample in a study is representative of the entire population from which the sample was taken. I’ll elaborate on what that means below (don’t worry, I’m not going to do any complex math).
Let’s imagine a randomized controlled drug trial where we take 100 patients, randomly split them into two groups (50 people each), give one group a placebo, give the other group the drug, then record how many of them develop a particular disease over the next month. In the control group, 20% of patients (10 individuals) develop the disease, whereas in the treatment group only 10% (5 patients) developed it. Does the drug work?
This is where the confusion comes in. Many people would look at those results and say, “obviously it helped, because 10% is lower than 20%”. When we do a statistical test on it (in this case a chi-square test), however, we find that it is not statistically significant (P = 0.263), from which we should conclude that this study failed to find evidence that the drug prevents the disease. You may be wondering how that is possible. How can we say that taking the drug doesn’t result in an improvement when there was clearly a difference between our groups? How can 10% not be different from 20%?
To understand this, you need to understand the difference between a population and sample and the reason that we do these tests. This hypothetical experiment did find a difference between the groups for the individuals in the study. In other words, the treatment and control groups were different in this sample, but that’s not very useful. What we really want to know is whether or not this result can be generalized. We really want to know whether, in general, for the entire population, taking the drug will reduce your odds of getting the disease.
To elaborate on that, in statistics, we are interested in the population mean (or percentage). This may be a literal population of people (as in my example) but it applies more generally, and is simply the distribution of data from which a sample was taken. The only way to actually know the population mean (or percentage) is to test the entire population, but that is clearly not possible. So instead, we take a sample or subset of the population, and test it, then apply that result to the population. So, in our example, those 100 people are our sample, and the percentages we observed (10% and 20%) are our sample percentages.
I know this is starting to get complicated, but we are almost there, so bear with me. Now that we have sample percentages we want to know how confident we can be that they accurately represent the population percentages. This is where statistics come in. We need to know how likely it is that we could get a result like ours or greater if there is no effect of the drug, and that’s precisely what statistical tests do. They take a data set and look at things like the mean (or proportion), the sample size, and the variation in the data, and they determine how likely it is that a result as great or greater than the one that was observed could have arisen if there is no effect of treatment. In other words, they assume that the treatment (drug in this case) does not do anything (i.e., they assume that all results are from chance), then they see that how likely it is that a result as great or greater than the observed result could be observed given the assumption that all results are due to chance. Sample size becomes important here, because the larger the sample size, the more confident we can be that a sample result reflects the true population value.
So, in our case, we got P = 0.263. What that means is that if the drug doesn’t do anything, there is still a 26.3% chance of getting a result as great or greater than ours. In other words, even if the drug doesn’t work, there is a really good chance that we’d get the type of difference we observed (10% and 20%). Thus, we cannot be confident that our results were not from chance variation, and we cannot confidently apply those percentages to entire population.
Having said that, let’s see what happens if we increase the sample size. Imagine we have 1,000 people, but still get 10% for the treatment group and 20% for the control group. Now we get a highly significant result of P = 0.00001. In other words, if the drug doesn’t do anything, there is only a 0.001% chance of getting a difference as great or greater than the one we observed. Why? Well, quite simply, the larger the sample the more representative it is of the population, and the less likely we are to get spurious results. From this, we’d conclude that the drug probably does have an effect.
To try to illustrate all of this, imagine flipping a coin. You want to know if a coin is biased, so you flip it 10 times and get 4 heads and 6 tails. That is your sample: 40% heads. Now, is the coin biased? In other words, if you flipped the coin 10,000 times, would you expect, based on your sample, that you’d get roughly 40% heads? How confident are you that your sample result applies to the population? You probably aren’t very confident. We all intuitively know that it is entirely possible for even a totally fair coin to give 4 heads and 6 tails in a mere 10 flips. No one would scream, “but in your test the coin was biased!” We all realize that the sample may not be representative of the population.
Now, however, imagine that you flip it 100 times and get 40 heads and 60 tails. This is your new sample. Now how confident are you? Probably more confident than before, but also probably not that confident. Again, we all realize that there is chance variation. Indeed, if we ran that actual stats on this, we’d get P = 0.2008. In other words, this test says, “assuming that the coin is not biased, there is a 20.08% chance of getting a difference as great or greater than a 40%/60% split,” but what if we did 1,000 flips and got 400 heads and 600 tails. Now, we’d probably think that the coin truly was biased. At that point, we’d expect that our sample probably does apply to the population and continuing to flip the coin will continue to yield percentages of roughly 40% and 60%. If we actually run the stats on that, our conclusion would be justified. The P values is less than 0.00001, meaning that if the coin is not biased, there is less than a 0.001% chance of getting a result as great or greater than ours. This would be good evidence that the coin itself (the population) is likely biased, and our results are unlikely to be from chance variation in our sample.
That is, in a nutshell, what statistical tests (at least frequentists statistical tests) are doing, and we only consider something to be “statistically significant” when the probability that a result like it (or greater) could arise (given the assumption that there is no effect of treatment) is below some pre-defined threshold. In most fields, that threshold is P = 0.05. In other words, we only conclude that the sample results apply to the population if there is less than a 5% chance of getting a result as great or greater than the one we observed if the thing being tested actually has no effect.
Note: An important topic not covered here is confidence intervals. These show you the range of possible population values for a given sample value. So, for example, if you had a mean of 20 and a 95% confidence interval of 10-30, that would mean that you can be 95% sure that the population mean is between 10 and 30.
Note: This post was revised to change statements that a P value showed the probability that a result as great or greater than the one that was observed could arise by chance to statements that the P value showed the probability that a result as great or greater than the one that was observed could arise if there is no effect of the the treatment.
Yes. The strength of science is not in giving us certainties.
It’s specialty is giving us probabilities for
with them have the power to make better decisions.
There;s also the issue of statistically significant and clinically significant. That’s something else people don’t understand.
I’m going to share this on Facebook.
Indeed, some of my other posts talk about that important topic.
You mean, size effect?
however, we find that it is not statistically significant (P = 0.263)
So, in our case, we got P = 0.293
OK, you have coin flips that come out 40 heads and 60 tails. Then you look to see if some other result is “great or greater than a 40%/60% split.” I sincerely don’t understand. What would be greater? A 30%/70% split? A 50%/50% split? A 60%/40% split? Which direction is greater?
Anything that would result in a greater difference in your groups. So, for example, 39% heads vs 61% tails would be greater, as would 61% heads vs 39% tails. Technically speaking, it is the test statistics that would be greater, but practically, the easy way to think about it is that the difference between groups becomes greater. So, in my examples, the difference went from a spread of 20 percentage to a spread of 22 percentage points. Basically, its results that make the coin seem even more biased.
(note: this gets into a tangential topic, but if the original question was simply “is this coin biased?” then it could be biased in either direction, which is why either heads or tails could be 39%. If you original hypothesis was simply that the coin is biased towards tails, then we would do a slightly different test, and greater would only be in one direction [e.g., 39% heads vs 61% tails would be greater, but 61% heads vs 39% tails would not])
Let’s see if I have this right. That which is “great or greater” is the difference between the two results (heads or tails). It doesn’t matter if the number of heads or tails increases. There is, in other words, no “lesser.” I need to do some rereading and thinking. I appreciate your response. [“a result as GREATER or GREATER than the observed result” is slip up, right?]
A lesser result would be anything that reduces the difference between the groups (i.e., makes the coin less biased). So, a 41/59 split would be a smaller difference (lesser).
To put that another way, if you flip a coin 100 times you aren’t surprised to get 45 heads and 55 tails. That is only a small difference between the groups, and thus a small difference from the expected result if the coin is fair (50/50). The odds of getting a result like that or greater are very high. In contrast, if you got 20 heads and 80 tails (a substantially greater result) you would be surprised and would conclude that the coin probably is biased, because there is a very low probability that such a large difference (or greater) could occur if the coin is fair.