If you’ve ever read a scientific paper, you’ve probably seen a statement like, “There was a significant difference between the groups (P = 0.002)” or “there was not a significant correlation between the variables (P = 0.1138),” but you may not have known exactly what those numbers actually mean. Despite their prevalence and wide-spread use, many people don’t understand what P values actually are or how to deal with the information that they give you, and understanding P values is vital if you want to be able to comprehend scientific results. Therefore, I am going to give a simple explanation of how P values work and how you should read them. I will try to avoid any complex math so that the basic concepts are easy for everyone to understand even if math isn’t your forte.

*Note: for the sake of this post, I am not going to enter into the debate of frequentist (a.k.a. classical) vs. bayesian statistics. Although bayesian methods are becoming increasingly common, classical methods are still extremely prevalent in the literature and it is, therefore, important to understand them.*

**Hypothesis testing and the types of means
**Before I can explain P values, I need to explain hypothesis testing and the difference between a sample mean and a population mean. To do this, I’ll use the example of turtles living in two ponds. Let’s say that I am interested in knowing whether or not the average (mean) size of the turtles in each pond is the same. So, I go to each pond and capture/measure several individuals. Obviously, however, I cannot capture all of the individuals at each pond (assume that there are hundreds in each pond), so instead I just collect 10 individuals from each pond. The mean carapace lengths (carapace = the top part of a turtle’s shell) from my samples are: pond 1 = 20.1 cm, pond 2 = 18.4 cm.

Now, the all important question is: can I conclude that on average, the turtles in pond 1 are larger than the turtles in pond 2? Maybe. The problem is that we don’t know whether or not my sample was actually representative of all of the turtles in the ponds. In other words, those numbers (20.1 cm and 18.4 cm) are *sample means*. They are the average values of my samples, and the sample means are clearly different from each other, but that doesn’t actually tell us much. It is entirely possible that, just by chance, I happened to capture more large turtles in pond 1 than pond 2.

To put this another way, we are actually interested in the *population means*. In other words, we want to know whether or not the mean for* all* of the turtles in pond 1 is the same as the mean of *all* of the turtles in pond 2, but since we can’t actually measure all of the turtles in each pond, we have to use our *sample means* to make an inference about the *population means*. This is where statistical testing and P values come in. The tests are designed to take our sample size, sample means, and sample variances (i.e., how much variation is in each set of samples) and use those numbers to tell us whether or not we should conclude that the difference between our sample means represents an actual difference between the population means.

For these statistical tests, we usually have two hypotheses: a *null hypothesis* and an *alternative hypothesis*. The null hypothesis states that there is no difference/relationship. So in our example, the null hypothesis says that the* population means* are not different from each other. Similarly, if we were looking for correlations between two variables, the null hypothesis would state that the variables are not correlated. In contrast, the alternative hypothesis says that the *population means* are different, the variables are correlated, etc.

**The P value
**Now that you understand the difference between the types of means and types of hypotheses, we can talk about the P value itself. For my fictional turtle example, the appropriate statistical test is the Student’s T-test (note: I got the values of 20.1 cm and 18.4 cm by using a statistical program to simulate two

*identical*populations and randomly select 10 individuals from each population). When I ran a T-test on those data, I got P = 0.0597, but what does that mean? This is where things get a bit tricky. Despite what some people will erroneously tell you, the P value is not, “the probability that you are correct” or “the probability that the difference is real.” Rather, the P value is the probability of getting a result of your observed difference/correlation strength or greater

*if the null hypothesis is actually true.*So, in our example, the difference between our

*sample means*was 1.7 cm (20.1-18.4 cm) and the null hypothesis was that the

*population means*are identical. So, a P value of 0.0597 means that

*if the populations means are identical*, we will get a difference of 1.7 cm or great 5.97% of the time (to turn a decimal probability into a percent just multiply by 100). Similarly, for a correlation test, the P value tells you the probability of getting a correlation as strong or stronger than the one that you got if the variables actually aren’t correlated.

To demonstrate that this works, I took the same simulated ponds that I sampled the first time, and I made 10,000 random samples of 10 individuals from each population. For each sample, I calculated the difference between the sample means for pond 1 and pond 2, resulting in Figure 1. Out of 10,000 samples, 525 had a difference of 1.7 or greater. To put that another way, the two population means were identical and just by chance I got our observed difference or greater 5.25% of the time, which is extremely close to the calculated value of 5.97% (because the sampling was random, you wouldn’t expect the numbers to match perfectly).

When you look at Figure 1, you may notice something peculiar: I included both differences of =/> 1.7 and =/< -1.7 in my 525 samples. Why did I include the negatives? The answer is that our initial question was simply “are the turtles in these ponds different?” In other words, our null hypothesis was “there is no difference in the population means” and our alternative was “there is a difference in the population means.” We never specified the *direction* of the difference (i.e., our question was not, “are turtles in pond 1 larger than turtles in pond 2?”). A non-directional question like that results in a *two-tailed test*. In other words, because we did not specify the direction of the difference, we were testing a difference of the size 1.7 rather than testing the notion that pond 1 is 1.7 larger than pond 2.

You can do a* one-tailed test* in which you are only interested in differences in one direction, but there are two important things to note about that. First, your hypotheses are different. If, for example, you want to test the idea that turtles in pond 1 are larger than turtles in pond 2, then your null is, “the population mean of turtles in pond 1 is not larger than the population mean of turtles in pond 2” and your alternative is, “the population mean of turtles in pond 1 is larger than the population mean of turtles in pond 2.” Notice, this does not say anything about the reverse direction. In other words, your null is not that the means are equal, so a result that pond 2 is greater than pond 1 would still be within the null hypothesis and would not be considered statistically significant.

Second, if you are going to do a one-tailed test, you have to decide that you are going to do that *before you collect the data*. It is completely inappropriate to decide to do a one-tailed test after you have collected your data because it artificially lowers your P value by ignoring one half of the probability distribution. Look at the bell curve in Figure 1 again. You can see that just by chance you expect to get a result of +/-1.7 or greater 5.25% of the time, but if you ignore the differences on the negative side of the distribution (Figure 2), then suddenly you are looking at a probability of 3.95% because if the null hypothesis is true, getting a difference of =/> 1.7 is less likely than getting a difference that is either =/> 1.7 or =/< -1.7 (typically, one-tailed values are half of the two-tailed value, but because of chance variation, this sample came out with a slight negative bias). If you had a good biological reason for thinking that pond 1 would be greater than pond 2 *before you started* then you could and should use the one-tailed test because it is more powerful, but you can’t decide to use it after looking at your data because that makes your result look more certain that it actually is (this is something to watch out for in pseudoscientific papers).

**What does statistical significance mean
**At this point, I have explained what P values mean in technical terms, but the question remains, what do they mean in practical terms? In our example, we got a P value of 0.0597, but what does that actually mean? In short, we use various cut off points (known as alpha [α]) to determine whether or not the P value is “statistically significant.” What α you use depends on your field and question, but it always has to be defined before the start of your experiment. In biology, α = 0.05 is standard, but other fields use 0.1 or 0.01. If your P value is less than your α, you reject the null hypothesis, and if your P value is equal to or greater than your α, you fail to reject the null. In other words, if your α = 0.05 and your P value is less than that, your conclusion would be that the observed difference between your sample means probably represents a true difference between your population means rather than just chance variation in your sampling. Conversely, if your P value was 0.05 or greater, you would conclude that there was insufficient evidence to support the conclusion that the differences between the sample means represented a real difference between the population means. This is not the same thing as concluding that there is no difference between the population means (more on that later).

It should be clear by this point that we are dealing with probabilities, not proofs. In other words, we are reaching conclusions about what is most likely true, not what is definitely true. The astute reader will realize that if a P value of 0.049 means that you will get that result by chance 4.9% of the time if the null hypothesis is actually true, then for every 20 tests with a P value of 0.049, one of them will be a false result (on average). This is what we refer to as a type I error. It occurs when you reject the null, but should have actually failed to the reject the null, and it is the reason that we like to have small α values: the larger the α, the higher the type I error rate (I explained type I errors in far more detail here). This also explains why some published results are wrong, even if the authors did everything correctly, and it once again demonstrates the importance of looking at a body of literature rather than an individual study.

Now, you may be thinking that we should try to make the α values very tiny, that way we rarely get false positives, but that creates the opposite problem. If the α is tiny, then there will be many meaningful differences which get ignored (this is known as a type II error). Thus, the standard α of 0.05 is a balance between type I and type II error rates.

**Statistical significance and biological significance are not the same thing**

This is an extremely important point that is true regardless of whether or not you got a statistically significant result. For example, let’s say that chemical X is dangerous at a dose of 0.5 mg/kg, and you do a study comparing people who take pharmaceutical Y to people who don’t, and you find that people who take Y have an average of 0.2 mg/kg and people who don’t take Y have an average of 0.1 mg/kg and the difference is statistically significant. That doesn’t in anyway shape or form show that Y is dangerous because the levels of X are still lower than 0.5 mg/kg. In other words, the fact that you got a significant difference does not automatically mean that you found something that is biologically relevant. The different levels of X may actually have no impacts whatsoever on the patients.

Conversely, if you did not get a significant difference, that would not automatically mean that there isn’t a meaningful difference. Statistical power (i.e., the ability to detect significant differences/relationships) is very strongly dependent on sample size. The larger the sample size, the greater the power. Consequently, if the *population* *means *are very different from each other, then you can get a significant result with a small sample size, but if the population means are very similar, then you are going to need a very large sample size. This is part of why you fail to reject the null, rather than accepting the null. There may be an actual difference between your population means, but you just didn’t have a large enough sample size to detect it. For example, let’s say that a drug causes a serious reaction in 5 out of every 1000 people and that “reaction” already occurs in 1 out of ever 1000 people (those are the population ratios), but when you test the drug, you only use sample sizes of 1000 people in the control group and 1000 people in the experimental group, resulting in sample ratios of 6/1000 and 1/1000. When you run that through a statistical test (in this case a Fisher’s exact test) you get P = 0.1243. So, you would fail to reject the null hypothesis even though the drug actually does cause the reaction that you were testing. In other words, the drug does cause adverse reactions, but your sample size was too small to detect it. If, however, your sample sizes had included 2000 people in each group, and you had gotten the same ratios, you would have had a significant difference (P = 0.0128) because those extra samples increased the power of your test. This is why scientists place so much weight on large sample sizes and so little weight on small sample sizes. Research that uses tiny sample sizes is extremely unreliable and should always be viewed with caution.

Finally, it is worth noting that the population means of any two groups will nearly always be different, but that difference may not be meaningful. Going back to my turtle example, for any two ponds, if I measured *all* of the turtles in each pond, it is extremely unlikely that the two means would be identical. There is almost always going to be some natural variation that makes them slightly different, but, with a large enough sample size, you can detect even a very tiny difference. So, for example, if I had two turtle ponds whose population means were 18.01 cm and 18.02 cm, and I had several million turtles from each pond, I could actually find that there is a statistically significant difference between those ponds, even though that actual difference is extremely tiny and is not a meaningful difference between the two ponds. My point is simply that the fact that a study found a statistically significant result does not automatically mean that they found a meaningful result, so you should take a good hard look at their data before drawing any conclusions.

**What do you do with a non-significant result?
**The question of what to do with non-significant results is a complicated one, and it is probably the area where most people mess up. For various reasons (some of which I discussed above) you’re never supposed to accept the null hypothesis, rather, you fail to reject it. In other words, you simply say that

*you did not detect*a difference rather than saying that

*there is no*difference (in reality there is nearly always a difference, it just might not be a meaningful one). In practice, however, there are many situations in which you have to act as if you are accepting the null hypothesis. For example, let’s say that you are comparing two methods, one of which is well established but expensive, while the other is untried and cheap. You do a large study and you don’t find any significant differences between those two methods. As a result, scientists will begin using the cheap method and they will cite your paper as evidence that it is just as good as the expensive method.

Drug trials present a similar dilemma. Let’s say that we are trying a new drug and we find that there are no significant differences in side effect X between people who take it and don’t take it. The FDA, doctors, and general public will treat that result as, “the drug does not cause X” which is essentially accepting the null.

So how do we solve this problem? Do all drug trials violate a basic rule of statistics? No, the key here is sample size and statistical power. Remember from the section above that nearly all real population means will be different, but the difference may be very slight and not meaningful. So, when we accept that a novel method is as effective as the old, for example, we aren’t actually saying that there are *no *differences between the two. Rather, we are saying that there are probably no differences *at the effect size that we were testing*. To put this another way, we would say that the current evidence does not support the conclusion that they are different.

This may seem confusing, but you can think of it like a jury decision. We don’t declare someone “guilty” or “innocent.” Rather, we declare them “guilt” or “not guilty.” The “not guilty” verdict is essentially the same thing as failing to reject the null. It doesn’t mean that the person definitely didn’t commit the crime. Rather, it simply means that we do not have the evidence to conclude that they did commit the crime; therefore, we are going to treat them as if they didn’t.

Jumping back to science, ideally you should do something called a power analysis. This shows you what size difference you would be able to detect given your sample sizes, variance, and the level of power that you are interested in. So, let’s say that during the methods comparison test, anything less than a 0.01 difference between the two methods would be good enough to consider the new one reliable, and you had the statistical power to detect a difference of 0.001. That would mean that although there may be some very tiny difference between the methods, that difference is less than a difference that you would care about, and you had the power to detect meaningful differences. Similarly, if you are doing a drug trial and you have the power to detect a side effect rate of in 1 out of every 10,000 people, then you cannot conclude that the drug doesn’t cause that side effect, but you can say that if it does cause that side effect, it probably does so at a rate of less than 1 in 10,000 (note [added1-1-16]: as a general rule, power analysis should be done before conducting the study in order to determine what sample size will be necessary to detect a desired effect size).

All of this connects back to the importance of sample size. If you have a small sample size, then you won’t be able to detect small differences. Let’s say, for example, that a drug trial found improvements in 40 out of 100 people in the control group and 50 out of 100 people in the experimental group. That would result in a P value of 0.2008, which is not statistically significant, but that test would not have much power. As a consequence, that result is not very helpful. It could be that the drug simply doesn’t work, but it could also be that it does have an important effect, and this study just didn’t have a big enough sample size to detect it. Therefore, I am personally very hesitant to use results like this as evidence one way or other, and I think that when you have results like this, it is best to wait for more evidence before you try to say that there are no meaningful differences.

Some people, however, fall for the opposite pitfall. On several occasions, I have encountered people who look at studies with small samples sizes like this and say, “there isn’t enough power to actually test for differences, therefore we should just go with the raw numbers and assume that there is a difference.” This is completely, 100% invalid. Think back to the very start of this post, the whole reason that we do statistics is because without them we can’t tell if a result is real or just the result of chance variation. So you absolutely cannot blindly assume that the difference is real. If you don’t have enough power to do a meaningful test, then you simply cannot draw any conclusions.

**Conclusion
**This post ended up being quite a bit longer than planned, so let me try to sum things up with a bullet list of key points (note: for simplicity, I will talk about means, but the same is true for proportions, correlations, etc. so you can replace “no difference between he means” with “no relationship between the variables,” “no difference between the proportions,” etc.).

- In science, you nearly always sample subsets of the total groups that you are interested in (these are sample means)
- The means of those subsets will nearly always be different, but you actually want to know whether or not the means of the entire groups are different (these are the population means)
- You have two hypotheses. The null says that there is no difference between the population means, and the alternative says that there is a difference between the population means
- The P value is the probability of getting a result where the difference between the sample means is equal to or greater than the difference that you observed, if the null hypothesis is actually true
- The larger your sample size, the greater your ability to detect differences
- If you get a statistically significant result, you reject the null hypothesis; whereas, if you don’t get a significant result, you fail to reject the null hypothesis
- Statistical significance is not the same as biological significance
- Nearly all population means will be different from each other, but that difference may not be meaningful. Therefore, you cannot conclude that no difference exists, but you can provide evidence that if a difference exists, it is very small (or at least smaller than the effect size that you were testing)

*Other posts on statistics:*

- Basic Statistics Part 1: The Law of Large Numbers
- Basic Statistics Part 2: Correlation vs. Causation
- Basic Statistics Part 3: The Dangers of Large Data Sets: A Tale of P values, Error Rates, and Bonferroni Corrections

Very nice description. Should be worth 3 CE units at least. You may want to capitalize the word “student’s” in “student’s T Test” lest someone think you’re talking about an education-level tool, instead of the inventor’s (pseudo) name.

LikeLike

I am slightly unclear if you are advocating a post-hoc power analysis. It should be stressed that power analysis should not be undertaken post-hoc of a test which failed to reject the null. This is pernicious, incorrect but strangely popular belief amongst biologists and psychologists. The reason it does not work is that any frequentist test that fails to reject the null will always have low power. Power should only be calculated a priori to further analysis perhaps after a pilot study or in planning a study. It does not work as a post-hoc. Confidence intervals are a post-hoc way of looking at practical significance (perhaps adjusted for multiple comparisons if necessary).

LikeLike

As a general rule, I agree with you and should have made that more clear. I have added a line to attempt to clarify.

LikeLike