“Correlation does not equal causation.” It is a phrase that everyone has probably heard, but many people seem to ignore or misunderstand it. Indeed, although useful, the phrase itself can be misleading because it often leads to the misconception that correlation can never equal causation, when in reality, there are situations in which you can use correlation to infer causation. I’ve written about this topic before, but it is really important, so I want to revisit it and explain why correlation does not automatically equal causation as well as the situations in which it does indicate causation.
Why correlation doesn’t always equal causation
First, we need to deal with what correlation is and why it does not inherently signal causation. When two things are correlated, it simply means that there is a relationship between them. This relationship can either be positive (i.e., they both increase together) or negative (i.e., one increases while the other decreases). To put that in a more technical way, we could say that when two variables are correlated, the variance (variation) in one variable explains or predicts the variation in the other variable (or at least part of the variation, assuming that the correlation isn’t perfect). Thus, if variable X and Y are positively correlated, then when X increases, Y should increase as well (on average); whereas if they are negatively correlated, then as X increases, Y should decrease.
Now, when X and Y are correlated (we’ll say positively correlated in this case), why can’t we automatically assume that the change in X is causing the change in Y? After all, if every time that X goes up, Y goes up as well, doesn’t that indicate that the change in X is causing the change in Y? Actually, no, it doesn’t. There are essentially four possible explanations for why X and Y would change together (see note at the end):
- X is causing Y to change
- Y is causing X to change
- A third variable (Z) is causing both of them to change
- The relationship isn’t real and is being caused by chance
As you can hopefully now see, there are multiple possibilities and you can’t jump to the conclusion that X is causing Y. Further, in most cases, these four possibilities can’t be disentangled.
Nevertheless, there are some helpful examples where the spurious nature of the correlation is pretty clear, and those examples are useful for illustrating why correlation doesn’t automatically equal causation. One of my personal favorites is the correlation between ice cream sales and drowning. As ice cream sales increase, so do drowning accidents. Does that mean that eating ice cream is causing people to drown? Of course not. When you scrutinize the data, it quickly becomes clear that a third variable (time of year/temperature) is driving both the drowning accidents and the ice cream sales (i.e., people both swim more often and eat more ice cream when it is hot, resulting in a correlation between drowning and eating ice cream that is not at all causal).
Additionally, sometimes two things really do correlate tightly just by chance. The website tylervigen.com has collected a bunch of these, such as the comical correlation between the number of films that Nicholas Cage stars in and the number of drowning accidents in a given year (everything correlates with drowning for some reason).
Examples like that are pretty funny and obvious, but when it comes to pushing an agenda, people often forget just how easy it is for spurious correlations to arise. For example, the anti-vaccine movement likes to cite a correlation between the “rise” in autism rates (see note at end) and increases in the number of vaccines that children receive. The problem is, of course, that this relationship could exist entirely by chance. Indeed, anything that has increased in recent years will correlate with increased autism rates. Thus, things like cell phone use, time spent in front of a screen, etc. will also correlate. Indeed, even things like the sale of organic food correlate with autism.
I singled out autism and anti-vaccers here, but these types of spurious correlations pervade the anti-science movement, and you can find them for anti-fluoride arguments, anti-GMO arguments, etc. As you can hopefully now see, however, those correlations may be completely spurious. Simply saying that X and Y are correlated tells you nothing about whether X is causing Y, unless, of course, you have extra information like I will talk about below.
Correlation can equal causation
Now that we have gone over why correlation does not automatically mean causation, we can talk about the situations where correlation can indicate causation. You see, essentially all scientific tests rely on correlation, so if there was no way to use it to assign causation, science would be in serious trouble. Fortunately, there is a way to go from correlation to causation: controlled experiments. If, for example, a scientist does a large, double-blind, randomized controlled trial of a new drug (X) and finds that people who take it have increased levels of Y, we could then say that taking X is correlated with increased levels of Y, but we could also say that taking X causes increased levels of Y. The key difference between a situation like this and the situations that we talked about previously is that in this case, we controlled all of the other possibilities such that only X and Y changed. In other words, we eliminated the possibilities other than causation.
To illustrate this further, let’s go back to the correlation between autism rates and organic food sales, but this time let’s say that someone was actually testing the notion that organic food causes autism (obviously it doesn’t, but just go with it for the example). Therefore, they select a large group of young children of similar age, sex, ethnicity, medication use, etc. They randomly assign half of them to a treatment group that will eat only organic food, and they randomly assign the other half to a control group that will eat only non-organic food. Further, they blind the study so that none of the doctors, parents, or children know what group they are in. Then, they record whether or not the children develop autism.
Now, for the sake of example, let’s say that at the end, they find that the children who ate only organic food have significantly higher autism rates than those who ate non-organic food. As with the drug example earlier, it would be accurate to say that autism and organic food are correlated, but it would also be fair to say that organic food causes autism (again, it doesn’t, it’s just an example). So, how is this different than the previous example where we simply showed that, over time, organic food sales and autism rates are correlated? Quite simply, the key difference is that this time, we controlled the confounding factors so that the only differences between the groups were the food (X). Therefore, we have good reason to think that the food (X) was actually causing the autism (Y), because nothing else changed.
Let’s walk through this step by step, starting with the general correlation between organic food sales (X) and autism rates (Y) and looking at each of the four possibilities I talked about earlier.
- Could organic food be causing autism? Yes
- Could autism be causing people to buy more organic food? Yes (perhaps families with an autistic family member become more concerned about health and, therefore, buy organic food [note: organic food isn’t actually healthier])
Could a third variable be causing both of them? Maybe, though I have difficulty coming up with a plausible mechanism in this particular case.
- Could the relationship be from chance? Absolutely. Indeed, this is the most likely answer.
Now, let’s do the same thing, but with the controlled experiment.
- Could the organic diet be causing autism? Yes
- Could autism be causing the diet? No, because diet was the experimental variable (i.e., the thing we were manipulating), thus changes in it preceded changes in the response variable (autism).
- Could it be caused by a third variable? No, because we randomized and controlled for confounding variables. This is critically important. To assign causation, you must ensure that the X and Y variables are the only things that are changing/differ among your groups.
- Could the relationship be from chance? Technically yes, but statistically unlikely.
Is the difference clear now? In the controlled experiment, we could assign causation because changes in X preceded changes in Y (thus Y couldn’t be causing X) and nothing other than X and Y changed. Therefore, X was most likely causing the changes in Y.
That “most likely” clause is an important one that I want to spend a few moments on. Science does not deal in proof, nor does it provide conclusions that we are 100% certain of. Rather, it tells us what is most likely true given the current evidence. It is always possible that a result arose by chance. Therefore, even when scientists make statements like, “X causes Y” what they really mean is, “based on the current evidence, the most likely conclusion is that X causes Y.” Indeed, science operates on probabilities, and when we do statistical tests, we are usually seeing how likely it is that we could get a result like the one that we observed just by chance. We then use those statistical methods to put confidence intervals around our conclusions, rather than stating something with 100% confidence. Importantly, however, the fact that science does not give us absolute certainty does not mean that it is unreliable. Science clearly works, and the ability to assign probabilities and confidence intervals to our conclusions is a vast improvement over the utter guesswork that we have without it. Further, for well-established conclusions, numerous studies have all converged on the same answer, and it is extremely unlikely that all of them picked up the same false associations just by chance.
Before I end this section, I want to make one final point. I talked specifically about randomized controlled trials in this section, and they are generally our most powerful tool, but there are other methods (such as cohort studies) that can also control confounding factors and assign causation. Further, in some cases, cohort studies can even be more powerful than randomized controlled trials, so you should not fall into the trap of thinking that anything less than a randomized controlled trial is unacceptable (I talked more about the different types of studies, their strengths and weaknesses, and which ones can and cannot assign causation here).
Assigning specific causation when general causation has already been established
Next, I want to talk about causes where you can use a correlation between X and Y as evidence of causation based on an existing knowledge of causal relationships between X and Y. In other words, if it is already known that X causes Y, then you can look at specific instances where X and Y are increasing together (if it is a positive relationship) and say, “X is causing at least part of that change in Y” (or, more accurately, “probably causing”).
Let me use an example that I have used before to illustrate this. Look at the data to the right on smoking rates and lung cancer in the US. There is a clear correlation (lung cancer decreases as smoking rates decrease), and I don’t think that anyone would take issue with me saying that the decrease in smoking was probably at least partially the cause for the decrease in lung cancer rates. Now, why can I make that claim? After all, if we run this through our previous four possibilities, surely we can come up with other explanations. So, why can I say, with a high degree of confidence, that the smoking rate is probably contributing to the decrease? Quite simply, because a causal relationship between smoking and lung cancer has already been established. In other words, we already know from previous studies that smoking (X) causes lung cancer (Y). Therefore, we already know that an increase in smoking will cause an increase in lung cancer and a decrease in smoking will cause a decrease in lung cancer. Therefore, when we look at situations like this, we can conclude that the decrease in smoking is contributing to the decrease in cancer rates because causation has already been established. To be clear, other factors might be at play as well, and, ideally, we would measure those and determine how much each one is contributing, but even with those other factors, our prior knowledge tells us that smoking should be a causal factor.
This same line of reasoning is what lets us look at things like the correlation between climate change and CO2 and conclude that the CO2 is causing the change. We already know from other studies that CO2 traps heat and drives the earth’s climate. Indeed, we already know that increases in CO2 cause the climate to warm. Therefore, just like in our smoking example, we can conclude that CO2 is a causal factor in the current warming. Further, in this case, we have also measured all of the other potential contributors and determined that CO2 is the primary one (I explained the evidence in detail with citations to the relevant studies here, here, and here, so please read those before arguing with me in the comments).
The same thing applies to the correlation between vaccines and the decline in childhood diseases. Multiple studies have already established a causal relationship (i.e., vaccines reduce diseases), therefore we know that vaccines were a major contributor to the reduction in childhood diseases (more details and sources here).
Argument from ignorance fallacies
Finally, I want to talk about a common, and invalid, argument that people often use when presenting a correlation as evidence of causation (here I am talking about examples like in the first section where the results aren’t from controlled studies and causation has not previously been established). I often find that people defend their assertions of causation with arguments like, “well what else could it be?” or “prove that it was something else.” For example, an anti-vaccer who is claiming that vaccines cause autism because of the correlation between autism rates and vaccine rates might defend their argument by insisting that unless a skeptic can prove that something else is causing the supposed increase in autism rates, it is valid to conclude that vaccines are the cause.
There are two closely related logical problems that are occurring here. The first is known as shifting the burden of proof. The person who is making a claim is always responsible for providing evidence to back up their claim, and shifting the burden happens when, rather than providing evidence in support of their position, the person making the claim simply insists that their opponent has to disprove the claim. That’s not how logic works. You have to back up your own position, and your opponent is not obligated to refute your position until you have provided actual evidence in support of it.
The second problem is a logical fallacy known as an argument from ignorance fallacy. This happens when you use a gap in our knowledge as evidence of the thing that you are arguing for. A good example of this would be someone who says, “well you can’t prove that aliens aren’t visiting earth, therefore, they are” or, at the very least, “therefore my belief that they are is justified.” Do you see how that works? An absence of evidence is just that: a lack of knowledge. You can’t use that lack of knowledge as evidence of something else. Nevertheless, that is exactly what is happening in situations like the example of our anti-vaccer above. That is what is occurring when someone says something like, “well, you can’t prove that something other than vaccines is causing the increase in autism rates, therefore I am justified in arguing that the correlation is evidence that vaccines are the cause.” It is an argument from ignorance fallacy and it is not logically permissible.
In short, correlation is not automatically evidence of causation because there are many other factors that could be at play. X could be causing Y, Y could be causing X, some third variable could be causing X and Y, etc. Nevertheless, if you can control for all of those other factors and ensure that the changes in X precede the changes in Y and only X and Y are changing, then you can establish causation within the confidence limits of your statistics. Additionally, once a general causal relationship between X and Y has been established, you can use that relationship to assign causation to particular instances of correlation.
Note: If you want to be really technical, you could argue that there are more than four possibilities to explain correlation, but they are all really just special cases of the four major ones I described. For example, you could argue that rather than a single third variable causing both X and Y, there is actually a complicated web of causal relationships involving multiple other variables that ultimately results in changes in both X and Y. That is, however, just a more convoluted way of stating my third option, and the point is the same: something else is causing both changes.
Note: The reported increase in autism rates is at least largely due to changes in diagnostic criteria, rather than an actual increase in autism rates. In other words, people who wouldn’t have been considered autistic 20 years ago are considered autistic today, resulting in the illusion of an increase in autism rates. More details and sources here.
- Basic Statistics Part 2: Correlation vs. Causation
- Basic Statistics Part 3: The Dangers of Large Data Sets: A Tale of P values, Error Rates, and Bonferroni Corrections
- Basic statistics part 4: Understanding P values
- Basic Statistics Part 6: Confounding Factors and Experimental Design
- The Rules of Logic Part 5: Occam’s Razor and the Burden of Proof
- Why are there so many reports of autism following vaccination? A mathematical assessment