Updated with additional sources on 16-June-16
It is fairly widely known that correlation does not inherently indicate causation. In fact, inappropriately asserting causation is a logical fallacy known simply as a correlation fallacy. Nevertheless, there is a great deal of confusion around this topic, and many people use it selectively. For example, anti-vaccers are adamant that the correlation between in the introduction of vaccines and the decline in diseases is not valid evidence that vaccines work, yet they insist that the supposed correlation between autism and vaccines is 100% proof that vaccines are dangerous. Therefore, in this post I will endeavor to unravel the mysteries of correlation and causation.
Let’s start with basic definitions. Correlation is simply a relationship between two variables. Either they both increase together, both decrease together, or one increases as the other decreases. So, for example, among people under the age of 20, there is a correlation between age and height. As age increases, height also increases. In contrast, causation means that not only do the two variables change together, but the change in one variable is actually causing the change in the other variable. The height example is, in fact, a causal one. Being older means that you have had more time to grow. Thus, on average, an 18 year old will be taller than a 12 year old because the 18 year old has had more time to grow.
The problem is that simply being correlated does not mean that one variable causes the other. I’ll use a classic example to illustrate. There is a strong correlation between ice cream sales and drowning accidents. Both of them increase together and decrease together, yet it would clearly be absurd to claim that ice cream sales are causing people to drown. The reality is that both ice cream sales and drowning accidents are caused by a third variable: time of year. People consume more ice cream in the summer than in the winter and people spend more time in the water in the summer than in the winter (which leads to more drowning accidents). Thus, these factors are correlated, but not causally related. This is why the correlation between increased vaccines and increased autism rates is not evidence that vaccines cause autism. There are countless other factors that could be causing autism to increase. Amusingly, it turns out that the increase in the sale of organic food also correlates well with the increase in autism (note: the supposed increase in autism is largely artificial as it is mostly caused by a change in the definition of autism, i.e., people who would not have been considered autistic 20 years ago are considered autistic today, also, there is a great deal of evidence that vaccines do not cause autism).
Having now established that correlation does not automatically mean causation, we run into our second point of confusion. Many people are under the impression that you can never use correlation to show causation. In reality there are two circumstances in which you can use correlation to conclude that there is a causal relationship. The first, and most powerful, is by simply controlling everything except for the two variables you are interested in. This is why scientists carefully design controlled studies such that one variable (the experimental variable) is deliberately changed and another variable (the response variable) is measured to see if it changes in response to the experimental variable, but all other variables are controlled so that they do not change. Under these circumstances, you can conclude that the correlation is causal because there are no other potential causes. If the only things that change are the two variables you are interested, then the changes in one variable must be caused by the changes in the other variable. This is why, for example, we can claim that there is a causal relationship between increased CO2 in our atmosphere and the increase in our planet’s temperature. We have carefully monitored output from the sun and the other major drivers of our planet’s climate and none of them correlate closely with the increase in temperature. In other words, without including CO2 , we can’t account for the current warming (Stott et al. 2001; Meehl et al. 2004; Allen et al. 2006; Lean and Rind 2008; Imbers et al. 2014).
The second way that we can infer causation is by using additional data. For example, a few months ago, I was debating with someone who was opposed to higher education, and I showed them a figure documenting the increase in salary that accompanied increasing levels of education. They, of course, accused me of a correlation fallacy, but there was a very obvious additional piece of information that demonstrated that the relationship was causal. Namely, the fact that the high paying jobs required higher education. This fact makes it abundantly clear that higher education was responsible for the higher salaries. Similarly, with global climate change we have additional data from laboratory trials that show that increased CO2 traps more heat, and we have data from satellites that show that less IR radiation (the energy that CO2 traps) is leaving the earth now than it was 30 years ago (Harries et al. 2001; Griggs and Harries. 2007). These additional pieces of evidence confirm that the relationship between the increase in CO2 and the increase in temperature is a causal one.
There are several key take home messages here. First, do not be duped by people who are trying to use correlation inappropriately to prove causation. Look carefully at their argument, and if there are other variables that were not controlled for, please tell them that they have committed a logical fallacy then proceed to ignore their argument. If, however, someone presents you with a carefully controlled study in which only the variables of interest have changed, you can then conclude that the relationship is causal. Similarly, if they can back up the relationship with independent facts which clearly demonstrate that one variable will cause the other, you can conclude that the relationship is causal. Finally, this once again demonstrates the superiority of carefully controlled studies over anecdotal evidence. Anecdotes have no controls whatsoever so you absolutely cannot conclude that your herb cured your cold or your vaccination caused your autism because there are too many other variables (note: these specific examples are technically post hoc ergo propter hoc fallacies). Using carefully controlled studies is the one and only way to test the relationships among variables.
Other posts on statistics:
- Basic Statistics Part 1: The Law of Large Numbers
- Basic Statistics Part 3: The Dangers of Large Data Sets: A Tale of P values, Error Rates, and Bonferroni Corrections
- Basic statistics part 4: Understanding P values