Most scientific studies are wrong, but that doesn’t mean what you think it means

When faced with scientific studies that disagree with them, many people are prone to claim that they don’t have to accept those studies because most scientific studies are actually wrong. They generally try to support this claim by either citing the work of John P. A. Ioannidis (especially his paper titled, “Why most published research findings are false”) or by quoting Dr. Richard Horton who said,

“The case against science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness.”

To the anti-scientist, these are “get out of jail free” cards that let them dismiss any study that they don’t like. In reality, of course, those who oppose science are grossly mischaracterizing and misusing Ioannidis’s work and Horton’s statements. Indeed, they seem to overlook the ironic fact that Ioannidis and Horton are among the world’s top scientists. So if their papers/comments actually gave us carte blanche to ignore any paper that we wanted to, then I could also blindly reject the papers in which they claim that many studies are wrong (hopefully you see the logical paradox there). Further, once you actually understand the arguments that Ioannidis and Horton are making, problems begin to emerge with many of the studies that anti-vaxxers, GMO opponents, naturopaths, etc. love to cite. Therefore, I want to take a look at what Ioannidis’s work and Horton’s statements actually mean and how they should be applied.

Note: I am going to talk generally about the problem being addressed by these two men, rather than focusing on a single comment or paper. The issue is, however, very complex and multifaceted, so I will only hit the most important topics.

 

Most studies may be wrong, but that doesn’t mean that an individual study is wrong
Before I actually talk about the issue itself, I want to make a few comments on the flawed notion that Ioannidis’s and Horton’s papers/statements entitle you to dismiss scientific studies. The fact that a large portion of all published papers are wrong does not mean that most papers on a particular topic are wrong, nor does it mean that an individual study is wrong.

Let me give you an analogy. It is a fact that most cars don’t go above 200 miles per hour, but if your friend owns a Bugatti (the world’s fastest production car) and claims that it goes faster than 200 miles per hour, you don’t get to say, “but most cars can’t go over 200 miles per hour, therefore I can reject your claim.” The fact that most cars can’t crack 200 doesn’t mean that no cars can go that fast. There are plenty of slow cars out there, but there are also some very fast ones. Therefore, you have to look at each model of car individually. Even so, there are lots of junk scientific studies out there, but there are also some very high quality studies, and you can learn to tell the difference. Indeed, I have written several posts on that very topic. So the take home message is not that you should arbitrarily reject scientific studies, but rather that you should carefully examine them before either accepting or rejecting them.

 

Type 1 errors and publication biases
One of the reasons that there are so many false studies is a statistical problem known as a type 1 error. I explained this in detail here and here, but in short, classical statistics operate off of probabilities that show you how likely you are to get a result of the effect size that you observed (or greater) if there actually is no effect. For example, if you are comparing a drug to a placebo and patients responded better to the drug with a P value of 0.03, that would mean that there is only a 3% chance of you finding the difference that you found (or greater) if the drug is not actually better than the placebo. To determine whether or not a result is statistically significant, we compare that probability to a predesignated value called alpha. This value is typically 0.05 (5%), and anything below it is considered significant.

If you think about the math for a minute, an obvious problem emerges. Because of the alpha level, if you did separate trials on 20 different drugs, you would expect one of them (on average) to give a significant result even if none of the drugs actually work. In other words, you expect 1 in 20 to be a false positive. To be clear, this is not because of fraud or user error, it is just a statistical fluke.

That problem often becomes magnified because of publication biases. Journals tend to strongly prefer positive results over negative results (positive = significant difference/relationship, negative = no significant difference/relationship). Therefore, in our example of 20 drug trials, the one study that found a significant difference just by chance will likely get published, whereas the other 19 studies probably won’t get published. As a result, the type 1 error rate in the published literature tends to be much higher than 5%.

All of this should make you very skeptical of topics with only one or two studies, and you should strive to avoid the “single study syndrome.” In other words, if a topic has only been studied once or twice, there is a pretty high chance that the results are actually type 1 errors, but if a study has been replicated multiple times, and the studies all agree with each other, then you can be fairly confident in the results, because it is unlikely that all of those studies were type 1 errors (note: true replication [i.e., doing the exact same study] is, unfortunately, quite rare, so “replication” generally involves asking the same question from a slightly different angle, rather than repeating exactly what a previous researcher did).

Similarly, for any well studied topic, you expect there to be a few false results just by chance. Therefore, you have to look at the entire body of literature and examine the overarching trends rather than focusing on a handful of cherry-picked studies that disagree with the majority of studies.

outlier central trand anti-science meme

Finally, it is worth noting that you can publish negative results under certain circumstances. For example, if there is a lot of concern over the safety of something and you find a negative result (i.e., it’s not dangerous), then you could publish that. Conversely, if there is a treatment that is widely believed to work and you find that it doesn’t work, you could publish that negative. So it is not impossible to publish a negative study, but it is difficult unless there is some external reason for why that negative is interesting.

 

Small sample and effect sizes
Another common problem that Ioannidis and Horton talk about is one that I also talk about frequently: sample size. The bigger the sample size, the more reliable your result. There are, however, lots and lots of unreliable studies with tiny sample sizes. Those studies are far more likely to be false than large studies are. So you should always look at the sample size, and use that in assessing how confident you can be in the results of the study.

A similar problem is effect size. This is the magnitude of the difference or relationship that you observed. If you found a very small difference, then it is likely that the result arose by chance, but if you found a very large difference, then it is more likely to be a true result.

This should make sense if you think back to P values. Any result with a P value less than 0.05 is considered to be significant. Thus a P of 0.049 gets treated as a real result, but there is a 4.9% chance that you would observe a difference of that size just by chance. Therefore, Ioannidis argues (and I agree) that we should look not just at whether or not the P value was significant, but also at how large the effect was (e.g., a P value of 0.0001 is much more likely to be a real result than a P value of 0.01). To that end, P values should be accompanied by additional pieces of information (like confidence intervals) that help you to judge whether or not the result is likely to be real.

Note: Ioannidis and many others also argue that we should move away from traditional P values and adopt other practices like Bayesian approaches. I largely agree with them, but that is a very complex topic that is beyond the scope of this post. 

 

Statistical fishing trips/P-hacking
If you continue to think about P values and type 1 error rates a bit more, another problem should emerge. Given that we will get a false positive if we do a test enough times, it becomes possible to do an experiment that involves many different components, then claim to have found an important result if any one of them is significant.

Let’s say, for example, that you want to know if a chemical causes cancer, but you don’t specify the type of cancer that you are interested in beforehand. Therefore, you take two groups of mice, and one gets the chemical while the other does not. Then, you dissect both groups and look for lung cancer, skin cancer, brain cancer, stomach cancer, etc. If you look for lots of different cancers like that, then you expect, just by chance, that at least one of them will be statistically significant even if the chemical doesn’t actually cause cancer. Indeed, that seems to be exactly what happened with a recent paper that claimed that Splenda causes cancer.

Another approach that has the same problem is to look for an effect in many groups. For example, I previously wrote about the now retracted study by Brian Hooker which claimed that vaccines caused autism in African American boys. One of the many problems with this study was that he looked for relationships in lots of different groups (African American boys, African American girls, Caucasian boys, Caucasian girls, etc.) and only that one group was significant. In other words, that result is probably a statistical fluke that resulted from doing many comparisons.

A closely related issue is a problem known as “P-hacking.” This occurs when you take a large data set and manipulate it until eventually the result that you want comes out. This can happen either consciously (in which case it’s fraud) or unconsciously, but the outcome is the same: you get a cherry-picked result that is almost certainly incorrect.

The correct way to solve these problems is to carefully define your study’s criteria beforehand and use some mechanism for controlling your family-wise type 1 error rate (i.e., the error rate for your entire experiment). In other words, before doing your study, you should specify exactly what comparisons you are going to make, exactly what analyses you are going to use, and exactly what will constitute a significant result. Further, if your analyses will involve multiple comparisons, then you should also use a method for controlling your type one error rate such as a Bonferroni correction (this adjusts your alpha based on the number of comparisons that you hare making). You can also use false discovery rates and various other techniques depending on the type of study that you are doing. The point is studies should be carefully designed to avoid P-hacking, and studies with lots of comparisons should be making use of techniques for controlling type 1 error rates, but they often don’t. So when you are reading papers, watch out for these pitfalls.

 

Multiple studies on the same topic
One rather interesting result of Ioannidis’s analysis is that the probability of a study being wrong is greater when multiple teams have asked the same question. At first, that may seem paradoxical. After all, isn’t replication the gold standard of research? It is (or at least should be), and I’ll talk more about it later, but the problem at hand is actually a bit different, and it once again comes back to type 1 errors.

If many different people study the same question, then some of them will get false results just by chance. For example, if 20 different teams study homeopathy, then you expect that at least one of them is going to find a significant improvement, even though homeopathy doesn’t work. To put this another way, if you only have 5 studies on a topic, then the odds that at least one of them is wrong are much lower than the odds of having at least one bad study among 100 studies.

Importantly, however, there is a huge difference between the odds of having a bad study and the odds of the overarching conclusion of all the studies being wrong. Going back to the homeopathy example, sure, you can find plenty of small, under-powered studies that found a significant effect, even though homeopathy makes no sense whatsoever, but when you look at the entire body of literature, you find that they are statistical noise, and most of the high quality studies failed to find strong evidence of homeopathy working (Ernst 2002; Ultunic et al. 2007; NHMRC 2015).

Finally, competition can actually be a big problem. If you know that another lab is doing a project that is similar to yours, you may rush to publish, and in the process fail to do a thorough job. Similarly, it may be tempting to fudge your results so that they appear better than the labs that you are competing with. Don’t get me wrong, competition can be a very good thing, but it can also be a very harmful thing, and I (and many other researchers) tend to think that collaboration is better than competition for many scientific topics.

 

Bias and conflicts of interest
You should always check papers to see if they declared any conflicts of interest, because that can be a real problem, and it definitely adds to the false literature that is published. When you find a conflict of interest, however, you have to be careful about how your treat it. Having a conflict of interest does not automatically mean that a study was flawed. It may have been perfectly fine. Indeed, in most cases, the organizations that provide the funding have no say or control over the results or publication of the study that they funded. So, rather than making you blindly reject a paper, a conflict of interest should simply make you look more closely at it. If you find that it was large, used proper controls, had large effect sizes, and has been replicated or is in agreement with many other studies, then it is likely to be true. In contrast, if it is small, improperly designed/conducted, had small effect sizes, and disagreed with many other studies, then it is probably in error and you should be very cautious about accepting its conclusion.

Also, it is worth noting that you need to actually look for conflicts of interest rather than assuming that they exist. Many people just assume that all pro-vaccine studies were funded by Big Pharma, all pro-GMO studies were funded by Monsanto, etc.  In reality, there are many studies that were not in any way affiliated with pharmaceutical companies, Monsanto, etc. (details here). So, you should not fall into the common trap of assuming that no independent research is ever conducted. Further, studies that disagree with mainstream views often have conflicts as well. There are plenty of organizations that profit tremendously from organic food, alternative medicine, etc., and funding from these groups should be viewed just as critically as funding from pharmaceutical companies.

Similarly, you can have strong biases without having actual conflicts of interest. For example, as I recently explained for the literature on autism and vaccines, several of the papers that suggested that vaccines cause autism were conducted by parents of autistic children who are active members of the anti-vaccine movement and firmly believe that vaccines caused their children’s autism. That type of strong bias can easily lead to erroneous results and should be taken seriously.

 

Quality of the journal
This is not actually one of the problems that Ioannidis talks about frequently, but it is a very real issue that I think is worthy of mentioning. Not all journals are created equal. There are lots of really bad, low quality journals that will publish just about anything. Some of these are for profit “journals” (what we call “predatory journals”) others are agenda driven or just really low quality (i.e., they do not use a thorough peer-review process). The easiest way to check the quality of the journal is to look at its impact factor. The higher the impact, the better the journal. If you can’t even find an impact factor, that is a really good indication that it is an untrustworthy journal. You should also check to see if it is a predatory journal.

Using the quality of the journal to assess a paper is very much like using conflicts of interest. You can have a great paper with conflicts of interest, and you can have a terrible paper with no conflicts of interest. Similarly, you can have a great paper that was published in a horrible journal, and you can have a horrible paper that was published in a great journal. As a general rule, however, high impact factor journals publish high quality research, whereas journals with really low impact factors often publish shoddy research that should probably never have been published.

Note: Some good journals have low impact factors simply because they are on a very specific, focused topic, so you should also look at the scope of the journal.

 

Prior probability
Another important consideration that needs to be taken into account is the probability of getting a given conclusion before the study is conducted. In other words, in many cases, we can use our existing knowledge to figure out whether or not a given conclusion is likely to be true, and for many conclusions, that probability is very low. This means that you need much larger effect sizes and sample sizes before accepting them. This is (I think) part of what Horton meant when he said, “an obsession for pursuing fashionable trends of dubious importance.” You can think of this as the scientific plausibility of a conclusion. The more implausible it is, the stronger the evidence needs to be.

Homeopathy (and alternative medicine more generally), once again, makes an excellent illustration of this point. For homeopathy to work, water needs to have “memory” (which we know it doesn’t), things need to become more potent as they are diluted (which is the opposite of reality), and you need “like to cure like” (which also doesn’t work). To put this another way, for homeopathy to work, it would have to break several fundamental concepts of chemistry and physics. Therefore, it is extraordinarily unlike that it actually works (i.e., it has a very low prior probability). Given that fact, for a study to be convincing, it would need to be extremely large and have an enormous effect size, and no studies of homeopathy have met those criteria.

As Carl Sagan so famously said, “extraordinary claims require extraordinary evidence.”

 

The importance of replication
I cannot overstate how important it is for papers to be replicated multiple times (or at least strongly supported by similar research). If several different independent research groups have looked at the same problem and found the same answer, then that answer is most likely correct, because there is a very low probability that all of those studies are wrong. In contrast, when something either hasn’t been replicated or replication attempts have failed, then it is more likely to be wrong. So, you should always be skeptical of a result unless it has been consistently replicated.

Note: when I say “consistently replicated” I mean that the large studies should be consistent. If you have 5 massive studies that agree with each other but disagree with 5 tiny studies, it is far more likely that the 5 massive studies are correct.

 

Conclusion/what papers does this apply to?
By this point, it should be clear that although there are real problems with the scientific literature, those problems do not apply to many of the best studies used in support of vaccines, GMOs, etc. For example, when I reviewed the evidence that vaccines do not cause autism, I cited multiple very large studies (often over 100,000 people), that had the power to detect even tiny differences, were published in high quality journals, did not have conflicts of interest, and had been replicated by multiple large studies. In contrast, the anti-vaccine papers tended to have tiny sample sizes, small effect sizes, used questionable statistics, were often published in low quality journals, were not replicated by large studies, and often had conflicts of interest. Thus, when we apply Ioannidis’s work and Horton’s comments to the topic of vaccines and autism, we find that it is the anti-vaccine studies that are likely wrong, not the pro-vaccine studies. To put this another way, neither Horton nor Ioannidis are suggesting there are no good studies out there. Rather, they are pointing out specific problems that exist with many studies. Those problems do not, however, apply to many of the best studies in favor of GMOs, vaccines, etc.

Indeed, the problems that Ioannidis and Horton are talking about apply largely to anti-science views. You can, for example, find lots of studies saying that homeopathy works, acupuncture is effective, vaccines are dangerous, etc., but when you start applying Ioannidis’s work and Horton’s concerns to those studies, you quickly realize that those studies meet their criteria for research that is likely to be false. Similarly, you should be very cautious about buzz-inducing articles about some new miracle cure, or a common product that increases your cancer risk. Those types of papers tend to be false, and are usually type 1 errors.

In short, there are very real problems with the scientific literature, and a lot of junk science does get published, but that does not mean that you can reject any papers that you want to. Rather, you should critically analyze papers based on their sample size, effect size, statistics, conflicts of interest, journal quality, prior probability, replication, etc. The results of large studies that have been replicated numerous times are unlikely to be false. So you cannot use the work and statements of Ioannidis and Horton as evidence against GMOs, vaccines, climate change, evolution, etc. Rather, it should make you skeptical of the outlier papers that disagree with the overarching trends. You should be slow to accept small sample sizes/effect sizes, and you should demand replication.

Other posts on peer-reviewed literature

Literature cited  

Ernst 2002. A systematic review of systematic reviews of homeopathy. British Journal of Clinical Pharmacology 54:577-582.

Ioannidis 2005. Why most published research findings are false. 2:e124.

NHMRC 2015. Evidence on the effectiveness of homeopathy for treating health conditions. National Health and Medical Research Council.

Ultunic et al. 2007. Homeopathy for childhood and adolescence ailments: Systematic Review of randomized clinical trials. Mayo Clinic Proceedings 82:69–75.

 

This entry was posted in Nature of Science and tagged , , , , , , , , , , , . Bookmark the permalink.

One Response to Most scientific studies are wrong, but that doesn’t mean what you think it means

  1. chiphines says:

    I Like the article, and it will get a re- read from me more than once. Three things strike me though. First, what the general public faces is tremendous difficulty in reviewing research. It really isn’t that easy to people who don’t devote their career to research. At the same time intentional or not, the easy to understand summaries are often misleading. There isn’t really a good place to find trusted information from an impartial source.

    Second, the scientists many people come in contact with rarely understand the limitations of science, and probably for the same reason as above – there is so much new, and few people even professionals like doctors can do real analysis of the research behind the decisions they make. On the average, whether it’s the public or the scientific community you shouldn’t take research as a whole or consensus as gospel. A scientist typically retreats behind science as God and rarely is willing in a public context to discuss areas of risk etc with the general public. This can come to an extreme on the science side as well. How is it helpful to hear the following exchange which took place between a sleep specialist professional and myself not long ago. I asked how melatonin might be used as an aid. His response was that the professional sleep disorder organization has not taken a position on melatonin so he couldn’t help me on the question.

    In the same way, when professionals are using research results or professional consensus to apply to specific cases, they need to also keep in mind the limitations of what they are using. Every individual case needs to be assessed for the possibility that it doesn’t fall within the research indicated solution. In my experience some are clear about doing this, some resent even being asked about issues, and some are in between, but in general the scientific community pushes back against being questioned

    I think the world would be a better place if it was easier for the public to under the issues and that the science community would be more open about addressing concerns. This might enable more trust throughout the community.

    Like

Comments are closed.