Last week, researchers published a paper suggesting that sucralose (Splenda) causes cancer in male mice. This has re-sparked an old debate, and various media outlets have been quick to pounce on the results and flood the internet with articles like, “The scary reason why you should stop using Splenda ASAP” or “Splenda linked to leukemia, study finds: New study finds sweetener unsafe.” Meanwhile, other people have taken a pendulum swing to the opposite conclusion, and are claiming that the study actually showed that Splenda reduced cancer rates in female mice. In situations like this, you always need to look at the actual study, not the headlines, because more often than not, the actual study is very different from what the media is reporting. Therefore, in this post I want to look at the study itself, and use it as teaching tool to give an example of how you should analyze scientific papers. This is a topic that I have written about many times before (for example, here, here, and here), but I think that this paper provides a particularly nice illustration.
To be clear, I am not going to engage in an overarching discussion about whether or not Splenda is safe. Rather, I am simply going to analyze this particular paper to see whether or not it supports that conclusion that Splenda is dangerous (if you are interested, you can find a nice review of the evidence for its safety in Grotz and Munro 2009).
Overview of the study
The actual paper in question is: “Sucralose administered in feed, beginning prenatally through lifespan, induces hematopoietic neoplasias in male Swiss mice” by Soffritti et al., and it was published in the International Journal of Occupational and Environmental Health. In short, the authors took a large group of Swiss mice and split them up into several treatment groups which received different amounts of sucralose mixed with their food (0 ppm [control], 500 ppm, 2000 ppm, 8000 ppm, 16000 ppm). The mice were first exposed to sucrose as fetuses (i.e., the pregnant females were given the food), and they continued to receive food that contained it every day for the rest of their lives. At the end of the study, surviving mice were euthanized and all of the mice were dissected to look for various cancers. The authors then concluded that in male mice that received 2000 ppm diets or 16000 ppm diets (but not 8000 ppm) there was a significant increase in the rates of certain cancers.
Limitations and applicability
Whenever you read a study like this, one of the most important things to do is look at how widely applicable the results are. As I have previously explained, scientific studies often have very narrow, specific results, whereas the media reports are generally extremely broad, and that is certainly the case for this study.
First, this was a study on mice, not humans. This is extremely important because there are many different types of scientific studies, and different types are used for different purposes. Thus, you always need to look at the type of design being used (details here). Some designs (such as mouse studies) are intended only as preliminary studies that should be used to fuel future research. Mice and humans have biochemical differences, and foods, drugs, etc. often behave differently in mice than in humans. So even if a study convincingly showed that a chemical causes cancer in mice, you could not jump to the conclusion that it causes cancer in humans. Rather, you should use that study as the basis for getting funding to test the chemical in humans.
Additionally, this study only found significant increases in the cancer rates of male mice. There were no significant increases in female mice. So even if you want to prematurely apply this study to humans, you can only apply it to males, not females. Again, look for caveats like this when you read papers.
Also, you always need to remember that the dose makes the poison. Almost everything is toxic in a high enough dose, and even water can be fatally toxic. So whenever you see sensational headlines, you should take a good look at the dose that was being used, because it is often much higher than anything that you would actually be exposed to.
In this particular case, the significant effects were only observed at very high doses. The lowest treatment category was 500 ppm. Using ppm (which stands for parts per million) for dry food is highly unusual (it’s normally expressed as mg/kg [i.e., milligrams of chemical per kilogram of subject), so I was very confused when I first read the paper, but apparently, the authors have now clarified that 500 ppm is roughly equivalent to 60 mg/kg in mice or 4.9 mg/kg in humans (see Reagen-Shaw et al. 2008 for details on the conversion). This is important because the FDA recommendation is to eat no more than 5 mg/kg of sucralose a day. So, their lowest exposure group was roughly the same as the maximum daily amount recommended by the FDA. However, they did not detect significant effects until 2000 ppm (which I presume is 240 mg/kg in mice or 19.5 mg/kg in humans). In other words, the risk did not increase until a group that received roughly four times the recommended dose. So this study absolutely does not support the conclusion that a normal exposure to Splenda is dangerous. In fact, the category representing the maximum recommended dose had no significant effects. So if anything, this study supports the conclusion that our current exposure rates are safe (remember, everything is dangerous at a high enough dose, so its hardly supprisng to learn that taking more than the recommended maximum dose is dangerous, and it does not suggest that the recommended dose is dangerous).
Note (18-Mar-16): originally, I had erroneously claimed that the lowest dose category was 12 times the recommended dose in humans; however, it was pointed out to me in the comments that I simply applied the mg/kg in mice to humans, and I should have done a conversion. I apologize for the mistake and have corrected it throughout the post; however, the mistake did not significantly change any of my arguments or conclusions.
Finally, the applicability of this study is limited because mice received Splenda throughout every life stage (gestation, nursing, adolescence, and adulthood), which makes it impossible to know which stage is important. For example, it could be that the important effects only happen during gestation, and those effects carry over into the rest of life. In which case, Splenda is fine during any stage after birth. Alternative, it may only be important while nursing or while growing. Based on this study, you simply can’t make the generalization that Splenda is dangerous for adults, because the results in the paper aren’t that specific. In other words, this study took an extremely broad approach to the problem, and you should be aware of the limitations that such a wide approach entails.
Hopefully you can now see the problems with the media hype. Even if the study’s results were reliable (which as I’ll show in a minute, they aren’t), all that it would have shown is that absurdly large doses of Splenda cause cancer in male mice if they receive it over the course of their entire lifespan. That’s hardly a result that should be causing people to flush their Splenda down the toilet.
Having now established that the claims being made by the media are clearly bogus, we need to turn our attention to the design of the study to see if it was even set up properly. The first thing to look at is the randomization process. In any study like this, it is very important to properly randomize your subjects, because this minimizes the chance of an unknown confounding factor skewing your results. In this case, they started the trails during gestation, so it is actually the parent mice which needed to be randomized, not the test subjects themselves. What the authors did, however, is a bit odd. They described their randomization process this way:
“The breeders were randomly distributed by weight into three groups of 40 and two groups of 60, encompassing the same number of males and females (n = 240).”
This is very confusing for two reasons. First, what do they mean by, “randomly distributed by weight”? Distributing randomly and distributing by weight are mutually exclusive. Either they grouped them by weight (e.g., a group of 10–15 gram mice, a group of 16–20 gram mice, etc.) or they randomly selected who would go into which group, but they can’t have done both. Second, why three groups of 40 and two groups of 60 instead of six groups of 40 or four groups of 60? It may seem like I am nit-picking here, but this information is actually important. If they did not properly randomize and block their experiment, then their results will be unreliable, and the information provided in the paper simply isn’t enough to tell if they did it right.
By itself, this confusing set up is clearly not enough to discredit the research, but it is a clue that something isn’t right, and it demands a closer scrutiny of the paper. As you read scientific papers, you should watch out for irregularities like this, because they are often indicative of underlying problems.
Next up, we have their statistical design, and this is where things become really troubling. Whenever you are writing a scientific paper you are supposed to describe your analyses in enough detail that someone else could take the same data set that you analyzed, do the exact same analyses that you did, and get the same result; however, in this paper, the description of the statistical design is exceedingly unsatisfactory. All that the paper says is,
“The statistical analyses of survival and of the malignant neoplastic lesions were based on the Cox proportional hazard regression model, which adjusts for possible differential survival among experimental groups.”
That’s it. That’s all the the authors told us about their design. In fact, they didn’t even bother to mention another analysis method that they apparently used, because if you look at their results, you will also find that they calculated a Kaplan–Meier estimate, but they never explain how they set that up. This is an enormous problem because it leaves the readers in the awkward position of having to assume that they knew that they were doing, and as I’ll explain in a minute, there is a lot that can go wrong here. So, whenever you read a paper, make sure that the stats were described in good detail. Exceedingly terse descriptions like this are completely unacceptable (ideally, there should be a whole subsection on the statistical methods used).
Now, I want to take a close look at the stats, and this is going to get a bit complicated, but bear with me, because it’s important. One of the most important assumptions of statistical tests is that each of your data points is independent of every other data point. In other words, the occurrence of one data point should not be linked to the occurrence of another data point. In this case, however, the experiment started by administering the different levels of sucralose to pregnant females, and their offspring went on to be the experimental subjects that were later dissected (i.e., each offspring was a data point). This means that the data points were clearly not independent because there were siblings (i.e., siblings will be far more similar to each other than to individuals from other litters). This is a huge problem that has a very high potential of giving false positives.
Consider, for example, a group like the males given 16000 ppm daily. There were 70 individuals in that group, but you cannot include each individual as a data point because litter sizes consisted of 12-13 individuals. Thus, you would only have 5-6 litters in that group. This is absolutely critical because if, for some reason, one of the females who produced one of those litters had genes that made it more prone to having cancer, those genes would get passed to all of its offspring, and those offspring would then majorly skew the results (note: the study specifically stated that these mice were “outbred” so genetic differences are entirely possible).
What all of this means is that when the authors built their statistical models, they should have included family as a factor in the model, and they should have done something called “nesting.” This means, that they should have built their model such that it would specify that there are multiple measurements per family (i.e., each individual is a measurement within a family), and they should have specified that each family was only in a specific treatment group, but the families were crossed across sex. The math of why you have to do this is complex and beyond the scope of this post, but the point is that they needed to set up a very precise, complex model, and if they failed to do this, then they would have gotten highly erroneous, unreliable results.
So did they do this? I don’t know. This is why it is so important for statistics to be described clearly. As it is, I have no clue if they set up their model properly. At one point, while describing the breeding process, they did state,
“All male and female pups of each litter were used in the experiment to reach the programmed number per sex per group and to allow the evaluation of a potential family effect in the carcinogenic process.”
This suggests that they were at least aware of the potential pitfalls of their data, but there is no indication that they actually set up the model properly, and the results of the family tests are never reported. Now, it is possible that the family effects were not significant and they simply decided not to mention them, but it is also possible that the family effects were huge, so they removed them from the model in order to get an overall significant result, and because they didn’t give us enough information, we simply don’t know which it was. This brings me to another important point. When reading scientific papers, make sure that they actually report the results for all of the tests that they did. Preferably, it should be a full report of P values, confidence intervals, and a measure of the variation in the data.
Now, you may be thinking that my argument here consists entirely of baseless speculation, but if that is what you are thinking, then you are missing my point. I’m not saying that they definitely used the wrong tests. Rather, I am saying that we don’t know if they used the right tests, and that’s a really big problem. This paper is claiming that a rather large body of literature is wrong, and that claim is all well and good if the authors have the data to back it up, but we simply don’t know if they do. Remember, extraordinary claims require extraordinary evidence. If you are going to claim that other studies are wrong, then the burden of proof is on you to provide extremely strong evidence for that assertion, and a paper where we have to guess about whether or not they correctly set up a rather complex model simply does not meet that standard.
Finally, we need to talk about multiple comparisons. This is a topic that I described at length here, so I’ll be brief. When you do a statistical test to compare two groups, you get a P value, which is the probably of getting a difference that is at least as great as the one you observed if there isn’t actually a difference (details here). You can think of this as the probability of getting your results just by chance (that’s not technically correct, but it is a useful way for most people to think about it). This means that sometimes you will get a false positive just by chance (we call this the type 1 error rate, and it’s usually 0.05). So, if you did the same experiment 20 times, and there was not actually a difference between your groups, you would expect to get one false positive, just by chance. In other words, the more times that you test a question, the more likely you are to get a false positive (we call this the family wise type 1 error rate). As a result, when you make multiple comparisons within a data set, you need to account for the number of comparisons that you made (i.e., you need to get the family wise error rate back to 0.05), but there is no indication that the authors did this.
Under the “Animal [sic] bearing hematopotetic neoplastas” category in table 3, for example, there appear to be four comparisons within the same data set, and there is no indication that the error rate was controlled. Further, that category appears to be a subset of the more general “Total animals bearing neoplastas” category, which also underwent comparisons. Similarly, it is not clear how they did the comparisons for each treatment group within each category, but it looks like multiple comparisons were probably involved. This means that they were very likely falsely assigning significance (again, we just can’t be certain what they did).
To be clear, I’m not suggesting that the authors deliberately manipulated the data to get a false result. Rather, I am simply pointing out that the statistics used in this paper are very suspicious, and it seems very likely that the authors did not set them up correctly. Further, since they failed to give adequate descriptions of their methods, we have no way of actually knowing whether or not they were done correctly.
At this point, I have shown that this study has limited applications to humans and that its statistics are questionable at best. Now, I want to take a look at the results, because this is where things get really interesting.
First, when you look at the column called “malignant neoplastas,” in table 3, you can see the first supposedly significant result, but it is barely significant, it’s only for males, and it’s only for the highest treatment category (which is 31.1 times the recommend maximum dose for humans). Indeed, 56.4% of control males had malignant tumors, whereas 62.9% of males in the highest dose group had them. That’s not much of a difference, especially when you consider that 67.6% of control females had malignant tumors. So, given the multiple comparison problem that I explained earlier, this is almost certainly a false positive.
Next, we get to the “hematopotetic neoplastas” category (these are tumors involving the lymph and blood systems). First, I have to wonder, why is this the only group of tumors that we see comparisons for? The authors collected data on lots of other body systems, so why is this the only place where we see comparisons? I can think of only one reasonable answer: they cherry-picked this system because it was the only one that gave significant results. Indeed, this whole paper reeks of a statistical fishing trip (i.e., this result is likely a type 1 error).
Nevertheless, let’s look closer, because I haven’t even gotten to the most interesting result: the 8000 ppm group. You see, the authors results were extremely inconsistent. Before I explain the inconsistency, let me ask you to make a prediction. Suppose that I told you that I was testing the effects of chemical X, and I tested it at doses of 0, 500, 2000, 8000, and 16000 ppm (just like the study). Further, suppose that I said that I found significant increases in cancer rates at 2000 and 16000 ppm. Now, if those increases were actually caused by X, what should we see for the 8000 ppm group? The answer is obvious. If X is actually causing cancer at doses of 2000 and 16000 ppm, then it should also cause cancer at 8000 ppm.
When we look at the actual study, however, we see that the authors report significant increases for males in the 2000 ppm group and the 16000 ppm group, but not the 8000 ppm group. This is an extremely strong indication that the results of this study were produced by chance or a flawed statistical design rather than actual effects of Splenda. It is a dead giveaway that these results are bogus. To put this another way, if the only results that this study reported were the 2000 and 8000 ppm groups, it would look like cancer rates decreased with increasing Splenda use. These results simply aren’t what you expect if Splenda actually causes cancer.
To be fair, if you zoom in specifically on the histiocytic sarcomas subcategory, there was a barely significant effect for the male 8000 ppm group. However, this is, once again, cherry-picking. Further, this category doesn’t really help you because it had the following cancer rates: 2000 ppm = 17.5%, 8000 ppm = 10.6%, 16000 ppm = 15.7%. Both the 8000 ppm group and the 16000 ppm group are lower than 2000 ppm group, which is hardly what you expect if this chemical is actually causing cancer. Granted, some of those differences (particularly the difference between 2000 ppm and 16000 ppm), may not be statistically different, but we don’t know if that comparison is significant because the authors did such a terrible job of reporting the results of their statistical tests. Further, we would have expected to see a significant increase with increasing dose, and we definitely didn’t see that.
The authors give three justifications for the notion that their result is real and not an artifact, so let’s look at them briefly. First, they say
“The incidence of the concurrent control group (8.5%) falls in the range of the overall historical control incidence (5.7%; range: 0.0–12.5%) and the incidence of hematopoietic neoplasias in the males exposed to 16,000, 8,000, and 2,000 ppm significantly exceeded the higher range of the historical controls.”
Part of this is fair. Their control does appear to be within the normal range for males of this breed, but the rest is suspicious because of the multiple comparison problem. Sure, if you pretend that they didn’t do all of the other comparisons or collect data on all of the other organ systems, their results for 2000, 8000, and 16000 ppm would be different from what has been historically reported, but you can’t ignore all of the other measurements that they made. When you consider the entire study and properly account for the family wise type 1 error rate, that significance goes away. Further, this still doesn’t explain why the 8000 ppm group was lower than the 2000 ppm group.
Their second defense is:
“If among the males exposed to the highest dose we exclude the animals bearing hematopoietic neoplasms, the survival is almost the same as among controls.”
Well of course that’s the case. This is just a restatement of the data. I’m not denying that there were higher cancer rates in the 2000 and 16000 ppm groups. Rather, I am questioning whether that change is a significant one that was caused by Splenda.
Finally, they state:
“The cumulative hazard is much higher among males treated at 16,000, 8,000, and 2,000 ppm than in males exposed to 500 or 0 ppm (Fig.7).”
This is true, but it’s also true that the cumulative hazard for the 8000 ppm group was lower than the cumulative hazard for the 2000 ppm group. That’s the key problem that they aren’t addressing.
As a final note, I have seen several people argue that this study actually found that Splenda caused the cancer rates to decrease in females. It is true that the raw numbers went down, but the difference was not statistically significant, which makes that claim erroneous.
As I have said many times on this blog, the peer-review system is not perfect. Bad papers do sometimes make it through, and this is one of those papers. Its statistics were questionable at best, and it looks like the authors cherry-picked their results and failed to properly control the error rate. Importantly, the reported increases in cancer rates were very inconsistent, which is not at all what you would expect from a true result. Further, even if the results were correct, this study was on mice, it used unrealistically high doses, and it only found significant increases in males. So you can’t make the type of broad, fear-mongering generalizations that so many people are jumping to.
More importantly than the results of this particular paper, however, I hope that you have learned some valuable lessons in critiquing scientific papers, because the things that I was pointing out are the type of things that you should look for in all papers. Reading the scientific literature should not be a passive process. Rather, it is usually an intensive, messy, and complicated process, and it does require you to have a certain level of background knowledge in scientific methodologies and statistics. So, if you really want to understand scientific studies, I strongly encourage you to take several courses in statistics.