Part I – Do we put too much faith in “the latest scientific evidence”?
Context of this post
This post is an edited extract from chapter 3 of my as yet unpublished book about error and fraud in biological and medical research. In chapter 2 of this book I discuss in detail four case studies of major errors in biological and medical science:
- The widespread but now discredited belief in a crisis of world protein supply commonly referred to as the protein gap. Between the early 1950s and mid 1970s it was generally held that protein deficiency was found in most Third World children and was the most serious and widespread dietary deficiency in the world. Billions of pounds/dollars were wasted on measures to solve this illusory problem.
- The promotion of front sleeping for babies in the 1970s and early 1980s that led to a worldwide epidemic of cot deaths and probably cost the lives of hundreds of thousands of babies.
- The widespread belief that antioxidant supplements would substantially prolong human life even the lives of generally well-nourished people. The consensus of current evidence from controlled trials and meta-analyses is that these supplements are more likely to do net harm than good.
- The belief that tissue called brown fat was capable of burning off surplus calories and thus preventing weight gain and obesity in some people who overeat. Defective brown fat thermogenesis (heat production) was falsely seen as a likely important cause of human obesity and drugs that increased heat production in brown fat were seen as having major potential for preventing and treating human obesity.
I have written about all of these case-studies in my books, academic papers and in educational and newspaper articles. I have also posted articles about them on this blog site, for example:
I believe that these errors have been largely caused by prematurely drawing conclusions and even taking policy actions based almost solely on evidence that is relatively low down on the pyramid or hierarchy of evidence e.g. animal experiments, studies with high risk groups and epidemiological studies.
In chapter 3 of my book and in this post I discuss whether these four case studies are more than isolated examples and are just the tip of the iceberg and symptomatic of a more general problem with the credibility of much of the scientific research that is published.
Doubts raised about the credibility of much of the research that is published
Almost every day, the health and science correspondents of newspapers and other popular media outlets summarise the latest scientific or medical papers that have caught their eye or been brought to their attention. Many of these findings seem counter intuitive or contradictory or sometimes just unbelievable or even bizarre. For many others it seems difficult to see how the research being summarised could make any significant contribution to advancing scientific understanding even if true. Sometimes it seems to me that the more unlikely and bizarre the findings the more likely they are to transiently attract the attention of these popular journalists.
In 2005, John Ioannidis published a paper with the provocative title “Why most published research findings are false”. By December 2015, this paper had been viewed more than 1.5million times and cited 1870 times. Clearly Ioannidis believes that there is a major and general credibility issue with published science. Similar negative views have been expressed about particular types of research approach. Young and Karr (2011) questioned the validity of much of the mass of published epidemiological (observational) research that features heavily in the research summaries in popular media and they concluded that “any claim coming from observational studies is most likely to be wrong”. Researchers affiliated to major pharmaceutical companies (e.g. Begley and Ellis, 2012) have reported that many basic preclinical studies especially those with apparent potential for the development of new cancer drugs have false or exaggerated findings.
Almost 2 million scientific papers are published each year in around 28,000 journals and the number of refereed/scholarly papers is doubling every 20 years; in the period 1996-2011, 15 million people authored 25 million scientific papers. The conclusions of authors like those above is that many and perhaps most of these have false conclusions and that as much as 85% of research resources are wasted. In The Chronicle of Higher Education in June 2010 Mark Bauerlain and his colleagues wrote an article entitled “We must stop the avalanche of low-quality research” .
Thus most reported new associations or effects are, according to such authors, either false or exaggerated. Many of these 2 million annual published papers will only be read by their authors and those involved in the peer review process and there will be no attempt to confirm most of them. Even the assumption that peer reviewers and editors have always read a peer reviewed paper may be optimistic e.g. the acceptance for publication of papers in open access journal (i.e. where authors pay the publication costs) which have deliberate and obvious flaws (Bohannon, 2013) or even some that contain just strings of meaningless jargon (Davis, 2009; Van Noorden, 2014). One paper was apparently accepted for publication even though it consisted entirely of repetitions of a profane request to remove the author from their mailing list. According to the Institute for Scientific Information (ISI), 55% of the papers published in journals indexed by ISI in the period 1981-1985 received no citations in the five years after they were published i.e. they were not referred to by any other authors. Even where the results are essentially correct, many publications will do little to advance scientific knowledge or understanding and/or have no potential for practical impact.
A more positive perspective
I initially found the above estimate that 85% of research resources are being wasted as shocking but on reflection I would have found it difficult to make a credible case that this figure was seriously exaggerated in my own area of expertise, nutrition (see Nutrition Research: Past Successes and future direction). Of course the very nature of scientific research means that a relatively high wastage rate is inevitable as people’s bright ideas don’t work and their exciting theories turn out to be incorrect. If one uses the analogy of drilling for oil then one would anticipate that only a proportion of the drillings would yield viable flows of oil nonetheless drilling is overall a productive activity.
The corollary of the 85% wastage rate is that all of the undoubted major advances in scientific understanding and the development of more effective medical treatments can be attributed to the remaining 15% of expenditure. If this 15% figure could be raised then this would be the equivalent of a free boost to productive research spending e.g. increasing this 15% to 30% would have an impact equivalent to a doubling of research expenditure. This should be a major incentive to trying to improve the conduct and reporting of scientific research and perhaps more importantly to increase efforts to design, plan and fund studies that have the best chance of yielding meaningful or useful results. Going back to our drilling for oil analogy, selecting sites for drilling based upon detailed geological analysis and preliminary seismic studies increases the proportion of successful wells. Of course, the metric used to judge success in science is the published paper so even if researchers themselves doubt the real value or validity of their work they will probably try to publish it somewhere for the sake of their careers, which adds to the avalanche of low quality research publications. In our oil drilling analogy this would like encouraging others to drill in areas where there is a low probability of success or in extreme cases continuing to develop production facilities around a dry oil well!
Some of the money classified as wasted on unproductive research will have resulted in the training of new researchers which, if they are well-trained, is a useful output in its own right.
Generalisation and personal perspective
In many areas of life, metrics are used to try and get an objective and quantitative measure of things like quality, efficiency and value for money. More subjective judgements even the consensus view of “experts” are seen as unreliable and too prone to bias and prejudice. The problem is that once a metric is identified then rather than simply being an objective incidental measure of quality or efficiency it can become the driver of the way in which organisations are managed and individuals conduct their activities. Management and effort may be focused upon ways to increase scoring on the metric that may not enhance quality or efficiency or in some instances may reduce them. If a government gave large grants to oil exploration companies on the basis of the number of metres drilled then at least in the short term this might encourage companies to drill deep bore holes in accessible but unlikely areas. If a teacher was told that s/he was to be judged on the scores in an end of term examination they might be tempted to find ways to inflate their pupils’ scores e.g. marking more leniently, setting easier questions or specifically preparing students to answer questions on the examination paper. If surgeons are judged on their crude mortality rate then some may be tempted to avoid operating on high risk or difficult cases even if this is against the patient’s best interests.
Many scientists and their managers are so focused upon generating more peer reviewed papers that they lose focus on conducting and publishing the best quality research. Potential projects may be judged on their likelihood of generating papers and maybe generating media interest rather than their potential for advancing scientific understanding or making a real contribution medical treatment. Researchers may be tempted to improve imperfect data or improperly cherry pick from their results to make them more likely to be accepted for publication. They may carve up data that could be published in one comprehensive paper so that it can be split into multiple publications or even publish slight variations of the same data in multiple publications. At the extreme end of the spectrum some unscrupulous scientists have “forged” very successful careers by publishing fabricated data. Over almost two decades, the Japanese anaesthesiologist Yoshitaka Fujii fabricated data for almost 200 scientific papers including over 120 clinical trials and he currently holds the record for the most retracted papers.
Reproducibility – the guarantee of scientific truth?
What drives professional scientists to make such pessimistic judgements as those above about the output of their fellow scientists? A major factor persuading these scientists to mistrust the bulk of scientific output is the lack of reproducibility of many published findings or sometimes the lack of any attempt to reproduce many apparently important research findings. Just before he published his paper suggesting that most published research findings are false Ioannidis (2005a) published an analysis of 49 very highly cited clinical research studies published in major medical journals. These papers were all published in the period 1999-2003 and had been cited by other authors more than 1000 times; a very high selection threshold. Forty five of these papers (92%) reported positive findings i.e. that the treatment under test was effective. He then searched the literature for later studies that had tested the same treatment using either a larger or similar-sized sample or had used better controlled designs. Of these 45 studies:
- 7 were contradicted by the subsequent studies
- 7 reported findings that were stronger than those of the subsequent studies
- 20 had their results confirmed
- 11 remained largely unchallenged.
So Ioannidis concluded that even very highly cited clinical research papers testing treatment outcomes were sometimes contradicted by later studies, produced stronger effects than the subsequent studies or had not been independently confirmed. Less than half had been independently verified. Of the six nonrandomised studies in the 45, five were refuted by subsequent bigger or better studies.
One of the most important pillars of science is that the results of an experiment or other study should be able to be repeated by the original authors and reproduced by other scientists. It is assumed that bad or fraudulent science will be found out when others fail to reproduce it:
Scientists generally trust that fabrication will be uncovered when other scientists cannot replicate (and therefore validate) findings”. Crocker and Cooper (2011) in an editorial in Science.
Despite this faith in the concept of reproducibility, failure to reproduce the fabricated research findings of serially fraudulent scientists has very seldom been responsible for their unmasking. Repeatability and reproducibility are nevertheless seen as the key factors that guarantee the truth of scientific studies:
A measurement is repeatable if the original experimenter repeats the investigation using same method and equipment and obtains the same results.
A measurement is reproducible if the investigation is repeated by another person, or by using different equipment or techniques, and the same results are obtained.
In both cases “the same” results means within the limits of normal random error i.e. not exactly the same but essentially similar. An indirect and anecdotal indication of the lack of reproducibility of some studies are the contrasting headlines summarising studies that fluctuate between support and opposition for a particular hypothesis that are frequently seen in the science and health columns of the popular media. Take for example the following headlines taken from the BBC web-site about St John’s wort over a five year period:
- 31/8/2000 Herb “as effective as antidepressants”
- 9/4/2002 Herb ineffective as antidepressant
- 11/2/2005 Herb “as good as depression drug”
Young and Karr (2011), the two statisticians who suggested that “any claim coming from an observational study is most likely to be wrong” use reports of contradictory findings from observational research as part of the support for this claim. This is strikingly illustrated by two consecutive papers in the world’s premier medical journal, The New England Journal of Medicine, both reporting data from world famous cohort studies about the use of postmenopausal hormone (oestrogen) replacement therapy (HRT) on the risk cardiovascular disease. The first paper concludes that HRT increased the risk of cardiovascular death by 50% whilst the second supported the proposition that HRT more than halved the risk of severe coronary artery disease (Wilson et al, 1985 and Stampfer et al, 1985). Young and Karr discuss the findings of Mayes et al (1988) who had attempted a more systematic demonstration of this phenomenon. They identified using fairly objective criteria 56 topics that were the subject of a case-control study and found that for each of them, conflicting findings had also been published. These were findings about whether exposure to things like common drugs including oral contraceptives, medical procedures, infections, vaccinations and lifestyle or environmental factors were associated with later onset of diseases such as:
- Oral contraceptive use and 10 outcomes including breast cancer
- Diazepam and birth defects
- Coffee and bladder cancer
- Saccharin and bladder cancer
- Dogs and multiple sclerosis
- Smoking and cervical cancer.
Note that case-control studies involve comparing past exposure to a proposed factor in those classified as cases (e.g. having a condition) and those classified as controls (e.g. free of the condition). Thus in the above list one might compare the past use of diazepam in women who had a baby with a birth defect and those with normal healthy babies. Significantly higher use of diazepam amongst the cases might be used to support the proposition that diazepam increases the likelihood of a birth defect.
Young and Karr (2011) did a survey of 12 randomised controlled trials that between them tested a total of 52 claims made on the basis of observational studies; none of the 52 observational claims was confirmed in these clinical trials. For five of the claims, the clinical trials produced statistically significant effects in the opposite direction to the observational claims. Several of these claims relate to testing the supposed beneficial effects of antioxidants referred to at the start of this post.
Observational studies are considered important for generating hypotheses that can then be tested by more rigorous methods. There are inherent and well accepted problems with observational research that are beyond the control of the investigator. But what about pre-clinical research using in vitro systems such as isolated cancer cells or animal models where experimenters have the opportunity to conduct very high quality controlled studies? In a 2012 commentary in Nature two cancer researchers (Begley and Ellis, 2012) report that when 53 “landmark” papers that indicated the possibility of significant advances in cancer therapy were repeated by the pharmaceutical company Amgen, in only six cases (11%) were the scientific findings confirmed. Most of these papers were highly cited and in high impact journals. Researchers working for another pharmaceutical company, Bayer, also found that in three quarters of 47 cancer projects, in-house scientists were unable to reproduce work previously published in top journals and were forced to abandon these projects. Substantial time and resources were devoted to these efforts at reproducing this highly rated work. This means that a large proportion of pre-clinical studies even though published in high impact journals cannot be repeated with the same conclusions by an industrial laboratory.
In psychology there has been a more systematic attempt at measuring reproducibility of published materials. A University of Virginia psychologist Brian Nosek initiated what is known as The Reproducibility Project and this has been co-ordinated by him under the umbrella of the Center for Open Science which Nosek helped to establish in 2013. 270 psychologists from around the world were persuaded to try and repeat 100 psychology studies published in three major psychology journals trying to replicate as far as possible the methodology of the original authors. They contacted the original authors before the attempt at replication to get details of materials and study design and to involve these original authors in checking the replication protocol and almost all co-operated in this process. 97% of the original papers had statistically significant results whereas only a third of the replications did. The mean “p” value for the original studies was 0.028 whereas that for the replications was 0.302.The average effect size in the replications was half of that in the original papers and 82% of the originals had greater effect sizes than the replicates. The headline figure was that according to pre-set criteria there were 39 replications and 61 non replications; only 47% of original effect sizes were within the 95% confidence limits of the replication study (Open Science Collaboration, 2013). Nosek is currently involved in an ongoing project to try and replicate a subset of experimental results from 50 high impact cancer biology articles published in the period 2010-2012 (see Van Noorden, 2013).
Thus there is a substantial body of literature suggesting that for most effects claimed in observational studies there are papers reporting conflicting results i.e. a lesser effect, no effect or an opposing effect. Attempts by pharmaceutical companies to reproduce basic pre-clinical research prior to beginning translational studies have failed in a majority of cases to reproduce the original findings. Even with randomised controlled trials and meta-analyses the most powerful methods at the very top of the hierarchy of evidence in medical research there are many examples of divergent or conflicting results (to be the subject of a later post). Attempts to replicate 100 psychology experiments with the co-operation and input of the original authors has failed to replicate two thirds of them and found average effect sizes in the replications that were less than those of the original in most cases and on average less than half original effect size. There are ongoing efforts to similarly test the reproducibility of experimental finding from 50 high impact cancer biology studies.
Part II – Why is so much published data irreproducible?
All of the authors cited in part I who have discussed the lack of reproducibility of scientific data and the generation of so many apparently false positive conclusions have discussed general flaws in design and execution of studies that produce false positive results. These devices are wittingly or unwittingly used by scientists to bias their data in favour of their preferred outcome i.e. supporting or consistent with their stated hypothesis. The following is an amalgamation of the flaws identified by these authors; the categorisation is somewhat arbitrary because they are often overlapping or sometimes different aspects of the same underlying fault.
Career advancement for most professional and academic scientists depends upon publication of research papers. Even relatively low grade publications can by sheer weight of numbers add to a scientist’s reputation and esteem but the most highly regarded publications in a scientist’s curriculum vitae would be first author papers in one of the top, high impact journals. To satisfy the referees and editors of these top journals, scientists need, or think they need, positive and clear cut data i.e. statistically significant data that supports the author’s hypothesis. The term publication bias refers to the increased likelihood of positive data (e.g. the experimental intervention works) being published compared to negative data (it has no effect). There has historically been a strong bias against negative data at the author, peer review and editorial levels. Where the prevailing view is that the intervention works then negative data may sometimes become the novel and interesting finding.
This striving for significant, hypothesis-affirming results encourages bias and scientists often find ways to confirm their preferred hypothesis. Several of the specific issues with scientific study design and execution discussed below could be regarded as mechanisms for introducing bias. A classic example of bias are studies into the efficacy of acupuncture. In the period 1966-95 there were 47 studies of acupuncture emanating from China, Japan and Taiwan where there is a long tradition of acceptance and use of acupuncture and all of them reported that it was an effective treatment. Over the same period, there were 97 studies in the USA, Sweden and the UK and just over half (56%) found any therapeutic benefit (see Lehrer, 2010). The herb St John’s wort seems to be more effective in treating depression when tested by German speaking scientists than when tested by those in the USA and UK (Linde et al, 2008).
In the recent blog post of my case study of the fraudulent cancer specialist Werner Bezwoda it was noted that in phase II trials, high dose chemotherapy (HDC) seemed to substantially increase survival times in advanced breast cancer patients compared to standard treatment. However, Zuhman et al (1997) showed that patients who met the eligibility criteria for receiving HDC fared better than other advanced breast cancer patients even when they received standard chemotherapy and so the improved survival after HDC was probably an artefact due to biased selection. In the recent blog post about my case study of Sir Cyril Burt and the heritability of intelligence I argued that extraordinary bias in the selection of skulls and brains (e.g. not matching for size, age or sex) led to the conclusion that the brains of non-white races were smaller than those of white northern Europeans and thus these non-white races were less intelligent than the Europeans; a view that was held by many scientists in the 19th and early 20th century. Bias may be often be unconscious but some scientists may use bias to obtain the result they want. As an analogy, if one were surveying the voting intentions of a British town one would get a very different prediction of the outcome of the election if one just sampled an impoverished working class estate or just an affluent middle class estate. Inadvertent biased sampling could give a false impression of overall voting intentions but it would also be easy to deliberately bias the prediction of voting outcome by biasing selection of the sample from different areas of the town.
Anyone who has read a lot of scientific papers will have observed that most of them report findings that are supportive of the original hypothesis or at least what the authors present as their original hypothesis. As long ago as 1959 the statistician Thomas Sterling found that 97.3% of published psychological studies with statistically significant results found the effect they were looking for and suggested that either psychologists were extremely lucky or more likely they were only publishing the results of “successful” experiments which gave the results that they wanted. In a later revisiting of this topic Sterling et al (1995) found that 95.6% of psychology papers still reported positive effects and a limited survey of three American medical journals found that 85.4% reported positive findings. Daniele Fanelli in a 2011 report analysed 4600 papers in all disciplines published over the period 1990-2007. He reported that around 70% of papers published in 1990-1991 reported a positive outcome i.e. supported the original hypothesis. The odds of reporting a positive result increased by around 6% per year between 1991 and 2007. The chances of reporting a positive outcome varied between disciplines and country of origin. The frequency of positive results increased as one moved from the physical to the biological to the social sciences. In ten of the discipline categories he used, 90% or more of the results were positive by 2007 including clinical medicine, immunology, molecular biology and genetics, neuroscience and behaviour, psychology/psychiatry and pharmacology/toxicology. Authors based in Asian countries were more likely to report positive findings than those in the USA who in turn were more likely to report positive findings than authors based in the EU. He suggests in the title of the paper that:
“Negative results are disappearing from most disciplines and countries”
Fanelli suggests that this bias against negative results not only distorts the literature directly but discourages scientists from embarking on high risk projects but also encourages the falsification or even fabrication of results. Publication bias is discussed again later in relation to selecting data for publication.
The pressure to achieve statistical significance
Scientists claim that a result is statistically significant if the chances are less than 1 in 20 (p<0.05) that such a result could have occurred by chance. If an experimental or observational study yields a difference between two averages or a relationship between two variables has a less than 1 in 20 (p<0.05) probability of being just due to random chance then this makes it an interesting and potentially publishable data set. Scientists are thus keen that their studies reach this level of significance. This 5% or 0.05 probability (1 in 20) is actually just a scientific convention; it was decided rather arbitrarily in 1922 by the English statistician Ronald Fisher largely on the basis of convenience. This has become the agreed boundary between significance and non-significance and in terms of acceptance of scientific results has become something of a glass ceiling for scientists. In fact there is very little real difference between absolute probabilities just fractionally on either side of this boundary see the table below. From the scientists perspective however this may represent the difference between a publishable dataset and an experiment that may need to be repeated or a dataset that is filed away and unused. In their eagerness to reach this significance threshold, scientists may make biased judgements or even unfairly manipulate their data to achieve this goal.
Table The likelihood of a result being due to pure chance at different probability (p) values close to 0.05.
If there are many groups researching a hot topic then there is strong incentive to be first to publish any positive results and the chance of an isolated positive finding increases. If a widely publicised positive finding is published in a high impact journal then the incentive may then be to publish conflicting negative findings. The phenomenon of alternating extreme claims and refuting claims has been dubbed the Proteus phenomenon.
Two psychologists (EJ Masicampo and Daniel Lalande in 2012) published a study in which they looked at the probability levels claimed in all papers from a whole year’s issues of three highly regarded psychology journals. From these 36 journal issues they identified about 3600 probability values within the range p=0.1-0.01. When they plotted the number of probabilities at each point within this range they found a fairly even spread of values with a declining frequency as p value went from 0.01 to 0.1. However there was a distinct peak of probabilities just below 0.05 (i.e. just statistically significant). When they narrowed the measurement interval for p progressively from 0.01→ 0.005 →0.0025→0.00125 then this peak just within the statistically significant range became more pronounced i.e. the excess values just below 0.05 was even more obvious (see figure below). Overall they thus found that there were many more probabilities that were just on the statistically significant side of this arbitrary border than would have been expected from the general distribution of probability values. This suggests that authors were either deliberately or subconsciously selecting or manipulating their data to increase the chances of it being below the magical 0.05 (5%) barrier e.g. they got more values between 0.045 and 0.05 than would have been expected from the overall distribution.
Figure The distribution of probability values in the range 0.01-0.1 from papers in 36 issues of three major psychology journals (from Masicampo and Lalande, 2012)
Masicampo and Lalande’s data indicates is that “fudging” of results may occur quite frequently as well the headline cases of total data fabrication which are discussed as case-studies in my book.
Selective exclusion/inclusion of outlying results
If one or two results that adversely affect the probability are excluded on the basis that they are likely to be errors or anomalies then this may be enough to push the results of an experiment or correlational study into statistical significance. Likewise inclusion of improbable outliers may tip the results into statistical significance. If this is done without sound reasons for doing so and if what has been done is hidden from the reader then this is a form of scientific misconduct. The researcher probably knows how exclusion or inclusion of an outlier will affect the overall outcome of the study so this may bias their decision even if they are not wilfully trying to cheat. Ways should be found to make such decisions impartial e.g. by getting a third party to make the decision. Any exclusions should be transparent to the reader and the reasoning and mechanism for deciding upon exclusion should be given.
In a well-designed study, a researcher would decide on the number of observations or repetitions that they are going to make during the planning stage of a study. However some researchers may take advantage of random fluctuations in the data. The researcher collects and analyses data continuously during the study and if at some point during the collection the results become statistically significant then the experiment is terminated at this point and a statistically significant result is recorded and published. Once again this is certainly bad practice and could be regarded as misconduct.
Multiple analyses and selective publication
According to Ioannidis (2005), the greater the number and the lesser the selection of tested variables the less likely the findings are to be true. The results of a large randomised controlled clinical trial set up to confirm previous findings are likely to be true. Hypothesis-generating studies where lots of variables are tested are less likely to be true e.g. microarrays to find gene associations with disease and epidemiological associations between dietary variables and disease. Many potential variables are sometimes recorded during the course of a scientific study; it is thought better to collect data that one may not use than wish that one had collected it once the study has finished. Some authors may perform multiple analyses on their data until they manage to produce a difference between two means or a correlation between two variables that reaches the magical 5% significance level. They may even use extra derived values in the statistical analysis until they find something that reaches the <5% barrier or can be manipulated under this barrier. They then publish their work highlighting this one significant finding and perhaps ignoring many of the non-significant comparisons that were made – a form of publication bias. If enough variables are tested then it becomes increasingly likely that one or more of them will be statistically significant just by chance.
Are multiple analyses, aided by selective reporting, the source of some of the very improbable sounding associations that fleetingly attract the attention of the health/science correspondents of the popular media? Consider a case-control study set up to test whether people with a particular disease (cases) consume more of a widely distributed dietary constituent than those who do not have the disease (controls) to see if this supports the hypothesis that this dietary constituent might be causally linked to this disease. The initial analysis might show that the difference in total intake of this constituent between cases and controls failed to achieve statistical significance. If one then performed multiple analyses in which the intake of every food that contained this constituent was compared in cases and controls, one might generate one or more statistically significant results. One might even resort to comparing ratios such as the amount of this constituent divided by the activity of an enzyme involved in its metabolism. In a similar cohort study where many dietary and lifestyle variables were recorded at the outset and many outcomes recorded at the end then it may be possible to test dozens or even hundreds of potential associations between initial measures and final outcomes. If the disease had many forms like cancer then one might resort to testing associations between each of the many dietary and lifestyle variables measured and each specific type of cancer.
Young and Karr use the tongue-in-cheek example of a proposed relationship between jelly bean consumption and acne. If the initial analysis showed no significant association between acne and jelly bean consumption then each of the 20 different colours of jelly beans might be tested separately and in this mythical scenario, a statistically significant association found between green jelly bean consumption and acne (if one tests 20 associations then by chance one might expect one to be significant at the 5% level). This could then be published with the claim that there is a statistically significant link between green jelly bean consumption and acne. One could even imagine these investigators trawling through the literature on food dyes and if they found an obscure report that suggested that green dye promoted growth in a particular bacterial culture they could then retrospectively suggest a hypothesis that “because of the known bacterial growth promoting activity of green food colorant green jelly beans may exacerbate or precipitate acne”.
Young and Karr also discuss a real example of multiple analysis based on a claim that “females eating breakfast cereal leads to more boy babies” from a paper that suggested that maternal diet prior to conception influenced the sex of their babies (Mathews et al (2008). Maternal diet was assessed at three time periods i.e. prior to conception, in early pregnancy and late pregnancy. The consumption of 133 individual foods was assessed by food frequency questionnaires and it was found that consumption of breakfast cereal pre-conceptually was highly significantly linked to the chances of having a boy baby. When this data was re-analysed the conclusion of Young et al (2009) was that the link between cereal consumption and gender was probably a false positive caused by multiple testing. This claim was disputed by the original authors but nevertheless the potential for chance false positive findings is very clear because one can test the association between dozens of dietary variables at different times to the chances of having a baby boy; a chance association even at quite a high significance level would not be unexpected.
Begley and Ellis (2011) attempted to identify why many pre-clinical studies were not reproducible by scientists working in industrial labs. They suggested frequent features of the non-reproducible studies were:
- Non blinding of investigators as to which were the control and experimental groups so increasing the chances of bias
- Reporting just a single experiment and sometimes reporting just the experiments that were supportive of the authors’ underlying hypothesis.
In those studies that could be reproduced, authors had paid close attention to controls, reagents, investigator bias and describing the complete data set. If one tested a potential cancer treatment using several different cancer cell lines but only published those or the one where the substance had an apparent positive effect then this would be a distortion of the totality of the study.
According to Ioannidis (2005), research findings are more likely to be true in fields that undertake large studies such as randomised controlled trials with several thousand subjects and less likely to be true in smaller studies. It is known that plotting study size against effect size tends to produce a funnel of results around the true or mean effect i.e. larger studies tended to be concentrated close to the mean effect but as study size decreases so the scatter around the mean tends to increase. A lack of small negative trials is used to indicate publication bias i.e. the non-publication of small negative studies when assessing the results of meta-analyses (a procedure for effectively combining similar studies into one large study of greater statistical power). Underpowered studies will tend to produce results that are widely scattered around the true result; if the true result is no effect then some of them may generate a statistically significant positive result especially if there is some slight bias in the way the study is conducted or analysed.
Some cohort studies have impressive sounding sample sizes (many tens of thousands) but often the number of cases will be small and so a 1.5 relative risk from highest to lowest exposure may actually be a rise of say 10 cases per 10,000 subjects to 15 cases per 10,000 subjects and it is much easier now to see how any small bias in the study could be responsible for this relatively small effect. For example if the disease has a long latent period before clinical diagnosis then the behaviour of some of the early cases may be affected by existing but undiagnosed disease.
Small effect size
Many of the headline-generating claims of relationships between diet and disease have relative risk within the range 1-1.5 i.e. the risk in the “higher risk” category is less than one and a half times that in the lower risk category. This compares to a relative risk of 15-40 for smoking and lung cancer depending on level and duration of smoking. The risk of cot death for a front sleeping infant in case-control studies is up to 8 times that of one sleeping on their back. Ioannidis (2005) goes as far as to suggest that some clinical research findings may simply be a measure of the prevailing bias. He takes the theoretical example of 60 dietary factors tested in relation to the risk of developing a specific tumour. If no dietary factors are actually truly related to developing this tumour but relative risks in the range 1.2-1.4 have been reported for the upper and lower thirds of these dietary variables then these effect sizes may simply be a measure of the net bias in these studies. The double blind design of clinical trials recognises the likelihood of experimenter bias affecting the outcome and ensures that those making measurements and collecting patient data do not know which treatment group a patient is in whilst the data is being collected. This blinding of data collection should be used in other types of study, where it is feasible, to reduce opportunities for making biased measurements or assessments.
If one is testing the correlation between two variables in epidemiological studies then there will be other variables that will also affect the outcome measure. There are procedures that allow one to try and statistically correct for and remove the effects of these so-called confounding variables. Decisions about which variables to try to allow for may make a difference to the outcome; different groups may make different decisions about which confounding variables to correct for or use different correction procedures and thus get different results. It may sometimes be difficult to know what all of the likely confounders are and it may be difficult to accurately correct for some. It is, for example, difficult to correct for the effects of activity levels in epidemiological studies, some groups may do it and others not or different groups may do it in different ways. A more cynical approach might involve sequentially correcting confounding variables but stop while the result remains or becomes statistically significant.
Degree of flexibility
The greater the flexibility in designs outcomes, definitions and analytical modes the less likely the results are to be true because flexibility offers the opportunity for bias and to transform negative or not significant results into significant positive ones. There may be selective outcome reporting, manipulation of the outcomes and analyses reported. If a standard clinical trial protocol is registered in advance with clear unequivocal outcomes like death and standard statistical methods generally agreed then the results are more likely to be true. A study using methods that are not fixed and agreed and where there are multiple ways of measuring outcome like scales for schizophrenia or depression and where there are different ways of analysing data is more likely to produce data that is not true. The less opportunity there is for the experimenter to exercise choice and discretion the less the chance of their biasing the outcome in favour of an expected, desired or statistically significant outcome.
Begley, CG and Ellis, LM (2012) Drug development. Raise standards for preclinical cancer research. Nature 483, 531-3.
Bohannon, J (2013) Who’s afraid of peer review? Science 342, 60-65.
Davis, P (2009) Open access publisher accepts nonsense manuscript for dollars. The Scholarly Kitchen.
Fanelli, D (2012) Negative results are disappearing from most disciplines and countries. Scientometrics 90, 891-904.
Ioannidis, JA (2005) Why most published research results are false. PLOS Medicine 2, e124.
Ioannidis, JA (2005a) Contradicted and initially stronger effects in highly cited clinical research. Journal of the American Medical Association 294, 218-28.
Lehrer, J (2010) The truth wears off. The New Yorker. Annals of science. December 13th 2010 issue.
Linde, K., Berner, M.M. and Kriston, L 2008 St. John’s wort for major depression. Cochrane Database systematic Reviews 2008, issue 4. Art. No.: CD000448. DOI 10.1002/14651858.CD000448.pub3.
Masicampo, EJ and Lalande, DR (2012) A peculiar prevalence of p values just below .05. The Quarterly Journal of Experimental Psychology, 65 2271-9.
Mathews, F, Johnson, PJ and Neil A (2008) You are what your mother eats: evidence for maternal preconception diet influencing foetal sex in humans. Proceedings of the Royal Society B 275, 1661-8.
Mayes, LC, Horwitz, RI, Feinstein, AR (1988) A collection of 56 topics with contradictory results in case-control research. International Journal of Epidemiology 17, 680-5.
Open Science Collaboration (2013) Estimating the reproducibility of psychological science. Science 349, aac4716.
Stampfer, MJ, Willett, WC, Colditz, GA, Rosner, B, Speizer, FE and Hennekens, CH (1985) A prospective postmenopausal estrogen therapy and coronary heart disease. New England Journal of Medicine 313, 1044-9.
Sterling, TD (1959) Publication decisions and their possible effects on inference drawn from tests of significance – or vice versa. Journal of the American Statistical Association 54, 30-34.
Sterling, TD, Rosenbaum, W and Weinham, JJ (1995) Publication decisions revisited – the effect of outcome of statistical tests on the decision to publish and vice versa. The American Statistician 49, 108-12.
Van Noorden, R (2013) Initiative gets $1.3 million to verify findings of 50 high-profile cancer papers. Nature News Blog 16 October 2013.
Van Noorden, R (2014) Publishers withdraw more than 120 gibberish papers. Nature news 25 February 2014.
Wilson, PWF, Garrison, RJ and Castell, WP (1985) Postmenopausal estrogen use, cigarette smoking, and cardiovascular morbidity in women over 50 – The Framingham Study. New England Journal of Medicine 313, 1038-43.
Young, SS, Bang, H and Oktay, K (2009) Cereal-induced gender selection? Most likely a multiple testing false positive. Proceedings of the Royal Society B 276, 1211-2.
Young, SS and Karr, A (2011) Deming, data and observational studies. A process out of control and needing fixing. Significance 8, 116-120.
Zuhman, ZU, Frye, DK, Buzdar, AU, Smith, TL, Asmar, L, Champlin, RE and Hortobagyi, GN (1997) Impact of selection process on response rate and long-term survival of potential high-dose chemotherapy candidates treated with standard-dose doxorubicin-containing chemotherapy in patients with metastatic breast cancer. Journal of Clinical Oncology 15, 3171-7.