This is the first in a series of eight posts under the umbrella title “Finding the truth – research in the biomedical sciences”. These posts outline the principles of the methods used to establish things like the cause of disease and the effectiveness of strategies to prevent or treat diseases. They discuss the strengths and weaknesses of these methods and how their misuse or the misinterpretation of their results can lead to misleading or incorrect conclusions. One of the primary aims of these posts is to help readers, even those with limited scientific background, make realistic judgements about the latest reports and media claims about research linking lifestyle factors and diseases or potential new therapies.
The titles of these eight posts are:
I – Introduction and overview
II – Observational epidemiological methods
III – In vitro and animal experiments
IV – Experiments with people
V – Meta-analysis
VI – Decision making in clinical research – hierarchies of evidence
VII – Is most published research really wrong? (This post already exists but a version modified for the context of this series will be re-posted in sequence)
VIII – Clinical trials and meta-analysis – the gold and platinum standards of evidence?
Aims and scope of this post
This post is intended to be the context for the other seven posts of the series, showing how all of these various observational and experimental methods fit into the overall scheme of biomedical research. It starts with a brief non-technical overview of some simple ways in which statistics are used by biomedical researchers; statistics has become an essential tool central to most biomedical research.
Use of statistics in the biomedical sciences
Biological scientists, like other scientists, use statistics to test whether results are likely to be just due to chance or a “real” effect i.e. to decide whether the difference between two averages or an association between two variables are significant or not.
Differences between means or averages
There is so-called biological variation in measurements made upon people and other living organisms. We talk about average height, average weight, average blood pressure, average heart rate and average body composition but even if you match people for factors like age, sex and race then there will be variation around that average or mean figure. Most biological measurements follow a so-called normal distribution whereby if you plot a frequency distribution of the number of people with measurements around the mean or average value then you get a bell-shaped curve. Most individuals are clustered around the mean and the further away from the mean you get on either side then the less values are present. This trailing off of values as you get further from the mean, occurs in a symmetrical and mathematically predictable way. A simply calculated value called the standard deviation describes the distribution of values around the mean: about 68% values lie within 1 standard deviation either side of the mean, about 95% within 2 standard deviations and 99.7% within 3 standard deviations of the mean (see figure 1). Thus in a large sample of British men only 2.5% would be taller than the mean/average + two standard deviations and just 0.15% taller than the mean or average + 3 standard deviations. This means that if say 25% of children attending a primary school were more than 2 standard deviations below their average height-for-age of British children then this might alert health and welfare authorities that some dietary, lifestyle or environmental influence was adversely affecting growth of children in this area. If an individual child was much more than 3 standard deviations above height for age then this might trigger endocrine tests on this individual particularly if there was not a family history of extreme tallness or there were other indicators of a pathological state.
To use the mathematical formula to calculate standard deviation one only needs to know the number of values, the mean value and to calculate the difference between each individual value and the mean.
Figure 1 The distribution of values around the mean (0) in a normal distribution
If one divided a large class of 15 year old boys in half by alphabetical order then the average height of the two groups would almost certainly not be exactly the same. If one divided another large mixed class of 15 year olds according to sex then the difference between average heights of the two groups would probably be bigger. Statistical analysis should indicate that the difference when division was by alphabetical order was likely to be just chance but that the difference when the children were divided by sex was real and that boys of this age tend to be taller than girls.
Consider an experiment designed to test whether a drug increased (or decreased) weight gain in rats. One would set up two matched groups of rats and add the drug to the food of one group and the other group would act as control and be given an identical diet which did not contain the drug. The weight gains of the two groups would then be compared at the end of the experimental period. Once again the average weight gains of the two groups would almost certainly not be exactly the same even if the treatment had no effect. The scientist needs to be able to decide whether any difference in the average weight gain between the two groups is likely to a real effect of the drug. To make this judgement objectively, the scientist would perform a statistical test. The average or mean weight gain of each group would be determined and the standard deviation calculated as the indicator of the degree of variability around the average value. A statistical test using the difference between the two averages and the two standard deviations makes it possible to mathematically estimate the chances of an average weight gain difference of the size found in the experiment occurring simply by chance. If this analysis suggested that a difference of this size was only likely to occur by chance on 1:1000 occasions then one would be confident that this was a real effect caused by the treatment/intervention and this result would be highly significant (p = 0.001 or 0.1% where p stands for probability). At the other end of the scale if the analysis told us that the likelihood of a difference of this size occurring by chance was 1:2 (p=0.5 or 50%) then one would have no confidence that this was a real effect of treatment. One would expect to get a difference of this magnitude by chance every second time and so one would conclude that the difference is too likely to be due to chance to claim any significance i.e. it is not statistically significant. By convention, scientist’s regard a probability of less than 1:20 (p<0.05 or 5%) as statistically significant and the result is claimed as being likely to be due to a real effect of the treatment. Of course, by using this criterion, statistically significant differences can still occasionally occur by chance; theoretically once in every 20 experiments. If a treatment does have a real effect upon the outcome measured then whether or not a test achieves statistical significance will depend upon:
- The magnitude of the treatment effect
- The general variability in the outcome measured and any variability in the response to the treatment
- The size of the sample.
It is possible when designing an experiment to roughly predict what sample size is required to produce a statistically significant result (p<0.05 or <5%). Thus in our experiment on rat weight gain, it is possible to predict how large a sample of rats one would need to get a statistically significant effect if the drug increased/decreased weight gain over the course of the experiment by say 10%.
If any study yields a result that falls just short of the 5% significance level then it may be worth repeating it with a larger sample.
A statistically significant difference between control and experimental treatments at the end of an experiment does not prove the underlying hypothesis; it could be the result of bias in the design or conduct of the experiment rather than a real effect of the treatment under test. For example:
- If there was some bias in the way the two groups were initially selected
- A difference or bias in the way the control and experimental groups were treated during the experiment
- Some consistent difference or bias in the way any outcome measure was monitored in the two groups.
Such differences could be responsible for any difference between the weight gains of the two groups rather than the treatment under test. One of the advantages of using animals like rats in an experiment is that it is relatively easy to ensure that the experiment is well controlled:
- The two groups of animals should be well matched at the outset e.g. for age, sex, weight and maybe litter matched (one of each pair of siblings assigned to each group)
- They would be fed identical diets throughout and kept under identical environmental conditions e.g. same size cages and occupancy, same room temperature and same light cycle.
If the person conducting the experiment did not know which group the animals belonged to when they were being handled and weighed then this should eliminate the risk of unconscious bias when making these measurements. This so-called blinding of experiments is a feature of good experimental design.
Another problem with statistics is that they are a mathematical statement that, in this case, two averages are probably different. This does not necessarily mean that in a clinical situation, the patient gets any real benefit from the intervention e.g. if something lowers blood cholesterol by a statistically significant but small amount in short-term trials, will this be of any real clinical benefit to the patients? Is the fall in cholesterol enough to make a real difference to heart attack risk and will it be maintained over the long-term? Any effect may be statistically significant but clinically small and of little real benefit to patients. The magnitude or degree of patient benefit from any effect of an intervention may be more critical than the probability value; something can have a small and not clinically useful effect that is nonetheless statistically significant.
Association between variables
Biomedical scientists, particularly epidemiologists often look for associations between variables:
- Is there any relationship between saturated fat intake and blood cholesterol concentration?
- Is there any link between calorie intake and body fat content in adults?
- Is there a link between daily portions of fruit and vegetables and risk of death from cardiovascular disease?
- Is there any link between measured activity level or fitness and body fatness?
When testing the relationship between most biological variables one can calculate a value known as the (Pearson) correlation coefficient given the symbol r. If two variables under test are plotted on the x and y-axis of a graph then if they lie on a perfect straight line with a positive slope (i.e. increases in x lead to increases in y) then this is a perfect positive correlation and r = +1. If all of the values lie on a straight line with a negative slope (increases in x lead to decreases in y) then this is a perfect negative correlation and r = -1. If all of the points lie randomly scattered around a horizontal line then this means that there is no correlation between the measures on the x and y-axis and r = 0 (shown in figure 2).
Figure 2 To illustrate Pearson correlation coefficients of +1, -1 and 0.
As with differences between means, one can assess the likelihood or probability of any association being due to chance and thus whether the r value is indicative of a real association between x and y. The statistical significance of r values between 0 and +1 / -1 depends upon the sample size. To be statistically significant (i.e. a probability <0.05): with a sample of 10 pairs one needs an r value of over 0.6, with 30 an r value of around 0.3, 100 values an r of around 0.17 and with 1000 an r of just 0.05 would be significant. Even though with large samples, low r values may be statistically significant the association between x and y is only a very weak one. If one squares the r value, this indicates how much of the variation in y is explained by variation in x. With r values of +1 and -1 then r2 is 1 and so 100% of the variation in y can be explained by variation in x, for other values:
r value (+ or -) r2 variation in y explained by variation in x
1 1 100%
0.7 0.49 49%
0.5 0.25 25%
0.2 0.04 4%
0.1 0.01 1%
A weak correlation can be statistically significant without being of much value to a medical researcher trying to explain how, why and if two variables are really linked. A high correlation coefficient indicates that two variables are associated but does not necessarily mean that changes in x cause changes in y. This mantra that association does mean cause and effect is probably the biggest weakness in the interpretation and analysis of epidemiological findings and is discussed at length in the second post in this series.
When looking at potential causes of disease, it is common to calculate something called the relative risk. In its simplest form this is the risk in a group exposed to the potential cause divided by the risk in the unexposed group. For example, one could monitor a large group of middle-aged people for some years and record the number of cases of lung cancer. The relative risk is the incidence of lung cancer (e.g. cases per 1000) in those who smoked divided by those who did not. In this type of study, relative risk may be well over 10 i.e. smokers are ten times as likely as non-smokers to develop lung cancer. One could calculate relative risk at different levels or durations of smoking and one might expect to see a graded increase in relative risk with increased exposure. One might also look at other variants like relative risk in ex-smokers compared to non-smokers. Once again, such epidemiological studies only demonstrate association but in this case the relative risk is so large and consistent across different types of studies that it was accepted fairly quickly that this association was almost certainly causal.
The process of scientific investigation
From the very beginning of their science education, students will have been taught that science works by making observations, using these observations to formulate a hypothesis and, where this is feasible, testing this hypothesis by an experiment. If the experimental results support the hypothesis then the hypothesis is, at least temporarily, accepted. If the experimental results do not support the hypothesis then a new or modified hypothesis that explains the observations is formulated and this can then also be tested by experiment:
Observation →hypothesis → experiment → accept/reject/modify hypothesis
If, for example, one found that the cases of cholera in an outbreak of the disease were clustered around a particular source of drinking water then one might hypothesise that this water is the source of the disease. If one further found that people within the normal catchment area of this water source who used an alternative supply were unaffected and people outside the normal catchment area who nevertheless had drunk water from this source were affected then this would strengthen belief in the hypothesis. One could then test this hypothesis by preventing access to this water supply and seeing if this leads to a decline in new cases. This in essence describes the case of the “Broad Street pump handle” where John Snow in 1854 was able to provide such convincing evidence that water from this pump in Soho, London was the source of a major outbreak of the disease in 1854 that the local authorities removed the pump handle. Of course, provision of the suspect water to healthy people would probably confirm the hypothesis but would be ethically unacceptable and animal studies would depend upon whether the species chosen developed cholera-like symptoms when infected with what we now know is the causative organism Vibrio cholereae. Analysing the water for the causative organism would not have been an option in 1854.
Generation of a testable hypothesis – observational studies
Scientific experiments are set up to test a hypothesis. The hypothesis may have been generated by anecdotal observations or may be the result of a formal collection of data as in an epidemiological study or even another experiment. For example, a clinician may have got the impression that children who have restricted access to sweets and sugary soft drinks seem to have less tooth decay than other children e.g. diabetic children who traditionally were allowed very little sugar in their diets. Or it may have been observed that rates of tooth decay rose sharply amongst an island population once regular supplies of sugar started to be shipped to the island. Or a short-term experiment may have shown that after consuming sugar the pH of the mouth becomes acidic for a time increasing the dissolution of tooth enamel. The notion that sugar is implicated as a cause of tooth decay actually stretches back to the Ancient Greeks, presumably on the basis of casual observation. In order to get a firm basis for this suggestion, a formal epidemiological study might be set up whereby the sugar intakes of a large sample of children could be estimated along with some quantitative measure of their dental health (number of decayed, missing and filled teeth (DMF). Alternatively the average sugar consumption of several populations might be correlated with the average DMF score of children in these populations. This might show a correlation or association between sugar consumption and rates of tooth decay or between consumption of some types of sugary foods and DMF. These observational or epidemiological studies could then be published in their own right and could generate a hypothesis that sugar is a cause of tooth decay or that some forms or patterns of sugar consumption cause tooth decay.
Use of experiments to test a hypothesis
Our hypothesis that sugar consumption is a cause of dental decay could be tested in an experiment with laboratory animals provided that, like humans, they are prone to tooth decay and susceptible to sugar. Matched groups of animals would be given different amounts of sugar e.g. in their food or drinking water but otherwise treated identically. If the hypothesis is correct then one might expect to see a graded or dose-dependent increase in rates of tooth decay with increasing sugar consumption. One might conclude nowadays that doing such an experiment with people would be ethically unacceptable but more than sixty years ago just such a study was conducted in Swedish mental hospitals, the Vipeholm study. Groups of patients were exposed not only to varying levels of total sugar intake but the form and frequency of the sugar intake was also varied. This study confirmed that high sugar intake increased risk of dental decay and also showed that it was not just the total amount of sugar that mattered but also the form and frequency of its consumption was important. The more frequently sugar was consumed between meals and the more it was in a form that adhered to teeth the worse its effects upon dental health.
Distinction between observation and experiment
The key difference between an observational study and an experiment is that an experimenter imposes some constraint or intervention with the intention of seeing whether this produces results that support the hypothesis being tested. Despite their complexity and sophistication, even the famous cohort studies mentioned in the second post in this series only collect and correlate data about people’s characteristics, what they choose to do or what the environment does to them. Thus in the original Nurses’ Health Study, the use or non-use of oral contraceptives by many thousands of nurses was recorded and health outcome measures compared in users and non-users but the decision about whether or not to use these drugs was that of the individual nurse and not the scientist conducting the study. In an experiment (e.g. a randomized controlled trial, RCT) subjects would have been randomly allocated to receive or not receive either an oral contraceptive or an identical placebo and the designated outcomes measured at the end of the experimental period. This study would have had to run for quite a few years and neither the subjects nor the scientists would have known which were real and placebo tablets until data had been collected. Clearly this would be a difficult study to set up with many thousands of sexually active married nurses who did not wish to get pregnant.
Measuring levels of blood cholesterol and relating this to the subject’s normal estimated saturated fat intake is an observational study. Inducing a change in the saturated fat intake of subjects to see what effect it has on blood cholesterol levels is an experiment. Observational studies cannot technically prove cause and effect, they can only show that two variables are related and the strength of that association and thus allow a hypothesis about cause and effect to be generated. Experimental studies can confirm or disprove this hypothesis.
A range of observational and experimental approaches are available to modern-day biomedical researchers and the way in which they fit into the overall research process is illustrated in figure 3.
Figure 3 A flow diagram to illustrate the various observational and experimental methods available for biomedical research
The wide range of different types of observational and experimental methods available to modern biomedical researchers are illustrated within the two main text boxes.
Post II – describes the different observational methods and briefly indicates their relative strengths and weaknesses. These can be simple descriptive studies where, for example, incidence or disease specific death rates in different populations are correlated with lifestyle, environmental or dietary characteristics of that population. At the other extreme they also include sophisticated and expensive cohort studies like the European Prospective Investigation into Cancer and Nutrition EPIC where measurements and dietary assessments were made on over half a million people from different countries who were then followed for several years to see if these initial measurements and characteristics partly predicted their risk of developing and dying of various types of cancer. These observational studies only show association but there is in-depth discussion of criteria that are used to establish whether any association is likely to be causal. Where associations are strong and these criteria met then they can be used to justify a clinical intervention or a policy where controlled trials are not feasible or they can be used to justify conducting difficult and expensive controlled trials.
Post III – describes the uses and limitations of in vitro experiments (e.g. using micro-organisms or isolated cells) and experiments with laboratory animals. This post focuses upon their limitations and the problems of applying the results of such studies to clinical and health policy issues. Despite all of their limitations, such studies often provide the evidence that initiates a line of enquiry that ultimately leads to new treatments or preventive strategies. Almost all Nobel Prize winners in physiology or medicine have made use of these in their prize-winning work.
Post IV – reviews the experimental methods that use human subjects. These range from short-term experiments often looking at changes in risk factors like blood cholesterol or measures of oxidant stress through to randomized, placebo-controlled trials (RCTs) which are seen as the gold standard of evidence in medical research.
Post V – describes the highly fashionable technique of meta-analysis. This is a weighted aggregation of similar studies testing the same hypothesis. It is essentially a statistical procedure which can be used to amalgamate almost any type of study including animal experiments and observational studies. It is frequently used to amalgamate controlled trial data and such meta-analyses are at the pinnacle of the evidence hierarchy. A successful meta-analysis combines several smaller studies into one large study of greater statistical power and gives a consensus from the results of these smaller studies. In practice differences between individual studies in things like the level of intervention (e.g. drug dose), selection criteria for subjects and outcome measures may make meta-analysis problematical or impossible.
Post VI – discusses how evidence from all of these different types of studies can be integrated and graded when making decisions about clinical practice or health policy. Evidence from these different observational and experimental methods is arranged into a hierarchy or pyramid and clinical and health policy decisions should, ideally be supported by evidence at the top of this hierarchy like RCTs where these are feasible. Some of the major scientific errors discussed in previous posts have arisen because policies or interventions have been decided based upon evidence at relatively low-level in this hierarchy e.g. see the posts about cot death , the protein gap and antioxidants.
Post VII – addresses the question “is most published research really wrong?” (see existing post). It has been claimed that most published research is wrong and that up to 85% of research expenditure is wasted. The first part of this post discusses several reports of the lack of reproducibility of much of the research data that is published. This failure of reproducibility even applies to many apparently “landmark” pre-clinical research papers published in top medical journals. In addition to this irreproducible data there is an avalanche of low-grade research papers published in a host of low impact journals with poor standards of peer review which few people, if any, ever read or cite and which nobody considers worthy of trying to reproduce. The second part of the post tries to identify some of the reasons why so much published research is wrong or irreproducible and some of the characteristics of the faulty research. If the level of wastage of research resources could be substantially reduced this would be the equivalent of a major boost to research expenditure.
Post VIII – discusses some of the flaws and limitations of randomized, controlled clinical trials (RCT) and meta-analyses which have been dubbed the gold and platinum standards in medical research. Why do RCTs and meta-analyses testing the same hypothesis not always produce the same results? Traditional narrative reviews are often seen to be biased by authors’ favouring research that supports their beliefs but there are also sources of bias in the conduct of meta-analysis which can similarly bias results towards the views of authors.