|
|
||||||||
Professional Perspectives |
J Sim, PhD, PT, Professor, Department of Physiotherapy Studies, Keele University, Keele, Staffordshire ST5 5BG, United Kingdom (pta05{at}keele.ac.uk). Address all correspondence to Dr Sim
N Reid, DPhil, Professor of Health Sciences, Office of the Vice-Chancellor, University of Plymouth, Plymouth, Devon, United Kingdom
Submitted April 9, 1998;
Accepted August 23, 1998
| Abstract |
|---|
Key Words: Confidence intervals Estimation Hypothesis testing Statistical inference
| Introduction |
|---|
|
|
|---|
An alternative approach to statistical inference, using confidence intervals (CIs), assists in addressing some of these limitations. In the medical literature, there has been increasing attention focused on the use of CIs.47 In a discussion of various aspects of statistical inference, Ottenbacher8 has advocated greater use of CIs in rehabilitation research, and CIs feature prominently in a recent discussion of statistical inference in rehabilitation.9
In this article, we examine some of the merits of an approach to statistical inference based on CIs. The conventional approach to hypothesis testing is described, followed by a discussion of the nature and use of CIs. Key strengths of the CI as a means of statistical inference are then considered, in particular, the precision that they attach to statistical estimates and the light they shed on issues of clinical importance. Finally, recommendations are made concerning the use of CIs.
| Conventional Hypothesis Testing |
|---|
|
|
|---|
Certain properties of the study sample can be calculated, such as the variance of all the scores for a given variable, the mean of these scores, or the mean difference in scores between 2 groups within the sample. These values are referred to as statistics. They can be calculated for each variable represented in the sample (although different statistics may be appropriate for different variables), and they will normally be slightly different, as different samples are drawn from the population. The corresponding properties of the population are known as parameters, and, because there is only one population, these values are fixed. Thus, the mean of a given population is invariant, but the means of a series of samples drawn from that population will typically vary to some degree. A study statistic, such as the mean, is an estimate of the corresponding population parameter. It is an estimate because the true value of the population is almost always unknown.
When a study is carried out, one or more relationships between various statistics will often be examined. For example, a study can examine an association between 2 variables within the sample or a difference in the mean or median values of a variable between 2 (or more) subgroups within the sample. The purpose of a statistical hypothesis test is to determine whether such a relationship is a "real" one (ie, it represents a corresponding relationship in the population) or a "chance" one (ie, it has emerged due to sampling variation and, although accurately reflecting the relationship that exists in the sample, does not necessarily represent a corresponding relationship in the population). The way in which the statistical test accomplishes this is by asking the question: What is the likelihood of finding this relationship in the sample, if, in fact, no such relationship exists in the population from which it was drawn? This assumption of no relationship is referred to as the null hypothesis, and the rival assumption (that the relationship does exist within the population) is referred to as the alternative hypothesis.10
A hypothetical example may serve to illustrate the way in which a statistical hypothesis test is utilized. A sample of 50 patients with fibromyalgia syndrome (FMS) is drawn randomly from a population of such patients and then assigned (again randomly) to receive 1 of 2 treatments designed to alleviate pain. The null hypothesis is that a difference in pain relief will not exist between the 2 groups following treatment. The alternative hypothesis is that such a difference will exist. The chosen outcome variable, pain intensity as measured in millimeters on a 10-cm visual analog scale (VAS), is measured before and after treatment, and a pain relief score is thereby obtained (pretest score minus posttest score). The mean pain relief score can then be calculated for each group.
This pain relief score will almost certainly differ in the 2 groups, but the question is whether such a difference in outcome is attributable to an underlying difference in the treatments received by the groups, rather than simply to the effect of sampling variation or of chance differences in group assignment. If there is a sufficiently high probability that the observed difference in outcome can be attributed to such variation in sampling or group assignment (conventionally, a probability of 5% or above [ie, P
.05]), then the assumption of no difference between the treatments (ie, the null hypothesis) is retained. That is, the observed difference between groups is considered to be no greater than the difference expected from variation in sampling or group assignment, at the specified level of probability, and the assumption of no underlying difference between groups, therefore, is considered to be plausible. Conversely, if there is a sufficiently low probability that the observed difference in outcome is due to these chance factors (conventionally, a probability of less than 5% [ie, P <.05]), then the assumption of no difference between treatments is rejected. That is, it is considered more plausible that the difference in outcome is due to an underlying difference between the groups than that it is due to chance factors such as variation in sampling or group assignment. The null hypothesis is rejected in favor of the alternative hypothesis.
Table 1 shows the results of the hypothetical experiment. A statistical test for differences was used in this study. Because the data concerned are continuous, are approximately normally distributed, and can be argued to lie on an interval scale of measurement, a t test for independent measures was performed. The probability associated with the test statistic (P=.043) was less than the conventional critical value of .05. This finding provides sufficient grounds for doubting the null hypothesis, which is therefore rejected in favor of the alternative hypothesis. The difference in pain relief between the 2 groups is said to be real.
|
The outcomes of statistical tests need to be considered in the context of the situation to which they relate, and outcomes of clinical research must be subjected to clinical judgment. It is worth noting that the converse of the situation just outlined may also arise. A difference in pain relief may be found not to be real, perhaps due to insufficient sample size or a high degree of variance in subjects' scores (Type 2 error). In such a case, although the observed difference in pain relief is real for this sample, it cannot be assumed to reflect a real difference in the population. In order for such a finding to be applicable to general clinical practice, the observed difference must be shown to be real for the population. This will require a reduction in the risk of a Type 2 error, through an increase in sample size, a more precise method of measurement, or other means of reducing random error.
Another unanswered question relates to the magnitude of this difference in pain relief between groups. This difference is an estimate of the difference that would exist if the full population of patients with FMS were studied. All we know is that the difference found in this particular sample is sufficiently great for it to be attributed to a genuine difference between the treatments rather than to chance variation in sampling or allocation to groups. We do not know how good an estimate it is of the true population difference (ie, the difference we would find had we tested the treatments on the whole population of individuals with FMS). The hypothesis test has told us, on a "yes/no" basis, whether the observed difference is real, but it has not enlightened us as to the true value of this difference in the population. As Abrams and Scragg11 point out, a probability value conveys no information about the size of the true effect. This is information, however, that we need in order to inform clinical practice.
The CI assists in addressing these questions as to the clinical importance and magnitude of an observed effect and remedies some of the shortcomings of more conventional approaches to hypothesis testing. These points will be considered in detail following an account of interval estimation.
| The Nature of Confidence Intervals |
|---|
|
|
|---|
The essential meaning of a 95% CI can be expressed as follows. If we were to draw repeated samples from a population and calculate a 95% CI for the mean of each of these samples, the population mean would lie within 95% of these CIs. Thus, in respect of a particular 95% CI, we can be 95% confident that this interval is, of all such possible intervals, an interval that includes the population mean rather than an interval that does not include the population mean. It does not strictly express the probability that the interval in question contains the population mean, as this must be either 0% or 100%. The population mean is either included or not included.12,13
The function of a CI, therefore, is essentially an inferential one. A CI is used when examining a characteristic of a sample (in this case, the mean pretest VAS score) in terms of its degree of variability in the corresponding population. If the researcher's concern with sample statistics, however, is a purely descriptive one (ie, if the researcher is interested only in the pretest VAS score of the sample, without reference to the population from which this sample was drawn), conventional measures of dispersion, such as the standard deviation (for a mean) or the semi-interquartile range (for a median), should be used.
Width of Confidence Intervals
For a given level of confidence, the narrower the CI, the greater the precision of the sample mean as an estimate of the population mean. In a narrow interval, the mean has less "room" to vary. There are 3 factors that will influence the width of a CI at a given level of confidence.
First, the width of the CI is related to the variance of the sample scores on which it is calculated. If this sample variance can be reduced (eg, by increasing the reliability of measurements), the CI will be narrower, reflecting the greater precision of the individual measurements. Selecting a sample that is more homogeneous will reduce the variance of scores and thereby increase their precision. This factor, however, is often outside the researcher's influence.14
Second, following the principles of sampling theory, sampling precision increases in a curvilinear fashion with increasing sample size. This increase in precision occurs because the variance of a statistic, as expressed by its standard error, decreases as sample size increases. Figure 1 shows 4 samples of a progressively greater size drawn from a single population of physical therapists and the mean period of postqualification experience for each sample. The mean is precisely the same in each case, but the CI becomes narrower as the sample size increases. As sampling precision is related to the square root of the sample size, doubling the sample size will only decrease the width of the CI by 25%.15
|
It is not the case, however, that, at a given confidence level, a narrow CI is any more (or less) likely than a wider CI to be one that contains the population parameter. The probability of including the parameter is determined by the chosen confidence level, not by the width of the particular CI concerned. If a 95% CI is narrow, this means that only a small range of possible values has to be included in order to be 95% confident that the CI contains the parameter. Correspondingly, a wide CI means that a large range of possible values has to be included in order to be 95% confident that the parameter lies within the CI. The probability of inclusion, however, is the same in both cases. A 95% CI is, by definition, one that is 95% likely to contain the population parameter, irrespective of its width.
The width of a CI is indicative of its precision (ie, the degree of random error associated with it), but it does not convey its accuracy (ie, whether it includes the population parameter), which is determined by the chosen level of confidence.9 Choosing a 99% CI rather than a 95% CI will increase the accuracy of the CI (ie, it will have a greater chance of being one of those that includes the population parameter), but will decrease its precision (ie, it will be wider than the corresponding 95% CI).
The usefulness of a CI depends on the point statistic (eg, the sample mean) on which it is based being an unbiased point estimate. If systematic error is present in a study, the point estimate will lie at some distance from the true value of the parameter. In such a case, a CI based on a large sample will, paradoxically, be more misleading than one based on a small sample.16 Consider again Figure 1. Imagine that the point estimate of 9.2 obtained is biased and that the true population mean is 8.5. It is evident that, unlike the 2 wider CIs, the narrow CIs, based on the larger samples, actually exclude this value. In the presence of systematic error, the lesser precision afforded by a wide CI actually increases the likelihood of its including the true population value. This example illustrates the fundamental point that increases in sample size will only assist in dealing with random, not systematic, error. Systematic error is usually an issue in study design rather than a function of the statistics used.
Calculating the Confidence Interval
The 95% CI stated earlier for the mean pretest VAS scores in the FMS study is calculated from the sample mean (
), the statistic from the t distribution representing a 95% level of confidence at 49 degrees of freedom (tcv), and the standard error of the sample mean (SE), according to the following formula:
|
|
The terms in the calculation relate to some of the basic concepts considered earlier. The t statistic corresponds to a particular probability level (and thus a confidence level), the degrees of freedom are determined by sample size, and the standard error of the mean represents sample variance.
For a 99% CI, tcv would be 2.683, and the CI will accordingly be wider: 62.10, 72.21. Conversely, for a 90% CI, tcv would be 1.676, resulting in a narrower CI of 64.72, 70.48.
A 95% CI for a difference in means would be calculated in an analogous manner:
|
|
1 and
2 are the 2 sample means and SEdiff is the standard error of the difference between these means. Confidence intervals can also be constructed for sample statistics other than the mean and in relation to samples that do not satisfy the assumptions of parametric statistics.1719
| Advantages of Confidence Intervals |
|---|
|
|
|---|
|
|
The way in which the null hypothesis is tested by means of a CI is by determining whether the null value (ie, the value specified in the null hypothesis) lies within the CI. If the null value lies within the CI, we cannot exclude it as being the population parameter at the chosen level of confidence. In contrast, if the null value lies outside the CI, we can exclude the null value from the possible values of the population parameter at this level of confidence. For example, Table 3 shows a 95% CI for the mean difference between the pain relief scores for the 2 groups in the FMS experiment, in addition to the results of the t test reported previously. The null value is that of zero difference, and the 95% CI does not include zero. That is, the researcher can be 95% confident that the difference that would be found in the population of patients with FMS would be greater than zero, which is equivalent to rejecting the null hypothesis of no difference at the P <.05 level by means of a t test.
|
|
|
When parametric analysis involving the testing of multiple hypotheses is being performed, there are techniques that take account of the number of such comparisons more efficiently than manual adjustment of probability values (eg, the various multiple range tests associated with analysis-of-variance procedures).13 In such situations, it is probably most advisable initially to conduct the process of hypothesis testing by a technique of this sort and then report CIs for the pair-wise comparisons found to be both statistically significant and clinically important.
Confidence Intervals Are Informative on Questions of Clinical Importance
The CI can provide valuable information when trying to determine the clinical importance of the outcome of a trial. In the third FMS study (Tab. 5), the mean difference in pain relief of just over 7 mm on the VAS was sufficient to reject the null hypothesis of no difference with an independent t test. Despite the statistical significance (a high probability that the result was real) attained by this outcome, the lower limit of the 95% CI reveals that the true value of the difference between treatments could be as low as 2.63 mm. The CI for the pain-relief scores is wider in this FMS study than in the previous FMS study (Tab. 4) due to the greater variance of these scores. Although a mean difference of 7.16 mm on a VAS is arguably likely to be clinically important and is the best estimate of the corresponding population parameter for this study sample, a difference of 2.63 mm, which is arguably not likely to be clinically important, is also compatible with rejection of the null hypothesis at the P <.05 level. Accordingly, 2 conclusions can be drawn from these results:
We believe that it is equally important for CIs to be reported when the null hypothesis is not rejected. Gore pointed out that a nonsignificant statistical test is "a statement that the trial results are consistent with there being no difference between treatments, and is not at all the same as saying that there is actually no difference."15(p660)
A 95% CI of the differences in scores that includes zero, and thereby causes the null hypothesis to be retained, may nonetheless contain differences, in either direction, that could be clinically meaningful and that may represent the "true" population value. Such information is potentially important to the clinician but is not revealed by inspection of the probability value alone. The Physicians' Health Study23 provides a case in point. In this randomized, controlled trial investigating the effects of aspirin and a placebo in the prophylaxis of stroke (N=22,071), the odds ratio (ie, the ratio of the likelihood of death from stroke in the group that received aspirin to the likelihood of death in the group that received a placebo) was 3.0.23 This 3-fold difference, however, was not statistically significant (P=.16). This finding is confirmed by the fact that the 95% CI for the odds ratio (0.75, 11.98) includes 1, which is the null value for an odds ratio. Inspection of the upper limit of the CI for the odds ratio reveals that an almost 12-fold difference in fatal stroke between the 2 groups cannot be excluded as the population parameter with 95% confidence. Thus, there is considerable lack of precision in the odds ratio for this study, which is due to the low incidence of strokes among the subjects (6 in the group that received aspirin, 2 in the group that received a placebo). Consequently, despite the nonsignificant finding, we would hesitate to conclude that aspirin has no effect on stroke mortality.
| The Role of Confidence Intervals in Meta-analysis |
|---|
|
|
|---|
In order for a meta-analysis to be performed, a common measure of effect size must be extracted from, or retro-spectively calculated for, each study included in the analysis. The odds ratio is an appropriate measure of effect size for studies that examine the relative incidence of a dichotomous outcome, as opposed to differences in an outcome variable measured on a continuous scale.31 The odds ratio is the ratio of the odds (likelihood) of achieving a certain outcome under one treatment condition to the odds of achieving that outcome under another treatment condition, with an odds ratio of 1.0 denoting no difference. To illustrate, in a systematic review of randomized, controlled trials of intensive versus conventional therapy for stroke, Langhorne et al32 found an overall odds ratio for death or deterioration of 0.54. This odds ratio means that the likelihood of death or deterioration during intensive therapy is 54% that of conventional therapy.
Studies will vary in the contributions they make to the total odds ratio, based on sample sizes and other factors that may or may not control random error. Displaying the CI for each study on what is known as a forest plot illustrates clearly the relative merits of the separate studies. Those studies that are based on larger samples have correspondingly narrower CIs, and the CI for the total odds ratio is the narrowest, as this CI is based on the aggregated sample.
Figure 3 shows a forest plot for hypothetical studies of biofeedback versus control treatment for habitual shoulder dislocation, in which the dichotomous outcome measure was whether a recurrence of dislocation occurred within the 8 weeks following treatment. The figure also shows the total odds ratio, which is calculated by statistical analysis of the aggregated data from the individual studies (odds ratios can be calculated retro-spectively for studies that used other outcome measures, such as risk ratios). A ratio of less than 1.0 indicates that biofeedback is associated with a lower likelihood of recurrence than the control treatment. Note that, although the odds ratio from study "c" approximates the total odds ratio very closely, the width of the associated 95% CI shows that it would be very difficult to draw a meaningful inference from the results of this study alone. The narrow width of the CI for the total odds ratio reflects the precision that results from aggregating data, and the fact that it excludes 1.0 indicates that the ratio is statistically significant at the P <.05 level. Thus, CIs assist considerably in the interpretation of the results of meta-analytic studies.
|
| Conclusion |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
R. R Richter and M. F Reinking How does evidence on the diagnostic accuracy of the vertebral artery test influence teaching of the test in a professional physical therapist education program? Physical Therapy, June 1, 2005; 85(6): 589 - 599. [Full Text] [PDF] |
||||
![]() |
G. A Koumantakis, P. J Watson, and J. A Oldham Trunk Muscle Stabilization Training Plus General Exercise Versus General Exercise Only: Randomized Controlled Trial of Patients With Recurrent Low Back Pain Physical Therapy, March 1, 2005; 85(3): 209 - 225. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Sim and C. C Wright The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements Physical Therapy, March 1, 2005; 85(3): 257 - 268. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. W Dalton and J. L Keating Number Needed to Treat: A Statistic Relevant for Physical Therapists Physical Therapy, December 1, 2000; 80(12): 1214 - 1219. [Full Text] [PDF] |
||||
![]() |
J. Richardson, M. Law, L. Wishart, and G. Guyatt The Use of a Simulated Environment (Easy Street) to Retrain Independent Living Skills in Elderly Persons: A Randomized Controlled Trial J. Gerontol. A Biol. Sci. Med. Sci., October 1, 2000; 55(10): 578M - 584. [Abstract] [Full Text] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |