|
|
||||||||
Letters and Responses |
There is much more research that describes the measurement properties of evaluative measures such as the Roland-Morris (RM) scale1 today than there was a decade ago. The greater volume of studies provides more data that can be used to shape clinical decisions. This increased amount of research also increases the chance that the results of some studies, at times, may conflict with results of other studies. As the number of studies on a particular issue grows, the potential for conflicting results increases. The study of Davidson and Keating2 seems to be an illustration of this phenomenon.
Davidson and Keating2 examined the reliability and responsiveness of 5 functional status questionnaires designed for patients with low back pain (LBP). One of the scales examined was the RM scale, a questionnaire that has been studied extensively by our group and many others. Davidson and Keating found that the reliability of RM scale measurements was low, with an intraclass correlation coefficient (ICC [2,1]) of .53 (95% confidence interval [CI]=.29,.71) for a sample of 47 patients with LBP who reported that their LBP was "about the same," "a little better," or "a little worse." For a smaller subgroup that reported their LBP was "about the same," the ICC (2,1) was lower at .42 (95% CI=;.07, .75). Based in part on these findings, the authors concluded that the RM scale "appeared to lack sufficient reliability and scale width for clinical application."2(p8)
In our opinion, these results are dramatically different from the large volume of evidence reported in the literature on the reliability of RM scale scores (Table).1,2,617 The evidence summarized in the Table was collected on diverse samples of patients from different countries with many different LBP diagnoses. Davidson and Keating2 attributed their findings, as compared with previous research, to a variety of "sample differences." For example, they suggested that, because their sample consisted of some patients who were self-referred for physical therapy, these patients added additional variability, leading to the low reliability. Several other researchers10,12,17 investigated the reliability of RM scale scores on samples that included self-referred and physician-referred patients. The ICCs reported in these studies varied from .79 to .88. Another potential explanation for the lower reliability could be related to the interval between assessments6 weeks in the study of Davidson and Keating. Yet, other studies8,11 with reassessment intervals of equal or longer duration reported ICCs on the order of .66 to .86.
|
Why were the point estimates reported by Davidson and Keating2 for the RM scale so low relative to past research? Considering the relatively small sample size, especially for the subgroup self-classified as "about the same," we suspect that there were a few patients who had an unusual amount of variability in their scores. Large variability in a few subjects could lead to low reliability when sample sizes are small. The authors refer to a small group of subjects who had pain for greater than 6 months and who demonstrated "considerable variability" in RM scores despite reporting no change in their condition. Given the relatively small sample sizes in the 2 reliability analyses (n=16 and n=47), it would likely take only a few patients with unusual variability in their scores to skew the reliability data and produce point estimates that are atypical compared with the large amount of evidence that has already been published.
Given the extensive evidence that supports the reliability for RM scale scores,1,617 we disagree with the authors' recommendations that the RM scale should not be used as a measure of functional outcome in a general clinical population. Some clinicians may be tempted, based on the results reported by Davidson and Keating,2 to discontinue use of the RM scale or to consider other measures in lieu of the RM scale. We think this would be misguided when considering the evidence. We contend that the overwhelming majority of evidence supports use of the RM scale for routine clinical use or for research, and many experts agree with this view.35
Department of Physical Therapy
School of Allied Health Professions
Medical College of Virginia Campus
Virginia Commonwealth University
Box 980224
Richmond, VA 23298
School of Rehabilitation Science
Associate Member
Department of Clinical Epidemiology and Biostatistics
McMaster University
Hamilton, Ontario, Canada
References
Correlation indices of reliability such as intraclass correlation coefficients (ICCs) indicate the error in measurements as a proportion of the total variance in scores.1 They are affected by sample variance (ie, the range of scores demonstrated by subjects) as well as inconsistency in measurements. In answering the question "Why are the ICCs lower?" we would like to examine the standard error of measurement (SEM), as we believe that expressions of error in the same scale units (in this case, RMQ units) provide a more useful basis of comparison than the correlation coefficients.
Table 1 shows the SEMs reported for, or that we have calculated from, a number of studies that have reported ICCs and Pearson correlation coefficients.213 The SEM provides an indication of the extent to which the average respondent varies (in RMQ units) when retested at a time when his or her condition could reasonably be considered to be unchanged.14 Table 1 shows that SEMs for measurements taken using the RMQ ranged from 1.5 to 4.1 RMQ units. This means that, on average, subjects who are assumed to be unchanged typically could be 1.5 to 4.1 RMQ units to either side of an obtained score. We believe it is likely that the amount of expected error in self-report measurements of activity limitation varies with time between test and retest. Our study and that of Patrick et al12 identified comparable estimates of error. Patrick et al were the only other researchers to report on error associated with retesting a cohort at greater than 6 weeks from the first test. Other researchers who have conducted retests 6 weeks or more after an initial test have pooled these data with data obtained for subjects retested much closer to the first test. The RMQ measurements may display increasing variability as the time between tests increases. Clinicians should consider these findings in the light of the time frames over which they typically monitor patient progress. In contrast, the other instruments used in our study yielded scores that were relatively more stable for the same subjects.
|
Riddle and Stratford make the reasonable suggestion that an explanation for our observation is that a few subjects exhibiting extreme test-retest differences in a small sample distorted the results. In our group of 47 unchanged subjects, RMQ change scores ranged from 9 to +19 points, and 13 subjects (28%) had change scores of 5 points (21% of the scale width) or more. The interesting question for us is why those subjects who exhibited large variations in RMQ change scores had more stable scores on the other questionnaires (Tab. 2).
|
The size of the ICC does not tell us what items are more or less useful or whether the magnitude of error is acceptable for the intended use of the instrument. Close examination of patterns in the data, we believe, allows us to explore ways to refine instruments that we use to evaluate people with back pain. Publication bias, in our opinion, almost certainly confers an optimistic message about measurement utility. It is likely that some of the biases that result in underpublication of clinical trials with null findings15 also lead to underpublication of reliability studies with low reliability coefficient values. However, we are not aware of any studies that have explored the extent to which investigators fail to submit, or journals reject, such studies. The responsibility of researchers is to investigate and improve the instruments that we recommend for use in the examination of patients.
The incisive questions raised by Riddle and Stratford regarding our article are appreciated. Our data set nevertheless speaks for itself. We do not consider our findings or conclusions to be aberrant simply because they vary from previous findings. Most previous studies were based on different samples using shorter time frames for retesting. Indeed, the study by Patrick et al,12 in which longer retest periods than ours were used, produced findings that were similar to ours. Clinicians and researchers should weigh this evidence when considering examination instrument choice and should be prepared to change their choice of outcome measurement tools as better options present themselves.
School of Physiotherapy Faculty of Health Sciences
La Trobe University
Victoria 3086 Australia
School of Physiotherapy Faculty of Health Sciences
La Trobe University
References
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |