|
|
||||||||
Literature Reviews |
E Croarkin, PT, MPT, NCS, is Neurological Clinical Specialist, Physical Therapy Section, National Institutes of Health, Bldg 10, Room 6S-235, 9000 Rockville Pike, Bethesda, MD 20892-1604 (Usa) (ecroarkin{at}cc.nih.gov).
J Danoff, PT, PhD, is Research Consultant, Physical Therapy Section, National Institutes of Health, and Associate Professor, Department of Exercise Science, George Washington University Medical Center
C Barnes, PT, MSPT, is Contract Physical Therapist, Broaddus Hospital, Philippi, WVa. At the time of data collection, Ms Barnes was a student intern at the National Institutes of Health
Address all correspondence to Mrs Croarkin
Submitted March 15, 2002;
Accepted July 10, 2003
| Abstract |
|---|
Key Words: Evidence-based rating Motor function tests Stroke Upper extremity
| Introduction |
|---|
|
|
|---|
Wade,1 in 1989, was the first author to review upper-extremity motor function tests commonly used in stroke rehabilitation. He reported that adequate tests existed and that these tests should be applied under research and clinical circumstances to establish their strengths and weaknesses. In his review, he indicated whether the degree of validity, reliability, and sensitivity of tests was known; which test domains were covered (eg, impairment or specific [focal] disability); and whether the test was composed of a battery of tasks. His recommendations for using a test were based on the amount of time required to administer the test.
Several other review articles, not specific to stroke rehabilitation, have considered hand and arm function tests.24 McPhee2 reviewed hand tests, emphasizing test characteristics and the importance of choosing appropriate tests for specific diagnoses. Bear-Lehman and Abreu,3 in 1989, demonstrated the continued need to provide estimates of the validity and reliability of instruments used to measure hand function. They described instruments that measure hand function in terms of range of motion, edema, performance, sensation, dexterity, and physical capacity evaluation. Rudman and Hannah4 reviewed 4 tests of hand function. They characterized what they believed was the clinical utility of tests and rated the strength of available evidence to support psychometric properties as "initial," "limited," or "questionable."4 Although Rudman and Hannah's review offered a method to judge the strength of the literature describing psychometric properties, it did not include all hand function tests, nor did it delineate psychometric evidence obtained exclusively from patients who have had strokes. None of these previously described reviews demonstrated an extensive search for all available upper-extremity motor function tests, and the tests included were not selected based on a set of predetermined inclusion criteria.
In 1995, the US Department of Health and Human Services published clinical practice guidelines for "post-stroke rehabilitation."6 Each guideline was based on the "level of research evidence" or the "degree of consensus" among experts. Using a modification of Sackett's procedures (Tab. 1),7 the guide described research evidence to support many areas of stroke rehabilitation. Sackett's levels of evidence are most readily applied to areas of research with large numbers of studies and with trials of interventions. Therefore, the group that developed the guide characterized available research designs as either randomized controlled trials or quasi-experimental designs. They also considered the amount of published research available and then made recommendations for the use of standardized assessment tools. Tables in the guide listed investigations in which the validity, reliability, and sensitivity of measurements obtained with various tests of function were examined. Measures of disability and overall motor function were included in these tables, but measures specific to upper-extremity motor function were not included. Evidence supporting the use of upper-extremity motor function outcome measures, therefore, has not been systematically reviewed.
|
In this review, the importance of having an assessment tool supported by published accounts of validity and reliability is discussed. We have identified the presence or lack of psychometric support for a number of upper-extremity assessment tests based on published reports of interrater reliability (IRR), test-retest reliability (TRT), convergent validity (CVV) or concurrent validity (CCV), and predictive validity (PV) (Appendix 1).1,2,8
| Purposes |
|---|
|
|
|---|
| Methods |
|---|
|
|
|---|
Approximately 2,200 article titles were identified in the 3 searches. Of this original group, 170 articles were selected to receive more detailed examination. Selection was based on article inclusion criteria of being published in a peer-reviewed journal and having at least one of several design objectives related to upper-extremity motor function tests. A few discrepancies occurred between the lists generated in the 2 searches. In these cases, we conferred until agreement to include or exclude the article was reached. Appendix 2 describes in more detail the inclusion criteria for this stage of the literature review.
After examination of these 170 articles, we identified 31 different tests that were used for upper-extremity motor function. A set of inclusion criteria related to the nature of the tests and how they were applied was then used to identify a subset of 13 articles. These articles not only described the use of upper-extremity motor function tests on people who had strokes, but also presented data that could be used to evaluate the psychometric properties of the tests. Details of the inclusion criteria are given in Appendix 2. Additional searches were performed using the 31 named tests as key words to be sure that other publications had not been missed in the 2 main literature searches. Tests that did not meet the inclusion criteria are listed in Table 2. Nine tests that qualified are described in Table 3.
|
|
Database fields included title, author, year, number of subjects, age of subjects, and whether the subject groups were people with or without stroke. Tests described in the articles were recorded in a checklist data field. Therefore, all tests referenced or used in the studies described in the articles could be added as the project continued. Inclusion criteria were applied to each test from the checklist to establish the list of appropriate tests. Comment fields were used to collect information regarding psychometric properties and statistical analyses.
Levels of Evidence
Upper-extremity motor function tests examined for this review were assigned to ordinal categories (levels I, II, and III) according to how many psychometric properties had demonstrable (published) results for those tests within a group of patients with strokes. If data for all 3 of the psychometric propertiesIRR, TRT, and CVV/CCVwere reported for subjects following a stroke and produced significant correlations (P<.05) between repeated evaluations or between final evaluation and a reference instrument, the test was assigned to level I. If 2 of the 3 psychometric properties were supported by significant correlations, the test was assigned to level II. If only one of the psychometric properties was supported by a significant correlation, the test was assigned to level III. For some studies, both correlations and significance were reported, and, for some studies, the principal author (EC) calculated significance values based on reported correlations and the number of subjects in a study.8
If the PV of a test is established, the clinical utility of that test is improved. Therefore, to give credit to tests for which PV was determined, this finding also was reported. Predictive validity could have been examined to predict placement after discharge, use of assistive device, functional independence during inpatient stay, and so on. These predictions may be projected to a year after stroke onset. Predictive validity only considers conditions at the time of testing. Many other factors may influence the long-term performance capabilities of a patient.
| Key CCV=concurrent validity CVV=convergent validity IRR=interrater reliability PV=predictive validity TRT=test-retest reliability
|
| Results |
|---|
|
|
|---|
|
Level II: Established by Evidence for at Least 2 Psychometric Properties (IRR, CVV/CCV Plus PV)
Motor Assessment Scale.
Loewen and Anderson16 and Poole and Whitney19 separated the Motor Assessment Scale into 3 upper-extremity components or subscales and determined IRR for each subscale. Both pairs of investigators used the same subscales: upper arm function, hand movement, and advanced hand activities. Poole and Whitney described IRR between 2 raters using 24 subjects and Spearman correlation coefficients. The Spearman correlation coefficients were 1.00, 1.00, and .98 for the respective subscales. Loewen and Anderson described IRR among 14 raters using 7 subjects and percentage of agreement with kappa coefficients. Percentages of agreement were 96.2%, 100%, and 100%, with kappa coefficients of .93, 1.00, and 1.00, respectively. Both pairs of investigators concluded that high IRR could be attained using the Motor Assessment Scale. In addition to IRR, Poole and Anderson described the CVV of data obtained with the Motor Assessment Scale using Spearman correlations. They estimated CVV values by comparing the Motor Assessment Scale and Fugl-Meyer Sensorimotor Assessment scores for 30 subjects. The CVV values were r=.89 (proximal component of upper-extremity motor function), r=.92 (distal component of upper-extremity motor function), and r=.91 (total upper-extremity motor function).
In 1990, Loewen and Anderson,17 using Spearman correlations, performed additional psychometric testing to examine the predictive ability of the Motor Assessment Scale. When using the combined arm score of the Motor Assessment Scale, scores at 1 month following stroke correlated with arm scores at discharge (r=.94, P<.0001, n=50).
Motricity Index.
The Motricity Index had evidence for CVV18 as early as 1986, for PV in 1989,21 and for IRR in 1990.10 Collin and Wade,10 using Spearman statistics for 20 subjects, reported IRR of the Motricity Index to be r=.88. Results were obtained by examining scores obtained for the following components of the upper-extremity subscale of the test: grip, elbow flexion, and shoulder abduction. Hsieh et al,14 using intraclass correlation coefficients (ICCs), described the CVV of Motricity Index values by comparing them with Action Research Arm Test scores (r=.87, n=50). Parker et al,18 using linear correlation, also described the CVV of Motricity Index values by comparing them with Nine-Hole Peg Test scores (r=.82, n=187). In 1990, Collin and Wade10 sought to establish the CVV of data obtained with 3 different tests: the Motricity Index, the Trunk Control Test, and the upper extremity subscale of the Rivermead Motor Assessment. They believed the Motricity Index and the Trunk Control Test were the tests requiring comparison, and the Rivermead Motor Assessment was used as the "established" measure. No evidence, other than that reported by Collin and Wade, was found to establish the Rivermead Motor Assessment as having validity or reliability for patients following a stroke. The correlations between the Motricity Index upper-extremity subscale scores and the Rivermead Motor Assessment upper-extremity subscale scores across 3 time periods (6, 12, and 18 weeks after stroke) were .76 (n=27), .73 (n=25), and .74 (n=14), respectively.
Sunderland et al,21 in 1989, also used the Motricity Index to describe the PV of what they called "grip strength" as determined with an electronic goniometer. They found that the Motricity Index was better than the Nine-Hole Peg Test at identifying subjects who would score above zero on the Frenchay Arm Test 6 months after initial assessment.
Level II: Established by Evidence for IRR and CVV/CCV
Action Research Arm Test.
The IRR of data obtained with the Action Research Arm Test was established by Hsieh et al14 and Lyle.22 Only Hsieh et al, however, reported studying patients with strokes exclusively. The ICC value was established at .98 using 50 patients.
In an effort to establish CCV/CVV of data obtained with the Action Research Arm Test, Hsieh et al14 compared the Action Research Arm Test with 3 other tests: Motor Assessment Scale, Modified Motor Assessment Chart and the Motricity Index. Only the upper-extremity subscales of the Motor Assessment Scale, Modified Motor Assessment Chart, and Motricity Index were used. Concurrent validity of data obtained with the Action Research Arm Test was assessed by using the Motor Assessment Scale. Hsieh et al examined 50 subjects and found the Motor Assessment Scale and the Action Research Arm Test were closely associated (r=.96). Convergent validity of data obtained with the Action Research Arm Test as compared with data obtained with the Modified Motor Assessment Chart and the Motricity Index was established at .94 and .87 using Pearson correlations.
Chedoke-McMaster Stroke Assessment.
The Chedoke-McMaster Stroke Assessment has both an impairment inventory and a disability inventory. In 1993, Gowland et al12 published a study of the reliability of data obtained for independent components of the impairment inventory. Impairment inventory components included measures of shoulder pain, postural control, and arm (upper-extremity), hand, leg, and foot function. Interrater reliability was established for the arm (upper-extremity) subscale (ICC=.88) and for the hand subscale (ICC=.93). Concurrent validity (r=.95) was reported for the combined arm (upper extremity) and hand components correlated with the combined Fugl-Meyer Sensorimotor Assessment shoulder, elbow, forearm, wrist, and hand scores.
Fugl-Meyer Sensorimotor Assessment.
Interrater reliability and CVV have been established for the Fugl-Meyer Sensorimotor Assessment multiple times. Duncan et al11 performed the first analysis of Fugl-Meyer Sensorimotor Assessment intrarater reliability and IRR on subjects whose mean time from onset of stroke was 51 months and found r=.98 to .99. Sanford et al20 repeated Fugl-Meyer Sensorimotor Assessment reliability studies on patients during rehabilitation 6 days to 6 months following a stroke and found an ICC value of .97 for the upper-extremity component of the test. Sanford et al did not report the mean time from onset of stroke. Of the tests included in this review, the upper-extremity portion of the Fugl-Meyer Sensorimotor Assessment has been compared against the Chedoke-McMaster Stroke Assessment (r=.95)12 and the Motor Assessment Scale (r=.88).19
Level III: Evidence Established by CVV/CCV Plus PV
Modified Motor Assessment Chart.
The Modified Motor Assessment Chart utilizes subscales of upper- and lower-extremity function and standing leg movements; it is a modified version of the Fugl-Meyer Sensorimotor Assessment. This test requires the patient to perform one-handed activities; both arms are evaluated separately.15 Lindmark and Hamrin15 examined the validity of data obtained with the Modified Motor Assessment Chart. Because they did not differentiate between the upper- and lower-extremity portions of the Fugl-Meyer Sensorimotor Assessment, their article did not demonstrate support for CVV/CCV as deemed adequate for this review. Hsieh et al,14 however, compared the Modified Motor Assessment Chart (upper-extremity portion) and the Action Research Arm Test and found a close association (Pearson r=.94). As defined by criteria in this article, PV has been adequately described for the Modified Motor Assessment Chart.15 Using regression analysis, Lindmark and Hamrin15 reported that the Modified Motor Assessment Chart provides prognostic information relative to survival, discharge destination, and functional score on discharge.
Level III: Evidence Established by IRR
Motor Club Assessment.
The Motor Club Assessment was first described as an upper-extremity motor function test for people following a stroke by Ashburn.9 No statistical analyses were performed. To describe IRR, Ashburn reported the number of disagreements among 15 paired observations and noted "minimal error." Although Ashburn did not report kappa values or specific percentages, this was the only article found that described how reliable data obtained with the Motor Club Assessment might be when the test is used between raters.
Sunderland et al21 used the Motor Club Assessment to examine grip force as a prognostic tool. They reported using the Motor Club Assessment to establish grip force as a predictive measure of outcome at 6 months following a stroke, but a secondary finding also was reported when they used the Motor Club Assessment to predict performance on the Frenchay Arm Test. The Motor Club Assessment classified 3% of the "cases" incorrectly, while the Motricity Index had identified all "cases" correctly.
Level III: Evidence Established by CVV/CCV
Rivermead Motor Assessment.
This instrument was demonstrated to have CVV with the Motricity Index (r=.88).10 Collin and Wade's objective was to establish CVV among 3 tests and to determine the validity and reliability of data obtained with the Motricity Index and Trunk Control Test.10 Only upper-extremity subscales were used for the correlations. Collin and Wade's article was the first to report use of the Rivermead Motor Assessment (upper-extremity subscale) with this diagnosis.
| Discussion |
|---|
|
|
|---|
Some potentially eligible articles may have been overlooked in our review. The inclusion criteria for this review were intended to identify articles that had the purpose of examining evidence to support the psychometric properties of tests, either as the primary objective or while examining upper-extremity motor function following a stroke. We therefore reviewed articles that alluded in their title, abstract, or key words to examining psychometric properties. Some articles may have been excluded inappropriately because reporting on psychometric properties was not part of the primary investigation or the title did not clearly refer to psychometric testing. This problem is inherent in any computerized literature review that is dependent on key words. We believe the chances of overlooking articles were reduced by using follow-up searches where the key words included the names of the tests being investigated.
|
Articles were retrieved even if the title did not specify an investigation of test psychometric properties (Appendix 2). In the article by Parker et al,18 for example, "loss of arm function after stroke: measurement, frequency, and recovery" does not imply an examination of psychometric properties. However, there are data in this article about the psychometric properties of the Motricity Index. This article was retrieved because of our broad article inclusion criteria. If this article were omitted upon initial title review, it would have been identified during subsequent searches based on test name.
Some tests may have more than one accepted designation (eg, the Jebsen Hand Test is the same as the Jebsen-Taylor Hand Test), and some tests may not have been coded as key words in the PubMed or CINAHL databases. Multiple combinations of test names were used to search for supporting literature as much as possible, but some variants may have been missed. Redundant literature searches and broad inclusion criteria for articles helped to eliminate this possibility of missed articles (Appendix 2).
The quality of observational data depends on partitioning of data variance into a true component (the theoretical exact value) and error components, which may include subject variations, rater variations, and any number of additional environmental variations.23 Because of the latter variations, strong psychometric properties of measurements for one population should not be considered supportive for populations with different diagnoses. Portney and Watkins8 provided a discussion of how this concept ("generalizability theory," originally proposed by Cronbach et al in 197224) may be applied to clinical practice. This criterion limited much of the evidence. For example, the Box and Block Test and the Jebsen Hand Test were reported to be well-established tests,25,26 but no evidence, based on our criteria, was found to support their use with patients following a stroke.
The Box and Block Test met our inclusion criteria and had been used to test upper-extremity motor function in a study of patients following a stroke.25 The authors, however, referenced a previously published article (ie, Desrosier et al26) for reliability and validity. Desrosier et al26 had performed reliability tests with individuals having a variety of diagnoses. Therefore, based on our criteria, this reference did not support the use of the Box and Block Test for patients following a stroke.
Similarly, the Jebsen Hand Test was studied and described in an article about subjects with strokes. Results of an investigation of TRT of data obtained with the Jebsen Hand Test were published.27 In that study, however, only 5 subjects diagnosed with strokes out of a total of 26 subjects with various diagnoses were used. Therefore, we believe this evidence could not be considered definitive for testing psychometric properties in groups of people with the diagnosis of stroke.
A method to examine the quality of evidence has been proposed by Rudman and Hannah.4 According to Rudman and Hannah's definitions or criteria, all of the articles found would have demonstrated "initial support" or "limited support"
for each of the psychometric properties. Given Rudman and Hannah's criteria, the rank of the Jebsen Hand Test or the Box and Block Test would be elevated from no support to limited support ("limited support" meaning "inadequacies in the research designs" of studies conducted to investigate reliability or validity).4 Intervention strategies such as the use of therapeutic exercise for people after stroke and clinical assessments that have resulted in extensive literature would be more suitable for examination according to Rudman and Hannah's criteria. Inclusion criteria for this review required that the studies (reporting psychometric properties) had used only subjects who had had a stroke. Consequently, evaluation of the studies in this review is based on quantitative criteria. Operational definitions of Rudman and Hannah's levels of support could be refined for future reviews, but further methodological discussion extends beyond the scope of this article.
The statistics selected for this review are commonly used to measure reliability and validity. They directly address measurement questions of clinical interest. There may be questions in the scientific community as to whether some psychometric properties might be more important than others and which statistical tests would be most appropriate. We have made no attempt to argue for a value system to differentiate the statistics used. We have attempted to describe the psychometric properties and supportive statistical tests reported in the literature with minimum bias. Clinicians and researchers are encouraged to become familiar with the meanings of the different psychometric properties and to decide which are more important for their applications and whether the statistics used in a given study were appropriate.
When the preliminary searches were conducted for this review, there were limited data to describe most psychometric properties of the tests. Therefore, we placed more emphasis on categorizing the available evidence than on rating its quality.
All correlation coefficients reported in this review exceeded critical values (ie, were significantly different from a correlation of zero).8 Colton28 recommended stratifying r values into categories of "poor," "fair," and "excellent." This categorization may allow for subjective and perhaps inaccurate classification. Whether a "significant" correlation coefficient is acceptable should rely on the judgment of both the clinician and the investigator. Error in measurement is unavoidable, given multiple sources of variability such as human factors and environment. The amount of error that is acceptable will depend on the purpose and specific clinical circumstances surrounding test use, and authors should report reliability relative to these issues.29 In addition, Portney and Watkins described the following: "[C]orrelation coefficients cannot be interpreted as proportions.... The difference in the degree of relationships between .50 and .60 is not necessarily the same as the difference between .80 and .90."8(p494) Even values less than .50 can represent strong relationships if the number of subjects is sufficient. No articles were found in this review that reported nonsignificant findings.
Some authors,30,31 however, do not accept linear correlations as appropriate tools for reporting reliability data. Intraclass correlation coefficients are perhaps more appropriate because they describe agreement of the scores and not just covariation (or association). Linear correlation coefficients often can overestimate the reliability of data obtained with a test because the relationship between true variance and observed variance may be overlooked. However, only 2 groups of authors12,14 chose to utilize ICC analysis.
Sample size may become an issue when considering psychometric properties. In Appendix 3, the samples used for the reviewed studies are listed. Counts ranged from a low of n=7 in the study by Loewen and Anderson16 to a high of n=231 in the study by Lindmark and Hamrin.15 Summation of individual samples from many studies would be inappropriate for statistical interpretation. However, a study with a large sample size, in general statistical terminology, would have greater power than a study with a smaller sample size, which decreases the likelihood of not correctly identifying a significant relationship. Clinicians and researchers utilizing the tests described in this review might want to attribute greater weight to supportive studies with larger sample sizes.
| Conclusions and Implications |
|---|
|
|
|---|
In 1991, Physical Therapy published a document titled "Standards for Tests and Measurements in Physical Therapy Practice."32 This document provides guidance to clinicians and investigators to help ensure the development of useful and meaningful measurements and describes standards for "ensuring integrity in measurement standards." We agree with the point made in the document that, although tests may not necessarily meet all of the standards set forth in the document, test users incur the responsibility of knowing the limitations of measurements and making logical arguments to support their test selection.
| Appendix 1 |
|---|
|
|
|---|
|
| Appendix 2 |
|---|
|
|
|---|
|
| Appendix 3 |
|---|
|
|
|---|
|
| Footnotes |
|---|
* As in the article by Rudman and Hannah,4 definitions of rehabilitation domains (eg, impairments and functional limitations as described by the National Center for Medical Rehabilitation Research5) were used in the current investigation. ![]()
FileMaker Inc, Corporate Headquarters, 5201 Patrick Henry Dr, Santa Clara, CA 95054-1171. ![]()
Initial support, according to Rudman and Hannah,4 would indicate that some studies have had positive results supporting validity or reliability. Limited support would indicate inadequacies in the research design of studies conducted. ![]()
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |