|
|
||||||||
Research Reports |
DP Gross, PT, BScPT, is a doctoral student, Faculty of Rehabilitation Medicine, University of Alberta, 3-48 Corbett Hall, Edmonton, Alberta, Canada T6G 2G4 (dgross{at}ualberta.ca).
MC Battié, PT, PhD, is Professor, Department of Physical Therapy, University of Alberta
Address all correspondence to Mr Gross
Submitted May 2, 2001;
Accepted October 24, 2001
| Abstract |
|---|
Key Words: Functional capacity evaluation Low back pain Reliability Occupational rehabilitation
| Introduction |
|---|
|
|
|---|
Various types of FCEs exist. Two common approaches have been described as psychophysical and kinesiophysical evaluations.5 Psychophysical FCEs place the worker in control, and performance is stopped when the worker believes maximal function has been reached.5 The kinesiophysical approach places the administering therapist in control, and tasks are stopped when biomechanical signs of maximal effort are observed, such as accessory muscle usage and counterbalancing (altered biomechanics judged as being unsafe).5 A set of standardized criteria for judging increased effort and maximal levels are outlined for the kinesiophysical method.5 Theoretically, this ensures the safety of the injured worker, as assessment is to be stopped prior to overexertion.5
If the FCE is to be considered a useful tool, reliability and validity must be demonstrated.69 As determinations require judgments regarding safety, some variance is expected with repeated measures within individual therapists and between therapists. In addition, variations in subject performance due to wellness on the day of the evaluation, motivation, pain levels, or interactions between the client and therapist conducting the evaluation may influence results. With these considerations in mind, interrater and test-retest reliability have been viewed as the most important forms of test reliability.6,10,11
Some work has been done to estimate the reliability of measurements obtained for various aspects of kinesiophysical testing.4,1214 A limitation of previous studies was the utilization of videotaped subject performance, resulting in a loss of some clinical information such as cardiovascular responses to testing used in maximal effort determination, which is gained during real-life observation. Studies that were done using real-life observation did not overcome the potential bias resulting from one rater influencing the judgment of the other rater when stopping the test. Lastly, all previous studies used a categorical outcome variable, rather than the interval-level outcome of amount of weight handled, as is determined in routine FCE testing.
Our goal was to determine the interrater and test-retest reliability of lifting determinations of maximal safe manual handling levels during kinesiophysical FCE using the Isernhagen Work Systems'* protocol in patients with LBP who were medically stable and receiving workers' compensation.
| Method |
|---|
|
|
|---|
Subjects were recruited through consultation with treating rehabilitation teams to identify eligible clients nearing the end of their treatment program. All prospective subjects were scheduled for FCE testing at discharge whether or not they participated in the study . Twenty-eight subjects with LBP were enrolled in the study from April to July 2000. At an alpha level of .05, using chi-square tests for categorical variables and independent-sample t tests for continuous variables, no significant differences were observed between our subjects and the entire group of clients with low back injuries discharged from the center during the data collection period. Variables compared were: age, sex, National Occupation Classification (NOC) code, job attachment status, duration of injury, and length of time off work, as determined from the center's clinical database for all subjects discharged (Tab. 1).
|
Five occupational therapists (3 male, 2 female) were enrolled to perform testing and act as raters. All raters had previously been trained by representatives of Isernhagen Work Systems, were conducting FCEs in clinical practice, and had at least 5 years of experience using kinesiophysical observation techniques. Raters reported an average length of time being trained in and performing kinesiophysical FCEs of 7.4 years (range=59 years). All raters were full-time employees and reported an average completion of 4.4 evaluations per week using kinesiophysical observation methods. Their average length of time spent in professional practice was 15.4 years.
Prior to the study, kinesiophysical principles and an operational definition of maximal effort were reviewed with the raters. Raters were asked to observe the following signs of increased effort in judging when subjects had reached maximal, safe levels:
Study Protocol
A repeated-measures design was used with the goal of independent, yet simultaneous observation of each subject by 2 raters. Observations occurred on 2 separate occasions separated by 2 to 4 treatment days, a time period during which no significant change was expected in subject performance while allowing some time to lessen recall of the previous performance. Between occasions, raters continued to perform regular work duties, including other FCEs. Time of day and place of testing were held constant. Testing took place within the subject's last week of a rehabilitation program.
The FCE tasks of floor-to-waist, waist-to-crown, and horizontal lifting and front, right, and left side carrying were completed. The specific protocol for each lift and carry was followed as outlined in the Isernhagen Work System's Functional Capacity Evaluation Manual,17 with sets of 5 repetitions being completed for each subtest at each successive weight level.
To obtain independent, yet simultaneous observation by the raters, 3 raters were selected randomly from the group of 5 raters for each enrolled subject. The first rater selected was referred to as the "primary rater." The primary rater's responsibility was to converse with the subject, guide the subject through testing, and upgrade weight in the lifting unit. Weight upgrades were possible in 1.1-, 2.2-, or 4.5-kg increments or any combination of these weights. The primary rater was the only individual with exact knowledge of the weight lifted or carried; the other raters were not able to see into the lifting unit and did not observe weight upgrades. The primary rater documented the amount of weight lifted or carried during each set, and other raters did not have access to this documentation. The primary rater also had the major responsibility for ensuring subject safety and was to stop testing if he or she judged safety to be obviously compromised.
The next 2 raters selected were referred to as "secondary raters." They observed performance and prompted the primary rater throughout testing, but they were instructed not to interact with the subjects. Secondary raters were instructed not to observe or talk to each other, but they were allowed to walk around the testing area for observation angle of choice.
Secondary raters were masked to each other's prompts and determinations in the following manner to avoid any potential bias. For each subject and subtest, the primary rater progressed testing from low to higher weight levels. Sets for each subtest were sequentially numbered on both the primary and secondary rater documentation forms. The primary rater documented the weight level, and secondary raters documented their observations for each set. After observing subject performance on an individual set, secondary raters documented their observations, then were allowed to prompt the primary rater nonverbally as to whether the weight in the lifting unit should be upgraded or testing stopped because maximal levels had been determined. They did this by pointing to one of 2 closely placed boxes with the words "Stop" and "Upgrade" on the bottom of their documentation forms. . Documentation stations were placed far enough apart for secondary raters not to see their companion's prompt. Primary raters walked between documentation stations to receive feedback. When a particular set was judged as maximal, the secondary rater pointed to the box stating "Stop," documented the observations, and circled the corresponding set number. All further prompting by this secondary rater was made by indicating "Stop." Testing continued with the primary rater upgrading weight until both secondary raters indicated "Stop." At the end of testing, all raters sealed their documentation forms in envelopes and delivered them to a secure location.
Maximal weight levels (in kilograms), as judged by the secondary raters, were determined through comparison of the primary rater's documentation with the corresponding set circled by each secondary rater. The factor leading to test termination for each lifting subtest also was recorded by the secondary raters. Limiting factors were categorized as physical maximum, cardiovascular limitation, nonfunctional time, or subject desire or pain.
Data Analysis
Intraclass correlation coefficients (ICCs [Shrout and Fleiss model 1,118]) with 95% confidence intervals (CIs) were calculated for interrater and test-retest reliability of secondary raters' judgments of maximal weight levels measured in kilograms. Two comparisons per subject were available for both forms of reliability. Because ICC values diminish when variance in a sample decreases, which would be the case if duplicate or repeat measures for both raters were used in analysis of test-retest data, calculations were performed separately for the 2 secondary raters' determinations.18 In addition, interrater ICCs were calculated using the first session, with values from the second session used to judge stability of results.
Paired t tests with alpha level set at .05 were used to compare mean differences between occasions on each subtest to determine whether a testing effect existed between days of testing. Kappa values and percentages of agreement were calculated for agreement on factors limiting subject performance. The statistical software package SPSS
was used for ICC, t-test, and Kappa calculations.
The ICC is currently the statistic of choice for reliability analyses of interval data; however, classical test theory may not provide a complete understanding of this issue. Generalizability theory may provide a more effective conceptual approach, and comprehensive reviews have been published.1921 Generalizability coefficients and estimated variance components for the factors controlled for were calculated. Generalizability coefficients represent the relative generalizability of a measurement to the total range of possible scores for that measurement, with results ranging from 0 to 1, similar to the ICC. Estimated variance components show the contribution made to total variance by each controlled factor. These statistics were calculated using formulas discussed elsewhere.20
| Results |
|---|
|
|
|---|
|
|
Mean scores of weight lifted on the 2 days were compared for all subjects who completed testing. Consistently, subjects lifted more on day 2, but these differences were statistically significant only for low-level lifting (21.8 kg for day 1, 25.7 kg for day 2; P=.01) and front carrying (32.2 kg for day 1, 34.7 kg for day 2; P=.02).
Findings from analysis of agreement for factors limiting test performance are summarized in Table 4. Kappa values ranged from .47 to 1.00, and overall percentage of agreement was 86.4% (235/272). Raters both judged a particular subject's performance as physical maximum on 68.8% of the comparisons. Of the 37 incidents where the raters disagreed, the same weight level was judged as maximum in 30 cases, with 26 of these cases being judged as physical maximum versus subject desire.
|
|
| Discussion |
|---|
|
|
|---|
Three subjects returned for day 2 of testing but stated they did not feel capable of participating in manual handling activities due to reported pain exacerbation. The ease with which subjects could withdraw or terminate testing may have led to more subjects declining testing during the second session than would have occurred under normal FCE test conditions. However, the subjects' beliefs and perceptions of pain, disability, and physical capacity that led them to decline testing may represent valid influences on FCE results. The first test session was not cited as the reason for increased pain by any of the subjects who declined testing.
The testing interval was selected to minimize functional change. Return to work was imminent in this group of subjects deemed medically stable, yet the performance of some subjects varied between occasions. This was especially true of those subjects who were unwilling to participate on the second occasion. Variations in subjects' performance between days may have been due to the reasons discussed previously such as wellness, motivation, or pain level. Another potential contribution to the observed variability is a testing effect in subjects participating in both days. Comparison of means between days, with significant increases on the second occasion for low-level lifting and front carry, indicates that a testing effect likely did exist. It was not great enough, however, to diminish test-retest ICC values below acceptable levels.
Estimated variance components for subjects participating on both days clarify what factors were responsible for the variance observed. Consistently, subjects were responsible for the greatest variance, a desirable finding supporting the acceptable ICCs. The subject-occasion interaction, defined by Shavelson and Webb20 as variance arising due to inconsistencies between occasions in particular subjects' performance, was consistently the second leading source of variance. The minimal residual variance in maximal ratings was made of various combinations of other factors, depending on the subtest, but these factors contributed little to the total variance.
Due to the variability observed between days and the fact 3 subjects felt they could not participate on the second occasion, manual handling is recommended over a 2-day period. The Isernhagen Work System's FCE protocol acknowledges client performance may vary between days and recommends a 2-day session of manual handling ability.
Raters agreed substantially or perfectly on the performance-limiting factor for test termination on most subtests according to the Landis and Koch categorization for Kappa values.22 Agreement on front and left side carrying was moderate.
No previous study has looked specifically at the reliability of determinations of maximal levels using actual weight lifted, but other aspects of reliability of the kinesiophysical approach have been examined. When Isernhagen et al4 studied interrater reliability of gross judgments of lifting effort, raters were able to accurately discriminate between "light" and "heavy" lifting efforts (Kappa=.81). Their study used videotapes of the subjects' performance; therefore, some clinical detail would have been lost. Smith14 studied the ability of trained and experienced therapists to reliably judge whether patients with low back injuries can lift from the floor to waist with "safe body mechanics," as operationally defined by the author. Interrater Kappa values ranged from .62 to .64. In Smith's study, as in the study by Isernhagen et al4 and a study by Gardener and McKenna,12 videotape was used for viewing subject performance. Our study's design allowed clinically realistic observation and gave access to all information gained during a typical FCE, while allowing simultaneous observation of subjects. The slightly higher reliability we found may be due to added information available to our raters such as subject cardiovascular responses, symptoms, and three-dimensional viewing.
In a study by Lechner et al,13 interrater reliability of measurements of maximal effort during another FCE protocol was examined. In this assessment, maximal effort was determined through observation of body mechanics and lifting technique. Interrater Kappa values found for manual handling determinations within Dictionary of Occupational Titles categories ranged from .62 to .88. These findings of substantial to almost perfect reliability are similar to, but slightly lower than, our findings. As the FCE under study was newly developed, raters had minimal experience, with total training time being approximately 20 to 24 hours. Conversely, raters in our study had at least 5 years of experience. The study protocol used by Lechner et al did not achieve independent observation between raters, resulting in a potential bias of one rater by the primary rater responsible for test termination.
One limitation of the present study affecting evaluation of test-retest reliability, in particular, was subject mortality. As noted previously, 3 subjects felt incapable of participating on day 2 of testing. In addition, only partial data sets were obtained from 6 subjects due to rater reporting error, subject lack of desire to perform all subtests, primary rater overruling a decision to upgrade, or lack of time to complete testing. A diminished sample size resulted and may have altered reliability calculations had all subjects been tested on all subtests. Yet, the consistency seen when alternate rater or occasion ICC values were calculated indicate the stability of the findings in the subjects tested. Although our design allowed us to overcome limitations of previous studies, the effect of multiple raters within the test setting as opposed to only one rater as in regular FCE practice is unknown. The effect on reliability when altering factors such as therapist discipline, level of therapist experience, and setting remains unknown.
| Conclusions |
|---|
|
|
|---|
| Footnotes |
|---|
This study was approved by the University of Alberta Health Research Ethics BoardPanel B and supported by the Clinical Research Partnership Fund, jointly sponsored by the Alberta Physiotherapy Association and the University of Alberta's Department of Physical Therapy.
* Isernhagen Work Systems, 1015 E Superior St, Duluth, MN 55802. ![]()
SPSS Inc, 233 S Wacker Dr, Chicago, IL 60606. ![]()
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
D P Gross and M C Battie Does functional capacity evaluation predict recovery in workers' compensation claimants with upper extremity disorders? Occup. Environ. Med., June 1, 2006; 63(6): 404 - 410. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. P Gross and M. C Battie Factors Influencing Results of Functional Capacity Evaluations in Workers' Compensation Claimants With Low Back Pain Physical Therapy, April 1, 2005; 85(4): 315 - 322. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |