PTJ
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH
 QUICK SEARCH:   [advanced]


     


Rapid Responses to:

Research Reports:
Teresa Steffen and Megan Seney
Test-Retest Reliability and Minimal Detectable Change on Balance and Ambulation Tests, the 36-Item Short-Form Health Survey, and the Unified Parkinson Disease Rating Scale in People With Parkinsonism
PHYS THER 2008; 88: 733-746 [Abstract] [Full text] [PDF]
*Rapid Responses: Submit a response to this article

Rapid Responses published:

[Read Rapid Response] Re: On “Test-retest reliability and minimal detectable change on balance…” Steffen T, Seney M. Phys Ther
Teresa M Steffen, Megan Seney   (3 June 2008)
[Read Rapid Response] On “Test-retest reliability and minimal detectable change on balance…” Steffen T, Seney M. Phys Ther
Paul W. Stratford   (3 June 2008)

Re: On “Test-retest reliability and minimal detectable change on balance…” Steffen T, Seney M. Phys Ther 3 June 2008
Previous Rapid Response  Top
Teresa M Steffen,
Professor
Program in Physical Therapy, Concordia University Wisconsin, Mequon, Wis,
Megan Seney

Send rapid response to journal:
Re: Re: On “Test-retest reliability and minimal detectable change on balance…” Steffen T, Seney M. Phys Ther

csteffen1{at}wi.rr.com Teresa M Steffen, et al.

Using an intraclass correlation coefficient (ICC[2,1]) rather than ICC(3,1) changed the ICCs for 13 of the 24 tests less than one hundredth of a point. An ICC(2,1) increased the reliability coefficients for the Berg Balance Scale and the Sharpened Romberg Test with eyes open, reducing the minimal detectable change (MDC) scores by 1 point each. ICC(2,1) decreased the remaining ICCs by one hundredth of a point, which increased the MDC scores of 6 tests by 1 point, 2 showed no change, and the Six-Minute Walk Test (6MWT) increased to 86 meters. Dr Stratford was sent the gait speed data to utilize his suggested ICC(2,2) formula for tests that incorporated averaged scores. This ICC formula is not available in the SPSS software we utilized. The analysis did not change the ICCs values or the MDCs for the gait speed tests. Our article states that gait speed is the strongest gait outcome variable in the population with parkinsonism and Stratford's analysis supports this.

We understand Stratford's suggestion on ICCs that test-retest reliability should always use ICC(2,k) formula. However, the article by Shrout and Fleiss1 did not suggest an ICC formula for test-retest reliability and changing the ICC formula had little effect on our study. Considering the same rater performed the same test each session, the formula for intrarater reliability ICC (3,k) was used. We appreciate Stratford's correction that arose from our report of the 6MWT being the only test to demonstrate a small learning effect. The incorrect use of the ICC formula can affect test-retest reliability when a systematic error occurs. Table 1 reports ICC(3,k), ICC(2,k), and minimal detectable change values using a 95% confidence interval (MDC95) for all the tests. Eleven MDC95 values had no change, 6 decreased, and 7 increased utilizing ICC(2,k) rather than ICC(3,k).


Table 1.
Intraclass Correlation Coefficients (ICC) for Test-Retest Reliability and Minimal Detectable Change Scores Utilizing a 95% Confidence Interval (MDC95) for Functional Tests, a Quality of Life Measure, and Disease Severity Rating Scale in People With Parkinsonisma

Test Performed ICC(3,k) MDC95 ICC(2,k) MDC95
Balance tests        
Berg Balance Scaleb (0–56 points)
.94 5 .95 4
Activities-specific Balance Confidence Scaleb (%)
.94 13 .94 13
Functional Reach Testc (cm)
       
Forward
.73 9 .72 9
Backward
.67 7 .67 7
Romberg Testb (s)
       
Eyes open
.86 10 .86 10
Eyes closed
.84 19 .85 19
Sharpened Romberg Testb (s)
       
Eyes open
.70 39 .71 38
Eyes closed
.91 19 .90 19
Mobility tests        
Six-Minute Walk Testb (m)
.96 82 .95 86
Timed "Up & Go" Testc (s)
.85 11 .85 11
Gait speedc (m/s)
       
Comfortable
.96 .18 .96 .18
Fast
.97 .25 .97 .25
SF-36b (0–100 points)        
Physical Functioning
.80 28 .80 29
Role–Physical
.85 45 .85 44
Bodily Pain
.89 25 .89 24
General Health
.85 28 .84 29
Vitality
.88 19 .87 20
Social Functioning
.71 29 .70 30
Role–Emotional
.84 45 .83 46
Mental Health
.83 19 .83 18
UPDRSb (points)        
Mentation, Behavior, and Mood (0–16)
.89 2 .89 2
Activities of Daily Living (0–52)
.93 4 .93 4

Motor Examination (0–108)

.89 11 .89 10

Total Score (0–176)

.91 13 .90 14

a SF-36=36-Item Short Form Health Survey, UPDRS=Unified Parkinson Disease Rating Scale.
b ICC: 3,1 and 2,1.
c ICC: 3,2 and 2,2.


Teresa M Steffen and Megan Seney

References

1 Shrout PE, Fleiss JL. Intraclass correlation: uses in assessing rater reliability. Psychol Bull. 1979;86:420–428.

On “Test-retest reliability and minimal detectable change on balance…” Steffen T, Seney M. Phys Ther 3 June 2008
 Next Rapid Response Top
Paul W. Stratford,
Professor
McMaster University

Send rapid response to journal:
Re: On “Test-retest reliability and minimal detectable change on balance…” Steffen T, Seney M. Phys Ther

stratfor{at}mcmaster.ca Paul W. Stratford

To the editor:

Translating reliability coefficients into clinically meaningful representations of measurement error is a necessary and important step when the goal is to link clinical research to clinical practice. The study by Steffen and Seney1 investigates the reliability of several balance and ambulation tests and converts the obtained coefficients into minimal detectable change (MDC) estimates. The authors apply Shrout and Fleiss2 type 3,k intraclass correlation coefficients (ICC) to quantify relative reliability and, from these estimates, they calculate the standard error of measurement (SEM) to quantify measurement error in the same units as the original measurement. For some of the balance and ambulation tests, 2 trials were performed on each of 2 occasions (eg, Timed "Up & Go" Test [TUG]); for other tests (eg, Six-Minute Walk Test [6MWT]), a single measurement was performed on each of 2 occasions. In the former case, the authors reported a type 3,2 ICC; in the latter case, they presented a type 3,1 ICC.

The authors’ rationale for applying the type 3,k ICC was “The ICC (3,k) was used instead of the Pearson correlation coefficient (r) for test-retest reliability because it assesses rating reliability by comparing the variability of different ratings of the same subject with the total variation across all ratings and all subjects.” In fact, the type 3,1 ICC provides an estimate of reliability similar to the Pearson r because neither coefficient accounts for a systematic difference in scores between the replicate measures (eg, either trials or occasions in Steffen and Seney’s study). Presumably, in a test-retest reliability study one is interested in both systematic and random errors, and, if this is true, the type 2,k ICC is the better choice because it includes both sources of variance in the reliability coefficient calculation. When the systematic error is zero, the type 2,k and 3,k ICCs provide identical estimates of reliability. However, when systematic error is present, as in the case of Steffen and Seney’s 6MWT data, the type 2,k ICC will be less than the type 3,k ICC.

My second reflection addresses the use of the Shrout and Fleiss classification system in situations where two or more facets exist, such as for the TUG data. Here, the facets are trials and occasions. A dilemma occurs when attempting to interpret the meaning of the type 3,2 ICC reported by Steffen and Seney. It is not clear if the second digit (2) refers to 2 trials, 2 occasions, or 2 trials performed on each of 2 occasions (ie, a total of 4 measurements). I propose that a generalizability3 approach to the analysis has the potential to provide a clearer picture of the sources of variance, their magnitude, and the relative merits of averaging over either trials or occasions, or both.

To illustrate the points raised above, I have generated synthetic data for the TUG. Paralleling the design of Steffen and Seney, the synthetic data represent 2 TUG trials performed on each of 2 occasions for 10 persons. The data presented in Table 1 were contrived to illustrate a systematic difference between occasions, but no systematic difference between trials.

Table 1.
Synthetic Time-Up-and-Go Data
  Occasion 1 Ocassion 2
Person Trial 1 Trial 2 Trial 1 Trial 2
Person 1 26.7 25.2 27.6 25.8
Person 2 4.6 6.9 7.6 7.1
Person 3 8.7 6.1 12.5 15.9
Person 4 18.1 19.1 26.1 28.5
Person 5 11.1 8.0 16.6 14.7
Person 6 20.7 24.0 20.4 22.6
Person 7 16.4 16.8 15.4 18.9
Person 8 4.3 6.4 16.0 14.2
Person 9 13.8 12.6 16.0 17.8
Person 10 25.7 24.8 34.5 34.6
Mean 15.0 15.0 19.3 20.0

Table 2 reports the mean scores for trials and occasions. Of interest is that the trial means averaged over occasions are almost identical; however, the occasion means differ. Stated another way, a systematic difference exists between occasions, but not between trials averaged over occasions.

Table 2.
Trial and Occasion Means
  Order
  1 2
Trial 17.1 17.4
Occasion 15.0 19.6

Table 3 displays Shrout and Fleiss type 2,1 and type 3,1 ICCs obtained by performing randomized block analysis of variance (ANOVA). Negative variance estimates were set to zero for all analyses. Pearson r values also are reported in this table. That the inter-trial type 2,1 and 3,1 ICCs are identical to 2 decimal places reflects the similarity of trial means shown in Table 2. By contrast, the inter-occasion means shown in Table 2 differed and this systematic difference is not reflected in the type 3,1 ICC or in the Pearson r. Accordingly, the type 3,1 ICC is greater than the type 2,1 ICC because the variance due to occasion is greater than zero.

Table 3.
Type 2,1 and 3,1 Inter-trial and Inter-occasion Intraclass Correlation Coefficients (ICC)
  Occasion 1 Occasion 2
Inter-trial Reliability    
Type 2,1 ICC
.96 .96
Type 3,1 ICC
.96 .96
Pearson r
.96 .96
  Trial 1 Trial 2
Inter-occasion Reliability    
Type 2,1 ICC
.76 .72
Type 3,1 ICC
.86 .85
Pearson r
.86 .85

The following section illustrates a generalizability analysis that includes both trials and occasions in a single analysis. I applied a 3-way random effects ANOVA. The rationale for applying a random effects model was that I wished to generalize beyond the persons, trials, and occasions composing the study sample. The ANOVA and variance components were calculated using MINITAB statistical software (Minitab Inc, Quality Plaza, 1829 Pine Hall Rd, State College, PA 16801-3008) and the results appear in Table 4. Once again, negative variance estimates were set to zero.

Table 4.
Analysis of Variance and Variance Components
Source Sum of Square Degrees of Freedom Mean Square Variance Components (ρ2)
Person (P) 2114.88 9 234.99 54.66
Trials (T) 1.30 1 1.30 0
Occasion (O) 215.30 1 215.30 10.00
P×T (to) 23.17 9 2.58 0.21
P×O (po) 143.23 9 15.92 6.88
T×O (to) 1.44 1 1.44 0
Error (e) 19.40 9 2.16 2.16

Inspection of the variance components reveals the following important findings: (1) there is a large variance among persons and this is desirable; (2) the variance between trials averaged over occasions is zero (this reflects the near identical means reported in Table 2); (3) there is a relatively large variance due to occasions (this reflects the difference in occasion means reported in Table 2); (4) the person by occasion (P×O) variance is substantially greater than the person by trial (P×T) variance (this suggests that averaging over occasion will have a greater effect than averaging over trials); and (5) the residual error is relatively small compared to the person variance.

The variance components reported in Table 4 can be applied to calculate generalizability coefficients that represent inter-trial and inter-occasion reliability. They can also be used to examine the distinct effect of averaging over trials, occasions, or both.

Equation 1:

The theoretical inter-trial reliability (generalizability) for a single trial is obtained by substituting the variance components into the Equation 1 and by setting nt and no to 1. The obtained value is .97 and this is analogous to the Shrout and Fleiss type 2,1 inter-trial ICCs of .96 reported in Table 3. The inter-trial reliability for an average of 2 trials can be obtained by setting nt to 2 and no to 1. This yields an inter-trial reliability of .98 which is analogous to a Shrout and Fleiss type 2,2 ICC.

When the goal is to draw inferences about the change status of a person, as is the case when MDC is applied, the inter-occasion reliability (generalizability) coefficient is of interest. It is calculated by applying Equation 2. The theoretical inter-occasion reliability for a single trial is obtained by substituting the variance components into Equation 2 and by setting nt and no to 1. This gives an inter-occasion reliability of .74 which is the average of the 2 inter-occasion reliability estimates reported in Table 3. The inter-occasion reliability for a single trial performed on each of 2 occasions is obtained by setting nt to 1 and no to 2. This yields an inter-occasion reliability of .85.

Equation 2:

Finally, one can examine the inter-occasion reliability for the average of 2 trials on each of 2 occasions. This is accomplished by setting nt to 2 and no to 2 in Equation 2. A value of .86 is obtained and, to my knowledge, there is no equivalent Shrout and Fleiss coding scheme to represent this combination.

References

1 Steffen T, Seney M. Test-retest reliability and minimal detectable change on balance and ambulation tests, the 36-Item Short-Form Health Survey, and the Unified Parkinson Disease Rating Scale in people with parkinsonism. Phys Ther. 2008 Mar 20 [Epub ahead of print].

2 Shrout PE, Fleiss JL. Intraclass correlation: uses in assessing rater reliability. Psychol Bull. 1979;86:420-428.

3 Brennan RL. Elements of Generalizability Theory. Iowa City, Iowa: ACT Publications; 1983.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH
Copyright © 2008 by the American Physical Therapy Association.