|
|
||||||||
Research Reports |
MJ Faber, PhD, is Senior Researcher, Faculty of Human Movement Sciences, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
RJ Bosscher, PhD, is Associate Professor, Faculty of Human Movement Sciences, Vrije Universiteit Amsterdam
PCW van Wieringen, PhD, is Associate Professor, Faculty of Human Movement Sciences, Vrije Universiteit Amsterdam
(m.faber{at}kwazo.umcn.nl) Address all correspondence to Dr Faber at Centre for Quality of Care Research (WOK), Radboud University Nijmegen Medical Centre, PO Box 9101, 117 KWAZO, 6500 HB Nijmegen, the Netherlands
Submitted May 23, 2005;
Accepted January 31, 2006
| Abstract |
|---|
Key Words: Minimal detectable change Older people Performance-Oriented Mobility Assessment Reliability Validity
| Introduction |
|---|
|
|
|---|
Several adapted versions of the POMA have been published, but in this article, only the original 28-point version is considered, as it is the most commonly used version.3 The total POMA scale (POMA-T) consists of a balance scale (POMA-B) and a gait scale (POMA-G). The POMA-B carries the subject through positions and changes in position, reflecting stability tasks that are related to daily activities. In the POMA-G, several qualitative aspects of the locomotion pattern are examined. Each item is scored on a 2- or 3-point scale, resulting in a maximum score of 28 on the POMA-T and maximum scores of 16 and 12 on the POMA-B and the POMA-G, respectively. Originally, the POMA-T was developed to predict falls in an institutionalized population.3 Later, the scale also was used in various clinical contexts as a measure of mobility impairment4–6 and to study the effects of interventions.7–13
A prerequisite for using a clinical measurement tool is that its clinimetric properties, including validity, reliability, and responsiveness, are satisfactory. Validity indicates whether the instrument does indeed measure what it is intended to measure. Concurrent validity refers to the relationship between scores on the scale in question and scores on other scales intended to measure the same construct. Predictive validity refers to the degree to which the scores predict an external criterion. Reliability refers to the extent to which the measurements are objective (interrater reliability) and stable over time (test-retest reliability). Absolute reliability is the degree to which repeated measurements vary for subjects, with the changes being expressed in the units of measurement of the instrument. Relative reliability is the degree to which subjects maintain their position in a sample with repeated measurements, usually assessed with some type of correlation coefficient.14 Responsiveness is defined as the ability of an instrument to accurately detect change when it has occurred.15,16
Only limited clinimetric data on the original POMA have been published. With regard to test-retest reliability of POMA-T scores, intraclass correlation coefficients (ICCs) of .88 (for 40 residents of skilled nursing homes)17 and .97 (for 8 community-dwelling older people)4 have been reported. The concurrent validity of POMA-T scores was investigated in a cross-sectional study6 of 167 older people with mild balance impairments. Spearman correlations (R) of POMA-T scores with the results of several balance-related tests were calculated; these measures included maximum step length (R=.75), tandem stance time (R=.69), stance time on one foot (R=.74), tandem walk time (R=–.62), Timed "Up & Go" Test (TUG) (R=–.65), and 6-Minute Walk Test (R=.62). For a group of 59 community-dwelling older people, a Spearman correlation of .79 between POMA-T scores and gait impairment scores based on a neurologic examination was found.5
With regard to the POMA-B, a test-retest reliability value (ICC) of .93 was reported for a group of 14 residential care facility residents.7 Interrater reliability values in that study, expressed as Pearson correlation coefficients (r), varied from .76 to .90. For a group of 40 residents of skilled nursing homes, the ICC indicating interrater reliability was .75.17 In one study focusing on the interrater reliability of scores on the 8 individual items of the POMA-B, kappa coefficients ranging from .40 to 1.00 were reported across many raters with various levels of experience for 29 hospital inpatients and nursing home residents.18 The predictive validity of scores on the POMA-B for falls was investigated by Verghese et al19 with a group of 60 community-dwelling older people; with a cutoff value set at a score of 10 points, the sensitivity was 61.5% and the specificity was 69.5%.
With regard to the POMA-G, an interrater reliability value (ICC) of .83 was reported for a group of 40 residents of skilled nursing homes.17 The concurrent validity of scores on the POMA-G was investigated for 34 community-dwelling older people by correlating POMA-G scores with their ankle ranges of motion, resulting in a Spearman correlation of .63.4
Although the data presented above are encouraging, the number of clinimetric studies is still relatively small, in particular, with regard to validity. Moreover, all reliability values reported so far refer to relative reliability; no findings have been published with regard to absolute reliability or to the related characteristic of responsiveness of the POMA scale. This dearth of published data raises questions about the use of the POMA for monitoring patients clinical recovery process or responses to interventions,20 even though the POMA has been used extensively for these goals.7–13
Given these considerations, we conducted a large-scale clinimetric study with older adults living in long-term care facilities in order to extend the small database with respect to the relative interrater and test-retest reliability and validity (concurrent, discriminant, and predictive) of scores on the original POMA and to add important information about its absolute reliability and the minimal detectable change, which was the type of change chosen for a study on responsiveness.15
| Method |
|---|
|
|
|---|
Of the 278 interested and eligible participants, 33 were excluded because they had Mini-Mental State Examination scores of less than 18. The concurrent and discriminant validity data for the present study were obtained from the remaining 245 participants in the RCT. The reliability and responsiveness data were collected from a sample of 30 participants living in the last 3 included residences. Participants in the RCT who were living in the latter residences could volunteer to participate in the reliability and responsiveness study. Predictive validity was determined for the participants who were randomly assigned to the control group in the RCT; these participants did not receive any intervention, and their fall history was recorded over a period of 10 months after randomization for the RCT (n=72). The characteristics of the participants belonging to the 3 study groups are summarized in Table 1. All participants gave written informed consent.
|
For the reliability and responsiveness part of the study, 2 graduate students who were studying human movement sciences and who received 8 hours of training in scoring the POMA scored the POMA for the 30 participants on 2 consecutive days while the physical therapists gave the test instructions to the participants. On both days, the students scored the POMA simultaneously but independently from each other. Given the short interval of about 24 hours between the 2 assessments, changes in performance attributable to changing health conditions or interventions seemed highly improbable. As indicated earlier, fall-related predictive validity was determined with the group of 72 control participants in the RCT, that is, participants who were not involved in an intervention program. Fall data were collected by means of fall diaries that were kept by the participants over a period of 10 months. A fall was defined as "an event that results in a person coming unintentionally to rest on the ground or other lower level."25
Measurement Instruments
The original POMA version used in this clinimetric evaluation (Appendix)1 consists of 8 balance items and 8 gait items to be scored on a 2- or 3-point scale. The balance items include sitting balance, rising from a chair and sitting down again, standing balance (eyes open and eyes closed), and turning balance, adding up to a maximum score of 12 points (POMA-B). The gait items include gait initiation, step length, step height, step length symmetry and continuity, path direction, and trunk sway, adding up to a maximum score of 16 points (POMA-G). The total score (POMA-T) ranges from 0 to 28 points. Lower scores indicate poorer performance.
The TUG is a test of basic functional mobility and is scored as the minimum time needed to stand up from a standard armchair, walk across a distance of 3 m, turn around, walk back to the chair, and sit down again. Interrater reliability (ICC=.99) and test-retest reliability (ICC=.99) of TUG scores have been determined for 22 patients attending a geriatric hospital.22 In that same study, the concurrent validity of TUG scores was determined for a larger group of 60 patients by correlating the time to complete the TUG with the Berg Balance Scale (Pearson r=–.81), a gait speed test (Pearson r=–.61), and the Barthel Index (Pearson r=–.51).22
The FICSIT-4 is used to test a persons ability to maintain balance in parallel stance, semitandem stance, tandem stance, and one-leg stance. Each position was tested for a maximum of 10 seconds, and participants proceeded to the next stance only when the previous stance could be maintained for at least 3 seconds. A summary score for the 4 positions was computed as suggested by Rossiter-Fornoff et al,23 resulting in a scale ranging from 0 to 5 points, with higher scores indicating better balance performance. The test-retest reliability of scores on the FICSIT-3 (similar to the FICSIT-4, but without the one-leg stance) has been determined over intervals between 2 measurements ranging from 3 to 12 months. The Pearson r ranged from .25 to .74, with longer intervals resulting in lower test-retest correlations.23
Fast gait speed was determined across a distance of 6 m, which was marked on the floor with tape. The participants, who were allowed to use their usual walking aid, were asked to walk as fast as possible without running. They were instructed to wait with both feet 1 m behind the starting line and to start walking after a verbal command. Timing began after the leading foot crossed the starting line and stopped after the leading foot crossed the finish line. The participants were instructed to continue walking for a short distance after the finish line was crossed to prevent them from decelerating before this line was reached. Speed was computed by dividing distance (in meters) by time (in seconds).24 The highest speed attained during 1 of 2 attempts was used for analysis. The test-retest reliability (ICC) for gait speed over an interval of about 2 weeks for a group of 105 frail older people (mean age=78.0 years) was .79.26 The test-retest reliability values (ICCs) determined on the same day for comfortable and maximum gait speeds for a group of 96 subjects between 60 and 89 years of age were .97 and .96, respectively.24
Self-reported limitations in basic activities of daily living (BADL) and independent activities of daily living (IADL) were assessed by means of the Groningen Activity Restriction Scale (GARS).27 The GARS consists of 18 items, covering 11 BADL and 7 IADL tasks, all scored on a 5-point scale (possible scoring range of 18–90 points, with higher scores indicating more limitations). The GARS has been used to determine changes in disablement over time, to differentiate between degrees of disability, and to assess the need for professional care.27 The test-retest correlation, determined within a group of 77 subjects over a 4-month interval, was .7428; the interrater reliability has not been determined. An indication for concurrent validity was found in a population-based study of 4,777 subjects in which the GARS scores correlated highly with the scores on the physical functioning subscale of the 20-Item Short-Form Health Survey (SF-20) (Pearson r=–.72).29 The latter subscale measures the extent to which health problems interfere with a variety of activities (eg, playing sports, carrying groceries, climbing stairs, and walking).30
Finally, the average number of minutes per day spent on habitual daily physical activities during the preceding 2 weeks was determined by administering the Longitudinal Aging Study Amsterdam Physical Activity Questionnaire (LAPAQ).31 The LAPAQ covers the frequency and duration of walking outside, bicycling, gardening, light and heavy household activities, and sport activities during the preceding 2 weeks. The total amounts of activity measured by the LAPAQ and by means of a 7-day diary were highly correlated (Spearman R=.68; n=356; men and women 65 years of age and older). The test-retest reliability was established with the same group, and the weighted kappa coefficient of the total number of activities measured by the LAPAQ over 1 year was .65.31
Data Analysis
Assumptions of normality were not met for the POMA-T, POMA-B, POMA-G, and TUG. Therefore, all calculations of relative reliability and of concurrent and discriminant validity were based on nonparametric statistics. The computation of absolute reliability and responsiveness is based on differences in paired observations, assuming that these differences are normally distributed. This assumption held true for POMA-T but not for POMA-B and POMA-G. Consequently, absolute reliability findings are provided only for the former scale.
The relative interrater and test-retest reliability of the POMA scores were expressed in terms of Spearman rank correlations (R). These calculations were complemented by testing the differences between the paired scores given by the 2 raters and between the paired scores on the 2 test days by means of a Wilcoxon signed rank test.
Absolute interrater and test-retest reliability for the POMA-T were visualized by means of Bland-Altman plots with 95% limits of agreement (LOA).32 In those plots, the differences (d) between each pair of observations are presented as a function of the average value for each pair of observations. Assuming a normal distribution of the differences, 95% of those differences may be expected to fall within the interval d ± (1.96 x SDdiff), with d being the mean difference and SDdiff being the standard deviation of the difference. The mean difference d captures the systematic difference between the paired observations, whereas the SDdiff captures the agreement at the level of individual observations.
The responsiveness of the POMA-T was considered at both the individual level and the group level and is presented in the units of measurements of this scale. The responsiveness at the individual level is captured as the minimal detectable change with a confidence level of 95% (MDC95) at the individual level (MDC95,ind), as follows:
![]()
where SEM is the standard error of measurement (ie, the square root of the within-subject variance).15 Changes smaller than MDC95,ind cannot be reliably (with a confidence level of 95%) interpreted as "real" changes in the score for a subject compared with chance fluctuations. The responsiveness to changes at the group level, known as the MDC95 at the group level (MDC95,group), depends on the size of the group (n), as follows33:
![]()
Changes smaller than MDC95,group cannot be reliably (with a confidence level of 95%) interpreted as "real" changes in the mean score for a group compared with chance fluctuations.
The concurrent validity of the POMA scores was assessed by calculating their Spearman rank correlations (R) with the scores on a number of reference tests described above. The discriminant validity was calculated by relating the POMA scores to the type of walking aid commonly used by the participants (none, cane or stick, walker, or wheelchair) by means of a Kruskal-Wallis test with type of walking aid as the experimental factor, followed by post hoc comparisons by means of Mann-Whitney U tests with Bonferroni adjustments.
Fall-related predictive validity was determined by predicting future falls on the basis of the POMA scores. A "nonfaller" was defined as a subject who did not fall or fell only once during the follow-up period, whereas a "faller" was defined as a subject who fell at least twice during the follow-up period (as in the study by Tinetti et al3). Predictive validity was expressed in terms of sensitivity and specificity. Sensitivity, in this context, is defined as the probability that a future faller is indeed predicted to be a faller, whereas specificity is defined as the probability that a future nonfaller is indeed predicted to be a nonfaller. Receiver operating characteristic curves were used for selecting the optimal cutoff scores, and 95% confidence intervals were calculated. All analyses were performed with SPSS version 11.5* for Windows.
| Results |
|---|
|
|
|---|
|
|
|
Validity
The Spearman correlations between the scores on the POMA scales and the scores on the reference tests (walking speed, TUG, FICSIT-4, GARS, and LAPAQ), indicating the concurrent validity of scores for the scales, are shown in Table 3. All correlations were significant at the .01 level. Except for the correlations with LAPAQ, which were low, all correlations between the POMA-T and the POMA-B on the one hand and the reference tests on the other hand ranged from |.64| to |.70|. The corresponding correlations between the POMA-G and the reference tests were lower, ranging from |.51| to |.56|.
|
|
|
| Discussion |
|---|
|
|
|---|
From a clinical point of view, relative reliability must be considered less relevant than absolute reliability. The LOA showed that for the POMA-T, no systematic bias was present for test-retest and interrater situations. The test-retest reliability data have direct implications for responsiveness. The responsiveness findings with regard to the POMA-T indicated that, given a confidence interval of 95%, intervention effects should be at least 5 points at the individual level and at least 0.8 point at the group level (with a group size of n=30) before a real improvement rather than a chance fluctuation can be reliably concluded. It should be emphasized, however, that this real change should be attributed to the intervention only when other systematic influences, such as spontaneous recovery, are controlled for by means of an adequate control group.
In earlier clinical trials in which the POMA was used as an outcome measure, statistically significant intervention effects of 3.5 to 5.3 points (relative to the results for a control group) were reported.8,11,34,35 Given these average group effects and the order of magnitude of the critical MDC95,ind determined in the present study, one may safely conclude that for a number of subjects, reliable intervention effects indeed have occurred. Even in those cases, however, the clinical relevance of the improvement is not beyond doubt. Clinical relevance can be demonstrated by showing that the change scores also exceed the minimal clinically important difference, defined as the smallest change that ensures clinically relevant improvement. Several methods have been proposed to determine the minimal clinically important difference.36 An anchor-based method is preferred, in which the change in an external criterion that may be determined from either a clinicians or a patients perspective is used to "anchor" improvement. However, finding a valid external criterion, which often will be very difficult,37 was beyond the scope of the present study.
The concurrent validity values for the POMA-T and the POMA-B were quite acceptable, as demonstrated by the association with other physical performance tests (R=|.64|–|.68|) and self-reported limitations (R=|.68|–|.70|). The validity of the POMA-G scores was weaker. The Spearman correlations in question ranged from |.51| to |.56|. The correlations between the scores on the POMA scales and the self-reported amounts of physical activity (LAPAQ) were low, ranging from .33 to .38. It may be argued, however, that self-reported physical activity is less adequate as a reference test, because it is a measure not of performance but of perception.38 Generally speaking, the concurrent validity values for the POMA-T and the POMA-B concur with the (sparse) data from previous studies.4–6 For the POMA-G, no such data were reported earlier.
Discriminant validity was demonstrated by finding significant differences between subgroups of subjects defined according to the type of walking aid that they used. Although the POMA-T and the POMA-G differentiated among the same (combined) subgroups and the POMA-B differentiated between other subgroups, there is no evidence for clear differences among the discriminatory powers of the 3 scales.
The predictive validity with regard to falling was not satisfactory for any of the POMA scales. Given optimal cutoff criteria, both the sensitivity and the specificity of the POMA-T and its subscales ranged from 62.5% to 66.1%. However, in studies in which other versions of the POMA scale were used, similar values for sensitivity and specificity were reported. In a prospective study of 60 community-dwelling older adults and using a 16-point version of the POMA-B, the sensitivity was 61.5% and the specificity was 69.5%.19 In another prospective study of 225 community-dwelling adults 75 years of age and older and using a 40-point version of the POMA-T, the sensitivity was 70% and the specificity was 52%.39 In a case-control study of 80 participants and using a modified 57-point version of the POMA, the sensitivity was 70% and the specificity was 65%.40 Only one study, a case-control study involving community-dwelling older people and using a 24-point version of the POMA-B, demonstrated much higher sensitivity and specificity: 95.5% for frequent fallers versus nonfallers.41
| Conclusion |
|---|
|
|
|---|
| Appendix |
|---|
|
|
|---|
|
| Footnotes |
|---|
The medical ethical committee of the Vrije Universiteit Medical Centre approved the study protocol.
* SPSS Inc, 233 S Wacker Dr, Chicago, IL 60606. ![]()
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |