PHYS THER
Vol. 86, No. 11, November 2006, pp. 1496-1498
DOI: 10.2522/ptj.20060002.ic
Invited Commentary
Dorcas Beaton
Scientist and Director
Mobility Program Clinical Research Unit
St Michaels Hospital
Assistant Professor
Department of Occupational Sciences, Graduate Departments of Rehabilitation Sciences and Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
The article by Stratford et al provides an excellent model for analyzing the performance of outcome measures. In the true context of construct validity, they evaluated whether the scales are measuring what they are intended to measure and they brought modern analytic techniques into the analysis to do so. They analyzed 4 performance-based measures of lower-extremity pain and function with the goal of seeing whether performance-based measures are better able to separate out the concepts of pain and function than previous experience with patient self-report measures. The work by Stratford et al is an excellent example of the need to push our measurement work and test all of the assumptions under which we believe we are measuring a certain construct.
There are 2 main points regarding the art of statistical modeling and conceptual frameworks that I would like to raise in this commentary, which the readers might consider as they review this work.
 |
The Art of Statistical Modeling
|
|---|
Structural equation modeling is a powerful and attractive tool for this type of work. Indeed, it offers a route into exactly what the authors wanted to explore. However, there is always a bit of art to the modeling process, and guidelines as to how to make it work best for your situation.1,2 Structural equation modeling is a large-sample method. Stratford et al used a time-honored method of splitting their data in order to have a test and validation sample to confirm their methods; however, in so doing, they were working at or below the very lowest limits of the sample size requirements. Kline1 suggested a minimum of 100 to 150 people, and approximately 10 to 20 people per parameter estimate. With 8 attributes loading onto 2 factors, the authors were estimating 17 parameters. The small sample size in general and the number of parameters being estimated could lead to a misestimation of the model and an inability to converge. The latter was not the case in this study as the authors reached convergence. Two other things also support the conclusion that the model did not suffer from the small sample. First, the authors had similar findings on the second half of the data. Second, the variables in the model were highly correlated with the latent trait, which means they were less vulnerable to misestimation. The authors made the choice to split the data, taking the risks associated with small samples. Another choice would have been to only run the model on the full sample and take advantage of being closer to the recommended sample sizes. This is an important consideration for the application of analytic techniques such as structural equation modeling—they take large numbers of observations.
Analysts using the standard error of the mean also make a series of decisions about how to model. Stratford et al began with a 2-factor model to see whether there were 2 concepts being measured: pain and function. Their description seems to suggest that they loaded the items on the respective factors and tested the model. I wonder whether we would have had even more information had they allowed the items to load on both factors or if they had tried just 1 factor first—both of which would have allowed them to demonstrate that their measures did not fit a model with either of the cross loadings, or only one latent trait. This would add strength to their findings that they observed a 2-factor model, and it would have offered more information and better data fit than if they had tried to load all indicators onto 1 factor (latent trait). Indeed, had they done this, they could have tested for the improved explanatory power and fit of their model compared with one with just 1 factor with a chi-square difference test.1 The authors were clear in their decisions as to how they modeled, and these comments highlight the interpretive part of any form of path or structural equation modeling.
 |
Conceptual Frameworks
|
|---|
Measurement requires 2 conceptual frameworks to be considered: (1) that of the target construct and (2) that of the instrument selected to define that construct. We have come a long way in rehabilitation toward an understanding of the former conceptual framework. Stratford et al stated that they were measuring physical function, and they used the definition based on the work of Bellamy3,4 to define their target: "the ability to move around and to look after yourself." Pain was not specifically defined by the authors. They identified these as 2 distinct concepts using the Outcome Measures in Arthritis Clinical Trials conference (OMERACT) III recommendations. What they have not provided is an overall framework that would help "guide our communication, clinical research, and patient care" as Jette noted in this journal.5 Such a framework helps to elaborate on the anticipated relationships between factors such as pain and function, and in this case may have helped in offering an explanation for the differences between self-report and performance-based measurement.
Stratford et al discussed the difference between their findings—where performance-based indicators of function and self-reported pain fell into separate factors. They contrasted this with past experiences of being unable to separate self-reported function and self-reported pain. A broader conceptual framework might have aided in this. If placed within Verbrugge and Jette's framework,6 Stratford et al would have been measuring a functional limitation when using a time test of function. Several of the self-report measures might be measuring more at the level of the whole task in an unrestricted context—the disability level—and more likely to be influenced by personal and environmental factors than the structured timed tests. This supports Stratford and colleagues' findings, but from a position of difference in concept rather than difference in quality of either type of scale (timed versus self-report measures of physical function). The same would be true if put within the International Classification of Functioning, Disability and Health (ICF) model,7 where the timed test would be considered an activity limitation and appraisal of functioning as a whole might shift that to participation. In both situations, the timed test and self-report both tap physical functioning, but at different levels of the conceptual frameworks we use. Interestingly, the final revision of the model provided by Stratford et al removed one task due to an undesirable correlation between the timed outcome for this test and its pain rating. The task was performing the stair test, which was distinct from the other tasks in that it was moving toward a higher level of contextualized complexity, much like self-report of physical function at the level of disability.5 Just as with the self-report measures, perhaps the distinction between pain and disability in this severely disabled population is lost when you move toward more applied complex appraisals of activities in daily life.
The second framework that must be considered is that of the measure that has been chosen. Hopefully, the developers have provided their definition and have described how they operationalized it in the development of their outcome measure; however, many do not.8 The user then must appraise this and make sure it matches with the intended target. Stratford et al have clearly articulated an a priori definition of their target—physical function—that includes both mobility and taking care of oneself. In the end, they have 3 of the 4 measures in the model. All 3 measures focus on the timed performance of mobility tasks: Timed "Up & Go" Test, self-paced walk, and Six-Minute Walk Test. The second domain is the pain experience immediately after each of these tasks. As mentioned above, the most functional one—the stair test—was dropped. Were the 3 remaining tasks measuring something congruent with Stratford and colleagues' definition of physical function, or were they measuring timed performance of mobility tasks alone? Similarly, Stratford and colleagues' pain scale focused on the pain associated with the timed tasks, not pain as we typically might measure it. Is this a broad enough measure of pain related to osteoarthritis, or do we need to consider a broader scale (pain visual analog scale or numeric rating scale without attribution to a task as suggested by OMERACT, or other arthritis-related pain scales)?
Stratford et al have pushed us to demand more from our measures, and specifically to make sure that our measures are measuring what they are supposed to be measuring in the way that we expect. The points raised above offer some possible explanations, all of which are testable in a study that fields both self-report and performance-based measures with attributed and nonattributed pain items and in a large enough sample to allow for even more confidence in large-sample modeling techniques, perhaps in people along the course of their experience with osteoarthritis rather than just at the end stage. The door has been opened, and the rehabilitation community could easily collaborate on such a venture.
Measurement is the application of a set of rules to get numeric quantification of attributes—in Stratford and colleagues' case, the pain and physical function concepts. Once we believe we have a good way of measuring these concepts, we keep pushing the boundaries to make sure we can interpret the scores in ways that we should. Stratford and colleagues have done that in their study.
 |
References
|
|---|
- Kline RB.
Principles and Practice of Structural Equation Modeling. 2nd ed. New York, NY: The Guilford Press; 2005.
- Fayers PM, Hand DJ. Factor analysis, causal indicators and quality of life.
Qual Life Res. 1997;6:139–150.[ISI][Medline]
- Bellamy N.
WOMAC Osteoarthritis Index User Guide IV. Herston, Queensland, Australia: University of Queensland; 2000.
- Bellamy N, Kirwan J, Boers M, et al. Recommendations for a core set of outcome measures for future phase III clinical trials in knee, hip, and hand osteoarthritis: consensus development at OMERACT III. J Rheumatol. 1997;24:799–802.[ISI][Medline]
- Jette AM. Toward a common language for function, disability, and health.
Phys Ther. 2006;86:726–734.[Abstract/Free Full Text]
- Verbrugge LM, Jette AM. The disablement process.
Soc Sci Med. 1994;38:1–14.[CrossRef][ISI][Medline]
- International Classification of Functioning, Disabilty and Health: ICF. Geneva, Switzerland: World Health Organization; 2001.
- Lohr KN, Aaronson NK, Alonso J, et al. Evaluating quality-of-life and health status instruments: development of scientific review criteria. Clin Ther. 1996;18:979–992.[CrossRef][ISI][Medline]
Copyright © 2006 by the American Physical Therapy Association.