|
|
||||||||
Clinical Perspectives |
DL Riddle, PhD, PT, is Associate Professor, Department of Physical Therapy, Medical College of Virginia Campus, Virginia Commonwealth University, 1200 E Broad, Richmond, VA 23298-0224 (USA) (driddle{at}hsc.vcu.edu). Address all correspondence to Dr Riddle
PW Stratford, PT, is Associate Professor, School of Rehabilitation Science, and Associate Member, Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
Submitted December 7, 1998;
Accepted July 7, 1999
| Abstract |
|---|
Key Words: Diagnosis Tests and measurements general
| Introduction |
|---|
|
|
|---|
Studies that can be used to determine whether meaningful clinical inferences can be made based on diagnostic tests are classified as "criterion-related validity studies."5 Criterion-related validity studies take 1 of 2 forms. Researchers can compare a clinical measure with a "gold standard" measure (ideally, a valid diagnostic test or a definitive measure of whether the condition of interest is truly present) obtained at about the same time as the measure being studied. In our illustration, the patient's report of falling is considered the gold standard measure. In other cases, a gold standard measure may be a diagnosis made at the time of surgery or via an invasive diagnostic procedure. Studies in which some form of gold standard is obtained at about the same time as the diagnostic test being studied are commonly called "concurrent criterion-related validity studies."5 Researchers can also compare a measure's prediction of a future event with what actually happens to a patient in the future. These studies are commonly termed "predictive criterion-related validity studies."5
Studies designed to estimate the risk of a future adverse event are often used by clinicians to make judgments about prognoses. For example, investigating whether the BBT can be used to predict whether a person will fall in the future is an illustration of a predictive criterion-related validity study. The gold standard for this type of study would be the subjects' report of falls for a period of time following administration of the BBT.
| The Berg Balance Test |
|---|
|
|
|---|
Data exist to support the reliability of BBT scores obtained from elderly subjects.3,6,7 For example, Bogle Thorbahn and Newton3 reported an intertester reliability (Spearman rho) value of .88 for 17 subjects aged 69 to 94 years. Evidence also exists to support the content validity,6 construct validity,7,8 and criterion-related validity3,4,8 of test scores for inferring fall risk in elderly subjects tested in a variety of settings. Construct validity has been assessed using a variety of approaches. For example, construct validity was supported to the extent that BBT scores were shown to correlate reasonably well with other measures of balance (Pearson r =.38.91) and measures of motor performance (Pearson r = .62.94).7,8 For example, the Pearson r correlation between the BBT and the balance subscale of the Tinetti Performance-Oriented Mobility Assessment9 was .91.8 The Pearson r correlation between the BBT and the Barthel Index mobility subscale10 was .67.8
| The Illustration |
|---|
|
|
|---|
|
| Diagnostic Test Methodology |
|---|
|
|
|---|
|
| Sensitivity and Specificity |
|---|
|
|
|---|
The authors of both studies in our illustration reported the sensitivity and specificity of the BBT for determining current fall risk. Berg et al8 contended that the best way to interpret scores on the BBT is to use a single cutoff point of 45 to differentiate those at risk for falls (those with scores of <45) and those who are not at risk for falls (those with scores of
45). Using a cutoff point of 45, as recommended by Berg et al, the sensitivity for the data collected by Shumway-Cook and colleagues4 was 55% and the specificity was 95%. For the data collected by Bogle Thorbahn and Newton,3 the sensitivity was 82% and the specificity was 87%. When we combined the data from both studies, a cutoff point of 45 yielded a sensitivity of 64% and a specificity of 90% (Tab. 3). A sensitivity of 64% indicates that 64% of subjects who were true fallers had a positive BBT (a score of <45). That is, approximately a third of the subjects who were fallers were missed by the BBT. Although there are no agreed-on standards for judging sensitivity and specificity, we believe the sensitivity of 64% should generally be considered quite low because more than a third of the subjects were misclassified.
|
45). That is, only 10% of the nonfallers were missed by the BBT. Specificity was much higher than sensitivity, indicating that the BBT does a better job of identifying subjects who are not fallers than subjects who are fallers. When we use diagnostic tests, we do not know who has the condition of interest and who does not have the condition of interest. That is, sensitivity and specificity have somewhat limited usefulness because they do not describe validity in the context of the test result.1 Rather, they describe validity in the context of the gold standard, a value we do not know when we do diagnostic tests. Sensitivity, for example, does not take into account the false positive test results (Tab. 2) on a group of patients. Stated another way, sensitivity does not describe how often patients with positive tests have the disorder of interest. Sensitivity only describes the proportion of patients with the disorder of interest who have a positive test. Similarly, specificity does not take into account false negative test results (Tab. 2). Specificity does not describe how often patients with negative tests do not have the disorder of interest. Specificity only describes the proportion of patients without the disorder of interest who have a negative test.
Diagnostic testing, in our view, is used because clinicians want to know the probability of the condition existing. Because clinicians make decisions based on diagnostic test results and not necessarily on results of tests that are considered gold standards, some authors1 have contended that positive and negative predictive values (see next section) are more important than sensitivity and specificity for clinical practice.
| Positive and Negative Predictive Values |
|---|
|
|
|---|
For many clinicians, the idea of estimating the probability of a disorder prior to doing a diagnostic test (pretest probability) may seem like a new or unusual concept. We believe that some clinicians, based on their experience and training, may use an ordinal-based scale estimate of pretest probability, such as the disease is highly likely, somewhat likely, or not very likely given the patient's signs and symptoms. In our view, however, using percentage estimates of pretest probability is not commonly done by most therapists. We suggest that therapists should make percentage estimates of the pretest probability of the disorder of interest. For example, if a clinician used an ordinal scale similar to the one just described, we contend that the clinician should convert it to a percentage estimate of pretest probability in the following way. If the pretest probability of the disorder were judged to be highly likely, this judgment could be converted to a 75% pretest probability, whereas a rating of "somewhat likely" could be converted to pretest probability of 50%. A rating of "not very likely" might be converted to a pretest probability of 25%. We believe that, as therapists become more comfortable with making percentage estimates of pretest probability, they will become more accurate, although we have no data to support this argument. By using percentage estimates for pretest probability, therapists can take full advantage of positive and negative predictive values (and likelihood ratios, to be discussed elsewhere in this article) reported in the literature. Several examples are discussed elsewhere in this article to illustrate how pretest probability can be estimated and how these estimates can influence the interpretation of the diagnostic test.
Positive predictive value is the proportion of patients with a positive test who have the condition of interest.1 Negative predictive value is the proportion of patients with a negative test who do not have the condition of interest.1 The closer the positive predictive value is to 100%, the more likely the disease is present with a positive test finding. The closer the negative predictive value is to 100%, the more likely the disease is absent with a negative test finding.
In our illustration, the combined data from both studies yielded a positive predictive value of 72% when using a cutoff point of 45 on the BBT (Tab. 4). A positive predictive value of 72% indicates that 72% of patients with a positive test (a BBT of <45) were classified as fallers (the gold standard) and 28% of the patients were misclassified as fallers based on the BBT, an error rate that we consider to be fairly high. A negative predictive value of 85% indicates that 85% of patients with a negative test (a BBT of
45) were classified as nonfallers (the gold standard). Our misclassification rate for nonfallers is less than for fallers (ie, we can be more confident about identifying nonfallers than fallers based on BBT test results).
|
| Issues Related to the Interpretation of Sensitivity, Specificity, and Predictive Values |
|---|
|
|
|---|
The choice of the cutoff point influences the sensitivity, specificity, and positive and negative predictive values. This concept is illustrated in Table 4. For example, if the cutoff point for the BBT were set at 40, the sensitivity would be 45% and the specificity would be 96%. With a cutoff point of 50, the sensitivity is 85% and the specificity is 73%. Generally, the choice of cutoff point by the researcher will increase one validity index (eg, sensitivity) but will decrease the other validity index (eg, specificity). For example, when sensitivity rises (as seen when going from a cutoff point of 40 to a cutoff point of 50 on the BBT), specificity falls. The same concept holds for positive and negative predictive values. When the positive predictive value rises (as seen when going from a cutoff point of 50 to a cutoff point of 40 on the BBT), the negative predictive value falls (Tab. 4).
The principal factor influencing the clinician's choice of a cutoff point is related to the consequence of misclassifying patients. Broadly speaking, there are 3 choices for a cutoff point: (1) maximize both sensitivity and specificity, (2) maximize sensitivity at the cost of minimizing specificity, and (3) maximize specificity at the cost of minimizing sensitivity. Maximizing sensitivity and specificity is appropriate when the consequences of false positives and false negatives are about equal. Maximizing sensitivity at the cost of minimizing specificity is desirable when the consequence of a false negative (eg, falsely identifying a subject as a nonfaller) exceeds the consequence of a false positive (eg, falsely identifying the subject as a faller). Conversely, maximizing specificity at the cost of minimizing sensitivity is desirable when the consequence of a false positive exceeds the consequence of a false negative. In the case of the BBT, it would appear that sensitivity should be optimized to avoid classifying a faller as a nonfaller. Misclassifying fallers would appear to have serious consequences (eg, fractures).
An important advantage associated with the use of sensitivity and specificity is that they are not influenced by prevalence. Prevalence is defined as the proportion of patients with the disorder of interest among all patients tested.1 A therapist can use sensitivity and specificity estimates from a published report and apply these estimates to a patient as long as the patient is reasonably similar to the subjects in the study.
Predictive values should guide clinical decisions (they estimate validity in the context of the test result), but unlike sensitivity and specificity, predictive values are prevalence dependent.1 That is, as the proportion of those with the disease changes, predictive values also change. Predictive values, therefore, vary when the prevalence of the disorder of interest changes. As the prevalence increases, the positive predictive value increases and the negative predictive value decreases. When the prevalence decreases, the positive predictive value decreases and the negative predictive value increases. Because the chance that an individual patient will have a target disorder varies (ie, the pretest probability changes depending on the patient's signs and symptoms), the prevalence associated with a diagnostic accuracy study may not apply to a given patient. For example, in the study by Shumway-Cook et al,4 there was a prevalence of fallers of 50%. If, for example, a clinician estimated the pretest probability of falling for a patient to be only 10%, the predictive values from the data of Shumway-Cook et al would not provide accurate estimates of positive or negative predictive values for the patient. The positive predictive value from the data of Shumway-Cook and colleagues would be spuriously high (because of the higher prevalence), and the negative predictive value would be spuriously low for the patient with a pretest probability of 10%.
Unfortunately, predictive values are influenced by prevalence, whereas sensitivity and specificity are not. Sensitivity and specificity, however, are related to positive and negative predictive values in the following way. When specificity is high, the positive predictive value tends to be high, and when sensitivity is high, the negative predictive value tends to be high. That is, when sensitivity is high, a negative test generally indicates the disorder is not present (or, in our illustration, the person is not at risk of falling). When specificity is high, a positive test generally indicates the disorder is present (the person is at risk of falling).2 Table 4 illustrates this concept. When specificity is high, for example, for a BBT cutoff point of 40 (96%), the positive predictive value will generally be high (83%). A clinician might hypothetically believe, for example, that based on medical history and examination data, a patient had a pretest probability of falling of approximately 40% and the patient might subsequently have a score of 37 on the BBT, a score considered positive using a cutoff point of 40 (Tab. 4). The positive predictive value would be 83%, an increase of 43 percentage points from the pretest probability. We contend that the clinician can be reasonably confident the patient is a faller.
Similarly, when sensitivity is high (97% for a cutoff point of 55), the negative predictive value will also generally be high (95%). For example, a clinician might believe, based on a patient's medical history and examination data, that the patient had a pretest probability of falling of approximately 40% (or a pretest probability of not falling of 60%). The patient might subsequently have a score of 56 on the BBT, a score considered negative using a cutoff point of 55 (Tab. 4). The negative predictive value (posttest probability) in this hypothetical example would be 95%, and we argue that the clinician can be very confident the patient is not a faller. We noted earlier that predictive values are dependent on prevalence, and in our examples, the prevalence (pretest probability) for falls was estimated to be 40%, a reasonable approximation of the prevalence reported in our illustration using the BBT data. Had the pretest probabilities for the patient examples been appreciably lower or higher, the predictive values reported in the 2 examples above would not have been accurate estimates of posttest probability.
In summary, sensitivity and specificity are not dependent on prevalence and are therefore seen as useful for clinical practice.1 As a general guide, we believe clinicians should conclude the condition is likely to be present when a test is positive and the specificity for the test is high. Conversely, clinicians should conclude the condition is likely to be absent when a test is negative and the sensitivity for the test is high.1,2 Positive and negative predictive values are, in part, prevalence dependent. As a result, we argue that predictive values are meaningful only when the prevalence reported in a study approximates the pretest probability of the disorder the clinician has estimated for the patient. To be most accurate, pretest probability estimates should be based on sound scientific data.
| Confidence Intervals for Validity Indexes |
|---|
|
|
|---|
For example, the 95% CI for specificity reported by Bogle Thorbahn and Newton3 ranged from 67% (not very specific) to 100% (perfect specificity). The 95% CI for specificity for the combined data from the studies of Bogle Thorbahn and Newton3 and Shumway-Cook et al4 ranged from 83% to 97% (both values, in our opinion, represent reasonably high specificity).
| Likelihood Ratios |
|---|
|
|
|---|
Jaeschke and colleagues20 proposed the following guide to interpreting likelihood ratios. Likelihood ratios greater than 10 or less than 0.1 generate large and often conclusive changes from pretest to posttest probability. Likelihood ratios between 5 and 10 or between 0.2 and 0.1 generate moderate changes from pretest to posttest probability. Likelihood ratios from 2 to 5 and from 0.5 to 0.2 result in small (but sometimes important) shifts in probability, and likelihood ratios from 0.5 to 2 result in small and rarely important changes in probability.
Because likelihood ratios can be applied to score intervals for tests with continuous measures, we believe they are more useful than sensitivity, specificity, and predictive values, which are limited to data presented in a dichotomous format. For example, the positive likelihood ratio for the score interval of 40 to 44 (a test score considered positive based on recommendations of Berg and colleagues8) is 2.8 (Tab. 5). This likelihood ratio indicates that a patient with a BBT score between 40 and 44 is 2.8 times more likely to be a faller than a nonfaller. The 95% CI ranges from 0.9 to 8.5. That is, the 95% CI overlaps 1 (no change in the probability of the disorder); therefore, a clinician cannot be very confident that a score between 40 and 44 increases the probability of identifying a patient at risk for falls. If a patient scores below 40 on the BBT, however, the likelihood ratio increases to 11.7 (95% CI=3.637.6). A patient with a BBT score below 40 is at greater risk for falls as compared with patients with scores between 40 and 44. On average, patients with BBT scores less than 40 are almost 12 times more likely to be a faller than a nonfaller.
|
| Applications of Likelihood Ratios to Clinical Practice |
|---|
|
|
|---|
40) is 0.6 times as likely to be a faller as a nonfaller. When using a cutoff point of 40, for a negative score (score of
40), a patient is more likely to be a nonfaller than a faller. Based on the data summarized in Table 4, lower cutoffs will usually increase the magnitude of the positive likelihood ratio (a desirable trait), but they will also increase the magnitude of the negative likelihood ratio (an undesirable trait). Another advantage of the use of likelihood ratios is that, along with the use of a nomogram (Figure), a clinician can determine the probability of a disorder, given the result of the test (also called "posttest probability").21 Because likelihood ratios do not vary when disorder prevalence varies, likelihood ratios can be generalized to other patients. To use the nomogram, the clinician must first estimate the pretest probability of the disorder. The pretest probability of the disorder (likelihood of the presence of the disorder prior to doing the test) is estimated, as mentioned earlier, by the clinician's own clinical training and experience with similar types of patients in the specific setting in which the patients are seen.2 The constellation of signs and symptoms also influences the clinician's judgment of the pretest probability of the disorder. If we knew the likelihood ratios for each of the medical history items and signs and symptoms of patients, we could repeatedly recalculate the pretest and posttest probability of the disorder of interest and come up with a very accurate estimate of the final posttest probability.20 Most of these data, unfortunately, are unavailable, so clinicians typically must rely on training, experience, and knowledge of the literature to estimate the pretest probability of the disorder. To use the nomogram, the clinician simply estimates the pretest probability of the disorder and identifies this value in the left-hand column of the nomogram (Figure). A straightedge is then anchored on the left column of the Figure at the pretest probability estimate and aligned on the middle column at the likelihood ratio. The right column indicates the posttest probability.
|
Our second hypothetical example is about a 75-year-old man who was diagnosed with congestive heart failure approximately 5 years previously and requires assistance with some activities of daily living. He reports losing his balance occasionally and remembers falling once in the past few years. Based on the patient's medical history and functional status, the pretest probability for falls would be fairly high (ie, on the order of 50%). A BBT was done, and a score of 38 (a positive test, using a cutoff point of 40) was obtained. Using the data in Table 4, the positive likelihood ratio for a score of less than 40 is 11.7. That is, this patient is 11.7 times more likely to be a faller than a nonfaller. Using the nomogram shown in the Figure, the posttest probability for current fall risk is approximately 92%, an increase of 42 percentage points above the pretest probability. If we believe our data are correct and our estimates are appropriate, we can theoretically be confident that we have identified a patient who has a very high probability of falling. We again appear to have substantially increased our level of certainty about the patient's risk of falling.
| Summary |
|---|
|
|
|---|
| Acknowledgments |
|---|
| Footnotes |
|---|
* Likelihood ratios should not be confused with odds ratios. Odds ratios are an estimate of risk often expressed in case-control studies designed to investigate causation of a disease. ![]()
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
W.-N. W. Huang, J. M VanSwearingen, and J. S Brach Gait Variability in Older Adults: Observational Rating Validated by Comparison With a Computerized Walkway Gold Standard Physical Therapy, October 1, 2008; 88(10): 1146 - 1153. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. A Studenski Invited Commentary Physical Therapy, April 1, 2008; 88(4): 460 - 461. [Full Text] [PDF] |
||||
![]() |
S. W Muir, K. Berg, B. Chesworth, and M. Speechley Use of the Berg Balance Scale for Predicting Multiple Falls in Community-Dwelling Elderly People: A Prospective Study Physical Therapy, April 1, 2008; 88(4): 449 - 459. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. E Dibble, J. Christensen, D J. Ballard, and K B. Foreman Diagnosis of Fall Risk in Parkinson Disease: An Analysis of Individual and Collective Clinical Balance Test Interpretation Physical Therapy, March 1, 2008; 88(3): 323 - 332. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y Kawamura-Hagiya, T Yoshioka, and H Suda Logistic regression equation to screen for vertical root fractures using periapical radiographs Dentomaxillofac. Radiol., January 1, 2008; 37(1): 28 - 33. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. F Beattie and R. M Nelson Evaluating Research Studies That Address Prognosis for Patients Receiving Physical Therapy Care: A Clinical Update Physical Therapy, November 1, 2007; 87(11): 1527 - 1535. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Cattaneo, J. Jonsdottir, M. Zocchi, and A. Regola Effects of balance exercises on people with multiple sclerosis: a pilot study Clinical Rehabilitation, September 1, 2007; 21(9): 771 - 781. [Abstract] [PDF] |
||||
![]() |
T. M Steffen, B. F Boeve, L. A Mollinger-Riemann, and C. M Petersen Long-Term Locomotor Training for Gait and Balance in a Patient With Mixed Progressive Supranuclear Palsy and Corticobasal Degeneration Physical Therapy, August 1, 2007; 87(8): 1078 - 1087. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Swanenburg, E. D. de Bruin, M. Stauffacher, T. Mulder, and D. Uebelhart Effects of exercise and nutrition on postural balance and risk of falling in elderly people with decreased bone mineral density: randomized controlled trial pilot study Clinical Rehabilitation, June 1, 2007; 21(6): 523 - 534. [Abstract] [PDF] |
||||
![]() |
M. T Kristensen, N. B Foss, and H. Kehlet Timed "Up & Go" Test as a Predictor of Falls Within 6 Months After Hip Fracture Surgery Physical Therapy, January 1, 2007; 87(1): 24 - 30. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. O. Hoque, Q. Feng, P. Toure, A. Dem, C. W. Critchlow, S. E. Hawes, T. Wood, C. Jeronimo, E. Rosenbaum, J. Stern, et al. Detection of Aberrant Methylation of Four Genes in Plasma DNA for the Detection of Breast Cancer J. Clin. Oncol., September 10, 2006; 24(26): 4262 - 4269. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. L Craik Never Satisfied Physical Therapy, November 1, 2005; 85(11): 1224 - 1237. [Full Text] [PDF] |
||||
![]() |
I. Poulsen, B. Hesselbo, I. Pietersen, and M. Schroll Implementation of functional assessment scales in geriatric practice: A feasibility study Scand J Public Health, August 1, 2005; 33(4): 292 - 299. [Abstract] [PDF] |
||||