The urodynamic tracing, urodynamic parameters documented in the medical record, and urodynamic findings were extracted from the record and included in the subject's packet (Table 3). Comments from the nurse administering the test, documented in the medical record or on the urodynamic tracing, were also included. Patient name and other specific identifying information was omitted from study forms and rendered unreadable on the urodynamic tracings in the study packets.
Physician reviewers assigned both International Continence Society diagnoses8 and clinical diagnoses (on which to base management decisions). International Continence Society filling (storage) phase diagnoses included abnormalities of sensation (choices: normal, increased, reduced or absent), detrusor function (normal, overactive), and urethral closure mechanism (normal, urodynamic stress incontinence). Reviewers also indicated presence or absence of urgency. International Continence Society voiding phase diagnoses included abnormalities of detrusor function (normal, detrusor underactivity, acontractile detrusor) and urethral function (normal, bladder outlet obstruction, dysfunctional voiding, detrusor sphincter dyssynergia, nonrelaxing urethral sphincter obstruction). “Unsure or cannot be determined” could also be chosen for each International Continence Society diagnosis. Reviewers chose among yes, no, or unsure for possible clinical diagnoses of stress urinary incontinence, bladder overactivity without incontinence, bladder overactivity with incontinence, and voiding dysfunction. When the latter was selected, reviewers further considered more specific diagnoses of bladder outlet obstruction, hypoactive detrusor, or another they could specify. If mixed incontinence was felt to be present, reviewers were instructed to select both stress incontinence and bladder overactivity.
Each of the six physician reviewers completed the diagnosis data sheets for each of the 100 subjects. Physician reviewers were instructed not to discuss study packet details or their diagnostic responses with each other during the course of the study. Missing or unclear responses in the data sheet were clarified with the physician nondirectively, and with masking preserved.
Multichannel urodynamic testing of all patients in this study was performed by a group of three urologic nurses, urodynamics instructors with at least 5 years of urodynamics experience, whose full time responsibility is urodynamic testing. Tests are performed using a Life-tech Janus model (Stafford, TX) in accordance with the established guidelines of the International Continence Society.9 A noninstrumented uroflow was first performed, with voided volumes less than 50 mL and maximum flow rates more than 50 mL/sec discarded. Subsequently, patients were catheterized with a 12-F catheter for residual urine and checked for signs of bladder infection. A dual-channel 6-F catheter was then inserted into the bladder, followed by insertion of a 9-F balloon catheter into the rectum for abdominal pressure measurement. In women with pelvic organ prolapse, the prolapse was reduced before the urodynamic study with a rolled sponge placed in the vagina if the prolapse extended beyond the hymen or otherwise interfered with placement of the urethral or rectal catheters. Testing was performed in the sitting position. The pressure transducer was placed at the level of the symphysis pubis and zeroed to atmospheric pressure. Tubing and transducers were initially adjusted until vesicle pressure (Pres) and abdominal pressure (Pabd) were within 20%. Room temperature sterile water was infused into the bladder at a continuous 80 mL/min until the patient indicated a strong desire to void, with infusion rate downwards adjustable to as low as 30 mL/min if the patient reported sensory urgency. Threshold volumes were recorded for first desire to void, strong desire to void or cystometric capacity, and detrusor contraction. Valsalva maneuvers were performed beginning at 150 mL and continuing at 50 mL increments up to 300 mL to elicit incontinence. Patients were then asked to void with the catheter in place. For patients reporting stress incontinence not elicited by the test, a second fill was performed and the urinary catheter removed, after which the patient was stressed at bladder capacity, both sitting and standing, in an attempt to reproduce incontinence. Measures recorded during the storage phase were filling rate, pressure and volume at first sensation and first desire to void, maximum bladder capacity, Valsalva leak point pressure, presence or absence of leak with stress maneuver, presence or absence of urgency, presence or absence of leakage with urgency, and maximum filling pressure (Pdet), and during the voiding phase were time to void, voided time, voided volume, average flow, maximum flow, average Pdet, Pdet at maximal void, and postvoid residual urine volume. Subjective results (leaks, urgency, pain) were noted on the data sheet by the nurse.
To describe the patient spectrum, diagnoses were pooled across physicians and times. We used weighted κ chance-corrected agreement statistics, which compare an overall agreement score to its average under random reassignments to patients, to summarize both reliability and interobserver agreement.10 To determine κ, an agreement score is obtained as the mean across response pairs of weights quantifying the extent the pair agree, from perfect disagreement (weight of 0) to perfect agreement (weight of 1). Weighted κ is then the difference between observed and chance-based agreement scores, divided by the difference when agreement is maximized without changing each physician's diagnostic distribution. Kappa ranges 0.8–1.0, 0.6–0.8, 0.4–0.6, 0.2–0.4, and less than 0.2 are widely interpreted as denoting almost perfect, substantial, moderate, fair, slight, or poor agreement.11
Each physician provided replicate responses per patient to each diagnostic question, yielding 600 pairs for reliability assessment and 6,000 across-physician comparisons for assessing interobserver agreement. These were analyzed using 1) pure agreement weights (identical response pairs scored 1 and all others scored 0), and 2) half-credit weights (response pairs including one “unsure or cannot be determined” scored 0.5). For urodynamic diagnosis of bladder sensation, half-credit weights were refined to 0.5 for (Reduced, Absent), 0.25 for (Reduced, Normal), and 0.2 for (Increased, Normal), and (Absent, Unsure or cannot be determined).
Weighted κ-statistics for reliability were computed as ratios of the summed numerators to summed denominators of the weighted κ's for individual reviewers. Similarly, weighted κ's for interobserver agreement are ratios of the summed numerators to summed denominators of weighted κ's between reviewer pairs compared on first, second, or between first and second chart reviews. Reliability and interobserver agreement were also examined separately among urogynecologists, among urologists, and between the subspecialties, using both sets of weights. To study whether agreement might be higher in patients sharing important symptoms, signs, or chief complaints, patients were partitioned into subgroups with and without these features, or by count on presentation of several related symptoms. Reliability and interobserver agreement were then compared between subgroups using half-credit weights. Table 4 shows the elements considered. Hypothesis testing was not used for these comparisons, due to their multiplicity and exploratory nature.
The clinical impact of disagreement depends upon its frequency without chance correction; hence, we report κ and raw agreement score jointly. Standard errors and bias-corrected 95% confidence intervals were obtained from 200 bootstrap replications. These reflect variability from sampling 100 patients to estimate reliability and interobserver agreement across similar patients of participating clinicians. Because these clinicians are neither a random sample nor clearly representative of any broader group, formal inference beyond them is unjustified. However, the external relevance of our results is reviewed below under Discussion.
One hundred patients were chosen to 1) estimate each physician's raw agreement with standard error of 0.05 or less, and aggregate agreement with standard error of 0.02 or less assuming no cluster design effect on agreement; and 2) produce 95% confidence intervals for reliability κ of width approximately 0.1 when κ = 0.8, and ≤ 0.15 or less for κ of 0.35 or more, for uniform dichotomies and approximately with addition of a small “unsure” category.12
Table 1 shows basic characteristics of 100 females who underwent urodynamic investigation, and Table 2 the proportions of study patients exhibiting International Continence Society symptoms and signs. Not surprisingly, nocturia, increased daytime frequency, urgency, stress and urge incontinence, and a feeling of incomplete emptying were commonly reported, whereas prolapse and urethral hypermobility were commonly observed on clinical examination.
We confine our reporting to results using half-credit weights. Implications of results from the two weighting systems were similar, and it is clinically appropriate to acknowledge the distinction between clear disagreement and diagnostic uncertainty. Table 5 A shows pooled diagnoses, raw agreement, and κ with standard error and 95% confidence interval, for each International Continence Society urodynamic filling phase item. Reliabilities of bladder sensation, urgency, and detrusor function were moderate to substantial, with κ's of 0.59 to 0.67 and raw agreement proportions of 80–85%. Reliability of urodynamic stress incontinence was almost perfect, with κ = 0.81 and raw agreement of 91%. Interobserver agreement was moderate to substantial for urgency, detrusor function, and urodynamic stress incontinence, with κ's of 0.59 for urgency to 0.74 for stress incontinence, and raw agreement from 78–87%. However, raw agreement on bladder sensation was only 64%, with κ a slight 0.18.
Table 5B shows corresponding measures for International Continence Society urodynamic voiding phase diagnoses. Raw agreement levels for replicate evaluations by the same clinician were 81% for detrusor function and 88% for urethral function, comparable to levels seen for filling phase diagnoses. However, the respective chance-adjusted values of κ=0.54 and 0.49 reflect only moderate agreement. Interobserver agreement for these diagnoses was only fair, with κ=0.36 and 0.25.
Results for clinical diagnoses are shown in Table 6. For 12.8% of reviews, reviewers diagnosed bladder overactivity both with and without incontinence, indicating confidence in overactivity but uncertainty about incontinence. We thus also considered the diagnosis of bladder overactivity regardless of the presence of incontinence by combining responses to these two items. With κ=0.78 and raw agreement of 89%, the clinical diagnosis of stress urinary incontinence was comparably reliable to its urodynamic diagnosis. The other clinical diagnoses were less reliable, with κs ranging from 0.40 to 0.58 and raw agreements of 78–88%. Agreement between observers on stress incontinence was substantial, with κ=0.68 and raw agreement of 84%. Between-observer agreement was disappointing for voiding dysfunction and for bladder overactivity with, without, or regardless of incontinence. For these diagnoses, κ and raw agreement were respectively between 0.13 and 58% for overactivity without and 0.38 and 72% for overactivity with incontinence.
Reliabilities of female urologists and urogynecologists were similar, and interobserver agreement between the urologists and urogynecologists was similar to that within either specialty. The combinations of clinical presentation features in Table 4 and the diagnostic variables shown in Tables 5 and 6 produced numerous comparisons, some involving small patient subgroups. The great majority of differences were neither clinically substantial nor large in relation to chance variation. Contrary to expectation, observed agreement was usually lower when a stipulated symptom, sign, or chief complaint was present than when absent. (This occurred because raw agreement either was not increased, or increased less than its benchmark chance expectation, which rose with patient homogeneity.) Such a tendency, if real, would suggest that absence of symptoms and signs was used more consistently to rule out related diagnoses than their presence was to rule them in. Such diagnostic behavior is logical for disease markers with good sensitivity but less than desirable specificity.
To assess the sensitivity of our results to case complexity generally, including the inclusion in our cohort of patients under anticholinergic therapy and patients with prolapse requiring packing during testing, we replicated the weighted κ analysis on the subset of 31 patients without prolapse and who had never been treated with anticholinergics. The results did not materially change from those on the full group.
We evaluated reliability and interobserver agreement in interpretations of multichannel urodynamic testing of female patients. Not surprisingly, reliability (self-agreement) was better than interobserver agreement, due to interphysician differences in interpretive styles and criteria. Diagnosis of stress urinary incontinence from urodynamic investigation, both by International Continence Society definition and clinically, appeared reliable and consistent across our physician group. International Continence Society filling (storage) phase diagnoses of detrusor overactivity with or without incontinence proved moderately reliable, but interobserver agreement on these clinical diagnoses was fair to poor. International Continence Society voiding phase diagnoses showed moderate reliability and fair interobserver agreement.
These data portray one female pelvic medicine referral center, not a broad sampling of urodynamic practice. Our setting features high specialization, collaboration, and urodynamic usage, all expected to promote agreement. However, we see many referrals after diagnostic or treatment failures. These patients likely present unusual diagnostic challenges, promoting greater ambiguity. Twenty-six percent of patients otherwise eligible for this study were excluded because of incomplete charts. Whether incomplete charting is related to diagnostic difficulty, and potentially to agreement on urodynamic interpretation, is unknown.
Our tests are performed by nurses who instruct in urodynamic technique. We thus compared interpretations of urodynamic tracings by clinicians who were not present during the actual studies. Although direct observation is usually recommended,9 nationally most urodynamic tests are performed with physicians absent, or present only during a critical portion of the test. Studying agreement with multiple physicians present during testing would have posed serious methodologic, logistic, and even ethical challenges. Separation of diagnosticians from the test environment may have prevented bias from their interactions during the tests. As is common practice, nurses annotated urodynamic tracings with filling phase events. These annotated tracings were seen by all reviewers, uniformly conveying some clinically important information from direct test observation, and probably increasing agreement.
Although replication is certainly needed, our data show that urodynamic interpretation can be quite consistent for diagnosing stress urinary incontinence. In contrast, interobserver agreement for bladder sensation, voiding phase detrusor and urethral function, and clinical diagnoses of detrusor instability and voiding dysfunction was poor to fair. Unless agreement is dramatically higher in other clinical settings, this is cause for concern. Observer agreement studies in other practice environments are needed.
The variability of any diagnostic test arises from short-term fluctuations in the patient's condition(s), variability in circumstances and procedures of test administration and, as we examined, in test interpretation between and within clinicians over time. Each of these components has, itself, multiple contributing factors. Moderate uncontrolled variability in several test components may produce poor performance even for a test with a compelling rationale. Uroflowmetry shows considerable variability in maximal flow rates13, with the interpretation of maximal flow rates varying typically 1 mL/sec or more.14 Chou et al15 found only moderate reliability and fair to moderate interobserver agreement in classification of uroflow tracings into six diagnostically indicative curve types. Cystometrogram findings are also imprecise, with maximal bladder volume and the presence of spontaneous bladder contractions known to vary between test administrations. One study in men found that the proportion demonstrating detrusor overactivity declined with successive bladder fills (Tubaro A, Renzetti R, Ranieri M, et al. Variability in filling cystometry results in men with LUTS and the impact on the diagnosis of an overactive bladder [abstract]. Neurourol Urodyn 2000;19:425). Gupta et al16 examined the precision of standard urodynamic parameters in 20 asymptomatic women. Uroflowmetry, postvoid volume, medium-fill cystometry, and pressure flow studies were performed, immediately repeated, and urodynamics again repeated in 16 women after 1 to 5 months. Reproducibility was low when results from the three series of measurements were compared. Our data add to the literature documenting the fallibility of urodynamic testing by suggesting that interpretive differences may contribute to diagnostic errors.
Comparison of urodynamic diagnoses with highest and lowest interobserver agreement is instructive. Best agreement (κ=0.74) was achieved for urodynamic stress incontinence, which is defined unambiguously as leakage without contraction.8 Both leakage and contraction are readily observable during urodynamics. Lowest agreement was seen for bladder sensation (κ=0.18), and urethral (κ=0.25) function during voiding. International Continence Society definitions of these diagnoses are not anchored to quantitative referents for “normal” and hence depend almost exclusively on subjective terms to distinguish diagnostic categories. Further, the definition of urethral function incorporates terms acknowledged to be confusing.8 In this context, reviewers expressed uncertainty about detrusor and urethral functions during voiding for 19% and 33% of cases, respectively, and uncertainty about clinical diagnosis of voiding dysfunction for 26% of cases. It may be possible to improve interpretive consistency through greater standardization and quantification of terms and definitions by the International Continence Society. Careful studies would be required to support such changes and confirm their effects.
The value and role of urodynamic testing have been debated for decades. Based on data from 19 articles spanning 17 years, Jensen et al2 concluded that patient history alone is inadequate for diagnosing stress incontinence or detrusor overactivity. However, other studies suggest clinical history and physical examination can adequately distinguish between these conditions.17–19 Weber et al3 suggested that a standardized basic office assessment is as cost-effective as urodynamic testing for the routine work-up of stress urinary incontinence. However, only 7.8% of patients studied by Weidner et al20 met Agency for Health Care Policy Research (now Agency for Health Care Research and Quality) criteria for bypassing urodynamics.
Ideally, a test can be judged by its effect on patient outcomes, but the sometimes conflicting outcome domains for lower urinary tract disorders complicate such judgments for urodynamics. In a large cohort study of surgical effect on stress incontinence morbidity, preoperative urodynamic testing was not associated with improved outcomes.21 An outcome-based decision analysis of Walters et al22 suggests urodynamics are unnecessary for surgical management of stress incontinence when detrusor overactivity is unusual in the patient population. However, observational studies have been too limited to determine the benefit of urodynamic testing to patient outcomes, and no clinical trials have been done.
Tests of moderate accuracy can still provide substantial clinical benefit. However, without an external standard or comparative test, the clinical benefit of urodynamic testing remains unknown. Improved standardization of both urodynamics and the initial assessment of women with lower urinary tract dysfunction may be necessary to provide a consensus framework for controlled evaluation. Development of such a framework, followed by randomized trials comparing outcomes between women receiving urodynamics and those receiving other office-based testing, seems the most promising way forward. Meanwhile our data suggest that, at least in similar centers, urodynamics is consistent within and across observers for diagnosis of stress urinary incontinence in females, but may lack such consistency for diagnosing other lower urinary tract dysfunctions.
1. Wagner TH, Hu TW. Economic costs of urinary incontinence in 1995. Urology 1998;51:355–61.
2. Jensen JK, Nielsen FR Jr, Ostergard DR. The role of patient history in the diagnosis of urinary incontinence. Obstet Gynecol 1994;83:904–10.
3. Weber AM, Taylor RJ, Wei JT, Lemack G, Piedmonte MR, Walters MD. The cost-effectiveness of preoperative testing (basic office assessment vs. urodynamics) for stress urinary incontinence in women. BJU Int 2002;89:356–63.
4. Norton PA, Zinner NR, Yalcin I, Bump RC; Duloxetine Urinary Incontinence Study Group. Duloxetine versus placebo in the treatment of stress urinary incontinence. Am J Obstet Gynecol 2002;187:40–8.
5. Dmochowski RR, Miklos JR, Norton PA, Zinner NR, Yalcin I, Bump RC, et al. Duloxetine versus placebo for the treatment of North American women with stress urinary incontinence [published erratum appears in J Urol 2004;171:360]. J Urol 2003;170:1259–63.
6. Homma Y, Batista J, Bauer S, Griffiths D, Hilton P, Kramer G, et al. Urodynamics. Plymouth (UK): Health Publications; 2002.
7. Glazener CM, Lapitan MC. Urodynamic investigations for management of urinary incontinence in adults. Cochrane Database Syst Rev 2002 (3):CD003195.
8. Abrams P, Cardozo L, Fall M, Griffiths D, Rosier P, Ulmsten U, et al. The standardisation of terminology of lower urinary tract function: report from the Standardisation Sub-committee of the International Continence Society. Am J Obstet Gynecol 2002;187:116–26.
9. Schafer W, Abrams P, Liao L, Mattiasson A, Pesce F, Spangberg A, et al. Good urodynamic practices: uroflowmetry, filling cystometry, and pressure-flow studies. Neurourol Urodyn 2002;21:261–74.
10. Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 1968;70:213–20.
11. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–74.
12. Shoukri MM. Measures of interobserver agreement. New York (NY): Chapman & Hall/CRC; 2004.
13. Golomb J, Lindner A, Siegel Y, Korczak D. Variability and circadian changes in home uroflowmetry in patients with benign prostatic hyperplasia compared to normal controls. J Urol 1992;147:1044–7.
14. Witjes WP, de la Rosette JJ, Zerbib M, Vignoli GC, Geffriaud C, Debruyne FM, et al. Computerized artifact detection and correction of uroflow curves: towards a more consistent quantitative assessment of maximum flow. Eur Urol 1998;33:54–63.
15. Chou TP, Gorton E, Stanton SL, Atherton M, Baessler K, Rienhardt G. Can uroflowmetry patterns in women be reliably interpreted? Int Urogynecol J Pelvic Floor Dysfunct 2000;11:142–7.
16. Gupta A, Defreitas G, Lemack GE. The reproducibility of urodynamic findings in healthy female volunteers: results of repeated studies in the same setting and after short-term follow-up. Neurourol Urodyn 2004;23:311–6.
17. Farrar DJ, Whiteside CG, Osborne JL, Turner-Warwick RT. A urodynamic analysis of micturition symptoms in the female. Surg Gynecol Obstet 1975;141:871–81.
18. Hastie KJ, Moisey CU. Are urodynamics necessary in female patients presenting with stress incontinence? Br J Urol 1989;63:155–6.
19. Lagro-Janssen AL, Debruyne FM, van Weel C. Value of the patient's case history in diagnosing urinary incontinence in general practice. Br J Urol 1991;67:569–72.
20. Weidner AC, Myers ER, Visco AG, Cundiff GW, Bump RC. Which women with stress incontinence require urodynamic evaluation? Am J Obstet Gynecol 2001;184:20–7.
21. Black N, Griffiths J, Pope C, Bowling A, Abel P. Impact of surgery for stress incontinence on morbidity: cohort study. BMJ 1997;315:1493–8.
© 2006 by The American College of Obstetricians and Gynecologists. Published by Wolters Kluwer Health, Inc. All rights reserved.
22. Walters MD, Shields LE. The diagnostic value of history, physical examination, and the Q-tip cotton swab test in women with urinary incontinence. Am J Obstet Gynecol 1988;159:145–9.