Hot flushes have been defined as subjective and transient sensations of heat, most commonly felt in the upper body and head.1 Hot flushes are a sentinel symptom of menopause, with some estimates suggesting that up to 80% of women may experience this bothersome symptom.2,3 Growing interest in hot flushes stimulates questions about how best to measure this phenomenon. Accurately and precisely assessing hot flushes is critical for examining physiological mechanisms, temporally relating hot flushes to other phenomena, and quantifying intervention effects. Although self-reports of symptoms are generally considered to be acceptable for assessment, this is only the case when a suitable objective measure is not available. For example, objective measures of pain are not available, and therefore, pain is considered to be an inherently subjective phenomenon.4 Hot flushes, on the other hand, can be assessed both subjectively and objectively.5,6
Published comparisons between subjective and objective assessment methods of hot flushes are limited to 2 studies.5,6 Because these studies were limited to a single 12- or 24-hour assessment period and included only small groups of women (n = 11–17), additional information is needed to further evaluate the potential discord between subjective and objective assessment methods. Therefore, the purpose of this report is to compare 2 subjective and 1 objective method for assessing hot flush frequency: prospective paper hot flush diaries, electronic event markers, and an ambulatory sternal skin conductance monitor. Using 2 weeks of baseline data from an intervention study, we report on sensitivity, specificity, positive predictive value, and negative predictive value of the subjective measures.
MATERIALS AND METHODS
Participants were recruited from outpatient breast care centers as part of a larger hot flush intervention study. The study was approved by the Vanderbilt University Institutional Review Board and the Indiana University Institutional Review Board. Potentially eligible women were contacted in person in the clinic or by telephone, were informed of the purpose and nature of the study, and invited to participate. Of 140 eligible women identified, 59 consented to the study (40%); 55 women completed week 1 of baseline and 53 completed week 2 of baseline to include their data here. The women were at least 21 years of age; willing and able to provide informed consent; able to read, write, and speak English; peri- or postmenopausal; experiencing daily hot flushes; in good general health; not using any hot flush treatments (eg, no antidepressants, hormone replacement therapy, high-dose vitamin E, herbs); and nondepressed. All women had been diagnosed with breast cancer, were considered disease free at time of study enrollment, and were at least 4 weeks postcompletion of surgery, radiation, and/or chemotherapy. Women who were taking tamoxifen had to have been taking the drug for at least 6 weeks.
Participants were a mean age of 50 years old (SD = 11), 41 months postdiagnosis of breast cancer (SD = 39), and had 15 years of education (SD = 2). The majority were Caucasian (91%), married or partnered (82%), working full time (58%) or part-time (16%), and postmenopausal (64%). Slightly less than half were using tamoxifen (47%). About half (51%) reported the presence of non–breast cancer–related comorbidities such as arthritis. The median yearly household income was over $60,000.
After providing written informed consent, women completed questionnaires (eg, age, race, education, menopausal status) and then participated in two 24-hour ambulatory recording sessions separated in time by 1 week. During these sessions, women were instructed to maintain a paper hot flush diary and use an electronic event marker to record their hot flushes while wearing an ambulatory sternal skin conductance monitor. Participants were compensated with a payment of $50 for their time and effort in participating in the 2-week baseline before being randomized. Women were randomized to receive either the intervention (venlafaxine 37.5–75 mg orally every day) or placebo for 6 weeks before crossing over to the opposite study arm. Dates of breast cancer diagnosis and treatment were abstracted by research nurses from paper and electronic medical records.
For the diaries, women were asked to record the date and time of every hot flush, rate the intensity of each hot flush, and rate how bothered they were by each hot flush. Subjective intensity was measured by using an 11-point numeric rating scale (0, not at all intense, to 10, extremely intense). Subjective bother was also measured using an 11-point numeric rating scale (0, not at all, to 10, extremely bothersome). Each woman was instructed to carry her diary with her at all times throughout the 2 baseline weeks and to record information for each hot flush prospectively as it occurred, rather than retrospectively.
For the event marker, women were instructed to depress 2 red buttons on the face of the sternal skin conductance monitor when they felt a hot flush occurring. This event marker then time-stamped the skin conductance data, indicating the exact time a hot flush was perceived, and produced an audible beep to let the participant know that the hot flush had been recorded.
Sternal skin conductance levels were monitored using Medi-Trace silver/silver chloride electrodes (S'offset, Graphic Controls, Buffalo, NY) and a 0.5 constant voltage circuit7 built into the front end of a single channel of a Biolog ambulatory recorder (Model 3991 SCL; UFI, Morro Bay, CA). Electrodes were 1.5 cm in diameter and filled with 0.05 mol/L potassium chloride Velvachol (Healthpoint, Fort Worth, TX)/glycol gel.8 Electrodes were attached 1.5 inches below the collarbones and 2 inches on either side of the sternal midline. The Biolog monitor contains a microprocessor and 4 megabytes of memory. The Biolog is powered by a standard 9-volt battery, is programmed to sample 12-bit skin conductance data at 1 Hz (once per second), measures 1.3 × 2.8 × 5 inches, and weighs 8 ounces. The monitor was placed in a bag and worn around the waist or across the shoulders. At the conclusion of the monitoring session, the monitor was connected to a personal computer through the Biolog Interface Box (UFI), and data were downloaded. Customized software (DPS 4.1; UFI) and standardized procedures were used to evaluate data. Data from the Biolog were graphically displayed during analysis by the downloading and plotting software (DPS 4.1). Based on prior research,5,6 hot flushes were defined as an increase in skin conductance of at least 2 μmho within a 30-second period.
The accuracy of each subjective measure (diary, event marker) was assessed by estimating 4 indexes: sensitivity, specificity, positive predictive value, and negative predictive value. These 4 indexes were calculated using all available hot flushes for each week and also using only waking or only nighttime (sleeping) hot flushes each week. Monitor results were considered to be the reference standard.5,6 In this study, the binary data (hot flush/no hot flush) were clustered because each patient was measured continuously for 24 hours, with hot flushes occurring at multiple times during each weekly monitoring session. The patients constituted the clusters, and the events during the 24-hour ambulatory sessions during each baseline week constituted the units within the cluster. Even though point estimates of diagnostic accuracy are unaffected by clustering, the standard errors of these indexes are usually larger when data are clustered rather than nonclustered. Therefore, ignoring the clustered effect when estimating standard errors yields liberal (overly narrow) estimates of confidence intervals. The confidence intervals for specificity and sensitivity were computed as described by Zhou et al9 by using the ratio estimator for the variance, which appropriately accounts for the clustered effect.10,11 To test whether week 1 versus week 2 and wake versus asleep times differed on each of the 4 computed indexes, we used mixed linear models, with the SAS Mixed Procedure (SAS 8.2 System for Windows; SAS Institute, Cary, NC). The mixed linear models appropriately account for the correlations that exist within a single patient for 3 factors of repeated measures (multiple measures during a 24-hour period, two 24-hour periods, and awake/asleep time).
True negatives (no objective or subjective hot flush) were identified in the following manner. Each weekly 24-hour monitoring period was divided into 96 consecutive 15-minute periods of data. The total number of true negatives for each woman at week 1 was calculated as the total number of the 96 data segments that contained neither an objective flush nor a subjective flush (no diary, no event marker). Week 2 true negatives were similarly calculated. Because the majority of the time during each monitoring period was spent as non–hot flush time, in general, the majority of the 96 data segments were true negatives.
Results indicated that sensitivity was very low for both the paper diary and electronic event marker (Table 1, column 2). Sensitivity for each week, and for waking and sleeping hot flushes, was uniformly poor, never exceeding 50%. Although one would expect that the sensitivity for both the diary and event marker would be lower during sleeping than waking times, this was only true for the event marker at week 1 (P = .02). For the diary at week 2, results were nonsignificant (P = .09). In other words, at week 2, sensitivity was as poor for nighttime (sleeping) hot flushes as it was for the daytime (waking) hot flushes. Using 95% confidence intervals, even the upper limit of the confidence intervals for sensitivity never exceeded 60%, suggesting that the true population sensitivity is far lower than desired for both of the subjective measures (Table 1, column 3). When a woman had a true hot flush (as measured by the gold standard), the probability that she would record the hot flush in a paper diary or by pushing an electronic event marker was not very high: between 36% and 50% of the time if she is awake and between 22% and 42% of the time if she was asleep. Because intensity and bother ratings were linked to subjective hot flush frequency (eg, a hot flush must first be reported for intensity and bother ratings to be provided by a subject), the probability of reporting intensity and bother ratings was also low. Intensity and bother ratings were absent between 50% and 64% of the time if a woman was awake and 58% to 78% of the time if a woman was asleep.
Specificity was high (96–98%) for both the diary and event marker, at both weeks and for both waking and sleeping times (Table 1, column 4). The lower limit of the 95% confidence intervals for specificity was never less than 95% (Table 1, column 5).
The positive predictive value was low (34–52%; Table 1, column 6) and negative predictive value was high (94–97%; Table 1, column 8).
Findings from this study indicate that instead of using diaries and event markers to overreport hot flushes when they did not exist, women tended to underreport hot flushes when they did exist. Theoretically, it would have been possible for women to erroneously report diary and event marker hot flushes throughout the day, resulting in overreporting subjectively when the gold standard was not true. This would have resulted in low specificity; however, specificity in this study was high, indicating that overreporting was not occurring. In addition, low sensitivity indicated that, in general, women were not very likely to subjectively report hot flushes when they occurred. Thus, low sensitivity and high specificity in this study suggest that women tended to underreport true hot flushes rather than overreport false hot flushes.
The low sensitivity for diary hot flush frequency also suggests that there may be significant missing data and measurement error associated with diary intensity and bother ratings. When diaries are used, intensity and bother ratings are inherently tied to frequency ratings. In other words, hot flushes that are not reported within a diary, by their very nature, do not have intensity and bother ratings. Although one might assume that hot flushes that were not reported were less intense or less bothersome, this hypothesis is not testable because of the lack of intensity and bother ratings for those hot flushes that were not reported. Hot flushes that are not reported may be just as intense or bothersome as those that are reported. Given the large number of hot flushes that did not have intensity and bother ratings (eg, no subjective report), it may be more appropriate in future studies to obtain a single daily rating of overall hot flush intensity and bother rather than ask participants to provide ratings for multiple hot flushes.
Our findings suggest that clinicians should use caution when interpreting research findings from studies using only subjective hot flush assessments, and researchers should be aware of measurement issues when designing future studies. When evaluating interventions, reliance on subjective hot flush reports may lead to the false conclusion that an intervention was effective, when in fact no differences exist (type I error) or the false conclusion that an intervention was not effective, when in fact it was (type II error). For example, if sensitivity is affected by patient characteristics that do not occur with equal frequency within intervention and control groups, or if sensitivity is affected by characteristics of the intervention, variations in reporting rather than in the symptom itself may be mistaken for intervention effectiveness (type I error) or may dilute the true impact of the intervention (type II error).12 If, on the other hand, sensitivity is similar for intervention and control groups (as might be expected for baseline measures of a randomized trial), the main consequence of errors in reporting is type II error.12 As another example, although sensitivity was uniformly low at weeks 1 and 2 (no significant differences between weeks 1 and 2), it is possible that sensitivity may decrease over time, particularly if compliance with subjective reporting methods also decreases over time (eg, over 4, 6, or 8 weeks). Decreasing compliance over time would result in lower sensitivity for diary hot flush frequency and similarly missing data for diary frequency, intensity, and bother. Thus, decreasing sensitivity of diaries over time could be mistaken for intervention effectiveness (eg, fewer hot flushes reported over time, fewer intensity and bother ratings reported) in a single-arm study or a controlled study if sensitivity differentially decreased for intervention and control groups.
Findings should be considered in light of study limitations. First, all of the participants were diagnosed with breast cancer. Because women with breast cancer may be more or less attuned to physical sensations, findings bear replication in samples of healthy women without breast cancer. Second, the high specificity is somewhat artificial because we measured objective hot flushes in 15-minute increments for the 24-hour period; therefore, there were many occurrences of truly no gold standard hot flushes (ie, the majority of the monitoring time was spent nonflushing). Thus, specificity would have been high even if diaries and event markers had not been not used at all by participants because the majority of true negatives would indeed be correctly identified as true negatives. Third, unlike sensitivity and specificity, which can be considered characteristics of a test, the positive predictive value is affected by the prevalence of the outcome (hot flush).12 Therefore, all other things equal, we might expect low positive predictive value simply because the prevalence of true hot flushes was low.
In summary, our results indicate that use of subjective, prospective paper hot flush diaries and electronic event markers may result in serious underestimation of the true number of hot flushes and missing intensity and bother ratings. Sternal skin conductance monitoring should be used for accuracy of measurement when conducting clinical trials of hot flush interventions.