Intraoesophageal multiple intraluminal impedance (MII), a method developed by Silny (1), detects bolus movement in the esophagus by changes in electric resistance/impedance between 2 electrodes. Retrograde bolus movements can be distinguished from antegrade bolus movements. An additional pH-sensitive electrode allows the differentiation between acid and nonacid reflux. According to the new European Society for Pediatric Gastroenterology, Hepatology, and Nutrition/North American Society for Pediatric Gastroenterology, Hepatology, and Nutrition guideline, this method is superior to pH-monitoring alone in the diagnostic of gastroesphageal reflux disease (GERD) in children (2).
After data recording with a portable device, software-aided analysis of the MII/pH tracing can be performed on a personal computer. The analysis software is continuously improving, but visual validation of the results is still necessary.
There are as yet no pediatric standards published on how data analysis should be performed. In 2005, the German Pediatric Impedance Group was formed by 4 German children's hospitals using MII-pH monitoring in reflux diagnostics: the Children's University Hospital Aachen, the Children's University Hospital Bochum, the Children's University Hospital Erlangen, and the Children's Hospital Worms. One aim was to develop standards and improve the quality of data analysis and interpretation. A first step was the assessment of inter- and intraobserver agreement among the 4 centers.
MII/pH studies are routinely performed in each of the above-named centers on children with suspected GERD. Each center uses the ambulatory Sleuth MII-pH system (Sandhill Scientific, Denver, CO). The system works with 3 different age-appropriate catheters depending on the patient's height, each mounted with 7 impedance electrodes and 1 pH-sensitive antimony electrode. Twenty-four measurements were randomly selected from the data pool of the 4 hospitals, 6 from each institution. The requirements were a minimum study duration of 18 hours and no artifactitious tracings resulting from catheter dysfunction. A catheter dysfunction was assumed when, for example, ≥1 impedance channel showed continuously high impedance without oscillation over several hours. The measurements and the analysis were performed using a predefined protocol.
The recorded data were analyzed automatically with the BioView AutoScan (Sandhill Scientific) analysis software accompanying the Sleuth system. After software-aided analysis, every measurement was validated visually for retrograde bolus movements (RBM) by 1 investigator from the institution where the measurement was originally performed. Each analyst visually validated each tracing with a 2-minute-per-screen-time window on a personal computer as instructed by the protocol. A retrograde bolus movement was defined as a consecutive retrograde drop in impedance in at least the distal 2 channels (1).
Every RBM event was marked for beginning, end, and proximal extent as judged by the investigator. Only liquid and mixed air-liquid bolus movements were considered; gas-only episodes and meal periods were excluded. No further instructions were given for analysis in the protocol.
All 4 participating investigators were experienced in working with the method for more than 3 years in daily clinical practice and with more than 100 performed visual analyses. Because of spatial separation, each investigator acquired analysis skills mainly through self-education independently from the other centers. The pH channel was neglected during analysis as not being the focus point of this study.
The analyzed measurements were collected by the coordinating center. A copy with raw measurement data, that is, with deleted analysis results, was prepared, and copies of 6 measurements were sent to each of the 4 study hospitals for a second analysis, each institution receiving 2 randomly selected measurements from 1 of the other centers. Again, data analysis was performed as described above, with the second observer being blinded for the results of the first investigator (because these had been deleted) and sent back to the coordinating center.
The 2 analysis results for each of the 24 measurements were compared regarding marked RBM. The following criteria were defined to decide over agreement between 2 marked RBM events:
1. If an event in analysis 1 started within 15 seconds before or after the beginning of an event in analysis 2, it was considered as congruent.
2. If an event in analysis 1 included beginning and end of an event in analysis 2, it was considered as congruent.
3. If there were <10 seconds between the end of an event and the beginning of a new event in the same measurement, they were considered as 1 event.
Because the true number of RBM events cannot be objectively determined, the union of the 2 sets of events detected by observers 1 and 2 was taken as the basis for the comparison of agreement between the 2 analysis results. The agreement was expressed as the percentage of RBM that were marked as an event by both observers from the sum of all marked events.
Additionally, agreement was calculated with the Cohen kappa test for categorical variables. For the Cohen kappa test, the 2-minute time windows without marked events by both observers 1 and 2 were counted and taken as “number of episodes judged negative by both observers.” A kappa coefficient ≤0.1 was considered as no agreement, 0.11 to 0.4 as fair, 0.41 to 0.6 as moderate, 0.61 to 0.8 as substantial, and 0.81 to 1 as perfect agreement.
For intraobserver agreement 6 measurements were randomly selected by the coordinating center without knowledge of the results of the first analysis and sent back to the center the measurements originated from for second analysis. These tracings were analyzed twice by the same observer (2 measurements by 3 different observers from 3 different centers) using the same approach as described for interobserver agreement. There was a time period of at least 3 months between the first and second analyses. The study was approved by the local ethics committees of the medical faculties of University Bochum and Aachen, Germany.
The results for interobserver agreement for each of the 24 measurements are presented in Table 1. Thirteen of the 24 measurements showed a kappa coefficient >0.8 judged as perfect agreement; 9 measurements had a substantial agreement (kappa coefficient 0.61–0.8), 1 a moderate (kappa coefficient 0.41–0.6), and 1 a fair agreement (kappa coefficient 0.11–0.4). The sum of all measurements taken together showed a kappa coefficient of 0.80 classified as substantial agreement. The median kappa coefficient was 0.83. The percent agreed events between the 2 observers from the sum of all marked events are graphically shown in Figure 1. The results for intraobserver agreement for each of the 6 measurements are presented in Table 2. Five of the 6 measurements showed a kappa coefficient >0.8 judged as perfect agreement and 1 measurement had a substantial agreement (kappa coefficient 0.72). The sum of all measurements showed a kappa coefficient of 0.88 classified as perfect agreement. The median kappa coefficient was 0.88. The percentage of agreed events between 2 analyses of the same observer from the sum of all marked events is graphically shown in Figure 2.
Reflux diagnostic with MII is more difficult than pH monitoring; an investigator experienced in this special pattern recognition is mandatory. Depending on signal quality, this sometimes can be a challenging task. To use this method in clinical practice and also to obtain reproducible study results, it is important to be aware of the agreement between different independent investigators.
To our knowledge, this is the first study on inter- and intraobserver agreement for MII performed between different centers. The investigators from the different centers acquired analysis skills independent from each other.
Two other published studies, 1 by Peter et al (3) and 1 by Dalby et al (4), addressing a similar issue worked with observers from the same institution, presumably being trained by the same person, which may lead to the same analysis “style” and therefore better agreement results.
Looking at the Cohen kappa values we found substantial (9 of 24) to perfect (13 of 24) agreement in most of the measurements of our study. Only 1 measurement showed a moderate and only 1 a fair agreement, reflected by a median kappa value of 0.83.
The time frame of 2 minutes for the “negative event” box in the Cohen kappa calculation was chosen randomly, which may influence the result of the kappa value. Because the kappa values did not change when a different time window, for example, 1 or 4 minutes for “number of episodes judged negative by both observers,” was chosen as long as the number exceeded 100 episodes, we chose the “2-minute time window,” which was used in analysis.
Peter et al (3) compared the interobserver agreement in twenty 3- to 6-hour recordings between 3 investigators from the same institution. They found median kappa values of 0.79, 0.83, and 0.83 for the 3 pairs of investigators. In contrast to our study, they did not have outliers with only fair to moderate agreement. Because we compared the results of investigators from different centers and longer recording times, our study gives a more valid picture for interobserver agreement.
Dalby et al (4) compared the analysis results of 2 investigators in thirty 24-hour-MII/pH measurements using the Bland-Altman plot. They found a low variability between investigators. The Bland-Altman plot takes the total number of reflux episodes into account, but not if the events seen by both investigators are congruent. For example, our patient A1 (Table 1) had 37 reflux episodes judged by observer 1 and 39 by observer 2. Using the Bland-Altman plot would have indicated a low variability between the investigators. As shown in Table 1, column 4 only 32 episodes were congruent (73% of all marked events). Therefore, using the Cohen kappa coefficient gives a more precise picture of interobserver agreement.
Using the Cohen kappa, the number of no-reflux events also comes into account. Because these episodes outnumber the reflux events by 6- to 12-fold, this may lead to better kappa values and hence to a supposedly better agreement between investigators. This is the reason why results are also presented as percentage of agreed events between the 2 analyses from the sum of all marked events (Fig. 1). Here, we found a more heterogeneous picture. Most measurements showed good to perfect agreement (7 measurements 70%–79% agreement, 7 measurements 80%–99% agreement), but there also were some measurements with only fair to moderate agreement (4 measurements with 50%–60% agreement, 3 with <40% agreement).
The problem with this approach may be that it does not take the total number of events into account. For example, in measurement B2 there were only 4 reflux events detected by 1 investigator. The other investigator did not find any event. Although the difference between the 2 investigators represents only 4 episodes, the percentage of agreement is 0%. In this case, the kappa value gives a more valid picture of the true agreement, the reason why both approaches were chosen and presented here.
The kappa coefficient of the intraobserver agreement (0.88) was, as one may have expected, slightly higher than that of the interobserver agreement (kappa coefficient 0.83). All except 1 measurement showed perfect results. Looking at the percentage (Fig. 2) we again found a more heterogeneous picture (agreement between 62% and 100%).
Both statistical approaches only judge interrater agreement. The problem with 24-hour pH impedance monitoring is that there is no way to calculate the true number of retrograde bolus movements in the measurement. This is the reason why we cannot comment on the quality of the observer's measurement interpretation.
One question arising is why there are analyses with only fair agreement, whereas other measurements show good to perfect agreement. This does not seem to depend on the investigator. For example, measurements A5, A6, B1, and B2 were analyzed by the same observer pair, but led to totally different results. Also, the heterogeneous intraobserver results led to the presumption that the difference is the result of variable signal quality.
Peter et al (3) also found consistently low agreement in some measurements and consistently high agreement in others. They also had presumed that this may have been the result of signal quality. As with most physiological signals, it is imaginable that this may be a particular problem in children, especially in infants. Additionally, patients with GERD in the pediatric population are often patients with other underlying conditions, for example, neurologically impaired children with impaired esophageal motility, which may have led to a different signal quality with more oscillations in the impedance channels.
Fröhlich et al (5) performed a swallowing test with liquid or viscous fluid under MII recording in 5 patients with surgically corrected esophageal atresia and in a cohort of 6 patients with GER symptoms but without any previous surgery. They found that patients with esophageal atresia showed mostly uncoordinated and often hardly recognizable patterns of bolus entry and exit at the different impedance channels during the swallows, whereas almost all of the individuals from the nonoperated reference group showed a normal complete bolus transit. These data support the presumption that in some patient groups a poorer tracing quality can be expected.
In spite of achieving good to excellent variability results in most measurements, an excellent inter- and intraobserver agreement in all of the measurements should be the goal, especially in tracings with lower signal quality. Therefore, analysis standards need to be developed. A continuous exchange and consensus finding between experienced investigators from different institutions is needed, which is 1 major aim of the German Pediatric Impedance Group. Less-experienced and new users of the method should be trained by experienced investigators. On the contrary, sufficient analysis quality of a new investigator should be confirmed on a regular basis.
In a study by van Wijk et al (6), an esophageal MII-manometry catheter was combined with videofluoroscopic images and used for characterization of intraluminal impedance patterns associated with gas reflux. This method could also been used to learn more about bolus movement patterns in low-quality tracings.
The identification of patients with low-quality tracings before beginning the analysis would also help the investigator in the interpretation of measurement results. An approach could be performing a standardized swallow test as used by Fröhlich et al (5). This may help identify patient groups in which a low signal quality must be expected beforehand, for example, patients with esophageal atresia.
A fairly new software tool displaying bolus movement in spatiotemporal color plots (ContourVIEW, Sandhill Scientific) simplifies the recognition of reflux patterns in MII and could also be helpful (7). It was not used in this study and may enhance inter- and intraobserver agreement.
In this study we found good to perfect intra- and interobserver agreement in most measurements; however, in a few tracings there was only a fair to moderate agreement. We assume that these heterogeneous results are the result of variable tracing quality. An improvement of analysis results may be achieved by developing a standard analysis protocol and a standardized method for judging tracing quality. More interchange between experienced analysts and better training options for new method users with validation of analysis quality should be enforced, which is what the German Pediatric Impedance Group stands for (8).
The authors thank Dipl-Math Thorsten Reineke for statistical advice.