The large majority of the estimated 590 000 radionuclide renal studies performed annually in the USA are conducted at sites that perform fewer than three studies per week [1,2]. The relative infrequency of these studies may lead to a lack of expertise and increase the likelihood of incorrect scan interpretation. Lack of training and expertise may also be an issue for general radiologists, especially among those with nuclear medicine training limited to 3–4 months.
To enhance the care of patients referred for diuretic renography and to develop a framework for a more general approach, software (RENEX) was developed to interpret 99mTc- mercaptoacetyltriglycine (MAG3) diuretic renal scans using the quantitative data derived from the diuretic renography study [3,4]. iRENEX is an expanded version of RENEX that incorporates clinical information into its diagnostic decision algorithm . Both RENEX and iRENEX are heuristic or rule-based decision support systems designed to interpret 99mTc-MAG3 diuretic renal studies and provide the rationale (rules) supporting that interpretation [4–8]. iRENEX was fine-tuned in a retrospective pilot study and showed good agreement with expert interpretations . The goals of the present study were to compare the interpretations of iRENEX with those of expert readers in a prospective study and determine if training by iRENEX could enhance the diagnostic accuracy of experienced resident physicians.
Subjects and methods
This study was approved by the Institutional Review Board and was Health Insurance Portability and Accountability Act compliant; informed consent was waived because the study utilized archived data. Patients were selected from an archived database that contained the raw data of 578 99mTc-MAG3 renal scans dating from 5 January 1994 to 7 April 2015. Of these 578 scans, 282 were categorized as ‘suspected obstruction with furosemide’ and had separate baseline and furosemide acquisitions. The 50 subjects in our study were blindly selected from the group of patients who received furosemide. There were 20 males with ages ranging from 28 years to 82 years (mean age 65.5 ± 15.5 years) and 30 females with ages ranging from 18 years to 74 years (mean age 54.3 ± 15.0 years). Two patients had only a single kidney; consequently, 98 kidneys could be evaluated for suspected obstruction. On the basis of the majority opinion of experts, 19 kidneys were obstructed, nine were indeterminate and 70 were not obstructed.
Acquisition and processing
Patients were hydrated with 500 ml of water and imaged supine. Our imaging protocol was a minor modification of the 1996 Santa Fe Consensus protocol on diuresis renography and consisted of a baseline acquisition followed by a diuretic acquisition if the baseline study was abnormal . If the baseline study was normal, the diuretic acquisition was unnecessary because a normal baseline study excludes obstruction.
A 24-min baseline acquisition was obtained followed by anterior pre-void and post-void images over the bladder; a posterior post-void image over the kidneys was also obtained with the patient in the supine position. All patients received an intravenous injection of approximately 40 mg of furosemide as recommended by recent consensus reports followed immediately by a 20-min acquisition [9,10]. Studies were acquired in a 128 × 128 matrix on an Infinia camera/Xeleris computer (GE Healthcare, Milwaukee, Wisconsin, USA) and ported to a research computer where they were processed using an updated non-commercial in-house version of the original QuantEM renal quantification program (GE Healthcare) [3,11]. The in-house software includes a camera-based 99mTc-MAG3 clearance measurement, incorporates new quality control procedures to improve reproducibility and ports the demographic, quantitative and clinical parameters to the iRENEX software for analysis [11–15].
The processing software automatically defines the whole kidney, background and cortical regions of interest (ROIs) using a static image summed from frames obtained 2–3 min after injection, and background-subtracted curves are generated. The user has the option to replace an automatic ROI with a manual ROI. Calculated quantitative parameters include time to peak counts, the 20 min to maximum count ratio for whole kidney and cortical ROIs, the post-void to pre-void and post-void to maximum count ratios, and the 99mTc-MAG3 clearance. Relative uptake is calculated based on the integral of counts from 60 to 150 s after the bolus reaches the kidneys .
Quantitative data derived from the furosemide acquisition include the T1/2 for whole kidney and pelvic ROIs and the ratio of the renal counts in the final frame of the furosemide acquisition to the maximum renal counts in the baseline acquisition. The baseline scan display includes a series of 2-min frames and renogram curves derived from whole kidney and parenchymal ROIs; the post-furosemide display includes a series of 2-min frames and renogram curves derived from whole kidney and pelvic ROIs . The display also includes patient demographics (height, weight, age and sex).
Study design and scan interpretation
We restricted our study population to patients who had an equivocal or abnormal baseline study and who subsequently received furosemide. The utility of a computer-assisted algorithm such as iRENEX lies in the subset of patients who have an equivocal or abnormal kidney on the baseline acquisition, not in the identification of a normal study.
The study design mirrored the anticipated use of iRENEX in the clinical environment. To avoid biasing the interpretation by the iRENEX results, the interpreting physician will first review the diuretic renal study and available clinical data and record a decision (score) regarding the presence of obstruction; only afterwards will the physician receive access to the iRENEX output.
All 50 studies were independently interpreted by three reference readers (experts), each with over 30 years of experience and multiple publications in renal nuclear medicine. The studies were also interpreted by four nuclear medicine residents who had all completed at least 1 year of nuclear medicine training; one had also completed radiology residency training. Physicians had access to the images, data derived from the renogram, and clinical information as a text file from an archived database (see below); the information in the text file was also ported to iRENEX for incorporation into its diagnostic algorithm.
The four experienced residents and the three reference readers independently interpreted each study and scored each kidney on a scale from +1.0 to −1.0; scores ≥0.20 indicated a diagnosis of obstruction with higher scores approaching 1.0 indicating greater confidence in the diagnosis of obstruction. Scores from +0.19 to −0.19 were indeterminate and scores ≤−0.2 indicated non-obstructed kidneys with lower scores (those approaching −1.0) indicating greater confidence in the absence of obstruction. Residents were blinded to the expert scores and scores of the other residents. After an interval of several months, each resident was given access to the scores and rationale provided by iRENEX as well as his/her original scores and asked to reinterpret the 50 studies. iRENEX interpreted the studies using the same scale.
The RENEX software applies a knowledge base of 60 rules to the quantitative data derived from the baseline and diuretic renal acquisitions to reach a conclusion regarding the presence of renal obstruction. iRENEX modifies the RENEX output by applying a new knowledge base of 56 clinical rules to incorporate the available clinical information [3–6]. Consequently, iRENEX contains a knowledge base with 116 (60 + 56) heuristic interpretation rules, a forward chaining inference engine to determine obstruction and a justification engine to provide the rationale for the interpretation. Examples of the quantitative and clinical rules used by RENEX and iRENEX are provided in previous publications [3,5].
The clinical database consisted of 342 possible data entry fields comprising demographic information, clinical history and findings from up to 11 radiological procedures . Variables extracted from the clinical records included the presence of a nephrostomy tube, ureteral stent, renal calculus, ureteropelvic junction calculus, obstructive calculus, ureteral calculus, hydronephrosis, hydroureter, ureteral stricture, flank pain on arrival for the 99mTc-MAG3 study, flank pain after diuretic administration, renal artery stenosis, pyeloplasty and serum creatinine. The dates and results of recent imaging studies (ultrasound, computed tomography, MRI, magnetic resonance angiography, angiography, kidney, ureter and bladder X-ray and retrograde contrast studies) that included the kidney in the field of view were entered into the database.
A power calculation was performed to determine the number of kidneys needed for the study and for testing the hypothesis that the concordance correlation coefficients (CCCs) were the same between the two groups. Comparison 1 evaluated experienced residents versus experts; comparison 2 evaluated experienced residents with access to iRENEX versus experts . From a prior study, we estimate the CCC of experts to be in the range of 0.8–0.9 and the SD of the difference in CCC to be 0.03 . With a conservative choice of the SD of the difference in CCC in the range of 0.05–0.10, a sample size of 50 subjects provided power of 88–99% to detect a minimum difference of 0.05 in CCC, at a significance level of 0.025 for two comparisons with multiple adjustments.
The CCC  was used to quantify the agreement among experts, residents and iRENEX. To determine if the iRENEX interpretations were comparable to expert interpretations, each expert was replaced by iRENEX and the CCC of each combination of iRENEX and two experts was calculated. If iRENEX were performing at the level of experts, there would be no difference in the CCC of the three experts and the CCC of the (iRENEX + 2 expert) combinations. Bootstrap methods were used to estimate the standard error and compare the agreement among experts, residents and the iRENEX . Results reported in the tables include the mean and confidence intervals of each of the CCC calculations.
To create the binary data set required for receiver operating characteristic (ROC) analysis, kidneys with scores in the indeterminate range were combined with kidneys in the obstructed group because scores >−0.2 convey to the referring physician that the kidney is obstructed or might be obstructed (indeterminate).
Expert scores served as the gold standard; obstruction status was determined by majority decision when there was disagreement among experts. Resident scores were analyzed as a continuous variable and the P value was determined using bootstrap methodology. All hypothesis tests were two-tailed and a P value of <0.05 indicated statistical significance. Analyses were performed in SAS (version 9.4, Cary, North Carolina, USA) and using R routine.
In the study population of 50 subjects (98 kidneys), the mean CCC for the (iRENEX + 2 expert) combinations was not significantly different than the CCC of the three experts, 0.81 versus 0.84 respectively, P = 0.16 (Table 1). If iRENEX were performing poorer than the expert readers, the CCC of iRENEX + 2 experts would be lower than the CCC of the three experts. Additional analyses confirmed that the direction of the iRENEX scores was in the same direction as the scores of experts, further supporting the conclusion that iRENEX performs similarly to an expert.
Table 1 -
Comparison of the concordance correlation coefficient among experts versus the iRENEX + 2 expert combination
||CCC among 3 raters (experts)
||CCC of iRENEX + 2 expert combination**
P = 0.14
CCC, concordance correlation coefficient; CI, confidence interval; SE, standard error.
**The iRenex+2 experts groupings evaluate the potential of iRenex to provide ratings similar to those of an expert.
The CCC of the four residents was 0.39, substantially lower than the 0.84 CCC of the three experts, P < 0.001, indicating a high degree of interobserver variability among residents compared to experts (Table 2). The residents later scored each kidney again with access to the iRENEX scores, the iRENEX rationale and their original scores. The CCC of the four residents with iRENEX improved to 0.73 and was significantly greater than the 0.39 CCC they obtained prior to iRENEX, P < 0.001 (Table 2). The increase in the CCC demonstrated that exposure to iRENEX substantially enhanced interobserver agreement among residents.
Table 2 -
Comparison of the concordance correlation coefficient among experts, residents prior to iRENEX and residents with access to iRENEX
||CCC among 3 experts
||CCC among 4 residents prior to iRENEX
||CCC among 4 residents with iRENEX
experts versus residents prior to iRENEX
residents prior to iRENEX versus residents with iRENEX
P < 0.001
P < 0.001
CCC, concordance correlation coefficient; CI, confidence interval.
To further evaluate resident performance, the CCC of the four residents was grouped with the CCC of the three expert readers. If residents were performing at the level of experts, the combined CCC of the residents and experts should be the same as that of the experts. In fact, it was significantly less, 0.53 versus 0.84, P < 0.001. Following access to iRENEX, the combined CCC of the four residents and three experts was significantly greater than that of the three experts and four residents prior to iRENEX, 0.75 versus 0.53, P < 0.0001 (Table 3).
Table 3 -
Comparison of concordance correlation coefficient among experts, the combined concordance correlation coefficient among experts and residents prior to use of iRENEX, and the combined concordance correlation coefficient of experts and residents with access to iRENEX
||CCC among 3 experts
||CCC among 7 raters (3 experts + 4 residents pre-iRENEX)
||CCC among 7 raters (3 experts + 4 residents with iRENEX)
7 raters pre-iRENEX versus 7 raters with iRENEX
P < 0.0001
CCC, concordance correlation coefficient; CI, confidence interval.
The area under the ROC curve for the initial scores of the four residents was 0.87; after the residents had access to iRENEX, it improved to 0.93, P = 0.034 (Table 4). When the residents were analyzed individually, one poorly performing resident did not improve with iRENEX. A second resident did not exhibit significant improvement, but that resident had demonstrated superior performance prior to iRENEX. The two remaining residents were performing at an intermediate level and both demonstrated significant improvement (Table 4, Fig. 1).
Table 4 -
Receiver operating characteristic analysis showing the area under the curve comparing the performance of expert readers with resident performance pre-RENEX
and with RENEX
||AUC with RENEX
AUC, area under the curve.
*For the ROC analysis a binary gold standard was created by classifying expert readings >−0.20 as obstructed and treating the resident scores as continuous.
The iRENEX diagnostic algorithm was optimized to match the performance of experienced physicians in a pilot study . The current study validates the pilot results in a randomized sample by showing the diagnostic performance of iRENEX to be comparable to that of experts. The CCC analysis showed the diagnostic performance of iRENEX was superior to that of resident physicians with at least one year of nuclear medicine training. In addition, interobserver agreement and diagnostic performance of the residents improved with access to iRENEX.
ROC analysis showed that overall resident performance improved with access to RENEX; however, the individual ROC results were more nuanced. The area under the curve (AUC) results suggest that Resident 1, the poorest performing resident, probably paid little attention to iRENEX. Resident 2 was already performing well; that resident’s AUC was higher following iRENEX, but the degree of improvement did not reach significance. Residents 3 and 4 were performing at an intermediate level and their AUC did significantly improve with iRENEX. These results demonstrate the potential of iRENEX to monitor and quantify resident performance, identify residents in need of more focused supervision and serve as an educational tool to enhance diagnostic performance.
All the residents in the study had completed a minimum of 1 year of nuclear medicine training and one had completed diagnostic radiology training. With this degree of training, their skill in interpreting diuretic renography studies should at least be equal to that of many radiologists, particularly those who had limited exposure to diuretic renography studies during their training or limited exposure in their practices. Assuming the diagnostic accuracy of nuclear medicine residents with a year of experience is no worse than that of general radiologists, there is likely to be widespread interobserver variability among radiologists in the interpretation of diuretic renography studies and their interpretations will often fail to correspond with those of experts. A test of this hypothesis could serve as a basis for future studies.
Interobserver variability in image interpretation is a major clinical concern because the report becomes increasingly dependent on which physician interpreted the study rather than on the underlying disease process . Disagreement in interpretations may compromise patient management and undermine trust in the procedure and the radiologist . For congruence and consistency, interobserver variability should approach zero. With decision support systems such as iRENEX, the interpretation of a given study will always be the same. If the interpretation is shown to contain a systematic error, adjustments can be made in the interpretative algorithm.
Despite broadly accepted guidelines, there is still substantial variability in the performance and reporting of diuretic renography studies [14,21,22]. This wide variance in practice and reporting may result in incomplete or inaccurate results and illustrates the need for additional approaches such as RENEX and iRENEX to educate the community and help standardize practices.
There are several limitations. RENEX and iRENEX are not yet available outside of our institutional research protocols, but we are developing an educational website (https:\\MAG3renography.net) that reviews the 99mTc-MAG3 diuretic display format, summarizes the most relevant quantitative parameters, uses a simplified scoring mechanism and provides 10 cases for review so that a user can compare his/her interpretations with those of RENEX and expert readers .
All the participating residents received their nuclear medicine training in a single program. Residents from other programs may have performed better or worse. Although the overall performance of the residents in this study was inferior to that of expert readers, the residents had not yet completed a 3-year nuclear medicine residency when they might reasonably be expected to perform at the level of experts.
Our study design incorporated a potential bias because residents had access to their original scores when they were given access to iRENEX. This type of bias occurs in clinical settings such as the interpretation of myocardial perfusion studies when physicians form an initial interpretation and only afterwards review the interpretation provided by their cardiac software. The concern regarding bias stems from the possibility that residents might be anchored to their original scores. If present, such an anchoring bias would have diminished rather than augmented the impact of iRENEX. Despite this potential bias, the use of iRENEX significantly improved the resident scores.
It is possible that residents learned from their initial review of the 99mTc-MAG3 diuretic studies although they received no feedback regarding their interpretations or how well their overall scores agreed with those of experts or other residents. A final limitation is that our results are not generalizable to all diuretic renography studies but apply specifically to adult patients evaluated with a two-stage protocol. iRENEX has not been tested in a pediatric population and was not designed to interpret studies when the diuretic is given before or simultaneously with the radioactive tracer.
The interpretative software for diuretic scintigraphy (iRENEX) has a diagnostic accuracy for 99mTc-MAG3 diuretic renography comparable to expert readers; its use by experienced resident physicians reduced interobserver variability and resulted in significantly better agreement with expert interpretations. These results demonstrate the potential of iRENEX to monitor and quantify resident performance, identify residents in need of more focused supervision and serve as an educational tool to enhance diagnostic performance for both residents and practicing physicians.
We are grateful to Daya Manatunga, MS for his assistance in this project. This work was supported by grants from the US National Institute of Biomedical Imaging and Bioengineering, and the National Institute of Diabetes and Digestive and Kidney Diseases, R01EB008838 and RO1DK108070-01A1.
Conflicts of interest
A.T.T., R.D.F., R.K.H. and E.V.G. are entitled to a share of the royalties for the use of QuantEM software for processing Tc-99m MAG3 renal scans that was licensed by Emory University to GE Healthcare in 1993. An in-house, non-commercial upgrade to QuantEM in a different programming language was used in this study and could affect the financial status of the developers. The terms of this arrangement have been reviewed and approved by Emory University in accordance with its conflict-of-interest policies.
1. IMV Medical information division. 2003 Nuclear Medicine Census Market Summary Report. IMV, Limited, IV. 2003. pp. 7–11.
2. Archer KD, Bolus NE. Survey on the use of nuclear renal imaging in the United States. J Nucl Med Technol 2016; 44:223–226.
3. Garcia EV, Taylor A, Halkar R, Folks R, Krishnan M, Cooke CD, et al. RENEX
: an expert system for the interpretation of Tc-99m MAG3 scans to detect renal obstruction. J Nucl Med 2006; 47:320–329.
4. Taylor A, Garcia EV, Binongo JN, Manatunga A, Halkar R, Folks RD, et al. Diagnostic performance of an expert system for interpretation of Tc-99m MAG3 scans in suspected obstruction. J Nucl Med 2008; 49:216–224.
5. Garcia EV, Taylor A, Folks R, Manatunga D, Halkar R, Savir-Baruch B, et al. iRENEX: a clinically-informed decision support system for the interpretation of Tc-99m MAG3 scans to detect renal obstruction. Eur J Nucl Med Mol Imaging 2012; 39:1483–1494.
6. Garcia EV, Taylor A, Manatunga D, Folks R. A software engine to justify the conclusions of an expert system for detecting renal obstruction on Tc-99m MAG3 scans. J Nucl Med 2007; 48:463–470.
7. Taylor AT, Garcia EV. Computer-assisted diagnosis
in renal nuclear medicine: rationale, methodology and interpretative criteria for diuretic renography
. Semin Nucl Med 2014; 44:146–158.
8. Porenta G. Being right for the right reason: better than just being right? J Nucl Med 2007; 48:335–336.
9. O’Reilly P, Aurell M, Britton K, Kletter K, Rosenthal L, Testa T. Consensus on diuresis renography for investigating the dilated upper urinary tract. J Nucl Med 1996; 37:1872–1876.
10. O’Reilly PH; Consensus Committee of the Society of Radionuclides in Nephrourology. Standardization of the renogram technique for investigating the dilated upper urinary tract and assessing the results of surgery. BJU Int 2003; 91:239–243.
11. Taylor A, Manatunga A, Morton K, Reese L, Prato FS, Greenberg E, et al. Multicenter trial validation of a camera-based method to measure Tc-99m mercaptoacetyltriglycine, or Tc-99m MAG3, clearance. Radiology 1997; 204:47–54.
12. Folks RD, Savir-Baruch B, Garcia EV, Verdes L, Taylor AT. Development of a relational database to capture and merge clinical history with the quantitative results of radionuclide renography. J Nucl Med Tech 2012; 40:236–243.
13. Folks RD, Manatunga D, Garcia EV, Taylor AT. Automated patient motion detection and correction in dynamic renal scintigraphy. J Nucl Med Tech 2011; 39:131–139.
14. Taylor AT, Blaufox MD, De Palma D, Dubovsky EV, Erbas B, Eskild-Jensen A, et al. Guidance document for structured reporting of diuresis renography. Semin Nucl Med 2012; 42:41–48.
15. Bao J, Manatunga A, Binongo JNG, Taylor A. Key variables for interpreting MAG3 diuretic scans: development and validation of a predictive model. AJR Am J Roentgenol 2011; 197:325–333.
16. Lin H, Williamson JM. A simple approach for sample size calculation for comparing two concordance correlation coefficients estimated on the same subjects. J Biopharm Stat 2015; 25:1145–1160.
17. Taylor A, Rahman AKM, Folks R, Sethi I, Carter S, Behr-Ventura D, et al. Can RENEX
, a decision support system for Tc-99m MAG3 diuretic renography
, teach residents better diagnostic skills? (abstract). J Nucl Med 2016; 57:540.
18. Lawrence IL. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45:255–268.
19. Efron B. Better bootstrap confidence intervals. J Am Stat Assoc 1987; 82:171–185.
20. Di Leo G. Measurements in radiology: the need for high reproducibility. Pediatr Radiol 2015; 45:32–34.
21. Taylor AT, Brandon DC, de Palma D, Blaufox MD, Durand E, Erbas B, et al. SNMMI procedure standard/EANM practice guideline for diuretic renal scintigraphy in adults with suspected upper urinary tract obstruction 1.0. Semin Nucl Med 2018; 48:377–390.
22. Banks KP, Farrell MB, Peacock JG, Costello M, Gordon LL. Diuretic renal scintigraphy: the state of practice and a potential opportunity for standardization. J Am Coll Radiol 2021; 18:1556–1563.
23. Taylor A, Manatunga D, Halkar R, Dubovsky E. An interactive teaching website for Tc-99m MAG3 diuretic renography
). J Nucl Med 2022; 63:2659.