Assessment of treatment response in luminal gastrointestinal (GI) disorders is dependent on the accurate measurement of clinical signs and symptoms. In the case of functional GI disorders, patient-reported outcomes (PROs) are commonly used metrics to measure the effects and adverse events of interventional trials. However, current PROs are subjective and seem to have significant intraindividual and interindividual variability (1–3). Moreover, the subjectivity of current PROs, such as abdominal pain, bloating, and nausea, further influences the treatment/placebo response rates (4).
In the case of irritable bowel syndrome (IBS), the US Food and Drug Administration (FDA) provides guidance for the reporting of, and outcomes for, stool consistency before and after an intervention (5). A common tool designed to guide and help standardize the reporting of stool consistency is the Bristol Stool Scale (BSS) (6) which is based on a 7-point scale ranking stool from hard lumps of stool (1 on the scale) to liquid stool (7 on the scale). In addition, the US FDA guidance further suggests the daily reporting of symptoms and bowel form rather than weekly, which can reduce recall bias. However, even daily reporting can be inaccurate because subjects are asked to report their average stool form for the day, but bowel movements may vary in form throughout the day. Moreover, subjects' interpretation of the BSS scale may vary, leading to inaccurate reporting of stool form.
Integration of artificial intelligence (AI) in health care is enabling research in new and objective ways. The use of AI algorithms to systematically assess digital images of individual bowel movements can potentially increase the accuracy and reliability of study outcomes. Recently, a HIPAA-compliant smartphone application (app) has been developed to allow the user to capture an image of every bowel movement. Subsequently, the app uses AI to measure 5 distinct visual stool characteristics (BSS, consistency, edge fuzziness, fragmentation, and volume).
In this study, we aimed to validate AI determinations for stool images on the basis of 5 visual stool characteristics, using a group of subjects participating in a randomized clinical trial for IBS with diarrhea (IBS-D). Next, a separate group of subjects participating in the same trial were used to assess the implementation of the app and compare the app findings with the subjects' self-reported BSS scores. Finally, subject-determined BSS and all AI-determined stool characteristics scores were compared with diarrhea severity scores.
Subjects (aged 18–75 years) participating in a randomized controlled trial of new drug therapy for IBS-D (ClinicalTrials.gov number NCT04557215) were included. Diagnosis of IBS-D was based on the Rome IV criteria (7). Subjects were excluded from the trial if they were pregnant or had known luminal GI diseases, recent antibiotic use (within the past 2 months), human immunodeficiency virus, unstable thyroid disease, poorly controlled diabetes, cirrhosis, or history of abdominal surgery (excluding appendectomy or cholecystectomy). Subjects were given the option of using a smartphone app (Dieta Health, Oak Park, CA) to capture images of all stool evacuations during the 2-week screening phase of the trial. Those who agreed were included in this study. The methods followed a plan for determination of whether there were added benefits of using AI. All subjects signed an informed consent form, and the trial was approved by the institutional review board.
Stool characteristics and AI training
Dieta (Dieta Health) developed its stool image recognition technology by building software to capture, quality assure, annotate, and analyze images of human stool. The app was designed to capture images of stool that are of equal distance from the toilet by guiding users using a toilet seat mask (Figure 1). To adjust for toilet size and color variability, a stool localization model was run to isolate just the stool from the pictures to provide a high-resolution, close-up view of the stool. On capturing more than 30,000 high-quality images acquired through the publicly available app, an interactive web application (Dieta Stool Annotation Portal) allowed users to view the images from different angles, lighting (brightness, contrast, and saturation), and resolution while comparing with high-fidelity guides and manually annotate the images. Each annotation guide showed multiple stool images on a 0–100 scale in increments of 10 to train the human annotator for proper calibration.
Each image was annotated with 5 visual stool characteristics:
- Consistency: How liquid or solid is the stool?
- Fragmentation: How divided is the stool into separate pieces?
- Edge fuzziness: How clear or fuzzy are the edges of the stool?
- Volume: How large is the stool?
- BSS: the current global standard scale for measuring stool form.
After a process of manual annotation and quality assurance, a regression model was trained to predict 5 clinically relevant stool characteristics from stool images taken by users within the app. The model was developed by transfer learning on a canonical convolutional image classification model with the classification layer replaced by a regression layer. The model was trained with a novel focal mean square error (MSE) loss (see Supplementary Figure 1, Supplementary Digital Content 1, https://links.lww.com/AJG/C470). To estimate confidence, multiple models were trained with the same hyperparameters, and the average of their predictions was used as a final prediction, and their agreement was used as a proxy for confidence.
Models trained on Dieta's data with MSE were biased against predictions near extrema for 2 reasons:
- The training data label distribution of all metrics resembled a Gaussian centered near the range midpoint of each metric, resulting in a distribution imbalance that biases models toward central prediction.
- In a bounded regression, particularly with a centrally biased training set, as mentioned above, central predictions were safer because they resulted in a smaller maximal squared residual.
Focal loss reduced central prediction bias by increasing the significance of loss on images with more extreme labels using the following formula:where is the loss for a given prediction, is the model prediction, is the image label, and is the focal factor, which determined the strength of the focal weighting of the loss, where yielded standard MSE.
Validation of AI
During the 14-day screening phase of the randomized controlled trial, subjects reported the daily average of their stool form using the BSS (6), and the results were entered into REDCap, a HIPPA-compliant, web-based portal designed to capture data for research studies (8,9). Moreover, using the app, subjects captured an image of every bowel movement before flushing or placing tissue paper in the toilet. The app recorded the image and determined the 5 visual stool characteristics and frequency of the bowel movements digitally.
For the AI validation, images from the first one-third of subjects were used to compare the outcomes with expert evaluations. Images from these subjects were collated and blindly graded by 2 expert gastroenterologists (M.P. and A.R.), who characterized the images based on the same 5 visual stool characteristics (BSS, consistency, fragmentation, edge fuzziness, and volume) according to Figure 1. The determination by the experts was considered to be the gold standard. AI and expert grades for each image were then compared. Images annotated by AI, subjects, and experts were further categorized into 3 groups: BSS <3, 3 ≤ BSS ≤ 5, and BSS >5. Sensitivity, specificity, accuracy, and diagnostic odds ratios of self-reported and AI-graded BSS scores were calculated by comparing with experts as the gold standard.
Implementation of the AI
Stool images collected from the remaining two-thirds of the subjects were graded on all 5 visual stool characteristics exclusively by AI. Note that subjects were only asked to self-report the average daily BSS and were not asked to report the other 4 visual characteristics. Self-reported and AI-graded BSS scores were compared. In an additional analysis, the daily average BSS scores from subject reports and the AI assessments were categorized as “constipation” (average BSS <3), “normal” (average BSS ≥3 but ≤5), or “diarrhea” (BSS >5). The agreement in the rating of constipation, normal, or diarrhea was also compared between the AI and the subject reports.
Correlations between subjects' self-reported BSS and AI-graded scores and diarrhea severity scores
Diarrhea severity was reported by subjects on a visual analog scale from 0 to 100 (0 = no symptom and 100 = severe symptom severity). Subjects' self-reported BSS scores and AI-graded BSS scores were correlated with the subjects' diarrhea severity scores.
Subject survey of the stool imaging procedure using a smartphone
At the end of the study, subjects were asked to respond to 2 questions in the app. The first question was “Overall, how was your experience using the application in this trial?” with potential responses being “Very easy and pleasant experience,” “Somewhat easy and pleasant experience,” or “It was difficult to use.” The second question was “How do you like tracking your stools in general?” with potential responses being “Love it,” “Like it,” “Doesn't matter,” “Don't like it” or “Hate it.”
The demographic information and the symptom severity scores were summarized as the mean ± SD for continuous variables and count and frequency for categorical variables. To compare these variables for the validation group and the implementation group, the Student t test was used for continuous variables and the χ2 test for categorical variables.
Daily average subject-reported BSS scores and experts-assessed BSS scores were compared with the same-day average AI-graded BSS scores. The agreement analyses for BSS scores were based on the intraclass correlation coefficients (ICCs), which focus on the heterogeneous variances across a measurement scale (10). The ICC estimates and their 95% confident intervals were calculated using SAS 9.4 (SAS Institute, Cary, NC) based on a mean-rating, absolute-agreement, 2-way random-effects model. ICC values <0.5 are indicative of poor agreement, values 0.5–0.74 indicate moderate agreement, values 0.75–0.9 indicate good agreement, and values >0.9 indicate excellent agreement. The sensitivity, specificity, and accuracy rates and the overall statistics were calculated based on the fixed-effects model. To consider both the sensitivity and specificity simultaneously, the bivariate model was analyzed using linear mixed model techniques. The parameters of the bivariate model are estimated in a single model to incorporate the possible correlation between sensitivity and specificity rates and calculation of diagnostic odds ratios and accuracy rates (11), using the metafor package in R version 4.1.2.
Finally, based on the data from the entire cohort, the correlations between the subjects' self-reported BSS scores and the 5 visual AI stool characteristics and the other IBS-D symptom severity scores reported by the subjects were calculated using the Spearman correlation coefficient.
To define a 1–100 scale for each of the 4 new stool characteristics (i.e., all except BSS), the Dieta team reviewed tens of thousands of stool images and identified the images of stool that represented the extremes and increments of 10 for each characteristic. AI training was performed by comparing with human reads of the images. The machine learning model was trained with a novel focal MSE loss, which gave extra training importance to images near the edges of the distribution using a tunable focal factor parameter, reducing the central prediction bias of standard MSE loss to a varying degree, resulting in an increasingly diagonal confusion matrix with increasing focal factor (see Supplementary Figure 1, Supplementary Digital Content 1, https://links.lww.com/AJG/C470). Focal MSE prevented the central prediction bias of standard MSE without losing accuracy, yielding better performance on nonaverage stools.
In total, 45 IBS-D subjects were enrolled in the parent-randomized controlled trial. Of these, 39 subjects agreed to participate in this study to assess a new smartphone app designed to characterize stool form based on 5 distinct visual stool characteristics (Figure 1). All 39 subjects participated in the entire 2 weeks of the study. For the validation phase of this study, 219 stool images from the first 14 subjects were used. For the implementation phase, stool images from the remaining 25 subjects were used. At baseline, there were no significant differences between the validation cohort and the implementation cohort (Table 1).
AI had good agreement with experts based on stool characterization parameters
Based on both the AI and assessment by 2 expert gastroenterologists, a wide range of stool characteristics scores were represented, including BSS scores from 1 to 7, i.e., from constipation to diarrhea. The distribution of stool scores is shown in Supplementary Figure 2 (see Supplementary Digital Content 1, https://links.lww.com/AJG/C470). There was good agreement between the 2 experts (Table 2) for all 5 visual stool characteristics (BSS, consistency, fragmentation, edge fuzziness, and volume).
Each expert was then considered separately as the gold standard and compared with the trained AI for the 5 visual stool characteristics. Overall, when compared with either expert, there was good agreement between the expert and the AI BSS, consistency, fragmentation, and edge fuzziness scores, and moderate-to-good agreement for volume (Table 2).
As compared with subjects, AI exhibited higher sensitivity and specificity in categorizing BSS scores
Using the experts' stool grading as the gold standard, AI-graded BSS score categorization (as constipation, normal, or diarrhea) had higher test characteristics when compared with subject-reported scores (Table 3). Average sensitivity and specificity rates were 16% and 11% higher, respectively. As expected, given the lower false-positive and false-negative rates of AI, a higher average diagnostic odds ratio (30.64 vs 3.67) and accuracy rate (95% vs 89%) were seen with AI when compared with subject-reported scores. In addition, the distributions of daily average BSS scores as determined by the 2 expert gastroenterologists, using images of subject stools, were compared with the daily average BSS scores reported by subjects (see Supplementary Figure 2, Supplementary Digital Content 1, https://links.lww.com/AJG/C470). As seen in this figure, using true images, there is a broader range of stool BSS scores as determined by the experts when compared with the subjects.
Implementation phase comparing the AI with subjects
Once the AI stool image characterization was validated by the 2 experts, subject self-reported and AI-graded BSS scores were compared. For the 14 subjects in the validation cohort, the agreement between AI-graded and subject-reported BSS scores was poor at 0.31 (95% confidence interval 0.06–0.49). The mean visual stool characteristics determined by AI between the validation and the implementation cohorts were similar (Table 4). The average daily AI-graded BSS scores were then compared with the subjects' self-reported average BSS scores for the implementation cohort. The agreement between the AI and the implementation cohort subjects' self-reported scores was moderate, at 0.61 (95% confidence interval 0.51–0.69).
Stool characteristics determined by AI correlated with diarrhea severity in IBS-D
To assess the correlation between AI and diarrhea severity in the entire cohort, the average scores for each of the 5 visual stool characteristics from the AI (BSS, consistency, fragmentation, edge fuzziness, and volume) and subject self-reported BSS scores were correlated with the average symptom severity scores for diarrhea. Again, subjects reported their average BSS scores daily but did not report the other 4 visual stool characteristics. There was a good correlation between the AI-graded daily average BSS scores and diarrhea severity scores in IBS-D subjects (r = 0.71; P < 0.001) (Table 5). There was a lesser correlation between subject-reported daily average BSS scores and diarrhea severity scores (r = 0.46, P < 0.05). Most of the other 4 visual stool characteristics captured and reported by the app also correlated significantly with diarrhea severity scores (Table 5).
At the end of the study, subjects were given the option of completing 2 questions in the app that rated their satisfaction with tracking stool form using the app: (i) “Overall, how was your experience using the application in this trial?” and (ii) “How do you like tracking your stools in general?” Because this was optional, only 20 of the 45 subjects responded. Of those who did respond, 50% described their experience as “very pleasant and easy” and the remaining 50% as “somewhat easy and pleasant” (see Supplementary Figure 3A, Supplementary Digital Content 1, https://links.lww.com/AJG/C470). No subjects found the app difficult to use. For the second question, most of the subjects responded with “doesn't matter to me” (40%), “like it” (30%), or “love it” (10%), and the remainder selected “don't like it” (5%) or “hate it” (15%).
In this study, AI demonstrates accurate stool characterization when compared with the 2 expert gastroenterologists. AI also seems to better characterize stool when compared with subject self-reporting and also correlates better with diarrhea severity. The use of this application in the assessment of subjects during a clinical trial for IBS-D demonstrates that current patient-reported outcomes for assessing stool form are inadequate compared with AI.
The study of GI, and particularly luminal, conditions are challenging because of the need for quantifying subjects' self-reported measures. This is most evident in the study of functional GI disorders. In functional disorders, such as IBS, subjects have to self-report abdominal pain, bloating, and stool form and frequency. These symptoms are highly subjective. However, because IBS is defined by either diarrhea (in the case of IBS-D) or constipation (in the case of IBS-C), assessments of stool form and frequency are critical to determine the success or benefits of interventions in these conditions.
Primary endpoints of drug trials in IBS have evolved over the years due to concerns about the subjectivity of endpoints. In fact, early drug development programs in IBS often relied on weekly assessments of GI symptoms (12–14). Subjects were asked to describe their stool consistency in a fashion that was described as “on average in the last week,” for example. In 2008, in an attempt to make outcome measures more objective, the US FDA developed a guideline for IBS drug development and described outcome measures (5). These new guidelines required sponsors to ask subjects to report and characterize symptoms on a daily basis in the hopes of improving the accuracy of response reporting.
In the case of stool form, clinical trials still rely heavily on the BSS (6) to provide a guide for subject self-reporting of stool form. The BSS has been validated in several studies and found to be useful in the evaluation of stool hardness (15) and in rating individual stool type (16) but was less useful in determining Rome III stool form categories (16), and some users have shown difficulties classifying certain stool types correctly (17). However, there remain many challenges with asking the subject to use the BSS. The BSS, although intuitive, requires subject education and instruction to avoid confusion. Most challenging is in the reporting of daily stool BSS. For subjects with diarrhea, they are still asked to report a daily average BSS. This can be difficult in subjects with multiple and varied bowel movements throughout a given day. Our findings support this and indicate that subjects' daily scores differed from BSS assignments by the 2 expert gastroenterologists. It is well described that regardless of whether a subject has IBS-D or IBS-C, the stool varies from one bowel movement to another in an irregularly irregular fashion (18). In fact, even the Rome Criteria suggest that in IBS-D, subjects can have constipation (i.e., BSS 1–2) up to 25% of the time (7,19). In a subject who has 2 bowel movements in 1 day where one is BSS 7 and the other is BSS 2, what would the subject report for the daily average? These problems continue to challenge study design and outcomes.
In this study, we show that AI is able to catalog and characterize stool form in an objective way. Although the app is still dependent on having a subject to photodocument each bowel movement, the photos document the “true” stool characteristics. This allows an objective measurement of stool form. Moreover, the photos allow the stool to be assessed with additional characteristics beyond the BSS.
AI demonstrates a mostly good correlation between the AI and 2 expert gastroenterologists, including BSS. The application also outputs 4 other stool characteristics including consistency, edge fuzziness, fragmentation, and volume. These novel outputs also seem to be important for stool characterization. By capturing all subject bowel movements in this way, the stool form as suggested by the US FDA Guidance document can be calculated objectively. In fact, the app would not change the US FDA guidance but rather make the measurement objective to better fulfill the requirement. Alternatively, each bowel movement can be considered individually and avoid the need to report daily averages for the sake of subject reporting convenience.
In the clinical trial noted here, the app demonstrated that using the standard patient-reported outcome technique did not represent the true form when compared with real-time images. Although there was some overlap in the confidence intervals (with the exception of accuracy), the overall pattern of test characteristics suggested that AI was superior. Subjects were often mischaracterizing their BSS. We found that subjects appeared reluctant to use the full scale of BSS and tended to gravitate toward the diarrhea (low) end of the scale. This inaccuracy could stem from a number of issues such as inadequate subject education or understanding but also recall bias. This raises significant concerns and should call into question the self-reporting methodology to assess stool form consistency.
The app may also provide additional benefits. Stool image capture allows for the full assessment of a drug effect. For example, a drug may benefit constipation but instead cause diarrhea. Stool capture would objectively quantify stool side effects of therapies for bowel disorders. Another potential benefit is trial costs. Owing to the wide variability and inaccuracy of subject reporting, as demonstrated in this study, trials need to be designed with large numbers of subjects to overcome the effect of variability. The use of an objective measure for stool form would likely mean a lower number of study subjects could be used to assess therapies because the image reading is standardized.
There are some limitations to this study. Although uncommon, stool image capture using the application depends on a modest amount of user skill and dedication to capturing every stool. However, we note that most of the subjects found using the app to be an easy and pleasant experience, and none described the app as difficult to use. Another limitation is that subject toilet colors other than white might affect the ability of the app or even an expert to assess stool characteristics. The study is also limited by the fact that most images were from subjects with IBS. Although the use of the app in broader GI disease may be intuitive, it may be important to specifically study the app in these other conditions.
In conclusion, a new easy-to-use smartphone application capturing real-time stool images is both accurate and objective. In addition, the app demonstrates that subject-reported stool characteristics (BSS) do not accurately describe the features of stool. Additional stool features such as stool consistency, edge fuzziness, fragmentation, and volume may provide added objective benefits in the study of GI disorders. This app may become a new and valuable tool for use in trials of luminal GI diseases involving changes in stool form.
CONFLICTS OF INTEREST
Guarantor of the article: Mark Pimentel, MD, FRCP(C).
Specific author contributions: A.K., A.R., and M.P.: conceptualization. A.K., M.P., A.R., and J.W.: data curation and formal analysis. A.K.: methodology and software. M.P., A.R., and R.M.: project administration. A.K., M.P., A.R., N.P., B.B., L.T., and B.C.: resources. C.C., A.H., A.F., M.R., G.L., and J.W.: investigation. M.P., W.M., S.W., M.R., C.C., and R.M.: supervision. M.P., A.R., A.K., and J.W.: writing—original draft. M.P., A.K., A.R., J.W., and R.M.: writing—review and editing.
Financial support: None to report.
Potential competing interests: A.K. is an employee and board member of Dieta Health. The remaining authors have no conflicts of interest.
WHAT IS KNOWN
- ✓ Outcome measures in gastrointestinal trials commonly rely on patients to self-report stool form using the Bristol Stool Scale (BSS).
- ✓ These characterizations are subjective and can result in inconsistent and variable reports.
WHAT IS NEW HERE
- ✓ A novel smartphone application (app) has been developed that characterizes stool form using artificial intelligence.
- ✓ Stool characterizations by this app were highly consistent with assessments by expert gastroenterologists.
- ✓ App characterizations outperformed subject self-reports in categorizing daily average BSS scores and correlated better with diarrhea severity in subjects with diarrhea-predominant irritable bowel syndrome.
- ✓ This novel app may be a valuable tool in clinical trials for functional gastrointestinal disorders.
1. Hsiao CJ, Dymek C, Kim B, et al. Advancing the use of patient-reported outcomes in practice: Understanding challenges, opportunities, and the potential of health information technology. Qual Life Res 2019;28:1575–83.
2. Waljee JF, Ghaferi A, Finks JF, et al. Variation in patient-reported outcomes across hospitals following surgery. Med Care 2015;53:960–6.
3. Dietz N, Sharma M, Alhourani A, et al. Variability in the utility of predictive models in predicting patient-reported outcomes following spine surgery for degenerative conditions: A systematic review. Neurosurg Focus 2018;45:E10.
4. Bosman M, Elsenbruch S, Corsetti M, et al. The placebo response rate in pharmacological trials in patients with irritable bowel syndrome: A systematic review and meta-analysis. Lancet Gastroenterol Hepatol 2021;6:459–73.
5. U.S. Department of Health and Human Services Food and Drug Administration, Center for Drug Evaluation and Research (CDER). Guidance for Industry. Irritable Bowel Syndrome—Clinical Evaluation of Drugs for Treatment. Food and Drug Administration: Silver Spring, MD, 2012. (https://www.fda.gov/media/78622/download
). Accessed November 12, 2021.
6. Lewis SJ, Heaton KW. Stool form scale as a useful guide to intestinal transit time. Scand J Gastroenterol 1997;32:920–4.
7. Mearin F, Lacy BE, Chang L, et al. Bowel disorders. Gastroenterology 2016;150:1393–407.
8. Harris PA, Taylor R, Thielke R, et al. Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 2009;42:377–81.
9. Harris PA, Taylor R, Minor BL, et al. The REDCap consortium: Building an international community of software platform partners. J Biomed Inform 2019;95:103208.
10. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med 2016;15:155–63.
11. Lee J, Kim KW, Choi SH, et al. Systematic review and meta-analysis of studies evaluating diagnostic test accuracy: A practical review for clinical researchers-part II. Statistical methods of meta-analysis. Korean J Radiol 2015;16:1188–96.
12. Jones RH, Holtmann G, Rodrigo L, et al. Alosetron relieves pain and improves bowel function compared with mebeverine in female nonconstipated irritable bowel syndrome patients. Aliment Pharmacol Ther 1999;13:1419–27.
13. Nyhlin H, Bang C, Elsborg L, et al. A double-blind, placebo-controlled, randomized study to evaluate the efficacy, safety and tolerability of tegaserod in patients with irritable bowel syndrome. Scand J Gastroenterol 2004;39:119–26.
14. Drossman DA, Chey WD, Johanson JF, et al. Clinical trial: Lubiprostone in patients with constipation-associated irritable bowel syndrome—Results of two randomized, placebo-controlled studies. Aliment Pharmacol Ther 2009;29:329–41.
15. Seppanen S, Niittynen L, Poussa T, et al. Removing lactose from milk does not delay bowel function or harden stool consistency in lactose-tolerant women. Eur J Clin Nutr 2008;62:727–32.
16. Chumpitazi BP, Self MM, Czyzewski DI, et al. Bristol Stool Form Scale reliability and agreement decreases when determining Rome III stool form designations. Neurogastroenterol Motil 2016;28:443–8.
17. Blake MR, Raker JM, Whelan K. Validity and reliability of the Bristol Stool Form Scale in healthy adults and patients with diarrhoea-predominant irritable bowel syndrome. Aliment Pharmacol Ther 2016;44:693–703.
18. Pimentel M, Hwang L, Melmed GY, et al. New clinical method for distinguishing D-IBS from other gastrointestinal conditions causing diarrhea: The LA/IBS diagnostic strategy. Dig Dis Sci 2010;55:145–9.
19. Lacy BE, Patel NK. Rome criteria and a diagnostic approach to irritable bowel syndrome. J Clin Med Res 2017;6:99.