In recent years, global excitement for artificial intelligence (AI) has been seemingly exponential, from advances in driver-less cars to advanced robotics (1). AI has the attention of mainstream media as well as many ‘big thinkers’ in science and technology, from Elon Musk to the late Stephen Hawking. There is also seemingly as much negativity as there is positivity; for instance, the ideas and concerns of ‘robots’ replacing workforces – not least the end of mankind, as Professor Hawking had previously hypothesized in 2014 (2). Gartner reports that AI-technology sits atop the ‘Peak of Inflated Expectation’, where novelty and excitement escalate to generate ‘hype’, the preceding state of over-exaggerated excitement before the ‘Trough of Disillusionment’, when the idea does not quite live up to reality (3). The big question for healthcare, however, is whether AI is friend or foe. More specifically, will it revolutionize gastroenterology practice or damage it? In this review, we dissect- the perplexing terminology in AI, describing its subcategories and prospective applications in gastroenterology practice. We describe its promises, but also its pitfalls.
DEBUNKING ARTIFICIAL INTELLIGENCE VOCABULARY FOR CLINICIANS
In understanding the AI-healthcare drive, it is important to appre- Machine learning refers to the area of computer science, whereby ciate the wide variety of possible approaches, such has become system’s ‘vocabulary’. Understanding this ‘AI-vocabulary’ becomes a necessity, not least if clinicians/researchers aim to develop predictive models for use in clinical practice.
What is ‘Artificial Intelligence’?
Firstly, “artificial intelligence” refers to the ability of a computer to perform a task associated with intelligent beings, including ‘cognitive’ functions that might mimic the human mind, such as the ability to ‘learn’ (4). Although the widespread interest in AI seems relatively recent, in-fact, its concept is far more developed and ancient, for instance Aristotle’s study of logic (384–322 BC) to Alan Turing’s (1912–1954) extensive advances in computer science (4).
What do I need to know?
Currently, ‘AI’ has become the expansive subject field, in the same way, ‘gastroenterology’ forms an all-encompassing title of many sub-fields (e.g., inflammatory bowel diseases, nutrition, endoscopy, etc). These gastroenterological sub-specialties all interact in many ways, and AI sub-fields are no different (Fig. 1) (5). We shall briefly categorize some of these AI sub-fields with examples of how they might find a niche in gastroenterology practice, or indeed academia.
Machine learning refers to the area of computer science, whereby a system can develop the ability to ‘learn’ with data without explicit programming; a term originally coined by Arthur Samuel, a computer game pioneer, in 1959 whilst working for IBM (6). Machine learning stems from computational learning theory in pattern recognition, whereby a computer system can develop an algorithm to make a prediction on data by recognition of patterns or ‘features’ that relate the data to the specific prediction (4). The predictive algorithm, or ‘model’, can further adapt to new circumstances (such as improve in accuracy with more practice/time). An example of machine learning in gastroenterology is the prediction of colorectal cancer, whereby Kinar et al., trained a model to detect the presence (or absence) of the disease by complete blood count, with an area under the curve of 0.82 (7). This particular machinelearning approach used a binary classifier, i.e., disease positive (+1) or disease negative (0), referred to as ensembles of decision trees. There are a large number of machine-learning methods in training a classification model: decision trees, ensemble trees, support vector machine, k-nearest neighbors, to name but a few. For greater discussion on the variety of ‘types of learning’, we would direct the reader to the following article (8). There remains some uncertainty in ‘the gold standard’ method to choose when developing a model. Indeed, many machine learning toolboxes automatically attempt to train a model using a myriad of methods, subsequently depicting the receiver operator characteristic curve (ROC) for each model to illustrate which performs best (9).
How do I train a model, and test it on new data?
In predictive model development, the central elements are partitions of ‘training’, ‘validation’ and ‘testing’ sets. The rationale is simple; if a predictive model is developed based upon a specific data set, then how would we then test it on never-before-seen data samples? It is commonplace to use randomized partitioning of data to an initial training set, typically 70% of samples, used to develop the model. The remaining 30% of samples are allocated to the following (i) model validation and (ii) model testing. While the former functions to ensure the model is not ‘overfitting’ to the training data set, the latter tests the performance of the developed model on new data not yet seen. To clarify, overfitting refers to a process, whereby the model is trained to correspond ‘too closely’ to the training data set, but then when it is applied to other examples (in the validation and testing datasets) it then performs poorly.
Deep learning refers to machine learning algorithms that use layers of nonlinear processing for ‘feature extraction’ and transformation (10). Feature extraction is the selection of variables that seemingly have power in predicting a target, while transformation refers to changing the data into a means that can be used more effectively for building the model. As an analogy, clinicians intrinsically perform feature extraction on a regular basis, for instance taking a full medical history and then ‘extracting’ the key data that points to a diagnosis. For deep learning, however, feature extraction can be automated computationally. Referring back to the aforementioned colorectal cancer example; by inputting complete blood count data (7), a deep-learning extraction feature can select the aspects of the complete blood count that hold the most ‘predictive power’ in determining presence or absence of colorectal cancer, and assign weights to the variables dependent on their usefulness. In this particular example, hemoglobin, mean corpuscular hemoglobin, mean cell volume and hematocrit have higher predictive power, whilst other metrics such as monocytes or red blood cells held lesser contributions to predictive accuracy (7). Furthermore, a deep-learning algorithm also has the ability to learn and utilize interaction factors between data inputted in predicting its target.
Deep Learning in Clinical Practice
For example, say a hepatologist wanted to develop a model to predict alcoholic liver disease using liver function tests. In this instance, the liver function test values are the inputs, whilst the presence or absence of alcoholic liver disease would form the ‘target’ to predict, in the form of a binary classifier. When inputting this data to a deep learning model, in feature extraction the model would assign particular weights in predicting the target; for instance, aspartate transaminase (AST) having more predictive power than, say, lactate dehydrogenase (LDH). Furthermore, the model would, likely, find that the interaction of AST and alanine transaminase (ALT) also holds predictive power, in the same way that clinicians might currently calculate the AST/ALT ratio in alcoholic liver disease (11,12).
Neural networks are a particular area within machine learning that have gained much focus recently. Named due to its deliberate similarities to the human brain, this method in essence simulates densely interconnected neurons in order to recognize patterns, extract features or ‘learn’ things about the data, in order to predict a result (see (10)) (Fig. 2). Much of current AI healthcare models use the neural network paradigm (13,14). Similar to classification models; the term ‘neural network’ is somewhat allencompassing for a variety of different model training paradigms; Scaled-Conjugate Gradient, Levenberg-Marquardt and Bayesian Regularization, to name but a few (4,8).
Computer vision refers to the field of using computer systems in processing images or videos, and how a computer might gain information or some form of understanding from this. Indeed, a core focus of the AI-healthcare drive pertains to artificial intelligence in computer vision: radiology. What historically has been a specialty of humans undertaking pattern-recognition in images to predict a response, for example identifying polyps on a CT colonography, is presently a hot topic of debate as to how AI might be used to improve accuracy and results (15). Furthermore, many computer vision approaches utilize the neural network approach (10,16,17). Moreover, a key headline from the 2017 Radiological Society of North American, RSNA, was that “Rads [Radiologists] who use AI will replace rads who don’t” (18). Could the stakes be similar for the gastroenterologist?
TOWARDS APPLYING ARTIFICIAL INTELLIGENCE TO HEALTHCARE
Having clarified some of the basic terminology, we shall now overview some exciting prospects, or ‘promises’, the advent of machine learning could spell for gastroenterology practice. Similarly, we reason foreseeable limitations, or ‘pitfalls’, in its advent. For clarity, these concepts of AI in gastroenterology are mere suggestions, and highlight examples which illustrate a forward-thinking process, which might improve gastroenterology practice and encourage gastroenterologists to ‘think out of the box’ in how artificial intelligence approaches might be used to advance healthcare.
Personalized medicine, for real this time
The idea of ‘personalized medicine’, whereby one would move away from the ‘one size fits all’ approach to healthcare, began to gain momentum approximately a decade ago, largely with the advances in genomics. To date, it remains a key ‘buzz word’, although it could be argued that it has largely remained a ‘buzz word’, still awaiting implementation on the ground for many common diseases. Yet, could the technological advances in artificial intelligence yield genuine personalization of gastroenterology healthcare provision?
A pillar of medical practice is the use of treatment pathways or algorithms, determined by a panel of experts, which should be followed for a given condition, regardless of the patient in question. For example, in upper gastrointestinal (GI) bleeds clinicians follow a sequence of management steps, X, Y, Z, determined to be ‘best practice’ by the literature / experts in the GI-bleed field. But, could it be that it is not always best practice for all patients? Considering the management of upper GI bleeds, one point of controversy in management is the use of proton pump inhibitors (PPI) prior to endoscopy (19). With the plethora of upper GI bleed research undertaken over recent decades, there are considerable data from clinical trials (be it demographical, comorbidity, GI-bleed score systems, or otherwise), both before intervention but also patient outcome-related. These data serve as a perfect example of stratifying into ‘inputs’, the data acquired from a patient having a GI bleed, and ‘target’, their outcome, depending on whether they received pre-endoscopy PPI or not. Given contemporaneous research has failed to conclusively identify if pre-endoscopic PPI is of benefit or not (19), it is possible that it could be for some, but not all. Indeed, perhaps a neural networks-based approach could identify the individuals who would or would not benefit; hence providing a personalized medicine approach?
Another example of personalizing gastroenterology healthcare could be the choice of treatment for a given disorder. Whilst disease-specific treatment algorithms are commonplace, it is often not ‘black or white’ but more ‘gray’ in the choice of which therapy to give, often confounded by the idea that some patients would benefit from one drug over another, and vice versa. Adding to that is the proposition that a response to a treatment can be quantified, be that improvement to quality of life, parameters of inflammation, pain relief, etc. Indeed, the concept of quality-adjusted life years (QALY) in health outcomes and economic outcomes is a central aspect of healthcare governance, or certainly in the United Kingdom (20). One example is that of irritable bowel syndrome, an enigmatic disorder which has been reclassified and its treatment pathways altered seemingly timelessly (21,22). Consider the choice of whether to prescribe X or Y, it might not be clear cut and seems more like going with your ‘gut feeling’, as opposed to evidence-based medicine. However, similar to our aforementioned example in upper GI bleed management, these treatments have largely all gone through rigorous academic clinical trial, and thus there will be data pertaining to the characteristics of a patient and their condition, along with the improvement (if any) gained from therapy. Moreover, most commonly a treatment outcome will not be binary classified (improved or did not improve), but rather quantified on some form of linear scale (‘on a scale of 1–10, how much improvement did you feel?’). This is where AI to personalize practice becomes an exciting prospect. Whilst machine learning models can indeed be trained to predict binary classifiers, it can also fit to linear scale data (such as an improvement percentage following a treatment); one such example being neural network ‘fitting’ algorithms. This raises the exciting possibility whether quantitative therapeutic benefit for a given treatment could be predicted for a given individual, rather than the generic quality adjusted life year ‘one size fits all’ approach? This becomes even more appetizing when training an algorithm to predict treatment response to more than one drug. Furthermore, referring back to the treatment dilemma, choosing X or Y, could machine learning algorithms predict percentage improvement to multiple therapies, enabling the gastroenterologist sitting in outpatient clinics to make ‘data-driven’ decisions at the individual level, as opposed to blanket treatment with the exact same therapy, regardless of how disparately patients might respond (Fig. 3).
Computer Vision Endoscopy
A number of prospective healthcare technological advances revolve around computer vision. Consider how much data are visually processed by the clinician to make an informed choice to diagnosis or treatment, be it a radiological or endoscopic. Focusing on radiology is well beyond the scope of this article, but computer vision in endoscopy is a highly relevant and likely feasible option in AI-assisted healthcare. Consider polyp surveillance in colonoscopy; in surveying the bowel wall for polyps, “realtime“ video is depicted on a screen for the gastroenterologist to interpret, whilst already facing the difficulty of navigating the mobilized colon through this two-dimensional representation of what is in-fact three-dimensional. Add to this a patient in discomfort or pain, the suboptimal bowel preparation, a polyp can be missed. Could computer-vision aided colonoscopy improve detection rates? This might function in a similar way to as if a second gastroenterologist were watching the video-feed, pointing out to the endoscopist any areas they are concerned about (Fig. 4). Indeed, these prospective ideas are shared by other groups and there are a number of promising recent studies developing machine-learning models for polyp detection (17,23–25), or even a step further - differentiating between adenomatous or hyperplastic (16). Referring to two studies in particular, Hirasawa et al., developed a neural network to predict gastric cancer, utilizing a “training“ set of more than 13,000 endoscopic images, and reported an overall sensitivity of 92% when “testing” it on a further 2200 images (17). Meanwhile, Byrne et al., built a neural network using a training set of 223 videos (split into >60,000 image frames) and, when tested on a further 125 endoscopy videos, report a 94% accuracy rate in differentiating diminutive adenomas from hyperplastic polyps (16). A key learning point to note is the large sample size required to produce robust algorithms.
Solving Diagnostic Dilemma
Another opportunity that machine learning provides is not just in predicting a result (e.g., presence of a polyp), but also to close an existing knowledge gap in incompletely understood GI disorders. All the examples thus far rely on the concept of ‘ground truth’, which refers to the concept that a clinician has objectively provided the correct answer for a machine to build a predictive model on, e.g., these patients categorically have IBS, while others have IBD. This manner is referred to as ‘supervised learning’, as we instruct a machine to base its model on this given ground truth. But what if there is no ground truth; what if one isn’t sure? Perhaps this dilemma can be solved by ‘unsupervised learning’, a process whereby a machine will try and classify or label data without this ground truth, e.g., here are some patients - classify them into groups depending on the data provided, without specifying what groups are expected. For example, this approach could improve gastroenterology practice for disorders assigned a homogenous ‘label’ to what is clearly a heterogeneous disease population, from relatively simple examples such as constipation to functional dyspepsia.
Having illustrated how AI approaches might change gastroenterology practice for the better, it surely can’t be all ‘pots of gold at the end of the rainbow’. Indeed, significant pitfalls do exist, which are vital to appreciate.
Thousands of algorithms to predict thousands of diseases vs. one algorithm to predict a thousand diseases?
As we have described throughout this review, there is huge scope for AI in predicting a huge variety of diseases, treatment responses or otherwise. Just like that, we may have a thousand algorithms to predict a thousand different diseases. Indeed, many AI-industry leaders have proposed that an ‘AI marketplace’ will become commonplace for these algorithms to be bought by healthcare providers - notably, one such AI-app store has already gone ‘live’ in recent months for radiology (26). While this may seem a hugely exciting prospect, a significant pitfall exists here, the core of which is the differential diagnosis.
To explain, the differential diagnosis forms a pillar of clinical practice, and gastroenterology is no different. Indeed, if a gastroenterologist were to see a patient, they may conclude with a diagnostic ‘impression’ and a series of differential diagnoses. So, in this instance, the gastroenterologist needs an additional test or guide to confirm the diagnosis or determine which of the differentials is the patient’s diagnosis, their ground truth. This is where clinicians could benefit from AI technology enormously. Clinically, the gastroenterologist almost certainly does not say that ‘it’s either this disorder, or they are completely healthy’. Yet, this is how the vast majority of machine-learning models have been (and continue to be) developed, whereby the system is provided a set of healthy controls and a singular disease to compare to. With this in mind, it becomes questionable how clinically useful many of these models actually might be. Rather, models need to be developed that solve a difficult differential more effectively than a human could.
Artificial intelligence vs. artificial humanity
The view that ‘robots’ might replace a workforce is not uncommon, and indeed forms many a headline in contemporaneous news, not least the topic of Hollywood films. But, whilst this may be a genuine concern for some occupations, we would argue that the physician is not one of them… for now at least. A principle reason for this is that creating artificial intelligence is one thing, but artificial humanity is something else entirely. Teaching a computer to spot diagnose a polyp during endoscopy may be ‘easy’, but teaching it to behave like a compassionate human being, realizing the patient’s discomfort, requirements for rest, analgesia/sedation, a holding hand, empathy, are a different entity. Or consider the concept of personalized medicine and AI-assisted treatment, as we discussed above. An AI algorithm may decide that a patient’s ‘best chances’ are with treatment X rather than Y, but a lack of humanity would prevent it from appreciating the ideas and concerns of the patient, perhaps the agenda of treatment preference would be vastly different to the patient, for instance determined by symptoms or side effects, rather than the quantitative treatment outcome the model has been trained on. With this, it is important to appreciate that whilst AI could revolutionize healthcare, the way in which it did this would be through AI-assisted practice, not AI-driven practice.
Machine Learning Stewardship
As we have described, options in training a machine learning model include that of supervised and unsupervised learning, depending on whether ‘ground truth’ is provided or available. Further to that is the concept of feature extraction and transformation, whereby if a plethora of variables assigned to a given patient is provided, the model can automatically extract the most important or ‘predictive’ features. Whilst this concept is hugely enticing for researchers it is important to realize that, while a machine-learning approach may be artificially intelligent, it is not superhuman and is vulnerable to garbage or misleading data. For example, say we were to develop a classification model to predict presence or absence of a diagnosis of irritable bowel syndrome in patients, using demographical, psychophysiological and symptom-questionnaires as our inputs. Perhaps we would opt for the ‘lazy’ approach, provide a neural network with all the data and develop an algorithm to predict the response. However, in this instance, we would not necessarily appreciate the features and weights the model has assigned to these factors. Taking gender as an example, research broadly appears to illustrate a diagnosis of IBS to be more common in women (27). So, when applying the algorithm to prospective male patients, their ‘being male’ would significantly reduce the probability of being diagnosed with IBS, regardless of if it were ground truth or not. Similarly, say the model was trained on female participants only, it is not out of the question that the model would then only correctly diagnose female participants. With this in mind, it is important to appreciate that, whilst the machine has the ability to learn patterns a clinician/academic might not appreciate in the data, our stewardship in teaching it has a role to play.
CONCLUSIONS & FUTURE DIRECTIONS
The future of AI-assisted technologies in gastroenterology is an exciting prospect and has a huge number of prospective applications to better healthcare personalization. These range from uses in personalizing medicine, predicting treatment outcomes, computer vision approaches in imaging/endoscopy to better understanding incompletely understood conditions through neural clustering. However, while these concepts have much potential, it is important to appreciate the significant limitations that could hamper AI-assisted healthcare development, including the verbose approach of developing a vast number of different algorithms, all of which predict a different diagnosis, opposed to a genuinely useful algorithm that could assist in a differential diagnostic dilemma. In addition, a tangible transformation of how promising studies using this technology might translate to clinical practice is not yet clear. Will hospitals, governments, even individual physicians, ‘purchase’ or ‘license’ the use of these algorithms? Furthermore, will one accurately mark the delineation between AI-assisted practice and AI-driven? The latter being where we were to frankly rely on this technology and forget the clinical reasoning we all spent years training for. Legally, who would take responsibility for the patients that are incorrectly diagnosed by computerized diagnostics, the cancer that is missed, for instance? While technological advance has made training powerful computational models to artificial intelligence a feasible task, an inability to teach artificial humanity highlights the necessity of adopting an AI-assisted practice, as opposed to an AI-driven one.
CONFLICTS OF INTEREST
Guarantor of the article: Dr James K Ruffle.
Specific author contributions: James K Ruffle—planned and wrote the manuscript. Adam D Farmer—planned and revised the manuscript. Qasim Aziz—planned and revised the manuscript. All authors (James K Ruffle, Adam D Farmer, Qasim Aziz) approved the final draft submitted.
Financial support: None.
Potential competing interests: None.