Share this article on:

The Reliability and Validity of the Upper Lip Bite Test Compared with the Mallampati Classification to Predict Difficult Laryngoscopy: An External Prospective Evaluation

Eberhart, Leopold H. J. MD; Arndt, Christian MD; Cierpka, Thomas MD; Schwanekamp, Judith MD; Wulf, Hinnerk MD; Putzke, Caroline MD

doi: 10.1213/01.ANE.0000154535.33429.36
General Articles: Research Report

Recently, a new bedside screening test to predict the occurrence of a difficult laryngoscopy has been developed as a substitute for the Mallampati classification. The Upper-Lip-Bite test (ULBT) evaluated the patient’s ability to reach or completely cover the upper lip with the lower incisors. It is often accepted that new predictive tools should undergo an external evaluation before the tool is used in clinical practice. Thus, we evaluated this test with respect to applicability, interobserver reliability, and discriminating power and compared it with the Mallampati-score (using Samsoon and Young’s modification). The ULBT could not be applied in 12% of all patients (Mallampati score, <1%). However, the interobserver reliability was better for the ULBT (κ = 0.79 versus κ = 0.59). The discriminating power to predict a patient with difficult laryngoscopy was evaluated in 1425 consecutive patients. Both tests were assessed simultaneously in these patients by two specially trained independent observers. After the induction of anesthesia, the laryngoscopic view was assessed by the attending anesthesiologist using the classification of Cormack and Lehane. A grade I or II was called easy laryngoscopy and grade III and IV difficult laryngoscopy. The discriminating power for both tests was low (0.60 for the ULBT [95% confidence interval, 0.57–0.63] and 0.66 [0.63–0.69]) for the Mallampati score), indicating that both tests are poor predictors as single screening tests.

IMPLICATIONS: The Upper-Lip-Bite test (ULBT) is a bedside screening test to detect difficult laryngoscopy, which was evaluated and compared with the Mallampati score in 1425 patients. The ULBT did not sufficiently discriminate between patients presenting with easy and difficult laryngoscopy and was less sensitive than the Mallampati score.

Department of Anesthesiology and Critical Care Medicine, Philipps-University Marburg, Germany

Accepted for publication December 13, 2004.

Address correspondence and reprint requests to Leopold Eberhart, Department of Anesthesiology and Critical Care Medicine, Philipps-University Marburg, Baldingerstrasse 1, 35043 Marburg, Germany. Address e-mail to

Recently, a new simple bedside test to predict difficult laryngoscopy, defined as Cormack-Lehane classification grade 3 and 4, was found to be superior to the Mallampati classification with respect to positive predictive value and specificity and thus accuracy (1) (the definitions of all statistical terms are listed in the Appendix). This new test, the Upper-Lip-Bite test (ULBT), evaluates the possibility of a patient to cover the mucosa of the upper lip with the lower incisors. Grade I (the lower incisors can fully cover the upper lip’s mucosa) and grade II (the lower incisors can touch the upper lip but cannot fully cover the mucosa) are thought to predict easy laryngoscopy and are compared with grade III of the ULBT (the lower incisors fail to bite the upper lip) that was found to be associated with difficult laryngoscopy. The authors of the new test propose that their simple method tests the presence of mandibular subluxation and buckteeth at once.

The aim of this study was to assess the clinical usefulness of this new bedside screening test. It was compared with another simple test that is frequently used in our department, the Mallampati classification (2), revised by Samsoon and Young (3). Clinical usefulness was rated according to three major criteria. First, the ease of applicability in daily practice was evaluated (e.g., the number of patients in whom the test cannot be assessed). Second, the interobserver reliability (the degree to which two independent observers record the same ULBT or Mallampati score in the same circumstances) was judged. Third, the predictive properties (discriminating power) were determined for both methods and compared with each other.

Back to Top | Article Outline


The study was approved by an institutional ethics committee, and informed consent was obtained from each patient screened for eligibility to participate in this prospective observational trial. Consecutive patients undergoing a broad variety of surgical procedures requiring elective (nonrapid sequence induction) endotracheal intubation were included. Because we tested the clinical usefulness of the ULBT, there were no exclusion criteria. Biometric data of the patients are shown in Table 1.

Table 1

Table 1

The applicability of the two scores was rated as the relative incidence of patients in which the tests could not be applied for any reason. The statistical testing for this criterion was performed using the χ2-test. The interobserver reliability was judged using independent assessments of 45 patients by 2 observers. The κ-coefficient was calculated as a measure of reliability and interpreted as moderate (0.40 < κ ≤ 0.60) or good (0.60 < κ ≤ 0.80). The differences between the two tests were analyzed using the McNemar test.

To evaluate the discriminating power of the two tests, both the ULBT and the Mallampati score were obtained simultaneously from the patients by two specially trained observers not involved in the clinical care of the patients. The results of this evaluation were not available for clinicians performing the endotracheal intubation. To allow an optimal level of comparison between the study by Khan et al. (1) and our results, we used identical definitions and gradings of the papers describing the two predictive tests (1,3). For the Mallampati classification, the revised scoring system of Samsoon and Young (3) was used. Classification of the oropharyngeal view was assessed with the patient in the sitting position with the tongue fully protruded but without saying “ahh.” It was rated class I if the soft palate, fauces, uvula, and pillars could be seen, class II if soft palate, fauces, and uvula could be seen, class III if only the soft palate and base of the uvula could be seen, and class IV if the soft palate was not visible.

The ULBT was rated as class I if the lower incisors could bite the upper lip above the vermilion line, class II if the lower incisors could bite the upper lip below the vermilion line, and class III if the lower incisors could not bite the upper lip.

In turn, ease or difficulty of laryngoscopy was assessed while the patient was fully anesthetized. General anesthesia was induced according to our local standard operating procedures using propofol (mean dose, 2.3 mg/kg), fentanyl (mean dose, 3.0 μg/kg), and rocuronium (mean dose, 0.56 mg/kg), administered after mask ventilation. The laryngoscopic view of the first attempt was rated with the patient in the sniffing position but without applying external laryngeal pressure according to the Cormack-Lehane classification (4). A grade I (glottis fully exposed) and II (glottis partially exposed with anterior commissure not seen) was rated as an easy laryngoscopy. A Cormack-Lehane classification of III (only epiglottis seen) or IV (epiglottis not seen) was rated as difficult laryngoscopy. Twenty-five anesthesiologists with varying levels of clinical experience (at least 1 yr of continuous clinical training with >500 endotracheal intubations; range, 1–35 yr; Table 1) performed the intubations using a Macintosh laryngoscope with a blade 3 or 4 depending on personal preferences.

Using these clinical data (Mallampati score, the ULBT score, and the Cormack-Lehane classification) recorded for each patient, several measures were calculated that have been frequently used to describe the predictive properties of a scoring system and were used in the previous study of the ULBT test (1). A list of these measures is provided in the Appendix together with a short description and instructions on how to perform the calculations. Most of them can be easily calculated using the data of a 2 × 2 table. However, additional statistics were added to this list because they provide valuable additional information to appraise the predictive models.

One of them, the area under a receiver-operating characteristic (ROC) curve (AUC), was even used as the main end-point of the study to decide whether or not the score was clinically useful. The ROC curve can be constructed correlating true- and false-positive rates (sensitivity and 1-specifity, respectively) of the test. The AUC represents the probability that a random pair of test results will be ranked correctly as to their disease state (5). Theoretically, a 45-degree bisector would be a score predicting not better than a random guess. Thus, the AUC of this random score would be 0.5. Usually tests with an AUC value of 0.7 and higher are accepted to be of clinical relevance. Differences between the AUC values of both predictive tests were calculated, and a P value of 0.05 was defined as statistically significant. ROC measurements were recalculated using the data given in the original study of Khan et al. (1), and values were compared descriptively using the 95% confidence limits.

A prospective power analysis revealed that assuming an incidence of difficult laryngoscopy of 5%, 1100 patients (55 diseased patients and 1045 control patients) provide a power of more than 97% to detect an improvement of discriminating power (measured by the AUC of the appropriate ROC curve) of an absolute value of 15% (e.g., from 50% to 0.65%) with a type I error of 5% and using a two-sided alternative hypothesis). Power analysis was performed using PASS 2002 (Number Cruncher Statistical Systems, Kaysville, UT). ROC analyses were performed using the software MedCalc 4.15 (MedCalc Software, Mariakerke, Belgium). All other calculations were performed using the JMP 5.1 software (SAS Inc, Cary, NC) and Excel 2002 for Windows (Microsoft Inc, Redmond, WA).

Back to Top | Article Outline


A total of 1425 patients were screened for eligibility. Of these, 162 patients (11%) were excluded because they were edentulous, and the ULBT is not designed for these patients. In another 15 patients, the ULBT (n = 9) or the Mallampati status (n = 6) could not be obtained for various reasons. This means that in 12% of patients of an unselected population, the ULBT cannot be used in our institution, whereas this was only the case in 0.4% for the Mallampati score (P < 0.0001).

However, interobserver reliability was better for the ULBT. The κ-coefficient was moderate (κ = 0.59) for the Mallampati score but good for the ULBT (κ = 0.79; P < 0.01).

Analyzing the predictive properties of both scores, 141 patients had to be excluded from the analysis because no endotracheal intubation was performed (n = 33) or intubation was performed by medical students or anesthesiologists with less than 1 yr of clinical experience (n = 108). Thus, 1107 patients were included in the latter analysis.

Most patients (n = 685; 62%) showed a ULBT grade I (lower incisor can fully cover the upper lip), and another 336 patients (30%) had a ULBT grade II. These patients were grouped together and predicted to have an easy laryngoscopy. Only 86 patients had an ULBT grade III predicting a difficult laryngoscopy.

During the first laryngoscopic attempt, 131 patients (12%) presented an initial Cormack-Lehane grade of 3 or 4. Of these patients with a difficult laryngoscopic view, 57 patients could be successfully tracheally intubated at the first attempt by applying external pressure to the larynx (BURP-maneuver (6)). In another 55 patients, conventional endotracheal intubation was successful after a mean number of 2.5 attempts (range, 2–6). In 19 of the patients presenting with a difficult laryngoscopic view, one (in 17 patients) or more (in 2 patients) alternative techniques were required to intubate the trachea. In most cases, an endotracheal tube was placed via a Fast-Trach® laryngeal mask airway (ACLS Recert, San Antonio, TX) or via a fiberoptic device. In three patients, hypoxemia (Sao2 <85%) occurred during intubation, but there was no case with postoperative morbidity or mortality associated with difficult intubation.

The measures used to describe the predictive properties of the two models are shown in Table 2. In our study, both tests did not predict as well as described in the study by Kahn et al (1). However, the most striking difference is the considerably lower sensitivity of the ULBT test in our trial. This was only 28% compared to 77% in the original trial. This means that several patients who will present with a difficult laryngoscopy will not be identified by the ULBT test (large number of patients with a false-negative test).

Table 2

Table 2

The main end-point of this study, the AUC of the ROC, was lower for the ULBT (AUC = 0.60; 95% confidence interval, 0.57–0.63) than the Mallampati score (AUC = 0.66; 95% confidence interval, 0.63–0.69), and the difference of both ROC curves was statistically significant (P = 0.009). For both tests, the discriminating power was less than in the initial study of Khan et al. (1) and did not reach a level that is usually accepted to be clinically relevant.

Back to Top | Article Outline


Unanticipated difficult tracheal intubation is a major concern of anesthesiologists. Thus, the search for a predictive test that has ease of applicability, reliability, and accuracy of prediction (discriminating power) continues. The ULBT seems to meet all of these quality factors. Obviously, it is easy to perform within seconds as a bedside test because it can be determined without the use of any equipment. The instructions required for both the observer and the patients are extremely easy, and thus, there is less probability of misinterpretation while performing the test compared with the Mallampati test in which a different manner of performing the test may be used (e.g., performing the test with or without phonation). Increased interobserver reliability compared with the Mallampati score may be another major advantage of the ULBT. Interobserver reliability (physician-physician (7) as well as physician-nurse (8)) is limited. For the Mallampati test, reliability does not exceed a κ coefficient (a measure for the intraclass correlation ranging from −1 to +1) of 0.6. During a prestudy evaluation period, we investigated the interobserver variability of the ULBT ratings of different anesthesiologists in 45 patients. Here, the κ-coefficient was considerably higher (κ = 0.79) than the assessment of the Mallampati score (κ = 0.59).

Finally, the accuracy of prediction was frequent in the original report describing the ULBT. Although not always statistically significant because of a small number of patients, there was a considerably better prediction in Khan et al.’s (1) study. Recalculations of these data revealed that the AUC of a ROC was 0.83, whereas the AUC for the corresponding Mallampati score was only 0.75. We prefer this measure of discriminating power of a predictive test because it is independent from the incidence of difficult intubation. Other measurements used to describe the accuracy of prediction are prone to be biased by differing incidences of the event that is predicted. For example, if an event is extremely rare, then a test forecasting that it will appear in no patient can have an accuracy (ratio of the number of correctly predicted patients compared with all patients) near 100%. The AUC of this ROC in this case, however, will not differ from 0.5 that is the pre-hoc probability (random guess) to correctly predict the disease state of a sample pair of patients with one patient having the disease and the other one not. Furthermore, the likelihood ratio (LR+) for a positive test result may be a useful measure to judge the usefulness of a predictive tool in daily practice (9). This measure is the number of times more likely that a patient with a positive test result will present with a difficult intubation. The LR+ was 6.8 for the ULBT, whereas it was only 2.5 for the Mallampati score in the original paper of Khan et al. (1). However, an essential step to establish the usefulness of the new ULBT has not been performed. This is an external validation of the tool that is indispensable before a score can be recommended for routine use (10,11).

In our trial, we could not reproduce the high predictive properties of the ULBT, and we could only reproduce part of the Mallampati classification. As listed in Table 2, all measures that describe the predictive properties of a test were inferior to that reported in the original study of Khan et al (1). The AUC of the ROC reveals that in our patients, the pre-hoc probability of a correct prediction is improved from 50% (random guess) to 60% when applying the ULBT, an improvement that is far from being of clinical relevance. This disappointing result is mainly caused by the low sensitivity of the score deriving from the large proportion of false-negative ratings in our trial. A major reason for this failure seems to be a more frequent incidence of difficult laryngoscopy in our patients. We hypothesized that this was due to the fact that the variability of clinical experience in the anesthesiologists performing the endotracheal intubation in our study was probably much larger than that in Khan et al.’s (1) study. In the latter trial, only two experienced residents in anesthesiology at their last year of clinical education (Dr. Arash Kashfi, personal communication, November 2004) performed the intubations, whereas we only restricted analysis to patients tracheally intubated by anesthesiologists with at least one year of clinical experience. Thus, 25 colleagues with clinical experience between 1 and 35 years participated. In our view, this heterogeneity is not a drawback of our study but is an essential methodological issue in validation of a clinical tool because it was designed as a screening method for daily routine.

At first we speculated as to whether the disappointing results of our validation were caused by patients who presented with a difficult laryngoscopy, e.g., a Cormack-Lehane grade III that could easily be transferred into a grade II situs or better by applying external pressure to the larynx (BURP-maneuver) or other simple techniques, because these patients would normally not have been de-scribed to be difficult. Thus, an explorative analysis was performed to predict difficult laryngoscopy that could not be transferred into a lower Cormack-Lehane grade using the same methodology. This approach slightly increased the discriminating power of the ULBT-test to 63%, but this value is still too low to recommend the test for routine use.

The use of the ULBT in our and any other institution with a considerable number of older patients is further hampered in clinical practice by the problems of evaluating edentulous patients. In our trial, 11% of a series of 1425 consecutive patients had to be excluded because of this criterion. Although a modified version for these patients has been proposed (12), it has not been formally evaluated. However, the importance of airway assessment in edentulous patients may be questioned because these patients rarely pose an intubation problem unless mobility is severely limited by cervical spine disease or radiation-induced scarring. Summarizing our results, both tests are poor predictors for difficult laryngoscopy when used as single preoperative bedside screening tests.

Back to Top | Article Outline




Back to Top | Article Outline


1. Khan ZH, Kashfi A, Ebrahimkhani E. A comparison of the upper lip bite test (a simple new technique) with modified Mallampati classification in predicting difficulty in endotracheal intubation: a prospective blinded study. Anesth Analg 2003;96:595–9.
2. Mallampati SR, Gatt SP, Gugino LD, et al. A clinical sign to predict difficult intubation: a prospective study. Can Anaesth Soc J 1985;32:429–34.
3. Samsoon GLT, Young JRB. Difficult tracheal intubation: a retrospective study. Anaesthesia 1987;42:487–90.
4. Cormack RS, Lehane J. Difficult tracheal intubation in obstetrics. Anaesthesia 1984;39:1105–11.
5. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29–36.
6. Knill RL. Difficult laryngoscopy made easy with a “BURP”. Can J Anaesth 1993;40:279–82.
7. Karkouti K, Rose DK, Ferris LE, et al. Inter-observer reliability of ten tests used for predicting difficult intubation. Can J Anaesth 1996;43:554–9.
8. Hilditch WG, Kopka A, Crawford JM, Asbury AJ. Interobserver reliability between a nurse and anaesthetist of tests used for predicting difficult tracheal intubation. Anaesthesia 2004;59:881–4.
9. Calder I. Useless ritual? Anaesthesia 2002;57:612.
10. Yentis SM. Predicting difficult intubation: worthwhile exercise or pointless ritual? Anaesthesia 2002;57:105–9.
11. Smith RL. Observational studies and predictive models. Anesth Analg 1990;70:235–9.
12. Khan ZH, Kashfi A. Evaluating a patient’s airway. Anesth Analg 2003;97:915–6.
© 2005 International Anesthesia Research Society