Do Visual Cues Aid Comprehension of a Dialogue? : The Hearing Journal

Journal Logo

Original Research

Do Visual Cues Aid Comprehension of a Dialogue?

Keidser, Gitte PhD; With, Simon B.L. MSc; Neher, Tobias PhD; Rotger-Griful, Sergi PhD

Author Information
The Hearing Journal 76(03):p 22,23,24, March 2023. | DOI: 10.1097/01.HJ.0000922292.15379.9d
  • Free

It is well established that visual cues improve speech understanding for both people with normal hearing and impaired hearing, especially in more challenging listening environments. 1–4 The bulk of evidence has been obtained using traditional phoneme, word, or sentence recognition tests that were carried out under ideal visual and auditory conditions (i.e., the listener had a clear view of the talker’s face and was presented with clearly articulated speech). Although past studies show a consistent significant benefit of adding visual cues to speech recognition tests, it has been suggested that the mechanisms enabling the integration of auditory and visual information differ between speech materials, and that visual gain (VG) diminishes with increased complexity of the material. 5,6

FU1 Visual cues, visual gain, signalto-noise ratio, speech tests.
Figure 1:
Schematic overview of the test setup. Green and red loudspeaker symbols denote the location of the speech and noise signals, respectively. Visual cues, visual gain, signalto-noise ratio, speech tests.
Figure 2:
The relationship between audio-visual and audio-only correct scores for each test material and SNR. Visual cues, visual gain, signalto-noise ratio, speech tests.
Figure 3:
Summary of the visual gain data obtained for each combination of test and SNR. The colored dots show individual data points, and the colored lines show mean values. Visual cues, visual gain, signalto-noise ratio, speech tests.

In the real world, we often communicate in situations where speech, talker position, and environment vary in a highly dynamic manner, and in which many objects and sound sources may compete for our visual and auditory attention, potentially compromising optimal audio-visual (AV) integration. Participating in real-life conversations also increases the demand on our cognitive resources in comparison with the demand required to simply recognize speech 7, as we need to engage higher-level speech processing skills, such as making sense of what has been heard and formulating a response while still listening. Therefore, one could speculate that the higher level of complexity of real-world listening environments and communication might reduce VG relative to what is measured with traditional speech tests.

Here, we describe a study that compared the VG obtained using a novel speech comprehension task with that obtained using a traditional speech recognition task. The speech comprehension task required participants to follow a natural conversation between two people of 10-40 seconds duration and then answer a question probing comprehension. VG was obtained from normal-hearing participants at two signal-to-noise ratios (SNRs) selected to simulate a comfortable and a challenging listening environment, respectively. We were particularly interested in finding out: 1) to what extent visual cues benefit comprehension of a natural dialogue as compared with performance on a traditional word recognition test; and 2) the effect of the interaction between task and SNR (i.e., difficulty of the listening situation) on VG.


Tests. Two tests were administered. One was a word recognition test utilizing lists of 25 monosyllabic words (“Dantale I” 8). In this test, participants verbally repeat the perceived words. Stationary speech-shaped noise was used as background noise for this test.

The other test was a comprehension test developed at the Eriksholm Research Centre. This test is centred around a dialogue in which two persons work together to solve a diapix (spot-the-difference) task. 9 During recording of this test material, the two talkers were seated at a slight angle to the camera and could freely move their heads to look at each other, their picture, or directly into the camera. As the camera captured the whole person, body movements and gestures were also present in the video. Following the recording, the only modification made to the speech signals was an overall level adjustment to equalize the levels of the two talkers. From the recorded dialogues, 52 independent segments of 10-40 seconds were extracted, with each segment disclosing a difference between two pictures. Each segment is followed by a comprehension question, with three response options, that probe into the thematical difference disclosed in the dialogue (e.g., did the difference concern the color, the number, or the position of an object; or was a word, a piece of clothing or another item missing). Babble-noise created from the dialogue recordings was used as background noise for this test.

Test setup. The experiment was performed in a soundproof room. Audio signals were presented via an RME Fireface UC soundcard to four Genelec 8020 loudspeakers. The video was projected on a 60” HDTV screen. Test participants were seated 180 cm in front of and facing the screen. The four loudspeakers were placed 20 cm behind the screen in a straight line parallel with the screen. Their positions corresponded to 15°, 5°, 5° and 15° azimuth in relation to the participant (Figure 1). The speech was presented from the two inner loudspeakers and the noise from the two outer loudspeakers. While the monosyllabic words were presented simultaneously from the two inner loudspeakers, the dialogue alternated between the two loudspeakers in accordance with the position of the active talker. For the comprehension test, the questions and response options were shown on the screen, and a wireless number pad was used by the participants to enter their responses. For the word recognition test, the participants’ verbal responses were noted by the experimenter.

Procedure. Test participants first completed the comprehension test and then the word recognition test. In both tests, the speech was presented at 62 dB SPL and the noise level was chosen to achieve SNRs of 0 or 7 dB. These two SNRs were selected based on findings from other studies that people with normal hearing typically achieve 80%-90% correct performance in audio-only (AO) speech recognition tests at 0 dB SNR, and 40%-60% correct performance at 7 dB SNR. 10,11 The four conditions (AO and AV, each tested at two SNRs) were presented in a pseudorandomized order across participants, with the same order of conditions used for both tests. During the comprehension test, the 52 segments were presented in random order, breaking for each new test condition after every 11-15 segments to ensure listening time was roughly equalized across conditions. Before data collection, a training session was conducted to familiarize the test participants with the material and procedure of both tests. At the conclusion of the experiment, the percentage of correct answers was calculated for each test material and test condition. Statistical analyses were performed using RStudio-2022.02.3+492.

Participants. In total, 25 adults (19 females) with normal hearing (i.e., threshold levels below 20 dB across the audiometric frequencies from 250 to 8,000 Hz) were recruited. Across frequencies, the average threshold was 4.5 dB HL for the right ear and 3.0 dB HL for the left ear. Eight of these participants were later excluded as they reached scores close to ceiling (defined as a score of 95% or higher) in either test during the AO condition, leaving little or no room for improvement in the AV condition. Another two participants were excluded as their scores on the comprehension test in one condition were at or below chance level. The remaining 15 participants ranged in age from 23 to 60 years with an average of 29.3 years (SD = 10.6 years).


Figure 2 shows for each test and SNR the relationship between AV and AO percentage correct scores, with symbols shown above the unity line indicating a benefit from visual cues. In most cases, the availability of visual cues led to improved scores. The exceptions were mainly seen for the comprehension test (blue symbols) that also produced greater between-participant differences. As several participants reached ceiling in the two easier (0 dB SNR) AV conditions, the percentage correct scores for each test and participant were converted to rationalized arcsine units (RAU) prior to statistical analyses, with the transformed comprehension units further corrected for guessing. 12 A repeated measures ANOVA using the transformed scores as observations and test (word/comprehension), SNR (0 dB/-7 dB) and visual cues (off/on) as repeated measures, revealed significant main effects (F1,98 > 23.0; P < 0.001) and a significant interaction between test material and SNR (F1,98 = 19.6; P < 0.001), but no significant interactions involving visual cues (F1,98 < 0.50; P > 0.48). Transformed scores were generally higher for the word test (83.9 vs. 69.8 RAU), for the better SNR (90.3 vs. 63.4 RAU), and when visual cues were available (82.8 vs. 71.0 RAU).

VG was then calculated for each test and SNR as the relationship between transformed AV and AO scores (VG = AV/AO). This formula was chosen as it produces valid data in cases where the AO score exceeds the AV scores and when ceiling is reached in the AV condition. Figure 3 shows for each test material and SNR the distribution of the calculated VG values. As one participant in one condition produced an outlying VG value that exceeded the standard deviation for that condition by a factor four, the figure shows data for only 14 participants. A repeated measures ANOVA on this data set using test (word/comprehension) and SNR (0 dB/-7 dB) as repeated factors showed no significant main effects (F1,39 < 3.1; P > 0.08) or significant interaction between test and SNR (F1,39 = 0.6; P = 0.44). That is, the availability of visual cues was similarly beneficial across the two SNRs and tests. Finally, the correlation between the VG values obtained across tests or SNRs were very low (r = 0.30 and r = 0.15, respectively), suggesting large variability within the participants in terms of benefiting from visual cues across the different conditions.


The findings obtained in this study with the word test agree with the extensive literature on visual benefit as measured with traditional speech tests. 1–4 That is, reducing noise and adding visual cues resulted in significantly higher recognition scores, with all individuals in the AV condition performing better, or at least similarly, than in the AO condition (see Figure 2, red symbols).

When presented with a novel comprehension test in which the task was to follow a natural dialogue between two talkers, the participants as a group also showed a significantly higher score when noise was reduced and when visual cues were added. VG was comparable across the two tests as well as the two SNRs. However, in contrast to the word recognition test, a handful of participants performed noticeably better in the AO condition compared with the AV condition on the comprehension test (see Figure 2, blue symbols). The comprehension test used here has not been validated and it is likely that for some participants the randomly selected segments varied in difficulty, with those presented in the AO condition occasionally being much easier to comprehend than those presented in the corresponding AV condition, or vice versa. This phenomenon could also explain the VG outlier and the greater between-participant variation seen in the comprehension scores. However, larger variation, relative to a standardized speech test, has also been reported in normal-hearing people for a comprehension test with equalized sets and is possibly due to individual variation in cognition associated with higher-level speech processing. 13,14 The fact that the two tests – word and comprehension – were similarly sensitive to changes in SNR and visual cues on a group level lends some credibility to the newly developed comprehension test.

While adding visual cues to traditional speech tests seems like a logical step towards obtaining more realistic audiometric speech tests, there is no evidence in our data that combining visual cues with a more realistic test paradigm (i.e., comprehending a natural dialogue) will produce different and potentially more ecologically valid outcomes 15, at least not in a normal-hearing population. The low correlations observed between the VG values obtained for the two tests is curious and warrant further investigations into why some individuals show more VG on a word recognition task than on a comprehension task, or vice versa. Future research should also aim to confirm these observations in a hearing-impaired population, and to examine if advancing the speech test to include the participant in an interactive conversation will result in different and more ecologically valid outcomes.


1. Sumby WH, Pollack I 1954 Visual contribution to speech intelligibility in noise The Journal of the Acoustical Society of America 26 212 215
2. Tye-Murray N, Sommers MS, Spehar B 2007 Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing Ear and Hearing 28 656 668
3. Bernstein JGW, Grant KW 2009 Auditory and auditory-visual intelligibility of speech in fluctuating maskers for normal-hearing and hearing-impaired listeners The Journal of the Acoustical Society of America 125 3358 3372
4. Jaha N, Shen S, Kerlin JR, Shahin AJ 2020 Visual enhancement of relevant speech in a “cocktail party Multisensory Research 33 277 294
5. Grant KW, Seitz PF 1998 Measures of auditory–visual integration in nonsense syllables and sentences The Journal of the Acoustical Society of America 104 2438 2450
6. Sommers MS, Tye-Murray N, Spehar B 2005 Auditory-visual speech perception and auditory-visual enhancement in normal-hearing younger and older adults Ear and Hearing 26 263 275
7. Kiessling J, Pichora-Fuller MK, Gatehouse S 2003 Candidature for and delivery of audiological services: special needs of older people International Journal of Audiology 42 Suppl 2 S92 S101
8. Elberling C, Ludvigsen C, Lyregaard PE 1989 Dantale: A new Danish speech material Scandanavian Audiology 1989 18 169 175
9. Baker R, Hazan V 2011 DiapixUK: Task materials for the elicitation of multiple spontaneous speech dialogs Behavior Research Methods 43 761 770
10. Gifford RH, Bacon SP, Williams EJ 2007 An examination of speech recognition in a modulated background and of forward masking in younger and older listeners Journal of Speech, Language, and Hearing Research 50 857 864
11. Holder JT, Levin LM, Gifford RH 2018 Speech recognition in noise for adults with normal hearing: age-normative performance for AzBio, BKB-SIN, and QuickSIN Otology & Neurotology 39 972 978
12. Sherbecoe RL, Studebaker GA 2004 Supplementary formulas and tables for calculating and interconverting speech recognition scores in transformed arcsine units International Journal of Audiology 43 442 448
13. Best V, Keidser G, Freeston K, Buchholz J 2016 A dynamic speech comprehension test for assessing real-world listening ability Journal of the American Academy of Audiology 27 515 526
14. Best V, Keidser G, Freeston K, Buchholz J 2017 Evaluation of a dynamic speech comprehension test in older listeners with hearing loss International Journal of Audiology 57 221 229
15. Keidser G, Naylor G, Brungart D 2020 The quest for ecological validity in hearing science: what it is, why it matters, and how to advance it Ear and Hearing 41 Suppl 1 5S 19S
Copyright © 2023 Wolters Kluwer Health, Inc. All rights reserved.