Listening in noise poses a challenge for many with varying degrees of hearing loss. The widespread use of masks during the COVID-19 pandemic was a potent reminder of just how valuable visual cues can be to those with hearing loss trying to decipher speech, and how much more difficult listening can be when those cues are gone, such as covered up by a mask. Many previous studies have shown listeners can achieve much better speech comprehension when looking directly at the talker’s face.
The authors of a new study recently published in Trends in Hearing developed a deep neural network (DNN) based system capable of generating movies of a talking face from speech audio and a single face image. Their study sought to quantify the benefits that a DNN could provide in terms of speech comprehension in noise.
DEEP NEURAL NETWORK TRIALS
DNNs are machine learning systems that can perform several layers of computation to provide a synthetic visual listening aid. In this study, the authors masked target speech audio with signal to noise ratios (SNRs) of −9, −6, −3, and 0 dB and presented this audio to the study participants in three audio-visual (AV) stimulus conditions: 1) synthesized AV: audio with the synthesized talking face movie; 2) natural AV: audio with the original movie from the corpus; and 3) audio-only: audio with a static image of the talker. The study involved 10 participants aged 19 to 40. Participants listened to audio and typed the sentences they heard in each trial. There were 15 trials within each combination of AV and SNR for a total of 180 trials. The study authors quantified the participants data using keyword recognition.
RESULTS AND CONCLUSIONS
Test results showed that participant speech comprehension performance was the strongest in the natural AV condition, second highest in the synthesized AV condition, and, as expected, lowest in the audio-only condition. Although individual progress varied among subjects, every participant benefited in some degree from the synthetic AV stimulus.
“Listeners’ performance on the experimental task showed that the synthesized face significantly improves speech comprehension compared to when only acoustical signals are present,” wrote the study authors. “The benefit was greatest when listening in low SNRs, with performance in the synthetic face condition falling approximately halfway between the audio-only and natural AV conditions.”
Results from the Trends in Hearing study support the hypothesis that a DNN-based model utilized a talking face can meaningfully enhance listening comprehension in noisy settings and could potentially become a visual hearing aid.
“This study demonstrates that in the absence of a natural visual stimulus, speech comprehension can be enhanced by a synthesized, realistic talking face that is generated purely from the acoustical signal using a DNN-based model,” concluded the authors.