Hearing loss is a major health problem, with 5.3 percent of the global population living with debilitating hearing impairment.1 Given the gradual increase in life expectancy, the number of people with a debilitating hearing impairment is expected to nearly double over the next 30 years. Hearing loss has important perceptual consequences and can accelerate cognitive decline. Additionally, loss of communication impacts social skills and promotes isolation and loss of confidence, particularly among the elderly.
The most common complaint of listeners with hearing loss is difficulty understanding a conversation in noisy environments, such as in a restaurant or at a cocktail party. These listeners usually have difficulty hearing a speaker's voice amidst competing sound sources (the problem of a low signal-to-noise ratio).2 Hearing aids try to correct the user's frequency-specific loss of sensitivity by amplifying the specific frequencies accordingly. Despite the noted benefits of hearing aids, only a small proportion of people who need these devices actually use them. One major factor that reduces the enthusiasm for hearing aids is their failure to restore the ability to selectively perceive a speaker because the devices amplify the background noise together with the target speech. While signal processing algorithms can suppress simple background noises, the enhancement of the target speaker fails when the noise and speech are not acoustically different, such as when the noise is coming from another speaker. In such scenarios, no speech enhancement method can help without first knowing which speaker the subject wants to focus on. This condition requires an additional control signal that tells the hearing aid system which speaker is the target and which speakers are interferences. Possible examples of such a control signal include head and gaze direction and manual selection. These solutions, however, are neither natural nor satisfactory; users might want to attend to sources to the side or behind them, users might not want to constantly operate a hand-held manual-selection device, or the target and interfering speakers may be close to each other.
POSSIBILITIES WITH BRAIN CONTROL
In the past, we proved that the human auditory cortex selectively represents the attended speaker relative to unattended sources.3 So when a listener focuses on a specific speaker in a crowded environment, the brainwaves of the listener track the voice of the target speaker. This scientific breakthrough has motivated the prospect of a brain-controlled hearing aid that constantly monitors the brainwaves of a listener and compares them with sound sources in the environment to determine the talker to whom a subject is attending. The device will then amplify the attended speaker relative to others to facilitate hearing that speaker in a crowd (Fig. 1). This process is called auditory attention decoding (AAD), a research area that has grown considerably in recent years. Multiple problems must be resolved to make a brain-controlled hearing aid feasible, including noninvasive and nonintrusive methods to measure the neural signals4-6 and designing effective decoding algorithms for accurate and rapid detection of attentional focus.7-12 Another major challenge is the lack of individual speaker audio. In realistic situations, we only have the mixed audio of speakers recorded from one or more microphones. Therefore, the first step in AAD is to automatically separate the speakers in the mixed audio. However, speaker-independent speech separation (meaning with no prior knowledge of specific speakers) is a very difficult problem that only recently has seen progress towards a solution.13-15
PROPOSED AAD FRAMEWORK
We recently proposed a framework that incorporates speaker-independent speech separation into AAD without needing the individual speaker audio.16 A critical component of this system is a real-time, low-latency speech separation algorithm based on deep neural network models. These models approximate the computation performed by the biological neurons, and have proven to be extremely effective in most machine learning tasks.17 Because this system can generalize to new speakers, it overcomes a major limitation of our previous AAD approach that required training of target speakers.18 To test the feasibility of this brain-controlled hearing device, we used invasive electrophysiology to measure neural activity from three neurosurgical patients undergoing treatment for epilepsy. These patients had clinically-implanted electrodes in their superior temporal gyrus (STG), a brain area that we had previously shown to selectively represent the attended speaker.3 Each subject was presented with a mixture of simultaneous speech stories and instructed to focus his or her attention on one speaker and ignore the others. The listeners’ brainwaves were then compared to the separated sound sources from the neural networks, and the speaker, most like the brainwaves, was amplified relative to the other speakers to facilitate listening (Figs. 2A, B).
To test if the difficulty of attending to the target speaker is reduced using the proposed system, we performed a psychoacoustic experiment comparing the perceived quality of the original mixed audio to the perceived quality of the audio in which AAD was used to detect and amplify the target speaker by 12 dB. Subjects were asked to rate the difficulty of attending to the target speaker when listening to (1) the original mixture and (2) the enhanced target speech using the output of the AAD system. Twenty listeners with normal hearing participated in the psychoacoustic experiment, in which they each heard 20 randomized sentences in each of the two experimental conditions. Subjects were asked to rate the difficulty of listening to the target speaker on a scale of one to five using the mean opinion score (MOS19). The barplots in Figure 2C show the median MOS +/- standard error (SE) for the two conditions. The average subjective score when using the AAD system showed a significant improvement over the mixture (100% improvement, paired t-test, p < 0.001), demonstrating that the listeners had a stronger preference for the modified audio than for the original mixture (for a demo, visit naplab.ee.columbia.edu/nnaad).
Our ongoing research on this problem focuses on advancing our understanding of auditory attention and its neural markers in the human auditory cortex and removing technological barriers to establishing the feasibility and efficacy of AAD for improving speech intelligibility and reducing the listening effort in people with hearing loss. This research will lead to a novel understanding of the neural mechanisms that enable a listener to focus on a speaker in multitalker speech conditions, thus bringing brain-controlled hearing aid technologies a significant step closer to reality.
1. WHO. Deafness and hearing loss. March 2019. http://bit.ly/2Jwy5rm
2. M.C. Killion. (2002) New thinking on hearing in noise: A generalized articulation index In: Semin. Hear. Copyright© 2002 by Thieme Medical Publishers, Inc., 333 Seventh Avenue, New …. pp. 57-76.
3. N. Mesgarani, E.F. Chang. (2012) Selective cortical representation of attended speaker in multi-talker speech perception. Nature
. 485 (7397): 233-236.
4. B. Mirkovic, M.G. Bleichner, M. De Vos, S. Debener. (2016) Target speaker detection with concealed EEG around the ear. Front Neurosci
. 10 349.
5. B. Mirkovic, S. Debener, M. Jaeger, M. De Vos. (2015) Decoding the attended speech stream with multi-channel EEG: implications for online, daily-life applications. J Neural Eng
. 12 (4): 46007.
6. L. Fiedler, M. Wöstmann, C. Graversen, A. Brandmeyer, T. Lunner, J. Obleser. (2017) Single-channel in-ear-EEG detects the focus of auditory attention to concurrent tone streams and mixed speech. J Neural Eng
. 14 (3): 36020.
7. A. de Cheveigné, D.D.E. Wong, G.M. Di Liberto, J. Hjortkjær, M. Slaney, E. Lalor. (2018) Decoding the auditory brain with canonical component analysis. Neuroimage
. 172 206-216.
8. S. Akram, A. Presacco, J.Z. Simon, S.A. Shamma, B. Babadi. (2016) Robust decoding of selective auditory attention from MEG in a competing-speaker environment via state-space modeling. Neuroimage
. 124 906-917.
9. S. Miran, S. Akram, A. Sheikhattar, J.Z. Simon, T. Zhang, B. Babadi. (2018) Real-time tracking of selective auditory attention from M/EEG: A bayesian filtering approach. Front Neurosci
10. N. Das, S. Van Eyndhoven, T. Francart, A. Bertrand. (2016) Adaptive attention-driven speech enhancement for EEG-informed hearing prostheses In: Eng. Med. Biol. Soc. (EMBC), 2016 IEEE 38th Annu. Int. Conf. IEEE. pp. 77-80.
11. S. Van Eyndhoven, T. Francart, A. Bertrand. (2017) EEG-Informed Attended Speaker Extraction From Recorded Speech Mixtures With Application in Neuro-Steered Hearing Prostheses. IEEE Trans Biomed Eng
. 64 (5): 1045-1056.
12. A. Aroudi, D. Marquardt, S. Doclo. (2018) EEG-based auditory attention decoding using steerable binaural superdirective beamformer In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Calgary, Canada. pp. 851-855.
13. Z. Chen, Y. Luo, N. Mesgarani. (2017) Deep attractor network for single-microphone speaker separation In: Acoust. Speech Signal Process. (ICASSP), 2017 IEEE Int. Conf. IEEE. pp. 246-250.
14. Y. Luo, Z. Chen, N. Mesgarani. (2018) Speaker-Independent Speech Separation With Deep Attractor Network. IEEE/ACM Trans Audio, Speech, Lang Process. 26 (4): 787-796.
15. J.R. Hershey, Z. Chen, J. Le Roux, S. Watanabe. (2016) Deep clustering: Discriminative embeddings for segmentation and separation. IEEE Int Conf Acoust Speech Signal Process. 31-35.
16. C. Han, J. O'Sullivan, Y. Luo, J. Herrero, A.D. Mehta, N. Mesgarani. (2019) Speaker-independent auditory attention decoding without access to clean speech sources. Sci Adv
. 5 (5): eaav6134.
17. Y. LeCun, Y. Bengio, G. Hinton. (2015) Deep learning. Nature
. 521 (7553): 436.
18. J. O'Sullivan, Z. Chen, J. Herrero, G.M. McKhann, S.A. Sheth, A.D. Mehta, et al. (2017) Neural decoding of attentional selection in multi-speaker environments without access to clean sources. J Neural Eng
. 14 (5):.
19. MOS. (2006) Vocabulary for performance and quality of service. ITU-T Rec. 10.