CURRENT SNR MODIFICATION METHODS
Current methods to improve speech intelligibility in noise include modifying the signal-to-noise ratio (SNR) of frequency-time cells (i.e., the frequency bands of the Fourier transform of time frames) before clean speech is delivered to the near-end listener. With these methods, the speech energy is redistributed within frequency-time cells with a power constraint according to the energy of the background noise (i.e., the background noise is captured with an extra microphone in the near-end side). For example, Tang and Cooke have proposed an SNR modification method in which the energy of each time-frequency cell is modified to equal the global SNR (ISCA. 2011;354). Sauert, et al., presented two strategies to modify the SNR of clean speech in background noise. The first strategy is based on “equal SNR,” and the second is based on a simple model of hearing (as opposed to the first approach) that cuts down the speech power at noisy frequencies and distributes it among clean frequency channels (IWAENC, 2006).
Wang, et al., introduced an SNR modification method based on ideal binary time-frequency masking, in which the speech energy is retained in time-frequency cells where the local SNR exceeds a certain threshold but is rejected in other time-frequency cells (ISCA, 2014). Two other ways to modify SNR is using neighborhood density to increase the intelligibility of synthetic speech in noise and discarding low-SNR speech components for binaural speech (Schoenmaker & van de Par. In: van Dijk, et al., eds. Springer, 2016; ISCA. 2013;113). However, these methods modify the SNRs without considering the relative contribution of each frequency band to speech intelligibility. Since these contributions are not the same on speech intelligibility, the blind speech energy redistribution may reduce the influence of the important frequency bands on speech intelligibility. For example, the speech energy would be cut down in a frequency band with the most relative contribution and transfers to the less relative contribution that is only based on the SNRs.
However, some optimization-based intelligibility improvement methods usually provide a complex equation to determine the optimal gains. Goli and Karami introduced a cost function based on the energy correlation and then optimized it to modify speech intelligibility, and Taal, et. al., optimized a cost function based on a perceptual distortion measure (Digital Signal Processing. 2017;52[C]:238; Computer Speech & Language. 2014;28:858). Despite the high performance of these optimization-based methods, the algorithms are usually very complicated and involve many statistical quantities that need to be estimated from clean speech and background noise. As such, this study introduces a simple method to modify the speech intelligibility in background noise based on SNR modification and the contribution of each frequency band to the speech intelligibility.
SPEECH INTELLIGIBILITY INDEX (SII) MEASURE
The American National Standard Institute (ANSI) provided the standard speech intelligibility index (SII) to evaluate speech intelligibility in background noise. The measure inputs are clean speech signal and noise, while the output is a scalar number that specifies the amount of speech intelligibility. A key component of the SII measure is the band importance function, which determines the contribution of each frequency band (i.e., the one-third octave bands) to speech intelligibility. This objective measure is based on the SNRs weighed in the one-third octave bands. The SII measure is calculated by considering the SNR of each frequency band weighted according to its contribution to speech intelligibility. The SII has a bell-shaped band importance function, with a value of 0.0083 for the one-third octave frequency band centered at 160 Hz, 0.0898 for the band centered at 2,000 Hz, and 0.0185 for the band centered at 8,000 Hz.
PROPOSED METHOD FOR SPEECH INTELLIGIBILITY IMPROVEMENT
The proposed algorithm improves speech intelligibility in noisy environments based on SNR modification in each time-frequency cell of speech, according to the global SNR, as well as based on the contribution of each frequency band to speech intelligibility. As energy reallocation to cells with high local SNR is wasteful (i.e., under a constant energy constraint) and unlikely to improve intelligibility and cells with a very low local SNR need too much energy to make a significant contribution to intelligibility, the proposed algorithm reallocates energy to cells with moderate SNRs. The energy reallocation for SNR modification is based on the contribution of each frequency band to speech intelligibility.
The clean speech signal and the background noise signal are indicated by z(n) and s(n) respectively. The speech signal degraded by the background noise is shown by y(n) = s(n) + z(n) and the modified speech signal by d(n). The clean speech is processed and modified before it is delivered for near-end listener. The modified speech signal, d(n), is delivered to the listener in the background noise.
Sampling frequency is considered as f s = 16 kHz. The clean speech and noise signals are segmented by Hann-windows with a length of 256 samples and 50 percent overlap. Then the DFT of each frame is calculated and a one-third octave band analysis is performed by grouping the DFT bins. The energy of the jth frequency band in the mth frame of the clean speech and noise are represented by Sj.m and Zj.m, respectively, and calculated as,
where S(k, m) and Z(k, m) are respectively the DFT of the clean speech signal and the noise in the mth frame and in the kth frequency index, Kh(j) and Kl(j) and are respectively the high-frequency index and the low-frequency index of the jth band. The local SNR in the jth band of the mth frame can be calculated as,
and the global SNR of the mth frame is determined as,
where J is the number of one-third octave bands in each frame of speech. Given that the sampling frequency is fs = 16 kHz, the frequency spectrum of a frame is divided into J = 18 of one-third octave band.
PROPOSED SNR MODIFICATION ALGORITHM BASED ON BAND IMPORTANCE FUNCTION
In the proposed method, the SNR of each cell is modified with a power constraint, according to the global SNR in the corresponding frame and the contribution of each frequency band to speech intelligibility. Therefore, it is necessary to find the contribution of each band to speech intelligibility to determine the gain of the jth frequency band of the mth frame which is represented by Gj.m. For this purpose, we apply the band importance function (i.e., used in the SII measure). The exponential equation is used to calculate the gain based on the importance function as follows,
where Ij is the value of the band importance function in jth the band, and α is an experimental constant. Coefficient γj.m would be determined to modify the local SNR of the jth band of the mth frame based on the global SNR of the frame. The fundamental idea of the coefficient γj.m is based on the important issue that energy reallocation to cells with a high local SNR and cells with a very low local SNR is wasteful. Therefore, energy is reallocated to cells that have moderate SNRs. Accordingly, the following equation is proposed to calculate γj.m,
According to the above equation, γj.m does not change the energy of cells with a high SNR (e.g., higher than 30 dB), and transfers the energy of cells with a very low SNR (e.g., lower than —30 dB) to cells with a moderate SNR. After calculating the coefficient of each time-frequency cell, the DFT of the modified signal can be determined as follows,
where α(m) is to normalize the energy of any modified speech frame to a clean speech frame, which is obtained from the following equation,
The energy is transferred from the frequency bands with a high SNR to bands with a moderate SNR by performing the above normalization. Finally, the modified signal is reconstructed by inverse DFT and windowing.
PROPOSED ALGORITHM IMPLEMENTATION
The proposed algorithm is implemented for a speech signal of approximately 4 s in traffic noise. Figure 2 shows the time domain and the spectrogram of clean speech signal, unprocessed noisy speech, and modified speech in noise for an SNR of 0 dB.
As shown in the spectrograms, most of the speech frequencies are covered by noise in the unprocessed noisy speech, whereas a large amount of the frequencies of clean speech is visible in the modified speech.
To evaluate the performance of the proposed algorithm, 30 sentences from the Timit database with a length of approximately three to five seconds are used. In addition, four common noises were used: traffic babble, factory, and chainsaw noises from the Noisex-92 database in SNRs of —20, —15, …, 0dB.
Two objective measures, SII and short-time objective intelligibility measure (STOI), were used to evaluate the performance of the proposed method. SII is an objective speech intelligibility measure based on the SNR. STOI evaluates the intelligibility of noisy speech based on the correlation between the noisy speech energy and the clean speech energy in one-third octave frequency bands. The inputs of this measure are clean speech and noisy speech, and the output—between 0 and 1—shows the amount of speech intelligibility.
The proposed method was compared with a reference method of improving speech intelligibility called “maximum power transfer,” an SNR modification method based on a simple human hearing model (IWAENC, 2006). In this model, the speech signal is processed in the ear after passing through the auditory filters. It is also assumed that frequency bands with low SNR are eliminated in the hearing filter. Therefore, in the Sauert method, bands with low intelligibility are initially eliminated and their energy is transmitted to other bands to transfer the maximum speech power to the ear. Figure 3 shows the results of the performance evaluation of the proposed algorithm.
Tested in all noise conditions, the proposed method performed better than unprocessed noisy speech and the maximum power transmission method in the SII measure, as expected. In addition, the SII scores for the maximum power transmission were close to the proposed method in the high SNRs of the babble and factory noises. (Fig. 3).
Figure 4 shows that the proposed method significantly improves speech intelligibility in almost all noise conditions compared with unprocessed noisy speech in the STOI measure. The proposed method didn't considerably improve intelligibility in factory noise, but it did in chainsaw noise. The STOI was notably increased in traffic and babble noises especially in low SNRs. The proposed method also outperformed the maximum power transfer method in traffic, chainsaw, and babble noises.
The proposed algorithm features a simple structure that is very important in online speech processing. The study findings show that this method improves speech intelligibility in the noisy settings, offering a promising alternative to other existing methods.
Copyright © 2017 Wolters Kluwer Health, Inc. All rights reserved.