The temporal information carried in speech can be classified into two major components: temporal envelope and periodicity components (TEPC) are slow-varying temporal information with fluctuation rates below 500 Hz; and fine structure components (FSC) are fast-varying temporal information with fluctuation rates from above 600 Hz up to 10 kHz (Rosen, 1992). TEPC and FSC carry different linguistic contrasts of speech in the prosodic and segmental domains. TEPC are responsible for carrying all the contrasts in the prosodic domain including tempo, rhythm, syllabicity, stress and intonation, and contrasts for manner of articulation and voicing in the segmental domain; whereas FSC are responsible for carrying contrasts only in the segmental domain mainly on place of articulation and voice quality. For a detailed discussion on the framework for temporal information in speech, please refer to Rosen (1992).
Cantonese and Mandarin are the two major Chinese tonal languages which are spoken by over 800 million people around the world (SIL International, 2004a; 2004b). A tonal language is defined as “A language with tone in which an indication of pitch enters into the lexical realization of at least some morphemes” (Hyman, 2001). In tonal languages, different pitch levels and movements can be used to encode different lexical and grammatical meanings of words, though the phonetic composition of the words remains unchanged (Bauer & Benedict, 1997). These different pitch patterns are known as the lexical tones. For example in Cantonese, the change in lexical tone of the lexemes in a compound alters the meaning of the compound totally, such as SYMBOL lou6/ meaning ‘walk’ but SYMBOL lou2/ meaning ‘run away’. The only difference between these two compound expressions is in the lexical tone of the second syllable (lexical tone 6 and lexical tone 2 in the first and second compound, respectively). Lexical tones carry a significant amount of linguistic information in the speech signal. This phenomenon happens only in tonal languages and not in nontonal languages. Lexical tones are essentially represented by the fundamental frequency (F0) and low-order harmonics of the speech signal in Cantonese (Fok, 1974; Vance, 1977). In addition, duration and amplitude of the speech signal also contribute to the recognition of lexical tone in Mandarin (Fu, Zeng, Shannon, et al., 1998; Whalen & Xu, 1992; Xu, Tsai, & Pfingst, 2002).
TEPC were found to be important for carrying information of lexical-tones. In Fu et al. (1998), FSC were removed from the speech signal by the substitution of white noise while TEPC were preserved. Lexical tone identification performance was consistently high irrespective of the number of channels used, even in the one band condition where no spectral information was available. Although in subsequent studies, identification of TEPC-only lexical tones was found to improve with the number of channels (Xu et al., 2002; Wei, Cao & Zeng, 2004). In addition, Fu et al. (1998) found that when only TEPC were available in the speech signal; open-set sentence recognition in Mandarin (11.0%) was found to be higher than that in English (3.9%). The results from Fu et al. (1998) suggests that TEPC are more important in speech recognition of tonal languages than nontonal languages. Due to the reduction of frequency selectivity, listeners with moderate to severe cochlear hearing loss have limited abilities to resolve individual components from FSC for speech recognition (Buss, Hall, & Grose, 2004). Nevertheless, their ability to utilize TEPC for speech recognition is not significantly impaired compared with the normal-hearing population (Edwards, 2004). From the above discussion, we believe that for the hearing- impaired population especially the tonal language speakers, TEPC encompass important acoustic elements for speech recognition.
A number of studies have investigated the relative contribution of TEPC from different frequency regions for speech recognition in English. For consonant identification in normal-hearing subjects, TEPC from all frequency regions were found to carry equal weight in quiet testing condition (Apoux & Bacon, 2004a; Kasturi, Loizou, Dorman, et al., 2002). However, TEPC from the high frequency rather than the low frequency regions were found to carry heavier weight, when the testing was conducted under noise for normal-hearing subjects (Apoux et al., 2004a). The study by Shannon, Galvin, & Baskent (2001) revealed that TEPC from the high frequency regions also carry heavier weight for consonant, vowel and sentence recognition, even in quiet testing condition in both normal-hearing and cochlear implanted subjects. The above studies suggest that TEPC from the high frequency regions than TEPC from the mid and low frequency regions are more critical for speech recognition, due to their different temporal envelope structures (Greenberg, Arai, & Silipo, 1998). Other findings show that TEPC from different frequency regions, not necessarily in the high frequency regions, carry different weight for vowel identification in normal-hearing subjects (Kasturi et al., 2002).
The above discussion shows that the presence or absence of frequency-specific TEPC for speech recognition may vary among different test materials (consonants versus vowels) and test conditions (in quiet versus in noise). Lexical tone is a unique acoustic characteristic of tonal languages which is important for speech recognition. This feature is absent in nontonal languages. The relative contribution of TEPC across different frequency regions in tonal languages, which has not been investigated yet, may deviate from that in nontonal languages.
This study is aimed at investigating the contributions of TEPC for lexical tone identification in Cantonese, and whether or not those contributions vary among different frequency regions. The results from these investigations would reveal if there are frequency-specific TEPC that are important for lexical tone identification.
The results from these investigations are expected to have significant implications for the design of signal processing algorithm in hearing prosthesis tailored for the hearing-impaired population speaking tonal languages. With the identification of critical TEPC in the speech signal, and effective processing methods that can preserve and enhance these critical acoustic elements, the communication abilities of the hearing-impaired tonal language speakers may be improved. The current designs of signal processing algorithms in commercially available hearing prosthesis have not paid much consideration to the possible speech recognition benefits of manipulating the TEPC of the speech signal. The outcomes of this project may open new directions for hearing aid research.
Materials and Methods
Eighteen subjects aged 19 to 24 yr participated in this study. All subjects had normal-hearing sensitivities with pure-tone air condition thresholds of 25dB HL or better in both ears at octave frequencies from 250Hz to 8 kHz. All subjects were native Cantonese speakers and with no reported history of ear diseases or hearing difficulties.
Three sets of Cantonese lexical tones were selected each with different phonetic contexts with the following three syllabic structures: /fu/, /ji/ and SYMBOL. The selection of the three phonetic contexts was aimed at sampling the widest range of first formant and second formant combinations in the vowel system in Cantonese. The syllabic structures were chosen based on the considerations that they contain lexical entries of all six lexical tones. Those lexical entries occur frequently in everyday communication and each of them represents exclusively or essentially one lexical tone. The written characters of the three sets of lexical tones are listed in Table 1.
Digital recordings of the lexical tone stimuli were prepared from two native Cantonese speakers (one male and one female). The recordings were collected in a sound treated booth. The Cool Edit Pro 2.0 software (Syntrillium Software Corporation, 2002) was used to perform online recording with a sampling rate of 44,100Hz and 16 bit resolution.
During recording, the speaker spoke a carrier phrase “this word is X” in Cantonese, where X is the lexical tone stimulus to standardize the overall pitch level of the speaker’s voice across the recordings. There is pause between the carrier phrase and the lexical tone stimulus. Five tokens were recorded for each lexical tone. Only one token for each lexical tone was selected and were used throughout the study.
For each set of the six contrastive lexical tones of each phonetic context of each speaker, the durations of the syllables carrying those tones were equalized according to the mean duration of the six syllables in the set. The Praat v.18.104.22.168 software (Boersma & Weenink, 1992–2003) was used to perform the equalization. The quality and naturalness of the duration-equalized stimuli were affirmed by two native Cantonese-speaking speech therapists. Figure 1 shows the average F0 changes with time across the three phonetic contexts of the six lexical tones, separately for the male speaker and the female speaker.
All lexical tone stimuli were low-pass filtered at 10 kHz. Each of them was saved as an individual wave file. Their root-mean-square (RMS) intensity levels were equalized with A-weighting correction applied. The stimuli then underwent a four-channel noise-excited envelope vocoder processing similar to the procedures described in many previous studies (e.g. Dorman, Loizou & Rainey, 1997; Fu et al., 1998; Shannon et al., 1995) for extracting the TEPC.
The frequency boundaries of the four bands were 60–500 Hz, 500–1000 Hz, 1–2 kHz and 2–4 kHz. TEPC were extracted from each band by full-wave rectification and low-pass filtering. The low-pass filter cutoff frequency was set at 500Hz from which the TEPC after filtering should encompass at least the F0 and even some low-order F0 harmonics of the speaker, depending on the F0 level of speaker as well as the F0 level of the lexical tone of the speakers. TEPC extracted from each band were used to modulate a speech spectrum-shaped noise (with average spectrum of all stimuli from both speakers), which was then spectrally limited by the same band-pass filter used for the original analysis. TEPC were preserved in each band, with the FSC removed. Each of the TEPC modulated noise bands was adjusted to have the same RMS level as their respective bands of the original speech signal. Therefore the relative amplitudes of TEPC across different frequency bands in the TEPC-only stimuli are maintained as in the original unprocessed stimuli. Different combinations of the four modulated noise bands created stimuli to define the following experimental conditions: (1) ALL, with all four modulated noise bands; (2) LOW, with two modulated noise bands of 60–500 Hz and 500–1000 Hz; (3) MID, with two modulated noise bands 500–1000 Hz and 1–2 kHz; (4) HIGH, with two modulated noise bands of 1–2 kHz and 2–4 kHz. Together with the original unprocessed signal (ORIG), there were a total of five experimental conditions.
Each subject attended two 1.5 hr test sessions. There was at least 1 d interval between the two sessions. In the first session, research consent was obtained followed by the air-conduction pure-tone audiometry assessment. All stimuli from one speaker were presented in the first session; whereas all stimuli from the other speaker were presented in the second session. The test sequence for speaker was randomized.
Before actual testing, the test administrator introduced the six contrastive lexical tones by presenting a response plate containing the six corresponding written characters which were horizontally aligned with the character of lexical tone 1 placed leftmost, followed by the character for lexical tone 2 on its right and so on, and finally with the character for lexical tone 6 placed rightmost. The lexical tone number 1 to 6 was printed above the 6 characters. The test administrator first spoke disyllabic words each containing the syllable of one of the six target lexical tones. The subject was asked to verbally repeat those disyllabic words and then the syllables of the 6 lexical tones. Then the tester spoke two lexical tones in sequence and the subject had to identify the characters for the two lexical tones by saying the corresponding numbers according to the presentation order. The training was repeated until the subject was familiar with the test items and the test procedure.
During actual testing, the stimuli were presented via soundfield with the subject seated one meter away from a loudspeaker at 0° azimuth. A Dell personal computer containing a custom test interface delivered the test stimuli to the loudspeaker via a GSI 10 audiometer. A calibration noise (the speech spectrum-shaped noise described above) with the same A-weighted RMS level as the lexical tone stimuli was used to calibrate a presentation level of 65dBA at the subject’s location. All test sessions were conducted in the same sound treated booth which was used to record the test stimuli.
During each test session, the order for presenting three phonetic context blocks was randomized. Within each phonetic context test block, the ORIG condition was always presented first, and the other four conditions: ALL, LOW, MID and HIGH were presented in randomized order. After finishing each phonetic context test block, the test administrator repeated the training for the 6 contrastive tones of the next phonetic context.
Within the test block of each experimental condition, there were a total of 36 presentations each consisted of two lexical tone stimuli presented sequentially with a 500 msec silence in between. The 36 presentations exhausted all possible pairing combinations of the 6 lexical tones (6 lexical tones x 6 lexical tones) in which every lexical tone was paired up with every other lexical tones. Therefore, there were a total of 72 lexical tone items presented, with equal number of trials (12) for each lexical tone. The 36 presentations were in random order. Subjects had to choose one out of the six lexical tone choices individually for each stimulus in each presentation. The subject’s responses for each presentation were entered into the test interface and stored for future analysis. The subject was allowed to listen two times for the first five presentations but just one time for the remaining 31 presentations. No unfilled response was allowed. The subjects were encouraged to provide guesses for any uncertainties. All responses from the 36 presentations were included for scoring. The correct lexical tone identification scores were calculated for each experimental condition, with results breakdown for the two speakers, the three phonetic contexts, and the six lexical tones. Since the response format was a six-alternative force choice task, the chance level performance was 16.7%.
A three-way repeated measures analysis of variance (ANOVA) was conducted to investigate the main effects for the four TEPC experimental conditions (CONDITION): ALL, LOW, MID and HIGH, phonetic context (CONTEXT), and speaker (SPEAKER). The results revealed that there were significant main effects for CONDITION [F (3,51) = 68.2, p < 0.000001] and SPEAKER [F (1,17) = 142.6, p < 0.000001], but not for CONTEXT [F (2,34) = 1.8, p = 0.2]; and significant effects for all interactions at the p < 0.005 level.
In Figure 2, the bottom right panel shows the overall mean correct percentage lexical tone identification from the two speakers for all the 5 experimental conditions, whereas the other 3 panels in Figure 2 show the results of individual phonetic context. The error bars show the 95% confidence intervals of the means. Figure 2 shows that the performance from the male speaker was better than that from the female speaker at p < 0.05 by posthoc Tukey Honest Significant Difference (HSD) test, both for the combined and individual phonetic context analyses.
Figure 3 shows the homogenous subsets obtained by posthoc Tukey HSD test at p < 0.05 for the overall results, the results breakdown for individual speaker data set, and the results breakdown for individual phonetic context within each speaker data set. Within each homogenous subset, the four TEPC experimental conditions were ordered with the best performed condition placed on the top of the list and the worst performed one on the bottom. Among the four TEPC conditions, the overall results, data collapsed for speaker and phonetic context, show that the performance from HIGH was significantly better than that from ALL, MID and LOW; the performance from ALL was significantly different from LOW; and there was no significant difference between ALL and MID, and MID and LOW. The same results apply to the breakdown for the male speaker data set. Whereas for the female speaker set, HIGH outperformed MID and LOW, and ALL was not significantly different from the other three conditions, and MID and LOW were not significantly different from each other.
From the results breakdown of individual phonetic context within each speaker data set, HIGH outperformed LOW in nearly all comparisons. The only exception was from the phonetic context /fu/ data set of the female speaker that results across the four conditions were consistently low (close to chance level) and were not different from one another. In addition, HIGH outperformed ALL in the /ji/ and /wSYMBOLi/ phonetic contexts, and in the /ji/ phonetic context for the male and female speaker set respectively.
The mean lexical tone identification scores for LOW and HIGH were respectively 33.6% and 57.1% from the overall data set; 42.9% and 74.5% from the male speaker data set; and 24.3% and 39.7% from the female speaker data set. The results clearly indicate that TEPC in the high frequency bands (1–2 kHz and 2–4 kHz) are more important for lexical tone identification than those in the low frequency bands (60–500 Hz and 500–1000 Hz).
It is worth noting that the results from HIGH outperformed those from ALL in phonetic contexts /ji/ and /wSYMBOLi/, and in /ji/ for the male and female speaker, respectively. Recall that TEPC in the two high frequency bands are available in both HIGH and ALL conditions, except that TEPC in the two low frequency bands are available only in ALL but not in HIGH. The results suggest not only that the TEPC in the low frequency regions are not effectively contributing to lexical tone identification, but they might have interfered with the contribution of important TEPC in the high frequency regions for lexical tone identification.
The lowest frequency noise band has a frequency range of 60–500 Hz, which shares the same frequency range of the extracted TEPC (up to 500 Hz). In other words the noise band might have masked the TEPC in this noise band due to aliasing effect and causing the poor performance in the LOW condition. However, noise bands in the MID condition (500–1000 Hz and 1–2 kHz) do not share the same frequency range with the extracted TEPC and the performance from the MID condition was still significantly worse than that from the HIGH condition. The explanation of the aliasing effect seems weak.
The results from this study agree with those from Shannon et al. (2001) and Apoux et al. (2004a) that TEPC in the high frequency regions are critical for speech recognition, when FSC are not available. This high-frequency specific TEPC for speech recognition have implications for designing signal processing strategies in hearing prosthesis for the hearing impaired listeners. Due to the reduction of frequency selectivity, listeners with moderate to severe cochlear hearing loss have limited abilities to resolve individual components from FSC for speech recognition (Buss et al., 2004). Nevertheless, their ability to utilize TEPC for speech recognition is not significantly impaired compared with the normal-hearing population (Edwards, 2004). Plomp (1988) suggested that speech recognition can be improved by preserving and enhancing TEPC.
Lexical tone identification from the female speaker stimuli was consistently worse than that from the male speaker stimuli for all TEPC conditions (p < 0.05, posthoc Tukey HSD test). By comparing the performance from the two TEPC-only conditions (HIGH and LOW) versus the TEPC plus FSC condition (ORIG), TEPC from the male speaker has a much higher contribution for lexical tone identification performance than the TEPC from the female speaker. The results suggest that higher F0 modulation frequency in the TEPC of the female stimuli might not be represented as well as the lower F0 modulation frequency in the TEPC of the male stimuli.
Qin and Oxenham (2005) reported that pitch perception of TEPC was poor with higher F0 modulations than lower F0 modulations. In their study, F0 difference limens for synthetic stimuli were worse for the ones with higher F0 (220 Hz) than those with lower F0 (130 Hz), especially in conditions where there were limited number of channels in the noise-band vocoder. Kohlrausch and Fassel (2000) also showed that modulation detection thresholds worsen with increasing modulation rate and suggested that the auditory system works like a low-pass filter in processing temporal information.
In this study, only one token per lexical tone per phonetic context was used for each speaker. Although the quality and naturalness of the selected tokens were affirmed, however, the specific waveform idiosyncrasies of the single tokens might have limited the generalization of the test results. More tokens are recommended in future investigations.
Temporal envelope and periodicity components (TEPC) in the high frequency bands of the speech signal are important for lexical tone identification in Cantonese, when fine structure components (FSC) are not available. This finding provides implications for designing signal processing strategies in hearing prosthesis for the tonal-language speaking hearing-impaired listeners. By designing new signal processing strategies in hearing prosthesis to preserve or enhance the critical TEPC in the high frequency bands of the speech signal, lexical tone identification and overall speech recognition may be improved.
We are very grateful to the subjects for their participation in this project. The study was conducted with approvals by the Clinical Research Ethics Committee of The Chinese University of Hong Kong and Hospital Authority North Territories East Cluster (Reference No. CRE-2004.212 & CRE-2005.054). The authors would like to thank the two anonymous reviewers whose comments significantly improved the paper. Part of this study was presented at the 5th Asia Pacific Symposium on Cochlear Implant and Related Sciences, 26–28 Nov 2005, Hong Kong.
ANSI. (1996). S3.6 Specifications for Audiometers.
New York: American National Standards Institute, Inc.
Apoux, F., & Bacon, S. (2004a). Relative importance of temporal information in various frequency regions for consonant identification in quiet and in noise. Journal of the Acoustical Society of America
Bauer, R. S., & Benedict, P. K. (1997). Modern Cantonese Phonology
, Vol. 103. New York: Mouton de Gruyter.
Boersma, P., & Weenink, D. (1992–2003). PRAAT Doing Phonetics by Computer
, Version 4.1.2.
Buss, E., Hall, J. W. R., & Grose, J. H. (2004). Temporal fine-structure cues to speech and pure tone modulation in observers with sensorineural hearing loss. Ear & Hearing
Dorman, M. F., Loizou, P. C., & Rainey, D. (1997). Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs. Journal of the Acoustical Society of America
Edwards, B. (2004). Hearing aids and hearing impairment. In S. Greenberg, W. Ainsworth, A. N. Popper & R. R. Fay (Eds.), Speech Processing in the Auditory System.
New York: Springer.
Fok, A. C. Y. Y. (1974). A perceptual study of tones in Cantonese.
Hong Kong: Centre of Asian Studies, University of Hong Kong.
Fu, Q. J., Zeng, F. G., Shannon, R. V., & Soli, S. D. (1998). Importance of tonal envelope cues in Chinese speech recognition. Journal of the Acoustical Society of America
Greenberg, S., Arai, T., & Silipo, R. (1998). Speech intelligibility derived from exceedingly sparse spectral information. International Conference of Spoken Language Proceedings, Sydney, Australia, December, 1998.
Hyman, L. (2001). Tone systems. In M. Haspelmath, E. König, W. Oesterreicher, & W. Raible (Eds.), Language typology and language universals: An international Handbook, vol.
2, 1367-1380. Berlin & New York: Walter de Gruyter.
Kasturi, K., Loizou, P. C., Dorman, M., & Spahr, T. (2002). The intelligibility of speech with “holes” in the spectrum. Journal of the Acoustical Society of America
Kohlrausch, A., & Fassel, R. (2000). The influence of carrier level and frequency on modulation and beat-detection thresholds for sinusoidal carriers. Journal of the Acoustical Society of America
Langhans, T., & Strube, H. W. (1982). Speech enhancement by non linear multiband envelope expansion. Proceedings I.E.E.E. I.C.A.S.S.P, 1982
Plomp, R. (1988). The negative effect of amplitude compression in multichannel hearing aids in the light of the modulation-transfer function. Journal of the Acoustical Society of America
Qin, M. K., & Oxenham, A. J. (2005). Effects of envelope-vocoder processing on F0 discrimination and concurrent vowel identification. Ear & Hearing
Rosen, S. (1992). Temporal information in speech: acoustic, auditory and linguistics aspects. Philosophical transactions of the Royal Society of London. Series B
Shannon, R. V., & Galvin, J. J., Baskent, D. (2001). Holes in Hearing. Journal of the Association of Research in Otolaryngology
Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science
SIL International. (2004a). Chinese, Mandarin: A language of China. Available at: http://www.ethnologue.com/show_language.asp?code=CHN
. Accessed June 2, 2004.
SIL International. (2004b). Chinese, Yue: A language of China. Available at: http://www.ethnologue.com/show_language.asp?code=YUH
. Accessed June 2, 2004.
Vance, T. J. (1977). Tonal distinctions in Cantonese. Phonetica
Wei, C. G., Cao, K., & Zeng, F. G. (2004). Mandarin tone recognition in cochlear-implant subjects. Hearing Research, 197
Whalen, D. H., & Xu, Y. (1992). Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica
Xu, L., Tsai, Y., & Pfingst, B. E. (2002). Features of stimulation affecting tonal-speech perception: implications for cochlear prosthesis. Journal of the Acoustical Society of America