Speech intelligibility in noise

Speech can be modified to promote intelligibility in noise, but the potential benefits for non-native listeners are difficult to predict due to the additional presence of distortion introduced by speech alteration. The current study compared native and non-native listeners’ keyword scores for simple sentences, unmodified and with six forms of modification. Both groups showed similar patterns of intelligibility change across conditions, with the native cohort benefiting slightly more in stationary noise. This outcome suggests that the change in masked audibility rather than distortion is the dominant factor governing listeners’ responses to speech modification.
Key Topics

Random noiseMaterials analysisSequence analysisSpeech analysisElectric measurements
1. Introduction GO TO SECTION…

Listeners are frequently required to understand recorded or synthetic speech output under less-than-ideal conditions. One approach to maintaining intelligibility in such environments is to modify the clean speech prior to output (e.g., Skowronski and Harris, 2006 ; Taal et al., 2013 ). Large-scale evaluations have demonstrated gains equivalent to a reduction in speech level of more than 5 dB for participants listening in their first language, at least for English ( Cooke et al., 2013 ). It is of interest to ask whether non-native listeners (NNLs) benefit from speech modifications to the same extent as native listeners (NLs). While the effect of noise on speech perception in NNLs has been researched extensively (see review in García Lecumberri et al., 2010 ), most studies to date have employed unaltered forms of speech. Far less is known about the impact of modified speech on NNLs.
Many speech modification algorithms aim to improve the masked audibility of speech. For instance, Taal et al. (2013) sought the optimal linear filter maximizing an approximation to the Speech Intelligibility Index ( ANSI, 1997 ). If masking release is the main effect of speech modification, previous studies of the effect of noise on NNLs (e.g., Cutler et al., 2004 ) lead to the prediction that this group of listeners will benefit by a similar amount to NLs for speech material with a predictable syntactic structure and limited lexicon. However, a known side-effect of modification is some degree of distortion, and it is also possible that NLs are able to use their richer experience with the phonology of the target language to extract a larger benefit than NNLs.
Earlier studies with altered speech styles provide a mixed picture of their effects on NNLs. Hazan and Simpson (2000) examined the degree of benefit produced by selective amplification of perceptually-salient regions of vowel-consonant-vowel material. Two groups of NNLs with different first languages showed similar intelligibility gains over unprocessed speech as a NL cohort. However, a study using synthetic speech ( Reynolds et al., 1996 ) demonstrated that NNLs suffer larger deficits than NLs for this form of non-standard speech material. Likewise, Lombard speech has been shown to be somewhat less beneficial for NNLs ( Cooke and García Lecumberri, 2012 ).
The current study measured the effect of speech modification on NNLs using a range of algorithms tested in Tang and Cooke (2011) . The six modification techniques tested differ both in their effect on intelligibility and in their degree of disruption to speech quality as predicted by an objective measure. NNLs identified keywords in simple unmodified and modified English sentences presented in stationary and fluctuating maskers. Results are compared with those from a NL cohort of 24 British English participants tested in Tang and Cooke (2011) .

2. Methods GO TO SECTION…

2.1 Listeners
A group of 71 young adult listeners participated in the experiment. All were native monolinguals in Spanish or bilingual in Spanish and Basque, and all were in their second year of studies for the degree of English Philology at the University of the Basque Country, Spain. Of these, six failed to complete some of the conditions and were excluded from subsequent analysis.
2.2 Speech and noise material
Sentences were drawn from the GRID Corpus ( Cooke et al., 2006 ) and consist of 6 word sequences with spoken letter and digit keywords in the fourth and fifth positions, e.g., “lay red at K 4 now,” spoken by 1 of 34 male or female talkers. These so-called “matrix” sentences were chosen in this preliminary study to avoid the involvement of higher-level knowledge which is known to produce larger NL benefits in noise ( García Lecumberri et al., 2010 ). Sentences were drawn at random from the corpus and presented in stationary (speech shaped noise; SSN) or fluctuating (speech modulated noise; SMN) maskers. The SSN sample approximated the long-term spectrum of the unmodified speech corpus. SMN was derived by modulating the SSN signal with the short-term temporal envelope of randomly-concatenated sequences of utterance from the corpus.
2.3 Processing conditions
Speech material was processed by six different modification techniques described in Tang and Cooke (2011) : “SegSNR,” “ChanSNR,” and “LocalSNR” equalized the signal-to-noise ratio (SNR) in each frame, frequency channel, and time-frequency location, respectively; “SelectBoost” amplified masked channels in the frequency range 1800–7500 Hz; “Pausing” introduced a 300 ms pause preceding a word boundary in such a way as to avoid the most intense noise epoch, while “Combined” consisted of Pausing and SelectBoost in sequence. Modifications were applied to clean speech prior to mixing with noise.
The overall root-mean-square (rms) energy was equalized following the modification, and since the Pausing and Combined techniques introduced pauses, the duration of the remaining speech sections was linearly compressed by an equivalent amount.
Figure 1 shows waveforms and spectrograms for unprocessed and modified speech for an example utterance. It is evident that the modification techniques differ in the degree of alteration to the signal and its spectro-temporal characteristics. For example, while ChanSNR is equivalent to a constant spectral filter and has little effect on speech quality, both SegSNR and LocalSNR impose rapid variations across time frames and result in significant audible distortions. Table 1 provides an estimate of distortion using the objective speech quality measure PESQ ( Rix et al., 2001 ). For the modifications tested here, values cover the entire PESQ range, from 1 (poor quality) to 4.5 (undistorted speech) relative to the reference unmodified speech signal.

Click to view
Fig. 1.
Original and modified waveforms and spectrograms for the utterance “Set red by O 2 soon.”

Table 1.
Table 1.

Click to view
Table 1.
Mean PESQ values across 50 sentences in each modified speech condition. Standard deviations are given in parentheses.

2.4 Procedure
In Tang and Cooke (2011) , NLs were tested at SNRs of −6 and −9 dB, apart from the modification method LocalSNR, which was mixed at SNRs of 0 and 3 dB due to reduced intelligibility at lower SNRs. In the current study, NNLs were tested at −6 and 0 dB for all conditions apart from LocalSNR, which was presented at 3 and 6 dB. Results are given here for the SNRs that the two listener groups had in common, namely, −6 dB (3 dB for LocalSNR). SNRs were computed over the region where the speech is present.
Listeners heard speech in noise in 28 conditions made up of all combinations of the 2 masker types, 2 SNRs, and 7 sentence processing conditions (i.e., 6 modifications plus unmodified speech). Sentences were blocked by condition: within each block the SNR, masker, and modification was constant. Each block consisted of 50 utterances. To avoid sentence subset effects, 28 sets of 50 sentences were generated for each condition (i.e., 784 sets in total) and listeners were assigned to sentence sets using a balanced design which ensured that no listener heard the same sentence more than once, and that each listener heard the same number of sentences in each of the 28 conditions. Condition order was also balanced across listeners, and the order of stimulus presentation within each condition randomized.
The experiment took place in a quiet laboratory. Stimuli were delivered under computer control via Plantronics Audio-90 headphones (Plantronics, Santa Cruz, CA). Participants entered letter and number keywords using a computer keyboard. Listeners were familiarized with the task via a short practice session and undertook the main experiment, which required approximately 90 min to complete, over 2 sessions separated by a break.

3. Results GO TO SECTION…

In the unmodified speech condition, NLs (from Tang and Cooke, 2011 ) identified 63.8% of keywords correctly in stationary noise and 81.1% in fluctuating noise, while NNLs obtained scores of 52.8% and 67.7%, respectively, representing NL benefits of 11.0 and 13.4 percentage points. Figure 2 plots mean percentage keywords correct for the two listener groups for all conditions. It is evident that NL and NNL scores are highly-correlated [ r = 0.97, p < 0.001] with the best linear fit having a slope close to unity and showing a mean NNL deficit of just over 12 percentage points.

Click to view
Fig. 2.
Mean keyword correct scores for NLs and NNLs in stationary noise (filled symbols) and fluctuating noise (unfilled symbols). Points have been shifted randomly by up to ±0.5 percentage points to avoid overlap. Native data come from Tang and Cooke (2011) .

The upper panel of Fig. 3 presents changes in keyword scores, expressed in percentage points, for the six processed speech conditions for both listener groups relative to their respective unmodified speech baselines. Overall, NLs and NNLs show a very similar pattern of gain for each masker. The additional NL gain in stationary noise averaged 5.1 percentage points across modifications and 0.8 percentage points in fluctuating noise. Separate two-factor (modification by listener group) repeated-measures analyses of variance were computed for each masker type. For the SSN masker, gains differ across modifications [ F(5, 435) = 363, p < 0.001, η 2 = 0.66] and listener group [ F(1, 87) = 10.2, p < 0.01, η 2 = 0.06] but the interaction between these factors is not statistically-significant [ p = 0.22]. For the SMN masker, the effect of modification is again significant [ F(5, 435) = 250, p < 0.001, η 2 = 0.62]. However, the two listener groups have equivalent overall gains [ p = 0.48]. The modification by listener group interaction is significant [ F(5, 435) = 3.61, p < 0.01, η 2 = 0.023]. Post hoc comparisons based on a Fisher’s Least Significant Difference value of 2.6 percentage points indicate that the interaction is due to different gains for the LocalSNR modification technique.

Click to view
Fig. 3.
NL and NNL keyword score gains in percentage points (pps; upper) and changes in RTs (lower) over unmodified speech in SSN (left) and SMN (right). Error bars represent ±1 standard error. Native data come from Tang and Cooke (2011) .

Figure 3 (lower) plots changes in response times (RTs) relative to unmodified speech. The median RT (measured from stimulus onset) per listener in each condition was used to avoid the influence of very long or short RTs. In the baseline unmodified speech condition NLs required 2.8 and 2.7 s for the SSN and SMN maskers, while NNLs responded in 3.4 and 3.1 s, respectively. For both maskers there is a significant interaction between nativeness and modification technique [SSN: F(5, 435) = 2.8, p < 0.05, η 2 = 0.01; SMN: F(5, 435) = 4.9, p < 0.001, η 2 = 0.03]. The pattern of RT change is complex, and varies both with modification technique and masker type. For NNLs, most of the RT changes across modification methods represent an amplified version of those seen for NLs.

4. Discussion GO TO SECTION…

In common with most previous studies which compared speech-in-noise intelligibility of NL and NNLs (see review in García Lecumberri et al., 2010 ), the non-native group identified fewer keywords correctly in noise than the native cohort. However, both listener groups showed a strikingly similar pattern of intelligibility changes when confronted by modified speech relative to an unmodified speech baseline. This finding is in line with Hazan and Simpson (2000) , whose two NNL groups benefited from speech enhancements to a similar degree to that of a native control group. Unlike Hazan and Simpson (2000) , whose modifications involved selective amplification of regions of phonetic importance, the algorithms tested in the current study were designed to promote masked audibility without regard for speech content, since a wider range of modification strategies are available if the need to identify salient phonetic information is removed. The present study supports the notion that differences in masked audibility across modification techniques affect NLs and NNLs identically. We found little evidence for the hypothesis that NLs are better able to handle distortions to the expected speech pattern resulting from speech modification. While NLs did benefit more (or suffer less) from modifications in the stationary masker, this additional NL benefit of around 5 percentage points was similar for all modifications regardless of the amount of objective distortion each one introduced. For the modulated masker NNLs were more adversely affected in the LocalSNR condition, where it might be argued that distortion played some part. However, the two conditions containing pauses had lower objective speech quality but exhibited no NNL disadvantage. One possibility is that spectro-temporal and pause-based modifications have differential effects on NLs and NNLs.
As expected, listeners responded more rapidly in conditions which produced high intelligibility. For instance, RTs decreased in stationary noise for the LocalSNR and SelectBoost modifications. Here, though, non-native RTs showed larger decreases over their baseline. This may be a ceiling effect: it is possible that at around 2.6 s for SelectBoost NLs were already responding as rapidly as possible. In spite of their larger decrease in RT, NNLs in the same condition remained slower at around 2.85 s. It is less clear why RTs for NNLs were more adversely affected than those for NLs in conditions which exhibited intelligibility reductions in the presence of fluctuating noise. The largest differential effect is seen for SegSNR. This modification redistributes energy across time frames to ensure that each has an equivalent SNR. For fluctuating maskers this has the side-effect of coupling speech modulations to those of the masker. The possibility that NNLs require more processing resources to perform speech separation under these conditions merits further study.
Finally, we note that the aim of this initial study was to establish the effect of masked audibility and distortion in sentences where the value of higher-level linguistic information is minimized. It remains to be seen whether modifications to more complex speech material interact with a listener’s native language status.

5. Conclusions GO TO SECTION…

Changes in intelligibility resulting from modified speech show a similar pattern for NL and NNLs despite differences in the degree of objective speech distortion across modifications. This outcome encourages the deployment of algorithmically-altered forms of speech in applications such as public transport interchanges where they promise to benefit listeners regardless of whether they are listening in their native language.

Acknowledgments GO TO SECTION…

This work has received funding from the European Union 7th Framework Programme under Grant Agreement No. FP7-PEOPLE-2011-290000 (INSPIRE) and the Basque Government under grant Language and Speech (IT311-10).

Comments ( 23 )

Leave A Comment

Your email address will not be published. Required fields are marked *

Optimization WordPress Plugins & Solutions by W3 EDGE