direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Page Content

Assessment of Speech Intelligibility


Speech intelligibility is the psychoacoustics metric that enhances the proportion of an uttered signal correctly understood by a given subject. Recognition tasks include phone, syllable, and words, up to entire sentences. The ability of a listener to retrieve speech features is submitted to external features such as competing acoustic sources, their respective spatial distribution or presence of reverberant surfaces; as well as internal such as prior knowledge of the message, hearing loss, attention. The study of this paradigm, mentioned as the “cocktail party effect” by Cherry in 1953 has motivated numerous research.

Formerly known as the Articulation Index from French and Steinberg (1947), resulting from Fletcher’s life long multiple discoveries and intuition, the Speech Intelligibility Index (SII ANSI-1997) aims at quantifying the amount of speech information available left after frequency filtering or masking of speech by stationary noise. It is correlated with intelligibility, and mapping functions to the latter are established for different recognition tasks and speech materials. Similarly Steeneken and Houtgast (1980) developed the Speech Transmission Index that predicts the impact of reverberation on intelligibility from the speech envelop. Durlach proposed in 1963 the Equalization and Cancellation theory that aims at modelling the advantage of monaural over binaural listening present when acoustic sources are spatially distributed. The variability of the experimental methods used inspired Boothroyd and Nittrouer who initiated in 1988 an approach to quantify the predictability of a message. They set the relation between the recognition probabilities of an element and the whole it composes.

However accurate these methods have proven to be, they apply to maskers with stationary properties. The very common case of the competing acoustic source being another source of speech cannot be enhanced by these methods as speech is non-stationary by definition. In the meanwhile, communication with multiple speakers is bound to increase, while non-stationary sources severely impair the listeners with hearing loss, the later emphasizing the cocktail party effect.

If one aims at predicting situations that are to vary, it is necessary to include the variable time in models, and consequently these should progressively become signal-based. In 2005, Rhebergen and Versfeld proposed a conclusive method for the case of time fluctuating noises. However, the question of speech in competition with speech remains. Voice similarity, utterance rate and cross semantics are some of the features that add to the variability in the attention as artefacts on the recognition performances by the listener. In order to enhance their impact, it is today of first importance to develop blind models that on a signal-based fashion enhance the weight of what could be named the energetic masking of speech by speech. This is obtainable for example by measuring the performances of an artificial speech recognizer with minimal knowledge of language, so as to extract the weight of central cues in message retrieving by humans.

Better understanding of the complex mechanisms of the cocktail party effect at the central level is a key to improve multi-speaker conversation scenarios, the listening of the hearing impaired and the general performances of humans and capacities of attention.

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Auxiliary Functions