Spoken Language Processing Group at LIMSI | Perception and automatic processing of variation in speech

Perception and automatic processing of variation in speech

Philippe Boula de Mareuil, Ioana Vasilescu, Martine Adda-Decker, Lori Lamel, Sophie Rosset

The very large corpora used for training statistical models are exploited for linguistic studies of spoken language, such as acoustic-phonetics, pronunciation variation and diacronic evolution. Automatic alignment enables studies on hundreds to thousands of hours of data, permitting the validation of hypotheses and models. This topic also studies human and machine transcription errors via perception experiments.

In speech communication, considerable variability comes into play, raising issues for both humans and machines. The aim is to increase our knowledge in this area and to improve the performance of automatic speech processing systems. As a research topic, variation in speech (socially constructed, correlating to speaker groups or situations) can also benefit from technological advances, since its study requires large-scale phonetic analyses.

In addition to diachronic change, the sociolinguistic literature differentiates three types of variation in speech: diatopic (regional), diastratic (social) and diaphasic (stylistic, within-speaker). This complex reality is routinely referred to in laymans terms as accents and speaking styles. Our work which focuses on these two issues combines perceptual and acoustic approaches to account for variation due to speakers geographic and linguistic backgrounds (accents) as well as the communicative situation (styles). It is based on large amounts of data, using measurement tools derived from automatic speech processing techniques to quantify certain trends in the French language, especially. The evolution of non-standard variants in French broadcast news data was studied in collaboration with Maria Candea (Univ. Paris 3, sabbatical leave). Regioanal and foreign accents in Italian were also addressed by Italian PhD students - Ilaria Margherita (Univ. Pisa) and Marilisa Vitale (Univ. Naples "L'Orientale").

A first research axis is concerned with modelisation, identification and characterisation of regional and foreign accents in French. Perceptual experiments and acoustic analyses were carried out using automatic phoneme alignment, which could include pronunciation variants corresponding to Southern, Belgian, West-African, Maghrebi, English, German, Spanish and Portuguese accents, among others. In total, over 100 hours of regional- or foreign-accented French were analysed. Some of the most discriminating pronunciation features, such as the realisation of nasal vowels in Southern French or the realisation of certain schwas (backed and closed) in Portuguese-accented French were ranked using automatic learning techniques. Word-initial stress followed by a falling pitch contour was also evidenced in Senegalese-accented French and interpreted as a possible prosodic transfer from Wolof (the dominant language in Senegal) to French.

Since speech conveys both phonemic and prosodic information, the contribution of prosody to the perception of regional or foreign accents (Belgian, Corsican, Italian, Polish, among others), the so-called banlieue accent and broadcasters style was examined. The latter was studied from a diachronic perspective through newscast archives dating back to the nineteen forties. The methodology included prosody transplantation and modification/resynthesis. The contribution of prosody was highlighted especially for Belgian French, with peculiar vowel lengthening phenomena, Polish-accented French, with more word segmentation, the banlieue accent, with word-final sharp pitch falls, and the news announcer style of the forties and fifties, with a more marked tendency to initial stress than in the following decades. Also, falling yes/no questions were found in Corsican French and Corsican: they may be attributed to prosodic transfer.

Another research axis is concerned with acoustic-prosodic investigations of frequent homophone (or near-homophone) word pairs in French, more recently extended to address and describe spontaneous speech in general, as found in conversations or interviews. Prosodic specificities have been studied based on a 30 hour dialogue corpus and compared to a 100-hour broadcast corpus of prepared (at least partially read) speech. Compared to broadcast news, spontaneous speech exhibits flatter melodic contours and less marked word segmentation: prosodic cues to word boundaries are less marked in spontaneous face-to-face speech, where speakers and interlocutors may interrupt their conversation at any point to clarify the subject, if ever some unsolvable ambiguity arose. Speaking style may cause differences in F0 profiles between determiners and nouns especially, in which F0 values are lower for spontaneous speech than for prepared speech. However, for both speaking styles, it can be asserted that in noun phrases F0 values start at a relatively low level and rise as soon as the first syllable of the following noun is produced.

Spontaneous speech is also characterised by the presence of a number of disfluencies (hesitations, repetitions, false starts) and discourse markers. These phenomena were studied in view of better understanding human strategies in verbal interaction management. The properties of some classical discourse markers (bon, ben, alors, donc, enfin/menfin etc.) and of the French hesitation euh in interactive speech man-machine and man-man question answering dialogs have been then investigated. The research was based on the working hypothesis that an automatic extraction of really informative words out of utterances may benefit from a smart handling of spontaneous speech-specific items such as vocalic hesitations and discourse markers. Preliminary results point out that these devices help start speaker turns and initiate rephrasing at two utterance levels: global (utterance-rewording) and local (utterance-internal). Utterance-internal rephrasing may concern a word, a phrase or a whole sentence within the on-going utterance. Also, classical discourse markers and the hesitation euh allow relevant information framing, in particular when occurring in utterance-internal position. On the long view, the modelling of such items may improve natural language generation and understanding.

Modelling the production and perception of variation in speech is important for understanding possible sources of errors of automatic speech recognition systems. This holds in particular for reduced speech phenomena which have been extensively investigated in large spoken corpora in French and American English. The long-term aim of this research axis is to improve the modelling of ambiguous items so as to reduce automatic transcription errors. It is well-known that human listeners significantly outperform machines when transcribing speech. Perceptual paradigms have been developed in which human listeners are asked to transcribe broadcast speech segments containing words which are frequently misrecognised by the system. In particular, we seek to gain information about the impact of increased context to help humans disambiguate problematic lexical items, typically homophone or near-homophone words. Studies of speech transcription errors are currently being investigated in relation to Named Entity detection, Spoken Language Understanding, Spoken Question-Answering and Dialog Systems. Furthermore, the same methodology can be applied to other linguistic features, such as other languages, accents and styles. Future work will try to take better account of social factors as well as accent, speaking style and expressive speech.

Some publications

Last modified: Monday,02-December-13 04:24:15 CET

Spoken Language Processing Group (TLP)

Perception and automatic processing of variation in speech

Some publications