A Multilingual Corpus for Language Identification

(from LIMSI 1996 Scientific Report)

G. Adda, J.J. Gangolf, L.F. Lamel, M. Adda-Decker, J.L. Gauvain, F. Connerade, C. Corredor, S. Foukia, C. Ulrich, H. Visser

Object

The object of this work is to record a large, multilingual (French, English, German and Spanish) corpus of telephone speech for research in automatic language identification. Issues in designing comparable corpora in different languages are addressed, including how to interact with callers so as to obtain the desired responses.

Situation

The multilingual corpus will contain speech from 250 speakers of each language calling the LIMSI data collection system from their home country via a toll-free number. An additional 50 native speakers of each language will call from within France (or from a foreign country for native French speakers). The speakers are recruited by a marketing survey company, who ensures a balance for sex and age (4 groups between 18 and 65 years of age) of the speakers. In order to represent different regional accents, subjects were recruited from 10 subareas in each country. Each participant receives (via the marketing survey company) a set of general calling instructions and a script corresponding to the call. Each caller has a unique script, identified by a code. The scripts contain 3 types of data: general questions concerning the call and caller (code, sex, age, city name, postal code, etc.); a series of items containing pre-defined texts to read (phonetically rich sentences, dates, times, spoken and spelled names) and a set prompts to ellicit responses (``what time is it now?''); and a set of questions aimed at obtaining spontaneous speech. The scripts were slightly modified to fit each language and country. To facilitate the generation of scripts, a language-independent program was written which makes use of language specific files. The program randomly selects items from the prespecified files, and generates a variety of presentation formats for dates and times according to usage in the given language. The program also keeps all the informations needed to easily verify and transcribe each call. The data collection system consists of an SGI Indy (R4400, 175MHz) and a telephone interface ELAN BT8. The telephone interface is controlled by the workstation and the telephone inputs and outputs are directly connected to the audio channels of the Indy. In this way one workstation is able to simultaneously handle 4 telephone lines.

Situation

We have completed the recording of over 250 calls in each of the four languages, and native speakers of each language are orthographically transcribing the corpus. Common protocols have been used to carry out the transcriptions in regard to marking of spontaneous speech effects such as hesitations, word fragments and laughter, and non-speech events. We have found that it is essential that the calls are transcribed by a native speaker of the language, who has recently lived in the country, thus having up-to-date linguistic and programatic knowledge of the country and culture. The transcribers also participated in the definition of the scripts and questions assuring their naturalness. We are now recording cross-language/country calls for 50 callers for each language (native French speakers calling from Germany, Great Britain and Spain and native British, German and Spanish speakers calling from France). These later calls will be used for testing purposes to ensure that the language and not the telephone channel are being identified. An analysis of the calls will help us define future data collection scenarios. For example, we have found that Spanish and German callers are relatively verbose in their responses to questions, where as English callers typically respond in single words or short phrases: therefore more concise scripts in Spanish or German language will lead to the same amount of recorded speech.

References

[1] ``Identification Automatique de la Langue a travers le reseau telephonique,'' Contract report CNET no. 94 1B 089, no. 1-6.

Last modified: Sunday,11-December-05 06:13:33 CET

Spoken Language Processing Group (TLP)

A Multilingual Corpus for Language Identification

Object

Situation

Situation

References