![]() |
|
![]() |
Spoken Language Processing Group (TLP)
A Multilingual Corpus for Language Identification
(from LIMSI 1996 Scientific Report)
G. Adda, J.J. Gangolf, L.F. Lamel, M. Adda-Decker, J.L. Gauvain,
F. Connerade, C. Corredor, S. Foukia, C. Ulrich, H. Visser
Object
The object of this work is to record a large, multilingual (French,
English, German and Spanish) corpus of telephone speech for research
in automatic language identification. Issues in
designing comparable corpora in different languages are addressed,
including how to
interact with callers so as to obtain the desired responses.
Situation
The multilingual corpus will contain speech from 250 speakers of each
language calling the LIMSI data collection system from their home
country via a toll-free number. An additional 50 native speakers of
each language will call from within France (or from a foreign country
for native French speakers). The speakers are recruited by a marketing
survey company, who ensures a balance for sex and age (4 groups
between 18 and 65 years of age) of the speakers. In order to
represent different regional accents, subjects were recruited from 10
subareas in each country. Each participant receives (via the
marketing survey company) a set of general calling instructions and a
script corresponding to the call. Each caller has a unique script,
identified by a code.
The scripts contain 3 types of data: general questions concerning the
call and caller (code, sex, age, city name, postal code, etc.); a
series of items containing pre-defined texts to read (phonetically
rich sentences, dates, times, spoken and spelled names) and a set
prompts to ellicit responses (``what time is it now?''); and a set of
questions aimed at obtaining spontaneous speech. The scripts were
slightly modified to fit each language and country.
To facilitate the generation of scripts, a language-independent
program was written which makes use of language specific files. The
program randomly selects items from the prespecified files, and
generates a variety of presentation formats for dates and times
according to usage in the given language. The program also keeps all
the informations needed to easily verify and transcribe each call.
The data collection system consists of an SGI Indy (R4400, 175MHz) and
a telephone interface ELAN BT8. The telephone interface is controlled
by the workstation and the telephone inputs and outputs are directly
connected to the audio channels of the Indy. In this way one
workstation is able to simultaneously handle 4 telephone lines.
Situation
We have completed the recording of over 250 calls in each of the four
languages, and native speakers of each language are orthographically
transcribing the corpus. Common protocols have been used to carry out
the transcriptions in regard to marking of spontaneous speech effects
such as hesitations, word fragments and laughter, and non-speech
events.
We have found that it is essential that the calls are transcribed by a
native speaker of the language, who has recently lived in the country,
thus having up-to-date linguistic and programatic knowledge of the
country and culture. The transcribers also participated in the
definition of the scripts and questions assuring their naturalness.
We are now recording cross-language/country calls for 50
callers for each language (native French speakers calling from
Germany, Great Britain and Spain and native British, German and
Spanish speakers calling from France). These later calls will be used
for testing purposes to ensure that the language and not the telephone
channel are being identified.
An analysis of the calls will help us define future data collection
scenarios. For example, we have found that Spanish and German callers
are relatively verbose in their responses to questions, where as
English callers typically respond in single words or short phrases:
therefore more concise scripts in Spanish or German language will lead
to the same amount of recorded speech.
References
[1] ``Identification Automatique de la Langue a travers
le reseau telephonique,'' Contract report CNET no. 94 1B 089,
no. 1-6.
Last modified: Sunday,11-December-05 06:13:33 CET
|
![]() |
|