Language Identification

(from LIMSI 1994 Scientific Report)

L.F. Lamel and J.L. Gauvain

Object

Automatic language identification has been a research topic for many years. With the increasing possibility of incorporating voice in information retrieval systems, the ability to automatically identify the language being spoken, and to respond appropriately, is of renewed interest. Automatic language identification avoids having to ask the user to select the language before beginning to interrogate the system. Language identification has many other potential uses including: emergency situations; travel services; translation services, information services; as well as the well-known national security applications.

Content

Our recent work in language identification makes use of phone-based acoustic likelihoods. A set of large phone-based ergodic hidden Markov models (HMMs) are trained for each of the languages to be identified. Language identification on the incoming signal x is then performed by computing the acoustic likelihoods f(x}|\lambda_i) for all the models \lambda_i of the set. The language associated with the model having the highest likelihood is hypothesized. This decoding procedure has been efficiently implemented by processing all the models in parallel using a time-synchronous beam search strategy. This approach has the following advantages:

It can perform text-independent feature recognition. (Text-dependent feature recognition can also be performed.)

It is more precise than methods based on long-term statistics such as long term spectra, VQ codebooks, or probabilistic acoustic maps.

It can easily take advantage of phonotactic constraints.

It can easily be integrated in recognizers which are based on phone models as all the components already exist.

Situation

This approach has been evaluated for French/English identification using comparable corpora of high quality speech and on the 10 language OGI corpus of telephone speech. The 2-way French/English language identification accuracies are given in Table 1 (see 1994 LIMSI Scientific Report) with phonotactic constraints provided by a phone bigram. Results are given for 4 test corpora, WSJ and TIMIT for English, and BREF and BDSONS for French, as a function of the duration of the speech signal which includes approximately 100ms of silence. The performance indicates that language identification is relatively task-independent, and with 2s of speech, language identification is less than 1%. Language identification on the OGI 10 language telephone-speech corpus is shown in Table 2(see 1994 LIMSI Scientific Report) as a function of signal duration. The overall 10-language identification rate is 59.7% with 10s of signal (including silence). There is a wide variation in identification accuracy across languages, ranging from 42% for Japanese to 82% for Tamil. Two-way French/English language identification was evaluated on the OGI corpus so as to provide a measure of the degradation observed due to the use of spontaneous speech over the telephone. Language identification was 82% at 10s (79% on French and 84\% for English) for the 135 10s-chunks. This can be compared to the results with the laboratory read speech, where French/English language identification is better than 99% with only 2s of speech.

References

[1] L. Lamel, J.L. Gauvain, ``Identifying Non-Linguistic Speech Features,'' EUROSPEECH-93

[2] L.\ Lamel, J.L.\ Gauvain, ``Language Identification using Phone-Based Acoustic Likelihoods,'' ICASSP-94.

Last modified: Sunday,11-December-05 06:13:33 CET

Spoken Language Processing Group (TLP)

Language Identification

Object

Content

Situation

References