The MASK Spoken Language System

(from LIMSI 1996 Scientific Report)

J.L. Gauvain, S. Bennacef, L. Devillers, J.J. Gangolf, L. Lamel, S. Rosset, C. d'Alessandro, B. Doval

Object

Our aim is to develop underlying spoken language technology that is speaker, task and language independent. This research is carried out in the context of the Esprit MASK (Multimodal-Multimedia Automated Service Kiosk) project where our role project is to develop the spoken language component to be integrated in a service kiosk [1]. The goal of the MASK project is to develop a multimodal, multimedia service kiosk to be located in train stations, so as to improve the user-friendliness of automated services.

Content

The main components of the spoken language system are the speech recognizer, the natural language component which includes a semantic analyzer and a dialog manager, and an information retrieval component that includes database access and response generation. The speech recognizer is a software-only system (written in ANSI C) that runs in real-time on a standard Risc processor. Statistical models are used at the acoustic and word levels. Speaker independence is achieved by using acoustic models which have been trained on speech data from a large number of representative speakers, covering a wide variety of accents and voice qualities. Context-dependent phone models are used to account for allophonic variation observed in different contextual environments. N-gram backoff language models are estimated on the orthographic transcriptions of the training set of spoken queries, with word classes for cities, dates and numbers providing more robust estimates of the N-gram probabilities. Sub-language models are being developed for use with system directed dialogs so as to reduce the search space and improving the accuracy and reducing the computational needs. The recognition lexicon has 1500 entries, including 600 station/city names, and is represented phonemically. The output of the recognizer is passed to the natural language component which extracts the meaning of the spoken query using a caseframe analysis [2]. The major work in developing the understanding component is writing the rules for the caseframe grammar, which includes defining the concepts that are meaningful for the task and their appropriate keywords. The goal of the dialog manager is to guide the interaction with the user so as to obtain information needed for database access. Natural language responses are automatically generated from the semantic frame and the information obtained from database access. Vocal feedback is provided by concatenation of speech units stored in a dictionary according to the automatically generated response text.

Situation

The MASK data collection system has been used to record over 18,500 queries from 322 subjects. This data is used for system development and to evaluate the system components. Overall system performance is assessed by a questionnaire regarding the ease-of-use, reliability, and user satisfaction of the MASK system [1]. Field trials with the MASK prototype kiosk will be carried out at a Parisian train station in the fall of 1996, which will enable us to evaluate the system in conditions of real use.

References

[1] J.L. Gauvain, S. Bennacef, L. Devillers, L. Lamel, S. Rosset, ``The Spoken Language Component of the Mask Kiosk,'' Proc. Human Comfort & Security Workshop , Brussels, Oct. 26, 1995.

[2] S.K. Bennacef, H. Bonneau-Maynard, J.L. Gauvain, L. Lamel, W. Minker, ``A Spoken Language System For Information Retrieval,'' Proc. ICSLP-94, Yokohama, September 1994.

Last modified: Sunday,11-December-05 06:13:33 CET

Spoken Language Processing Group (TLP)

The MASK Spoken Language System

Object

Content

Situation

References