Ritel : Spoken Dialog for Information Extraction

Objet

This project aims at integrating a spoken language dialog system and an open-domain information retrieval system to allow a human to ask a general question (f.i. "Who is currently presiding the Senate?'' or "How did the price of gas change for the last ten years?'') and refine his research interactively. This project is at the junction of several distinct research communities (information retrieval, spoken language dialog systems, natural language generation) and is not only about integrating existing tools but also and mainly about studiying a newly created, emerging object, a new kind of human-computer dialog. In particular it includes collaborative information search and dynamic co-building of semantics and interaction domain.

Description

An overview of the spoken language system architecture is shown in Figure 1. The main components for spoken language are the speech recognizer, the entities tagger, the dialog manager, the question-answering system, the natural language generation system and the Text-To-Speech synthetizer. They communicate through a message-passing infrastructure. The dialog manager controls and organise the interaction. It manages the entities tagger and the information passed through the QA system.

Fig. 1: Overview of the Ritel platform

The Speech Activity Detection component is derived from LIMSI's Conversational Telephonic Speech segmenter: GMMs are used to model speech and silence. A standard Viterbi decoding produces the segmentation, outputting results as the beam closes. This gives a good detection quality with a mean detection delay of about 0.4s with some false positives on loud breath noises in particular, but no false negatives to speak of.
The ASR system is a single-pass lattice decoder using triphone acoustic models and a trigram language model followed by a quadrigram consensus rescoring. The acoustic models are the ones created for the Arise [1] dialog system.
The language models were created by interpolating component models trained on various text sources: the Ritel training corpus; Newspaper texts from 1988-2005; manual transcriptions of radio and TV news shows; and Questions/Answers from the Internet (e.g. Madwin.fr). The recognition vocabulary contains 65K words, selected as the most probable words in unigram language models estimated on each text source. It has an Out-Of-Vocabulary rate of 1.3% on the dev and 1.7% on the test, which is reasonable for an open domain system in French.
The Specific Entity Detection System takes an utterance transcript as input and outputs the same utterance with the interesting parts typed and tagged. The specific entities are from 3 categories: (i) classical named entities such as people, products, titles and commercial names, time markers, organizations and places and extended named entities like lexical units, citation; (ii) syntactic request markers (who, where, when, how, how much...); and (iii) semantic request markers, giving topics (literature, geography, history, social, theater, politics, economy...) and subtopics (author, president, spelling, population count...). All entities are extracted using a set of rules which take the form of regular expressions on words. Macros (local sub-rules) and classes (lists) can be used in the rule definitions. A first evaluation of the specific entity tagging resulted in an F-measure of 70% classical named entities. Most of the errors are due to bad frontiers (15.5%) and deletions (17.1%).
The dialog manager was designed to incite people to talk as much as possible, to reformulate their requests in many ways, and refining their question while keeping a reasonably natural interaction. It can hence be considered an Eliza variation [2]. Moreover it allows searching of information in databases as appropriate. Each semantic frame sent by the specific entity detector is analyzed in context, i.e. taking into account the history of the interaction. The new, in-context frame is sent to the decision module which rewrites it again, this time using both a dialog model (how interactions go in general, whatever the subject) and a task model (how the specific request for information and refinement of the request occurs). If according to these models the current request is considered to be of the kind that can be answered factually by searching in one of the available databases, the search module extracts the relevant keys and does the search. Otherwise the incitation module isolates the topic of the request in order to generate an answer which, while not actually answering the question, shows that the system has understood something and urges the user to refine or reformulate the question. These two modules generate new semantic frames that are sent to the natural language generation (NLG) module. Current searches can only be done in fixed databases, but a full-blown QA system is in the process of being connected to the dialog manager.
Part of the generation is currently done by the dialog manager, which generates a semantic frame sent to the NLG module. Two strategies are followed, depending on whether the search module or the incitation module is activated. The NLG module itself is based on a set of template sentences. They allow generation of answers at different levels depending on the dialog state and history as sent by the dialog manager. The audio synthesis is a simple in-house diphone concatenation system.

Results

As this point in time the Ritel platform [3] is being used to collect a human-computer dialog corpus. Figures 2 and 3 show examples of collected dialogs.

S: Hello and welcome to Limsi's Ritel service. Do you want the help message?
U: err no thanks I'd like the capital of France please
S: The capital of France is Paris. Do you want another information?
U: err who is its president
S: Jacques-Chirac is the president of France. Do you want another information?
U: and err the one for Bulgaria
S: Georgi-Parvanov is the president of Bulgaria. Do you want another information?
U: hm and which country Ion Iliescu is the president then?
S: Ion-Iliescu is the president of Romania. Do you want another information?
U: since when
S: Ion-Iliescu is the president of Romania since december 2000. Do you want another information?
U: none thanks goodbye
S: Thanks for having used Limsi's Ritel service. We thank you for you call. Goodbye.

Fig. 2 : Translated sample of dialog collected with factual answers

S: Hello and welcome to Limsi's Ritel service. Do you want the help message?
U: I'd like some information on the ONU
S: You're looking for information about the ONU. Could you be more specific ?
U: which countries are in it
S: Your request is about geography. But more specifically? Are you interested in a specific country? A town?
U: who wrote to be or not to be
S: I do not know who is the author of to be or not to be. I've lost my connection with the information search server. Do you want another information?

Fig. 3 : Translated sample of dialog collected with incitations

The corpus was collected between September 2004 and February 2005. 13 persons called the Ritel system. Each subject had received a list of 300 possible questions. They were told to feel free to ask the system what they want however they want. Table [1] shows a summary of the data collected. Of the 6 hours of user speech one hour has been set aside for development (dev) and one hour for testing (test) of the modules, especially the speech recognition and the specific entities detection.

# Dial	Duration of user speech	# user's queries	# user's words
582	6h	5360	60k

# Distinct user's words	# user's queries per dial.	Mean duration of user's speech/dial.
2876	9	33s.

# Topics & Sub-Topics	# Ling. Entities	# Named Entities
1120	6860	7000

Table 1 : Summary of the Ritel corpus

Perspectives

Such a project has several challenges to conquer. The most obvious ones are: speech recognition which must be real time, streamed and have a very large vocabulary, dialog management in an open domain, communication and result exchange between an information retrieval system and a dialog system, natural sounding answer generation. We plan to split our efforts into 3 main axes:

User utterance analysis. We'll use light syntactico-semantic analysis to build a parse tree useful for both dialog management and information retrieval. Perplexity measures on utterance segments will be use to detect and take into account errors induced by the automatic speech recognition process.
Information retrieval and dialog management systems communication. These two systems need to exchange data on the user request context and the returned documents. The search being interactive the information retrieval system needs the pertinent parts of the dialog history. Conversely, the dialog manager needs information about the possible answers and clusters of documents found in order to interactively refine the request.
Answer generation. In most current dialog systems, which are all closed-domain, predefined answer templates are used and work well. In our system this method will not always be usable since the answer may have to be built from multiple sources or from large documents. We will therefore investigate methods from automatic summarization from multiple sources in order to build candidate answers. We will also put a particular emphasis on entertaining a dialog with the user that is collaborative, for example by suggesting approximate answers when a given question returned no answers. Furthermore, this work will be geared towards minimizing the timespan needed to produce the answer (involving the use of heuristics to utter sentence starts while information extration is still going on).

References

[1] L. Lamel, S. Rosset, J.L. Gauvain, S. Bennacef, M. Garnier-Rizet, and B. Prouts (2000). The LIMSI ARISE System, Speech Communication, 31(4), 339-354.

[2] J. Weizenbaum (1966). ELIZA--A Computer Program For the Study of Natural Language Communication Between Man and Machine, Communications of the ACM, 9(1), 36-35.

[3] O. Galibert, G. Illouz, S. Rosset (2005). Ritel: An Open-Domain,Human-Computer Dialog System, Proc. of Interspeech'05. 2789-2792.