Objet
This project aims at integrating a spoken language dialog system and
an open-domain information retrieval system to allow a human to ask a
general question (f.i. "Who is currently presiding the Senate?'' or
"How did the price of gas change for the last ten years?'') and
refine his research interactively. This project is at the junction of
several distinct research communities (information retrieval, spoken
language dialog systems, natural language generation) and is not only
about integrating existing tools but also and mainly about studiying a
newly created, emerging object, a new kind of human-computer dialog.
In particular it includes collaborative information search and dynamic
co-building of semantics and interaction domain.
Description
An overview of the spoken language system
architecture is shown in Figure 1. The main components for spoken
language are the speech recognizer, the entities tagger, the dialog
manager, the question-answering system, the natural language generation
system and the Text-To-Speech synthetizer. They communicate
through a message-passing infrastructure. The dialog manager controls
and organise the interaction. It manages the entities tagger and the
information passed through the QA system.
Fig. 1: Overview of the Ritel platform
The
Speech Activity Detection
component is derived from LIMSI's Conversational Telephonic Speech
segmenter: GMMs are used to model speech and silence. A standard
Viterbi decoding produces the segmentation, outputting results as the
beam closes. This gives a good detection quality with a mean
detection delay of about 0.4s with some false positives on loud breath
noises in particular, but no false negatives to speak of.
The
ASR system is a
single-pass lattice decoder using triphone acoustic models and a
trigram language model followed by a quadrigram consensus
rescoring. The acoustic models are the ones created for the Arise
[1] dialog system.
The
language models were
created by interpolating component models trained on various text
sources: the Ritel training corpus; Newspaper texts from 1988-2005;
manual transcriptions of radio and TV news shows; and Questions/Answers
from the Internet (e.g. Madwin.fr). The recognition vocabulary contains
65K words, selected as the most probable words in unigram language
models estimated on each text source. It has an Out-Of-Vocabulary rate
of 1.3% on the dev and 1.7% on the test, which is reasonable for an
open domain system in French.
The
Specific Entity Detection System
takes an utterance transcript as input and outputs the same utterance
with the interesting parts typed and tagged. The specific
entities are from 3 categories: (i) classical named entities such
as people, products, titles and commercial names, time markers,
organizations and places and extended named entities like lexical
units, citation; (ii) syntactic request markers (who, where, when, how,
how much...); and (iii) semantic request markers, giving topics
(literature, geography, history, social, theater, politics, economy...)
and subtopics (author, president, spelling, population count...). All
entities are extracted using a set of rules which take the form of
regular expressions on words. Macros (local sub-rules) and
classes (lists) can be used in the rule definitions. A first evaluation
of the specific entity tagging resulted in an F-measure of
70% classical named entities. Most of the errors are due to
bad frontiers (15.5%) and deletions (17.1%).
The
dialog manager was
designed to incite people to talk as much as possible, to reformulate
their requests in many ways, and refining their question while keeping
a reasonably natural interaction. It can hence be considered an
Eliza
variation [2]. Moreover it
allows searching of information in databases as appropriate. Each
semantic frame sent by the specific entity detector is analyzed in
context, i.e. taking into account the history of the interaction.
The new, in-context frame is sent to the decision module which rewrites
it again, this time using both a dialog model (how interactions go in
general, whatever the subject) and a task model (how the specific
request for information and refinement of the request
occurs). If according to these models the current request is
considered to be of the kind that can be answered factually by
searching in one of the available databases, the search module extracts
the relevant keys and does the search. Otherwise the incitation module
isolates the topic of the request in order to generate an answer which,
while not actually answering the question, shows that the system has
understood something and urges the user to refine or reformulate the
question. These two modules generate new semantic frames that are
sent to the natural language generation (NLG) module. Current searches
can only be done in fixed databases, but a full-blown QA system is in
the process of being connected to the dialog manager.
Part of the
generation is
currently done by the dialog manager, which generates a semantic frame
sent to the NLG module. Two strategies are followed, depending on
whether the search module or the incitation module is activated. The
NLG
module itself is based on a set
of template sentences.
They allow generation of answers at different levels depending on
the dialog state and history as sent by the dialog manager. The
audio synthesis is a simple in-house diphone concatenation system.
Results
As this point in time the Ritel platform [3] is being used to
collect a human-computer dialog corpus. Figures 2 and 3 show
examples of collected dialogs.
S: Hello and welcome to Limsi's Ritel
service. Do you want the help message?
U: err no thanks I'd like the capital of
France please
S: The capital of France is Paris.
Do you want another information?
U: err who is its president
S: Jacques-Chirac is the president of
France. Do you want another information?
U: and err the one for Bulgaria
S: Georgi-Parvanov is the president of
Bulgaria. Do you want another information?
U: hm and which country Ion Iliescu is
the president then?
S: Ion-Iliescu is the president of
Romania. Do you want another information?
U: since when
S: Ion-Iliescu is the president of
Romania
since december 2000. Do you want another information?
U: none thanks goodbye
S: Thanks for having used Limsi's Ritel
service. We thank you for you call. Goodbye.
|
Fig. 2
: Translated sample of dialog collected with factual answers
S: Hello and welcome to Limsi's Ritel
service. Do you want the help message?
U: I'd like some information on the ONU
S: You're looking for information about
the ONU. Could you be more specific ?
U: which countries are in it
S: Your request is about geography.
But more specifically? Are you interested in a specific
country? A town?
U: who wrote to be or not to be
S: I do not know who is the author of to
be or not to be. I've
lost my connection with the information search server. Do you
want
another information?
|
Fig. 3 : Translated sample of dialog collected with incitations
The corpus was collected between September 2004 and February
2005. 13 persons called the
Ritel system. Each subject had
received a list of 300 possible questions. They were told to feel free
to ask the system what they want however they want. Table [1] shows a
summary of the data collected. Of the 6 hours of user speech one
hour has been set aside for development (dev) and one hour for testing
(test) of the modules, especially the speech recognition and the
specific entities detection.
# Dial
|
Duration
of user speech
|
# user's
queries
|
# user's
words
|
582
|
6h
|
5360
|
60k
|
# Distinct
user's words
|
# user's
queries
per dial.
|
Mean
duration
of user's speech/dial.
|
2876
|
9
|
33s.
|
# Topics
& Sub-Topics
|
# Ling.
Entities
|
# Named
Entities
|
1120
|
6860
|
7000
|
Table 1
: Summary of the Ritel corpus
Perspectives
Such a project has
several challenges to conquer. The most obvious ones are: speech
recognition which must be real time, streamed and have a very large
vocabulary, dialog management in an open domain, communication and
result exchange between an information retrieval system and a dialog
system, natural sounding answer generation. We plan to split our
efforts into 3 main axes:
- User utterance analysis. We'll use light syntactico-semantic
analysis to build a parse tree useful for both dialog management and
information retrieval. Perplexity measures on utterance segments will
be use to detect and take into account errors induced by the automatic
speech recognition process.
- Information retrieval and dialog management systems
communication. These two systems need to exchange data on the user
request context and the returned documents. The search being
interactive the information retrieval system needs the pertinent parts
of the dialog history. Conversely, the dialog manager needs information
about the possible answers and clusters of documents found in order to
interactively refine the request.
- Answer generation. In most current dialog systems, which are all
closed-domain, predefined answer templates are used and work well. In
our system this method will not always be usable since the answer may
have to be built from multiple sources or from large documents. We will
therefore investigate methods from automatic summarization from multiple
sources in order to build candidate answers. We will also put a
particular emphasis on entertaining a dialog with the user that is
collaborative, for example by suggesting approximate answers when
a given question returned no answers. Furthermore, this work will be
geared towards minimizing the timespan needed to produce the answer
(involving the use of heuristics to utter sentence starts while
information extration is still going on).
References
[1] L. Lamel, S.
Rosset, J.L. Gauvain, S. Bennacef, M. Garnier-Rizet,
and B. Prouts (2000).
The LIMSI ARISE System, Speech Communication,
31(4), 339-354.
[2] J. Weizenbaum
(1966).
ELIZA--A Computer Program For the Study of
Natural Language Communication Between Man and Machine, Communications of the ACM,
9(1), 36-35.
[3] O. Galibert,
G. Illouz, S. Rosset (2005).
Ritel: An Open-Domain,Human-Computer Dialog
System, Proc. of Interspeech'05.
2789-2792.