Abstracts of publications for the spoken language processing group at LIMSI

LIMSI TLP group - Online publications

Some conference papers, journal papers, and reports

Years: 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

IS160180.PDF
[InterSpeech 2016, San Francisco, CA, Sept 2016]
---------
A Divide-and-Conquer Approach for Language Identification Based on Recurrent Neural Networks
Gregory Gelly, Jean-Luc Gauvain, Viet-Bac Le, Abdelkhalek Messaoudi
------------------------

This paper describes the design of an acoustic language recognition system
based on BLSTM that can discriminate closely related languages and dialects of
the same language. We introduce a Divide-and-Conquer (D&C) method to quickly
and successfully train an RNN-based multi-language classifier. Ex- periments
compare this approach to the straightforward train- ing of the same RNN, as
well as to two widely used LID tech- niques: a phonotactic system using DNN
acoustic models and an i-vector system. Results are reported on two different
data sets: the 14 languages of NIST LRE07 and the 20 closely re- lated
languages and dialects of NIST OpenLRE15. In addition to reporting the NIST
Cavg metric which served as the primary metric for the LRE07 and OpenLRE15
evaluations, the EER and LER are provided. When used with BLSTM, the D&C
training scheme significantly outperformed the classical train- ing method for
multi-class RNNs. On the OpenLRE15 data set, this method also outperforms
classical LID techniques and combines very well with a phonotactic system.

hlt12_lamel.pdf
[HLT 2012 The Fifth International Conference: Human Language Technologies - The Baltic Perspective]
---------
Multilingual Speech Processing Activities in Quaero: Application to Multimedia
Search in Unstructured Data
Lori Lamel
------------------------
Spoken language processing technologies are principle components
in most of the applications being developed as part of the Quaero
program. Quaero is a large research and industrial innovation program
focusing on the development of technologies for automatic analysis and
classification of multimedia and multilingual documents. Concerning
speech processing, research aims to substantially improve the state-ofthe-
art in speech-to-text transcription, speaker diarization and recognition,
language recognition, and speech translation.

Oparin_ICASSP_2012.pdf
[IEEE ICASSP 2012, Kyoto, Japan, March 2012]
---------
Performance Analysis of Neural Networks in Combination with N-Gram Language Models
Ilya Oparin, Martin Sundermeyer, Hermann Ney, Jean-Luc Gauvain
------------------------
Neural Network language models (NNLMs) have recently become
an important complement to conventional n-gram language
models (LMs) in speech-to-text systems. However, little
is known about the behavior of NNLMs. The analysis presented
in this paper aims to understand which types of events
are better modeled by NNLMs as compared to n-gram LMs,
in what cases improvements are most substantial and why this
is the case. Such an analysis is important to take further benefit
from NNLMs used in combination with conventional ngram
models. The analysis is carried out for different types
of neural network (feed-forward and recurrent) LMs. The results
showing for which type of events NNLMs provide better
probability estimates are validated on two setups that are different
in their size and the degree of data homogeneity.

odyssey12_mfb.pdf
[Odyssey 2012 The Speaker and Language Recognition Workshop
Singapore June 2012]
---------
Fusing Language Information from Diverse Data Sources for Phonotactic Language Recognition
Mohamed Faouzi BenZeghiba, Jean-Luc Gauvain and Lori Lamel
------------------------
The baseline approach in building phonotactic language
recognition systems is to characterize each language by a single
phonotactic model generated from all the available languagespecific
training data. When several data sources are available
for a given target language, system performance can be
improved using language source-dependent phonotactic models.
In this case, the common practice is to fuse language
source information (i.e., the phonotactic scores for each language/
source) early (at the input) to the backend. This paper
proposes to postpone the fusion to the end (at the output) of the
backend. In this case, the language recognition score can be
estimated from well-calibrated language source scores.
Experiments were conducted using the NIST LRE 2007 and
the NIST LRE 2009 evaluation data sets with the 30s condition.
On the NIST LRE 2007 eval data, a Cavg of 0.9% is obtained
for the closed-set task and 2.5% for the open-set task.
Compared to the common practice of early fusion, these results
represent relative improvements of 18% and 11%, for the
closed-set and open-set tasks, respectively. Initial tests on the
NIST LRE 2009 eval data gave no improvement on the closedset
task. Moreover, the Cllr measure indicates that language
recognition scores estimated by the proposed approach are better
calibrated than the common practice (early fusion).

is12_1398_jk.pdf
[InterSpeech 2012, Portland, Oregon, Sept 2012]
---------
Development and Evaluation of Automatic Punctuation for French and English Speech-to-Text
Jachym Kolar and Lori Lamel
------------------------
Automatic punctuation of speech is important to make speechto-
text output more readable and to facilitate downstream language
processing. This paper describes the development of an
automatic punctuation system for French and English. The
punctuation model uses both textual information and acoustic
(prosodic) information and is based on adaptive boosting.
The system is evaluated on a challenging speech corpus under
real-application conditions using output from a state-of-the-art
speech-to-text system and automatic audio segmentation and
speaker diarization. Unlike previous work, automatic punctuation
is scored on two independent manual references. Comparisons
are made for the two languages and the performance of the
automatic system is compared with inter-annotator agreement.

is12_1063_mfb.pdf
[InterSpeech 2012, Portland, Oregon, Sept 2012]
---------
Phonotactic Language Recognition Using MLP Features
Mohamed Faouzi BenZeghiba, Jean-Luc Gauvain, Lori Lamel
------------------------
This paper describes a very efficient Parallel Phone
Recognizers followed by Language Modeling (PPRLM)
system in terms of both performance and processing
speed. The system uses context-independent phone recognizers
trained on MLP features concatenated with the
conventional PLP and pitch features. MLP features have
several interesting properties that make them suitable for
speech processing, in particular the temporal context provided
to the MLP inputs and the discriminative criterion
used to learn the MLP parameters. Results of preliminary
experiments conducted on the NIST LRE 2005 for the
closed-set task show significant improvements obtained
by the proposed system compared with a PPRLM system
using context-independent phone models trained on PLP
features. Moreover, the proposed system performs as
well as a PPRLM system using context-dependent phone
models, while running 6 times faster.

is12_1398_jk.pdf
[InterSpeech 2012, Portland, Oregon, Sept 2012]
---------
Discriminatively trained phoneme confusion model for keyword spotting
Panagiota Karanasou, Lukas Burget, Dimitra Vergyri, Murat Akbacak, Arindam Mandal
------------------------
Keyword Spotting (KWS) aims at detecting speech segments
that contain a given query within large amounts of audio data.
Typically, a speech recognizer is involved in a first indexing
step. One of the challenges of KWS is how to handle recognition
errors and out-of-vocabulary (OOV) terms. This work proposes
the use of discriminative training to construct a phoneme
confusion model, which expands the phonemic index of a KWS
system by adding phonemic variation to handle the abovementioned
problems. The objective function that is optimized is
the Figure of Merit (FOM), which is directly related to the KWS
performance. The experiments conducted on English data sets
show some improvement on the FOM and are promising for the
use of such technique.

is12_473_hb.pdf
[InterSpeech 2012, Portland, Oregon, Sept 2012]
---------
Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast
Johann Poignant, Herve Bredin, Viet-Bac Le, Laurent Besacier,
Claude Barras, Georges Quenot
------------------------
We propose an approach for unsupervised speaker identification
in TV broadcast videos, by combining acoustic
speaker diarization with person names obtained via
video OCR from overlaid texts. Three methods for the
propagation of the overlaid names to the speech turns are
compared, taking into account the co-occurence duration
between the speaker clusters and the names provided by
the video OCR and using a task-adapted variant of the
TF-IDF information retrieval coefficient. These methods
were tested on the REPERE dry-run evaluation corpus,
containing 3 hours of annotated videos. Our best unsupervised
system reaches a F-measure of 70.2% when considering
all the speakers, and 81.7% if anchor speakers
are left out. By comparison, a mono-modal, supervised
speaker identification system with 535 speaker models
trained on matching development data and additional TV
and radio data only provided a 57.5% F-measure when
considering all the speakers and 45.7% without anchor.

sltu12_fraga.pdf
[SLTU 2012, Cape Town, South Africa May 2012]
---------
Incorporating MLF features in the unsupervised training process
Thiago Fraga Silva, Viet-Bac Le, Lori Lamel and Jean-Luc Gauvain
------------------------
The combined use of multi layer perceptron (MLP) and perceptual
linear prediction (PLP) features has been reported
to improve the performance of automatic speech recognition
systems for many different languages and domains. However,
MLP features have not yet been used on unsupervised acoustic
model training. This approach is introduced in this paper
with encouraging results. In addition, unsupervised language
model training was also investigated for a Portuguese broadcast
speech recognition task, leading to a slight improvement
of performance. The joint use of the unsupervised techniques
presented here leads to an absolute WER reduction up to
3.2% over a baseline unsupervised system.

sltu12_lamel.pdf
[SLTU 2012, Cape Town, South Africa May 2012]
---------
Transcription of Russian conversational speech
Lori Lamel, Sandrine Courcinous, Jean-Luc Gauvain, Yvan Josse and Viet Bac Le
------------------------
This paper presents initial work in transcribing conversational
telephone speech in Russian. Acoustic seed models
were derived from other languages. The initial studies
are carried out with 9 hours of transcribed data, and
explore the choice of the phone set and use of other data
types to improve transcription performance. Discriminant
features produced by a Multi Layer Perceptron trained on
a few hours of Russian conversational data are contrasted
with those derived from well-trained networks for English
telephone speech and from Russian broadcast data. Acoustic
models trained on broadcast data filtered to match the
telephone band achieve results comparable to those obtained
with models trained on the small conversation telephone
speech corpus.

lrec12_300_iv.pdf
[LREC 2012 Istabul, May 2012]
---------
Cross-lingual studies of ASR errors: paradigms for perceptual evaluations
------------------------
Ioana Vasilescu, Martine Adda-Decker and Lori Lamel

It is well-known that human listeners significantly outperform machines when it
comes to transcribing speech. This paper presents a progress report of the joint research in the
automatic vs human speech transcription and of the perceptual experiments developed at LIMSI that
aims to increase our understanding of automatic speech recognition errors. Two paradigms are
described here in which human listeners are asked to transcribe speech segments containing words that are
frequently misrecognized by the system. In particular, we sought to gain information about the
impact of increased context to help humans disambiguate problematic lexical items, typically homophone
or near-homophone words. The long-term aim of this research is to improve the modeling of ambiguous
contexts so as to reduce automatic transcription errors.

0004866.pdf
[IEEE ICASSP 2010, Dallas, Texas, March 2010]
---------
Multi-style MLP Features for BN Transcription
Viet-Bac Le and Lori Lamel and Jean-Luc Gauvain
------------------------

It has become common practice to adapt acoustic models to
specific-conditions (gender, accent, bandwidth) in order to improve the
performance of speech-to-text (STT) transcription systems. With the growing
interest in the use of discriminative features produced by a multi layer
perceptron (MLP) in such systems, the question arise of whether it is
necessary to specialize the MLP to particular conditions, and if so, how to
incorporate the condition-specific MLP features in the system. This paper
explores three approaches (adaptation, full training, and feature merging)
to use condition-specific MLP features in a state-of-the-art BN STT system
for French. The third approach without condition-specific adaptation was
found to outperform the original models with condition-specific adaptation,
and was found to perform almost as well as full training of multiple
condition-specific HMMs.

0004325.pdf
[IEEE ICASSP 2009, Taipei, Taiwan, April 2009]
---------
Modeling characters versus words for Mandarin speech recognition
Jun Luo and Lori Lamel and Jean-Luc Gauvain
------------------------
Word based models are widely used in speech recognition since they
typically perform well. However, the question of whether it is better
to use a word-based or a character-based model warrants being for the
Mandarin Chinese language. Since Chinese is written without any spaces
or word delimiters, a word segmentation algorithm is applied in a
pre-processing step prior to training a word-based language
model. Chinese characters carry meaning and speakers are free to
combine characters to construct new words. This suggests that
character information can also be useful in communication. This paper
explores both word-based and character-based models, and their
complementarity. Although word-based modeling is found to outperform
character-based modeling, increasing the vocabulary size from 56k to
160k words did not lead to a gain in performance. Results are reported
for the Gale Mandarin speech-to-text task.

0004537.pdf

[IEEE ICASSP 2009, Taipei, Taiwan, April 2009]
---------
Lattice-based MLLR for speaker recognition
Marc Ferras and Claude Barras Jean-Luc Gauvain
------------------------
Maximum-Likelihod Linear Regression (MLLR) transform coefficients have
shown to be useful features for text-independent speaker recognition
systems. These use MLLR coefficients computed on a Large Vocabulary
Continuous Speech Recognition System (LVCSR) as features and Support
Vector machines(SVM) classification. However, performance is limited
by transcripts, which are often erroneous with high word error rates
(WER) for spontaneous telephone speech applications. In this paper, we
propose using lattice-based MLLR to overcome this issue. Using
wordlattices instead of 1-best hypotheses, more hypotheses can be
considered for MLLR estimation and, thus, better models are more
likely to be used. As opposed to standard MLLR, language model
probabilities are taken into account as well. We show how systems
using lattice MLLR outperform standard MLLR systems in the Speaker
Recognition Evaluation (SRE) 2006. Comparisonto other standard
acoustic systems is provided as well.

Langlais08translating.pdf

[ Workshop on Intelligent Data Analysis in bioMedicine and Pharmacology (IDAMAP) 2008]
---------
Translating Medical Words by Analogy
Philippe Langlais and Fran�ois Yvon and Pierre Zweigenbaum
------------------------

Term translation has become a recurring need in medical informatics. This
creates an interest for robust methods which can translate medical words in
various languages. We propose a novel, analogy-based method to generate word
translations. It relies on a partial bilingual lexicon and solves bilingual
analog- ical equations to create candidate translations. To evaluate the
potential of this method, we tested it to translate a set of 1,306 French
medical words to English. It could propose a correct translation for up to
67.9% of the input words, with a precision of up to 80.2% depending on the
number of selected candidates. We compare it to previous methods including
word alignment in parallel corpora and edit distance to a list of words, and
show how these methods can complement each other.

Rosset07_et_al2.pdf

[ ASRU 2007, Kyoto, December 2007]
---------
The LIMSI Qast systems: comparison between human and automatic rules generation for
question-answering on speech transcriptions
S. Rosset and O. Galibert and G. Adda and E. Bilinski
------------------------
In this paper, we present two different question-answering systems on
speech transcripts. These two systems are based on a complete and
multi-level analysis of both queries and documents. The first system
uses handcrafted rules for small text fragments (snippet) selection
and answer extraction. The second one replaces the handcrafting with
an automatically generated research descriptor. A score based on
those descriptors is used to select documents and snippets. The
extraction and scoring of candidate answers is based on proximity
measurements within the research descriptor elements and a number of
secondary factors. The preliminary results obtained on QAst (QA on
speech transcripts) development data are promising ranged from 72%
correct answer at 1st rank on manually transcribed meeting data to 94%
on manually transcribed lecture data.

final_srsl07.pdf

[ Workshop on the semantic representation of spoken language - SRSL07, Salamanca, November 2007]
---------
Semantic relations for an oral and interactive question-answering system
J. Villaneau and S. Rosset and O. Galibert
------------------------

RITEL is an oral and QA dialogue system, which aims to enable a human
to refine his research interactively. Analysis is based on typing of
chunks, according to several categories: named, linguistic or specific
entities. Since the system currently obtains promising results by
using a research based on a common presence of the typed chunks in the
documents and in the query, we are expecting better score quality by
detecting semantic relations between the chunks.

After a short description of the RITEL system, this paper describes
the first step of our research to define and to detect generic
semantic relations in the user's spoken utterances, by coping with
difficulties of such an analysis: necessary quickness, recognition
errors, asyntactical utterances. We also give the first results we
have obtained, and some perspectives of our future works to extend the
number of detected relations and to extend this detection to the
documents.

turmoCLEF2007-QASToverview.pdf

[ CLEF 2007 Workshop, Budapest, September 2007]
---------
Overview of the QAST 2007
J. Turmo and P. Comas and C. Ayache and D. Mostefa and S. Rosset and L. Lamel
------------------------
This paper describes QAST, a pilot track of CLEF 2007 aimed at
evaluating the task of Question Answering in Speech Transcripts. The
paper summarizes the evaluation framework, the systems that
participated and the results achieved. These results have shown that
question answering technology can be useful to deal with spontaneous
speech transcripts, so for manually transcribed speech as for
automatically recognized speech. The loss in accuracy from dealing
with manual transcripts to dealing with automatic ones implies that
there is room for future research in this area.

rossetCLEF2007.pdf

[ CLEF 2007 Workshop, Budapest, September 2007]
---------
The LIMSI participation to the QAst track
S. Rosset and O. Galibert and G. Adda and E. Bilinski
------------------------

In this paper, we present twe two different question-answering systems
on speech transcripts which participated to the QAst 2007 evaluation. These
two systems are based on a complete and multi-level analysis of both queries
and documents. The first system uses handcrafted rules for small text
fragments (snippet) selection and answer extraction. The second one replaces the
handcrafting with an automatically generated research descriptor. A
score based on those descriptors is used to select documents and
snippets. The extraction and scoring of candidate answers is based on
proximity measurements within the research descriptor elements and a
number of secondary factors. The evaluation results are ranged from
17% to 39% as accuracy depending on the tasks.

toaddpdffile

[InterSpeech Eurospeech 2007, Antwerp, August 2007]
---------
The Relevance of Feature Type for the Automatic Classification of Emotional
User States: Low Level Descriptors and Functionals
Bj�rn Schuller, Anton Batliner,Dino Seppi, Stefan Steidl, Thurid Vogt, Johannes Wagner,
Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous and Vered Aharonson
------------------------
In this paper, we report on classification results for emotional user
states (4 classes, German database of children interacting with a pet
robot). Six sites computed acoustic and linguistic features
independently from each other, following in part different
strategies. A total of 4244 features were pooled together and grouped
into 12 low level descriptor types and 6 functional types. For each of
these groups, classification results using Support Vector Machines and
Random Forests are reported for the full set of features, and for 150
features each with the highest individual Information Gain Ratio. The
performance for the different groups varies mostly between approx. 50%
and approx. 60%.

[InterSpeech Eurospeech 2007, Antwerp, August 2007]
---------
Speech fundamental frequency estimation using the Alternate Comb
Jean-Sylvain Lienard, Francois Signol and Claude Barras
------------------------
Reliable estimation of speech fundamental frequency is crucial in the
perspective of speech separation. We show that the gross errors on F0
measurement occur for particular configurations of the periodic
structure to be estimated and the other periodic structure used to
achieve this estimation. The error families are characterized by a set
of two positive integers. The Alternate Comb method uses this
knowledge to cancel most of the erroneous solutions. Its efficiency is
assessed by an evaluation on a classical pitch database.

toaddpdffile

[InterSpeech Eurospeech 2007, Antwerp, August 2007]
---------
Using phonetic features in unsupervised word decompounding for ASR with
application to a less-represented language
Thomas Pellegrini and Lori Lamel
------------------------
In this paper, a data-driven word decompounding algorithm is described
and applied to a broadcast news corpus in Amharic. The baseline
algorithm has been enhanced in order to address the problem of
increased phonetic confusability arising from word decompounding by
incorporating phonetic properties and some constraints on recognition
units derived from prior forced alignment experiments. Speech
recognition experiments have been carried out to validate the
approach. Out of vocabulary (OOV) words rates can be reduced by 30% to
40% and an absolute Word Error Rate (WER) reduction of 0.4% has been
achieved. The algorithm is relatively language independent and
requires minimal adaptation to be applied to other languages.

toaddpdffile

[InterSpeech Eurospeech 2007, Antwerp, August 2007]
---------
Improved Acoustic Modeling for Transcribing Arabic Broadcast Data
Lori Lamel, Abdel. Messaoudi and Jean-Luc Gauvain
------------------------
This paper summarizes our recent progress in improving the automatic
transcription of Arabic broadcast audio data, and some efforts to
address the challenges of the broadcast conversational speech. Our
efforts are aimed at improving the acoustic, pronunciation and
language models taking into account specificities of the Arabic
language. In previous work we demonstrated that explicit modeling of
short vowels improved recognition performance, even when producing
non-vocalized hypotheses. In addition to modeling short vowels,
consonant gemination and nunation are now explicitly modeled,
alternative pronunciations have been introduced to better represent
dialectical variants, and a duration model has been integrated. In
order to facilitate training on Arabic audio data with non-vocalized
transcripts a generic vowel model has been introduced. Compared with
the previous system (used in the 2006 GALE evaluation) the relative
word error rate has been reduced by over 10%.

toaddpdffile

[InterSpeech Eurospeech 2007, Antwerp, August 2007]
---------
Improved Machine Translation of Speech-to-Text outputs
Daniel D�chelotte, Holger Schwenk, Gilles Adda and Jean-Luc Gauvain
------------------------
Combining automatic speech recognition and machine translation is
frequent in current research programs. This paper first presents
several pre-processing steps to limit the performance degradation
observed when translating an automatic transcription (as opposed to a
manual transcription). Indeed, automatically transcribed speech often
differs significantly from the machine translation system's training
material, with respect to caseing, punctuation and word
normalization. The proposed system outperforms the best system at the
2007 TC-STAR evaluation by almost 2 points BLEU. The paper then
attempts to determine a criteria characterizing how well an STT system
can be translated, but the current experiments could only confirm that
lower word error rates lead to better translations.

ritel-final-20070530.pdf

[InterSpeech Eurospeech 2007, Antwerp, August 2007]
---------
Handling speech input in the Ritel QA dialogue system
B. van Schooten and S. Rosset and O. Galibert and A. Max and R. op den Akker and G. Illouz
------------------------
The Ritel system aims to provide open-domain question answering to
casual users by means of a telephone dialogue. Providing a
sufficiently natural speech dialogue in a QA system has some unique
challenges, such as very fast overall performance and large-vocabulary
speech recognition. Speech QA is an error-prone process, but errors or
problems that occur may be resolved with help of user feedback, as
part of a natural dialogue. This paper reports on the latest version
of Ritel, which includes search term confirmation and improved
followup question handling, as well as various improvements on the
other system components. We collected a small dialogue corpus with the
new system, which we use to evaluate and discuss it.

toaddpdffile

[InterSpeech Eurospeech 2007, Antwerp, August 2007]
---------
Comparing Praat and Snack formant measurements on two large corpora of northern and southern French
C. Woehrling & P. Boula de Mare�il
------------------------
We compare formant frequency measurements between two authoritative
tools (Praat and Snack), two large corpora (of face-to-face and
telephone speech) and two French varieties (northern and southern).
There are both an evaluation of formant tracking (as well as related
filtering techniques) and an application to find out salient
pronunciation traits. Despite differences between Praat and Snack with
regard to telephone speech (Praat yielding greater F1 values), results
seem to converge to suggest that northern and southern French varieties
mainly differ in the second formant of the open /O/. /O/ fronting in
northern French (with F2 values greater than 1100 Hz for males and 1200
Hz for females) is by far the most discriminating feature provided by
decision trees applied to oral vowel formants.

allauzen_is07.pdf

[InterSpeech Eurospeech 2007, Antwerp, August 2007]
---------
Error detection in confusion network
Alexandre Allauzen
------------------------

In this article, error detection for broadcast news transcription
system is addressed in a post-processing stage. We investigate a
logistic regression model based on features extracted from confusion
networks. This model aims to estimate a confidence score for each
confusion set and detect errors. Different kind of knowledge sources
are explored such as the confusion set solely, statistical language
model, and lexical properties. Impact of the different features are
assessed and show the importance of those extracted from the confusion
network solely. To enrich our modeling with information about the
neighborhood, features of adjacent confusion sets are also added to
the vector of features. Finally, a distinct processing of confusion
sets is also explored depending on the value of their best posterior
probability. To be compared with the standard ASR output, our best
system yields to a significant improvement of the classification error
rate from 17.2% to 12.3%.

VieruBoulaMadda_ParaLing07.pdf

[ParaLing07, August 2007]
---------
Identification of foreign-accented French using data-mining techniques
B. Vieru-Dimulescu, P. Boula de Mare�il & M. Adda-Decker
------------------------
The goal of this paralinguistic study is to automatically
differentiate foreign accents in French (Arabic, English, German,
Italian, Spanish and Portuguese). We took advantage of automatic
alignment into phonemes of non-native French recordings to measure
vowel formants, consonant duration and voicing, prosodic cues as well
as pronunciation variants combining French and foreign acoustic units
(e.g. a rolled /r/). We retained 62 features which we used for
training a classifier to discriminate speakers' foreign accents. We
obtained over 50% correct identification by using a cross-validation
method including unseen data (a few sentences from 36 speakers). The
best automatically selected features are robust to corpus change and
make sense with regard to linguistic knowledge.

toaddpdffile

[XVIth International Conference on Phonetic Sciences ICPhS Saarbr�cken, August 2007]
---------
Impact of duration and vowel inventory size on formant values of oral
vowels: an automated formant analysis from eight languages.
C�dric Gendrot and Martine Adda-Decker
------------------------
Eight languages (Arabic, English, French, German, Italian, Mandarin
Chinese, Portuguese, Spanish) with 6 differently sized vowel
inventories were analysed in terms of vowel formants. A tendency to
phonetic reduction for vowels of short acoustic durations clearly
emerges for all languages. The data did not provide evidence for an
effect of inventory size on the global acoustic space and only the
acoustic stability of quantal vowel /i/ is greater than that of other
vowels in many cases.

[XVIth International Conference on Phonetic Sciences ICPhS Saarbr�cken, August 2007]
---------
Analysis of oral and nasal vowel realisation in northern and southern French varieties
P. Boula de Mare�il, M. Adda-Decker & C. Woehrling
------------------------
We present data on the pronunciation of oral and nasal vowels in
northern and southern French varieties. In particular a sharp contrast
exists in the fronting of the open /O/ towards [oe] in the North and
the denasalisation of nasal vowels in the South. We examine how
linguistic changes in progress may affect these vowels, which are
governed by the left/right context and bring to light differences
between reading and spontaneous speech. This study was made possible
by automatic phoneme alignment on a large corpus of over 100 speakers.

VieruBoulaMadda_Icphs07.pdf

[XVIth International Conference on Phonetic Sciences ICPhS Saarbr�cken, August 2007]
---------
Characterizing non-native French Accents using automatic alignement
B. Vieru-Dimulescu, P. Boula de Mare�il & M. Adda-Decker
------------------------

The goal of this study is to investigate the most relevant cues that
differentiate 6 foreign accents in French (Arabic, English, German,
Italian, Spanish and Portuguese). We took advantage of automatic
alignment into phonemes of non-native French recordings. Starting
from standard acoustic models, we introduced pronunciation variants
which were reminiscent of foreign-accented speech: first allowing
alternations between French phonemes (e.g. [s]~[z]), then combining
them with foreign acoustic units (e.g. a rolled /r/). Results reveal
discriminating accent-specific pronunciations which, to a large
extent, confirm both linguistic predictions and human listeners'
judgments.

toaddpdffile

[XVIth International Conference on Phonetic Sciences ICPhS Saarbr�cken, August 2007]
---------
Bayesian framework for voicing alternation & assimilation studies on large
corpora in French
Martine Adda-Decker, Pierre Hall�
------------------------
The presented work aims at exploring voicing alternation and
assimilation on very large corpora using a Bayesian framework. A
voice feature (VF) variable has been introduced whose value is
determined using statistical acoustic phoneme models, corresponding to
3-state Gaussian mixture Hidden Markov Models. For all relevant
consonants, i.e. oral plosives and fricatives their surface form voice
feature is determined by maximising the acoustic likelihood of the
competing phoneme models. A voicing alternation (VA) measure counts
the number of changes between underlying and surface form voice
features. Using a corpus of 90h of French journalistic speech, an
overall voicing alternation rate of 2.7% has been measured, thus
calibrating the method's accuracy. The VA rate remains below 2%
word-internally and on word starts and raises up to 9% on lexical word
endings. In assimilation contexts rates grow significantly (>20%),
highlighting regressive voicing assimilation. Results also exhibit a
weak tendency for progressive devoicing.

IV07nato.pdf

[to appear in NATO Security Thourgh Science Program, IOS Press BV,]
''The Fundamentals of Verbal and Non-verbal Communication and the Biometrical Issue'',
---------
A Cross-Language Study of Acoustic and Prosodic Characteristics of Vocalic Hesitations
Ioana Vasilescu and Martine Adda-Decker
------------------------
This contribution provides a cross-language study on the acoustic and
prosodic characteristics of vocalic hesitations.One aim of the
presented work is to use large corpora to investigate whether some
language universals can be found. A complementary point of view is to
determine if vocalic hesitations can be considered as bearing
language-specific information. An additional point of interest
concerns the link between vocalic hesitations and the vowels in the
phonemic inventory of each language. Finally, the gained insights are
of interest to research in acoustic modeling in for automatic speech,
speaker and language recognition.
Hesitations have been automatically extracted from large corpora of
journalistic broadcast speech and parliamentary debates in three
languages (French, American English and European Spanish). Duration,
fundamental frequency and formant values were measured and
compared. Results confirm that vocalic hesitations share (potentially
universal) properties across languages, characterized by longer
durations and lower fundamental frequency than are observed for
intra-lexical vowels in the three languages investigated here. The
results on vocalic timbre show that while the measures on hesitations
are close to existing vowels of the language, they do not necessarily
coincide with them. The measured average timbre of vocalic hesitations
in French is slightly more open than its closest neighbor (/oe/).
For American English, the average F1 and F2 formant values position
the vocalic hesitation as a mid-open vowel somewhere between
/^/ and /ae/. The Spanish vocalic hesitation almost
completely overlaps with the mid-closed front vowel /e/.

icphs2007ivrnmafinal.pdf

[XVIth International Conference on Phonetic Sciences ICPhS Saarbr�cken, August 2007]
---------
Vocalic Hesitations vs Vocalic Systems: A Cross-Language Comparison
Ioana Vasilescu, Rena Nemoto and Martine Adda-Decker
------------------------
This paper deals with the acoustic characteristics of vocalic
hesitations in a cross-language perspective. The underlying questions
concern the ``neutral'' vs. language-dependent timbre of vocalic
hesitations and the link between their vocalic quality and the
phonemic system of the language. An additional point of interest
concerns the duration effect on vocalic hesitations compared to
intra-lexical vowels. Acoustic measurements have been carried out in
American English, French and Spanish. Results on vocalic timbre show
that hesitations (i) carry language-specific information; (ii) whereas
often close to measurements of existing vowels, they do not
necessarily collapse with them. Finally, (iii) duration variation
affects the timbre of vocalic hesitation and a centralization towards
a ``neutral'' realization is observed for decreasing durations.

toaddpdffile

[XVIth International Conference on Phonetic Sciences ICPhS Saarbr�cken, August 2007]
---------
Voicing assimilation in journalistic speech
Pierre Hall�, Martine Adda-Decker
------------------------
We used a corpus of radio and television speech to run a quantitative
study of voicing assimilation in French. The results suggest that,
although voicing itself can be incomplete, voice assimilation is
essentially categorical. The amount of voicing assimilation little
depends on underlying voicing but clearly varies with segment duration
and also with consonant manner of articulation. The results also
suggest that voicing assimilation, though largely regressive, is not
purely unidirectional.

jel2007PBM.pdf

[ 5�mes Journ�es d'�tudes Linguistiques de Nantes, June 2007]
---------
Traitement du schwa : de la synth�se � l'alignement automatique
automatiques de la parole
P. Boula de Mare�il
------------------------
Le but de cette communication est de fournir une meilleure image du
schwa en fran�ais, � travers le traitement automatique de la parole.
Apr�s une pr�sentation g�n�rale de la conversion graph�me-phon�me pour
la synth�se vocale, le traitement du schwa est envisag� et la question
de l'�valuation est soulev�e. Pour sortir du cadre d�terministe de la
synth�se o� une seule fa�on de lire une phrase donn�e est autoris�e,
des techniques d'alignement automatique sur de grands corpus de parole
sont mises � profit.
Le traitement automatique permet de quantifier des diff�rences
stylistiques et r�gionales entre vari�t�s de fran�ais. La confirmation
de r�sultats attendus sur la prononciation du schwa sugg�re la
validit� de l'approche.

EMNLP-CoNLL200745.pdf
[Empirical Methods in Natural Language Processing,]
---------
Smooth Bilingual N-gram Translation
Holger Schwenk and Daniel D�chelotte and H�l�ne Bonneau-Maynard and Alexandre Allauzen
------------------------

WMT25.pdf
[ACL Workshop on Statistical Machine Translation]
---------
Building a statistical machine translation system for French using the Europarl corpus
Holger Schwenk
------------------------

toaddpdffile

[Computer Speech and Language, 21:492-518, 2007]
---------
Continuous space language models.
Holger Schwenk
------------------------
This paper describes the use of a neural network language model for
large vocabulary continuous speech recognition. The underlying idea of
this approach is to attack the data sparseness problem by performing
the language model probability estimation in a continuous space. Very
efficient learning algorithms are described that enable the use of
training corpora of several hundred million words. It is also shown
that this approach can be incorporated into a large vocabulary
continuous speech recognizer using a lattice rescoring framework at a
very low additional processing time. The neural network language model
has been thoroughly evaluated in a state-of-the-art large vocabulary
continuous speech recognizer for several international benchmark
tasks, in particular the NIST evaluations on broadcast news and
conversational speech recognition. The new approach is compared to
4-gram back-off language models trained with modified Kneser-Ney
smoothing which has been often reported to be the best known smoothing
method. The neural network language model achieved consistent word
error rate reductions for all considered tasks and languages, ranging
from 0.5% to up to 1.6% absolute.

toaddpdffile

[ 5�mes Journ�es d'�tudes Linguistiques de Nantes, June 2007]
---------
Probl�mes pos�s par le schwa en reconnaissance et en alignement
automatiques de la parole
Martine Adda-Decker
------------------------
Cette contribution aborde les questions suivantes : Quels probl�mes
pose le schwa pour la reconnaissance automatique de la parole, et pour
l'alignement automatique qui sert d�j� et servira de plus en plus dans
le futur � de nombreuses analyses phon�tiques et phonologiques? Pour
cela il faudra revenir � des probl�mes en amont concernant le schwa et
la mod�lisation acoustique. Quels liens existent entre dictionnaires
de prononciation, variantes de prononciation, mod�lisation acoustique
et segmentation automatique? Que pouvons-nous apprendre sur le schwa
� partir du traitement automatique de grands corpus? Que pouvons-nous
apprendre par le schwa sur les autres voyelles du syst�me?

toaddpdffile

[ 7�me Journ�es Jeunes Chercheurs en Parole, Paris, July 2007]
---------
Comparaison entre l'extraction de formants par Praat et Snack
sur deux grands corpus de fran�ais du nord et du sud
C. Woehrling-Dimulescu & P. Boula de Mare�il
------------------------
We compare formant frequency measurements between two authoritative
tools (Praat and Snack), two large corpora (of face-to-face and
telephone speech) and two French varieties (northern and southern).
There are both an evaluation of formant tracking (as well as related
filtering techniques) and an application to find out salient
pronunciation traits. Despite differences between Praat and Snack with
regard to telephone speech (Praat yielding greater F1 values), results
seem to converge to suggest that northern and southern French varieties
mainly differ in the second formant of the open /O/. /O/ fronting in
northern French (with F2 values greater than 1100 Hz for males and 1200
Hz for females) is by far the most discriminating feature provided by
decision trees applied to oral vowel formants.

Vieru_rjcp07.pdf

[ 7�me Journ�es Jeunes Chercheurs en Parole, Paris, July 2007]
---------
Identification de 6 accents �trangers en fran�ais utilisant des
techniques de fouilles de donn�es
B. Vieru-Dimulescu & P. Boula de Mare�il
------------------------
The goal of this study is to automatically differentiate foreign
accents in French (Arabic, English, German, Italian, Spanish and
Portuguese). We took advantage of automatic alignment into phonemes of
non-native French recordings to measure vowel formants, consonant
duration and voicing, prosodic cues as well as pronunciation variants
combining French and foreign acoustic units (e.g. a rolled 'r'). We
retained 62 features which we used for training a classifier to
discriminate speakers' foreign accents. We obtained over 50% correct
identification by using a cross-validation method including unseen
data (a few sentences from 36 speakers). The best automatically
selected features are robust to corpus change and make sense with
regard to linguistic knowledge.

toaddpdffile

[Revue Fran�aise de Linguistique Appliqu�e, pp.71-84 vol. XII-1/juin 2007]
---------
Corpus pour la transcription automatique de l'oral
Martine Adda-Decker
------------------------
Cette contribution vise � illustrer la r�alisation et l'utilisation de
corpus � des fins de recherche en transcription automatique de la
parole. Ces recherches s'appuyant largement sur une mod�lisation
statistique, s'accompagnent naturellement de production de corpus
�crits et de corpus oraux transcrits ainsi que d'outils facilitant la
transcription manuelle. Les m�thodes et techniques mises au point
permettent aujourd'hui un d�ploiement vers le traitement automatique
de l'oral � grande �chelle, tout en contribuant � un domaine de
recherche interdisciplinaire �mergeant : la linguistique des corpus
oraux.

LIMSI_rt07asr_final.pdf

[to appear Springer Lecture Notes in Computer Science]
---------
The LIMSI RT07 Lecture Transcription System
Lori Lamel and Eric Bilinski and Jean-Luc Gauvain and Gilles Adda
and Claude Barras and Xuan Zhu
------------------------
A system to automatically transcribe lectures and presentations has
been developed in the context of the FP6 Integrated Project
CHIL. In addition to the seminar data recorded by the CHIL
partners, widely available corpora were used to train both the
acoustic and language models. Acoustic model training made use of the
transcribed portion of the TED corpus of Eurospeech recordings, as
well as the ICSI, ISL, and NIST meeting corpora. For language model
training, text materials were extracted from a variety of on-line
conference proceedings. Experimental results are reported for
close-talking and far-field microphones on development and evaluation
data.

LIMSI_rt07diar.pdf

[to appear Springer Lecture Notes in Computer Science]
---------
Multi-Stage Speaker Diarization for Conference and Lecture Meetings
X. Zhu and C. Barras and L. Lamel and J.L. Gauvain
and Claude Barras and Xuan Zhu
------------------------
The LIMSI RT-07S speaker diarization system for the conference and
lecture meetings is presented in this paper. This system builds upon
the RT-06S diarization system designed for lecture data. The baseline
system combines agglomerative clustering based on Bayesian information
criterion (BIC) with a second clustering using state-of-the-art
speaker identification (SID) techniques. Since the baseline system
provides a high speech activity detection (SAD) error around of 10%
on lecture data, some different acoustic representations with various
normalization techniques are investigated within the framework of
log-likelihood ratio (LLR) based speech activity detector. UBMs
trained on the different types of acoustic features are also examined
in the SID clustering stage. All SAD acoustic models and UBMs are
trained with the forced alignment segmentations of the conference
data. The diarization system integrating the new SAD models and UBM
gives comparable results on both the RT-07S conference and lecture
evaluation data for the multiple distant microphone (MDM) condition.

toaddpdffile

[HLT/NACL workshop on Syntax and Structure in Statistical Translation, April 2007]
---------
Combining morphosyntactic enriched representation with n-best reranking in statistical translation
H�l�ne Bonneau-Maynard, Alexandre Allauzen, Daniel D�chelotte, and Holger Schwenk
------------------------
The purpose of this work is to explore the integration of
morphosyntactic information into the translation model itself, by
enriching words with their morphosyntactic categories. We investigate
word disambiguation using morphosyntactic categories, n-best
hypotheses reranking, and the combination of both methods with word or
morphosyntactic n-gram language model reranking. Experiments are
carried out on the English-to-Spanish translation task. Using the
morphosyntactic language model alone does not results in any
improvement in performance. However, combining morphosyntactic word
disambiguation with a word based 4-gram language model results in an
improvement in the BLEU score of 0.6% on the development set and 0.3%
on the test set.

0400021.pdf

[IEEE ICASSP 2007, Honolulu, Hawaii, April 2007]
---------
Detection and Analysis of Abnormal Situations Through Fear-Type Acoustic Manifestations
C. Clavel and L. Devillers and G. Richard and I. Vasilescu and T. Ehrette
------------------------
Recent work on emotional speech processing has demonstrated the
interest to consider the information conveyed by the emotional
component in speech to enhance the understanding of human
behaviors. But to date, there has been little integration of emotion
detection systems in effective applications.
The present research focuses on the development of a fear-type
emotions recognition system to detect and analyze abnormal situations
for surveillance applications. The Fear vs. Neutral classi cation gets
a mean accuracy rate at 70.3%. It corresponds to quite optimistic
results given the diversity of fear manifestations illustrated in the
data. More speci c acoustic models are built inside the fear class by
considering the context of emergence of the emotional manifestations,
i.e. the type of the threat during which they occur, and which has a
strong in uence on fear acoustic manifestations. The potential use of
these models for a threat type recognition system is also
investigated. Such information about the situation can indeed be
useful for surveillance systems.

0400053.pdf

[IEEE ICASSP 2007, Honolulu, Hawaii, April 2007]
---------
Constrained MLLR for Speaker Recognition
Marc Ferras, Cheung-Chi Leung, Claude Barras and Jean-Luc Gauvain
------------------------
Maximum-Likelihood Linear Regression (MLLR) and Constrained MLLR
(CMLLR) are two widely-used techniques for speaker adaptation in
large-vocabulary speech recognition systems. Recently, using MLLR
transforms as features for speaker recognition tasks has been
proposed, achieving performance comparable to that obtained with
cepstral features. This paper describes a new feature extraction
technique for speaker recognition based on CMLLR speaker adaptation
which avoids the use of transcripts. Modeling is carried out through
Support Vector Machines (SVM). Results on the NIST Speaker Recognition
Evaluation 2005 dataset are provided as well as in combination with
two cepstral approaches such as MFCC-GMM and MFCC-SVM, for which
system performance is improved by a 10% in Equal Error Rate relative
terms.

0401277.pdf

[IEEE ICASSP 2007, Honolulu, Hawaii, April 2007]
---------
Speech Recognition System Combination for Machine Translation
M.J.F. Gales and X. Liu and R. Sinha and P.C. Woodland and K. Yu and S. Matsoukas
and T. Ng and K. Nguyen and L. Nguyen and J-L Gauvain and L. Lamel and A. Messaoudi
------------------------
The majority of state-of-the-art speech recognition systems make use of
system combination. The combination approaches adopted have
traditionally been tuned to minimising Word Error Rates (WERs). In
recent years there has been growing interest in taking the output from
speech recognition systems in one language and translating it into
another. This paper investigates the use of cross-site combination
approaches in terms of both WER and impact on translation
performance. In addition the stages involved in modifying the output
from a Speech-to-Text (STT) system to be suitable for translation are
described. Two source languages, Mandarin and Arabic, are recognised
and then translated using a phrase-based statistical machine
translation system into English. Performance of individual systems and
cross-site combination using cross-adaptation and ROVER are
given. Results show that the best STT combination scheme in terms of
WER is not necessarily the most appropriate when translating speech.

0400997.pdf

[IEEE ICASSP 2007, Honolulu, Hawaii, April 2007]
---------
The LIMSI 2006 TC-STAR EPPS Transcription Systems
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Claude Barras, Eric Bilinski,
Olivier Galibert, Agusti Pujol, Holger Schwenk, Xuan Zhu
------------------------
This paper describes the speech recognizers developed to transcribe
European Parliament Plenary Sessions (EPPS) in English and Spanish in
the 2nd TC-STAR Evaluation Campaign. The speech recognizers are
state-of-the-art systems using multiple decoding passes with models
(lexicon, acoustic models, language models) trained for the different
transcription tasks. Compared to the LIMSI TC-STAR 2005 EPPS systems,
relative word error rate reductions of about 30% have been achieved on
the 2006 development data. The word error rates with the LIMSI systems
on the 2006 EPPS evaluation data are 8.2% for English and 7.8% for
Spanish. Experiments with cross-site adaptation and system combination
are also described.

0400641.pdf

[IEEE ICASSP 2007, Honolulu, Hawaii, April 2007]
---------
Modeling Duration via Lattice Rescoring
Nicolas Jennequin and Jean-Luc Gauvain
------------------------
It is often acknowledged that HMMs do not properly model phone and
word durations. In this paper phone and word duration models are used
to improve the accuracy of state-of-the-art large vocabulary speech
recognition systems. The duration information is integrated into the
systems in a rescoring of word lattices that include phone-level
segmentations. Experimental results are given for a conversational
telephone speech (CTS) task in French and for the TC-Star EPPS
transcription task in Spanish and English. An absolute word error rate
reduction of about 0.5% is observed for the CTS task, and smaller but
consistent gains are observed for the EPPS task in both languages.

Tal_ritel.pdf

[Traitement Automatique des Langues, vol 46 (3)]
---------
Interaction et recherche d'information : le projet Ritel
S. Rosset and O. Galibert and G. Illouz and A. Max
------------------------
L'objectif du projet Ritel est de r�aliser un syst�me de dialogue
homme-machine permettant � un utilisateur de poser oralement des
questions, et de dialoguer avec un syst�me de recherche d'information
g�n�raliste. Dans cet article, nous pr�sentons la plateforme actuelle
et plus particuli�rement les modules d'analyse, d'extraction
d'information et de g�n�ration des r�ponses. Nous pr�sentons les
r�sultats pr�liminaires obtenus par les diff�rents modules de la
plateforme.

The Ritel project aims at integrating spoken language dialog and
open-domain information retrieval to allow a human to ask general
questions and refine her search interactively. In this paper, we
describe the current platform and more specifically the analysis,
information extraction and natural language generation modules. We
present some preliminary results of the different modules.

parole06.pdf
[Revue Parole, 37: 25-65]
---------
Identification d'accents r�gionaux en fran�ais : perception et analyse
C. Woehrling and P. Boula de Mare�il
------------------------

This article is dedicated to the perceptual identification of French
varieties by listeners from the surroundings of Paris and
Marseilles. It is based on recordings of about forty speakers from six
Francophone regions: Normandy, Vendee, Romand Switzerland, Languedoc
and the Basque Country. Contrary to the speech type (read or
spontaneous speech) and the listeners' region of origin (Paris or
Marseilles), the speakers' degree of accentedness has a major effect
and interacts with the speakers' age. The origin of the oldest
speakers who have the strongest accent according to Paris listeners'
judgements is better recognised than the origin of the youngest
speakers. However confusions are frequent among the Southern
varieties. On the whole, three accents can be distinguished (by
clustering and multidimensional scaling techniques): Northern French,
Southern French and Romand Swiss.

Boula_Vieru_phonetica06.pdf
[Phonetica 63 (p.274-267)]
---------
The contribution of prosody to the perception of foreign accent
P. Boula de Mare�il and B. Vieru-Dimulescu
------------------------
The general goal of this study is to expand our understanding of what
is meant by foreign /accent/ in people's parlance. More specifically,
we insist on the role of prosody (timing and melody), which has rarely
been examined. New technologies, including diphone speech synthesis
(Experiment 1) and speech manipulation (Experiment 2), are used to
study the relative importance of prosody in what is perceived as a
foreign /accent/. The methodology we propose, based on the prosody
transplantation paradigm, can be applied to different languages or
language varieties. Here, it is applied to Spanish and Italian. We
built up a dozen sentences which are spoken in almost the same way in
both languages (e.g. /ha visto la casa //del// presidente americano/
"you/(s)he saw the American president's house"). Spanish/Italian
monolinguals and bilinguals were recorded. We then studied what is
perceived when the segmental specification of an utterance is combined
with suprasegmental features belonging to a different
language. Results obtained with Italian, Spanish and French listeners
resemble one another, and suggest that prosody is at least as
important as the articulation of phonemes and voice quality in
identifying Spanish-accented Italian and Italian-accented Spanish.

IS061776.PDF

[InterSpeech ICSLP 2006, Pittsburgh, September 2006]
---------
Investigating Automatic Decomposition for ASR in Less Represented Languages
Thomas Pellegrini and Lori Lamel
------------------------
The Ritel project aims at integrating spoken language dialog and
open-domain information retrieval to allow a human to ask general
question (f.i. ``Who is currently presiding the Senate?'') and refine
her search interactively. This project is at the junction of several
distinct research communities, and has therefore several challenges to
tackle: real-time streamed speech recognition with very large
vocabulary, open-domain dialog management, fast information retrieval
from text, query cobuilding, communication between information retrieval
and dialog components, and generation of natural sounding answers. In
this paper, we present our work on the different components of Ritel,
and provide initial results.

IS061776.PDF

[InterSpeech ICSLP 2006, Pittsburgh, September 2006]
---------
Investigating Automatic Decomposition for ASR in Less Represented Languages
Thomas Pellegrini and Lori Lamel
------------------------
This paper addresses the use of an automatic decomposition method to
reduce lexical variety and thereby improve speech recognition of less
well-represented languages. The Amharic language has been selected for
these experiments since only a small quantity of resources are
available compared to well-covered languages. Inspired by the Harris
algorithm, the method automatically generates plausible affixes, that
combined with decompounding can reduce the size of the lexicon and the
OOV rate. Recognition experiments are carried out for four different
configurations (full-word and decompounded) and using supervised
training with a corpus containing only two hours of manually
transcribed data.

IS061636.PDF

[InterSpeech ICSLP 2006, Pittsburgh, September 2006]
---------
Real-life emotions detection with lexical and paralinguistic cues
on Human-Human call center dialogs
Laurence Devillers and Laurence Vidrascu
------------------------
The emotion detection work reported here is part of a larger study
aiming to model user behavior in real interactions. We already studied
emotions in a real-life corpus with human-human dialogs on a financial
task. We now make use of another corpus of real agent-caller spoken
dialogs from a medical emergency call center in which emotion
manifestations are much more complex, and extreme emotions are
common. Our global aims are to define appropriate verbal labels for
annotating real-life emotions, to annotate the dialogs, to validate
the presence of emotions via perceptual tests and to find robust cues
for emotion detection. Annotations have been done by two experts with
twenty verbal classes organized in eight macro-classes. We retained
for experiments in this paper four macro classes: Relief, Anger, Fear
and Sadness. The relevant cues for detecting natural emotions are
paralinguistic and linguistic. Two studies are reported in this paper:
the first investigates automatic emotion detection using linguistic
information, whereas the second investigates emotion detection with
paralinguistic cues. On the medical corpus, preliminary experiments
using lexical cues detect about 78% of the four labels showing very
good detection for Relief (about 90%) and Fear (about 86%)
emotions. Experiments using paralinguistic cues show about 60% of good
detection, Fear being best detected.

IS061251.PDF

[InterSpeech ICSLP 2006, Pittsburgh, September 2006]
---------
Perceptual identification and phonetic analysis of 6 foreign accents in French
Bianca Vieru-Dimulescu and Philippe Boula de Mare�il
------------------------
A perceptual experiment was designed to determine to what extent na�ve
French listeners are able to identify foreign accents in French:
Arabic, English, German, Italian, Portuguese and Spanish. They succeed
in recognizing the speaker s mother tongue in more than 50% of cases
(while rating their degree of accentedness as average). They perform
best with Arabic speakers and worst with Portuguese speakers. The
Spanish/Italian and English/German accents are the most mistaken
ones. Phonetic analyses were conducted; clustering and scaling
techniques were applied to the results, and were related to the
listeners reactions that were recorded during the test. They support
the idea that differences in the vowel realization (especially
concerning the phoneme /y/) seem to outweigh rhythmic cues. Terms:
perception, foreign accent, non-natif French.

IS061261.PDF

[InterSpeech ICSLP 2006, Pittsburgh, September 2006]
---------
Identification of regional accents in French: perception and categorization
C�cile Woehrling and Philippe Boula de Mare�il
------------------------
This article is dedicated to the perceptual identification of French
varieties by listeners from the surroundings of Paris and
Marseilles. It is based on the geographical localization of about
forty speakers from 6 Francophone regions: Normandy, Vendee, Romand
Switzerland, Languedoc and the Basque Country. Contrary to the speech
type (read or spontaneous speech) and the listeners region of origin,
the speakers degree of accentedness has a major effect and interacts
with the speakers age. The origin of the oldest speakers (who have the
strongest accent according to Paris listeners judgments) is better
recognized than the origin of the youngest speakers. However
confusions are frequent among the Southern varieties. On the whole,
three accents can be distinguished (by clustering and multidimensional
scaling techniques): Northern French, Southern French and Romand
Swiss.

zhu_odyssey.pdf

[IEEE Odyssey The Speaker and Language Recognition Workshop, San Juan, June 2006]
---------

Dong Zhu and Martine Adda-Decker
------------------------
Recent results have shown that phone lattices may significantly
improve language recognition in a PPRLM (parallel phone recognition
followed by language dependent modeling) framework. In this
contribution we investigate the effectiveness of lattices to language
identification (LID) when using different sets of multilingual phone
and syllable inventories and corresponding multilingual acoustic phone
models. The LID system that achieves best results is a PPRLM structure
composed of four acoustic recognizers using both phone- and
syllable-based decoders. A 7-language broadcast news corpus of
approximately 150 hours of speech is used for training, development
and test of our LID systems. Experimental results show that the use of
lattices consistently improves LID results for all multilingual
acoustic model sets and for both phonotactic and syllabotactic
approaches.

tcstar06_dechelotte.pdf

[TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, June 2006]
---------
The 2006 LIMSI Statistical Machine Translation System for TC-STAR
Daniel D�echelotte, Holger Schwenk and Jean-Luc Gauvain
------------------------
This paper presents the LIMSI statistical machine translation system
developed for 2006 TC-STAR evaluation campaign. We describe an
A*-decoder that generates translation lattices using a word-based
translation model. A lattice is a rich and compact representation of
alternative translations that includes the probability scores of all
the involved sub-models. These lattices are then used in subsequent
processing steps, in particular to perform sentence splitting and
joining, maximum BLEU training and to use improved statistical target
language models.

tcstar06_lamel.pdf

[TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, June 2006]
---------
The LIMSI 2006 TC-STAR Transcription Systems
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Claude Barras, Eric Bilinski,
Olivier Galibert, Agusti Pujol, Holger Schwenk, Xuan Zhu
------------------------
This paper describes the speech recognizers evaluated in the TC-STAR
Second Evaluation Campaign held in January- February 2006. Systems
were developed to transcribe parliamentary speeches in English and
Spanish, as well as Broadcast news in Mandarin Chinese. The speech
recognizers are stateof- the-art systems using multiple decoding
passes with models lexicon, acoustic models, language models) trained
for the different transcription tasks. Compared to the LIMSI TC-STAR
2005 European Parliament Plenary Sessions (EPPS) systems, relative
word error rate reductions of about 30% have been achieved on the 2006
development data. The word error rates with the LIMSI systems on the
2006 EPPS evaluation data are 8.2% for English and 7.8% for
Spanish. The character error rate for Mandarin for a joint system
submission with the University of Karlsruhe was 9.8%. Experiments with
cross-site adaptation and system combination are also described.

tcstar06_jennequin.pdf

[TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, June 2006]
---------
Lattice Rescoring Experiments with Duration Models
Nicolas Jennequin and Jean-Luc Gauvain
------------------------
This paper reports on experiments using phone and word duration models
to improve speech recognition accuracy. The duration information is
integrated into state-of-the-art large vocabulary speech recognition
systems by rescoring word lattices that include phone-level
segmentations. Experimental results are given for a conversational
telephone speech (CTS) task in French and for the TC-Star EPPS
transcription task in Spanish and English. An absolute word error rate
reduction of about 0.5% is observed for the CTS task, and smaller but
consistent gains are observed for the EPPS task.

jep06madda.pdf
[JEP 2006, Dinard, June 2006]
---------
De la reconnaissance automatique de la parole a l'analyse linguistique
de corpus oraux
Martine Adda-Decker
------------------------
This contribution aims at giving an overview of present automatic
speech recognition in French highlighting typical transcription
problems for this language. Explanations for errors can be partially
obtained by examining the acoustics of the speech data. Such
investigations however do not only inform about system and/or speech
modeling limitations, but they also contribute to discover, describe
and quantify specificities of spoken language as opposed to written
language and speakers speech performance. To automatically transcribe
different speech genres (e.g. broadcast news vs conversations)
specific acoustic corpora are used for training, suggesting that
frequency of occurrence and acoustic realisations of phonemes vary
signi- ficantly across genres. Some examples of corpus studies are
presented describing phoneme frequencies and segment durations on
different corpora. In the future largescale corpus studies may
contribute to increase our knowledge of spoken language as well as the
performance of automatic processing.

jep06bv.pdf}
[JEP 2006, Dinard, June 2006]
---------
Identification perceptive d'accents �trangers en fran�ais
de corpus oraux
B. Vieru-Dimulescu and P. Boula de Mare�il
------------------------

A perceptual experiment was designed to determine to what extent na�ve
French listeners are able to identify foreign accents in French:
Arabic, English, German, Italian, Portuguese and Spanish. They
succeed in recognising the speaker's mother tongue in more than 50% of
cases (while rating their degree of accentedness as average). They
perform best with Arabic speakers and worst with Portuguese
speakers. The Spanish and Italian accents on the one hand and the
English and German accents on the other hand are the most mistaken
ones. Phonetic analyses were conducted; clustering and scaling
techniques were applied to the results, and were related to the
listeners' reactions that were recorded during the test. Emphasis was
laid on differences in the vowel realisation (especially concerning
the phoneme /y/).

jep2006_zhu_madda_versionfinale.pdf
[JEP 2006, Dinard, June 2006]
---------
Identification automatique des langues : combinaison d'approches
phonotactiques � base de treillis de phones et de syllabes
Dong Zhu and Martine Adda-Decker
------------------------
This paper investigates the use of phone and syllable lattices to
automatic language identification (LID) based on multilingual phone sets
(73, 50 and 35 phones). We focus on efficient combinations of both
phonotactic and syllabotactic approaches. The LID structure used to
achieve the best performance within this framework is similar to PPRLM
(parallel phone recognition followed by language dependent modeling):
several acoustic recognizers based on either multilingual phone or
syllabe inventories, followed by language-specific n-gram language
models. A seven language broadcast news corpus is used for the
development and the test of the LID systems. Our experiments show that
the use of the lattice information significantly improves results over
all system configurations and all test durations. Multiple system
combinations further achieve improvements.

jep06pellegrini.pdf

[JEP 2006, Dinard, June 2006]
---------
Experiences de transcription automatique d'une langue rare
Thomas Pellegrini and Lori Lamel
------------------------
This work investigates automatic transcription of rare languages,
where rare means that there are limited resources available in
electronic form. In particular, some experiments on word decompounding
for Amharic, as a means of compensating for the lack of textual data
are described. A corpus-based decompounding algorithm has been applied
to a 4.6M word corpus. Compounding in Amharic was found to result from
the addition of prefixes and suffixes. Using seven frequent affixes
reduces the out of vocabulary rate from 7.0% to 4.8% and total number
of lexemes from 133k to 119k. Preliminary attempts at recombining the
morphemes into words results in a slight decrease in word error rate
relative to that obtained with a full word representation.

jep06lefevre.pdf

[JEP 2006, Dinard, June 2006]
---------
Transformation lineaire discriminante pour l'apprentissage
des HMM a analyse factorielle
Fabrice Lefevre and Jean-Luc Gauvain
------------------------
Factor analysis has been recently used to model the covariance of the
feature vector in speech recognition systems. Maximum likelihood
estimation of the parameters of factor analyzed HMMs (FAHMMs) is
usually done via the EM algorithm. The initial estimates of the model
parameters is then a key issue for their correct training. In this
paper we report on experiments showing some evidence that the use of a
discriminative criterion to initialize the FAHMM maximum likelihood
parameter estimation can be effective.
The proposed approach relies on the estimation of a discriminant
linear transformation to provide initial values for the factor loading
matrices. Solutions for the appropriate initializations of the other
model parameters are presented as well. Speech recognition experiments
were carried out on the Wall Street Journal LVCSR task with a 65k
vocabulary. Contrastive results are reported with various model sizes
using discriminant and non discriminant initialization.

lrec06SR.pdf

[LREC 2006, Genoa, May 2006]
---------
The Ritel Corpus - An annotated Human-Machine open-domain question
answering spoken dialog corpus
S. Rosset, S. Petel
------------------------
In this paper we present a real (as opposed to Wizard-of-Oz)
Human-Computer QA-oriented spoken dialog corpus collected with our
Ritel platform. This corpus has been orthographically transcribed and
annotated in terms of Specific Entities and Topics. Twelve main topics
have been chosen. They are refined into 22 sub-topics. The Specific
Entities are from five categories and cover Named Entities, linguistic
entities, topic-defining entities, general entities and extended
entities. The corpus contains 582 dialogs for 6 hours of user speech.

lrec06media.pdf

[LREC 2006, Genoa, May 2006]
---------
Results of the French Evalda-Media evaluation campaign for literal understanding
H. Bonneau-maynard, C. Ayache, F. Bechet, A. Denis, A. Kuhn, F. Lefevre, D. Mostefa,
M. Quignard, S. Rosset, C. Servan, J. Villaneau
------------------------
The aim of the Media-Evalda project is to evaluate the understanding
capabilities of dialog systems. This paper presents the Media protocol
for speech understanding evaluation and describes the results of the
June 2005 literal evaluation campaign. Five systems, both symbolic or
corpus-based, participated to the evaluation which is based on a
common semantic representation. Different scorings have been performed
on the system results. The understanding error rate, for the Full
scoring is, depending on the systems, from 29% to 41.3%. A diagnosis
analysis of these results is proposed.

lrec06evaSy.pdf

[LREC 2006, Genoa, May 2006]
---------
A joint intelligibility evaluation of French text-to-speech synthesis systems:
the EvaSy SUS/ACR campaign
P. Mare�il, C. D'Alessandro, A. Raake, G. Bailly, M. Garcia, M. Morel
------------------------
The EVALDA/EvaSy project is dedicated to the evaluation of
text-to-speech synthesis systems for the French language. It is
subdivided into four components: evaluation of the grapheme-to-phoneme
conversion module (Boula de Mare�il et al., 2005), evaluation of
prosody (Garcia et al., 2006), evaluation of intelligibility, and
global evaluation of the quality of the synthesised speech. This paper
reports on the key results of the intelligibility and global
evaluation of the synthesised speech. It focuses on intelligibility,
assessed on the basis of semantically unpredictable sentences, but a
comparison with absolute category rating in terms of e.g. pleasantness
and naturalness is also provided. Three diphone systems and three
selection systems have been evaluated. It turns out that the most
intelligible system (diphone-based) is far from being the one which
obtains the best mean opinion score.

lrec06prosody.pdf
[LREC 2006, Genoa, May 2006]
---------
A joint prosody evaluation of French text-to-speech synthesis systems
M. Garcia, C. d'Alessandro, G. Bailly, P. Mare�il, M. Morel
------------------------
This paper reports on prosodic evaluation in the framework of the
EVALDA/EvaSy project for text-to-speech (TTS) evaluation for the
French language. Prosody is evaluated using a prosodic transplantation
paradigm. Intonation contours generated by the synthesis systems are
transplanted on a common segmental content. Both diphone based
synthesis and natural speech are used. Five TTS systems are tested
along with natural voice. The test is a paired preference test (with
19 subjects), using 7 sentences. The results indicate that natural
speech obtains consistently the first rank (with an average preference
rate of 80%) , followed by a selection based system (72%) and a
diphone based system (58%). However, rather large variations in
judgements are observed among subjects and sentences, and in some
cases synthetic speech is preferred to natural speech. These results
show the remarkable improvement achieved by the best selection based
synthesis systems in terms of prosody. In this way; a new paradigm for
evaluation of the prosodic component of TTS systems has been
successfully demonstrated.

lrec06Abrilian.pdf

[LREC 2006, Genoa, May 2006]
---------
Annotation of Emotions in Real-Life Video Interviews: Variability between Coders
S. Abrilian, L. Devillers, J.C. Martin
------------------------
Research on emotional real-life data has to tackle the problem of
their annotation. The annotation of emotional corpora raises the issue
of how different coders perceive the same multimodal emotional
behaviour. The long-term goal of this paper is to produce a guideline
for the selection of annotators. The LIMSI team is working towards the
definition of a coding scheme integrating emotion, context and
multimodal annotations. We present the current defined coding scheme
for emotion annotation, and the use of soft vectors for representing a
mixture of emotions. This paper describes a perceptive test of emotion
annotations and the results obtained with 40 different coders on a
subset of complex real-life emotional segments selected from the EmoTV
Corpus collected at LIMSI. The results of this first study validate
previous annotations of emotion mixtures and highlight the difference
of annotation between male and female coders.

lrec06clavel.pdf

[LREC 2006, Genoa, May 2006]
---------
Fear-type emotions of the SAFE Corpus: annotation issues
C. Clavel, I. Vasilescu, L. Devillers, T. Ehrette, G. Richard
------------------------
The present research focuses on annotation issues in the context of
the acoustic detection of fear-type emotions for surveillance
applications. The emotional speech material used for this study comes
from the previously collected SAFE Database (Situation Analysis in a
Fictional and Emotional Database) which consists of audio-visual
sequences extracted from movie fictions. A generic annotation scheme
was developed to annotate the various emotional manifestations
contained in the corpus. The annotation was carried out by two
labellers and the two annotations strategies are confronted. It
emerges that the borderline between emotion and neutral vary according
to the labeller. An acoustic validation by a third labeller allows at
analysing the two strategies. Two human strategies are then observed:
a first one, context-oriented which mixes audio and contextual (video)
information in emotion categorization; and a second one, based mainly
on audio information. The k-means clustering confirms the role of
audio cues in human annotation strategies. It particularly helps in
evaluating those strategies from the point of view of a detection
system based on audio cues.

lrec06JCM.pdf

[LREC 2006, Genoa, May 2006]
---------
Manual Annotation and Automatic Image Processing of Multimodal Emotional Behaviours:
Validating the Annotation of TV Interviews
J. Martin, G. Caridakis, L. Devillers, K. Karpouzis, S. Abrilian
------------------------

There has been a lot of psychological researches on emotion and
nonverbal communication. Yet, these studies were based mostly on acted
basic emotions. This paper explores how manual annotation and image
processing can cooperate towards the representation of spontaneous
emotional behaviour in low resolution videos from TV. We describe a
corpus of TV interviews and the manual annotations that have been
defined. We explain the image processing algorithms that have been
designed for the automatic estimation of movement quantity. Finally,
we explore how image processing can be used for the validation of
manual annotations.

lrec06devil.pdf

[LREC 2006, Genoa, May 2006]
---------
Real life emotions in French and English TV video clips: an integrated annotation
protocol combining continuous and discrete approaches
L. Devillers, R. Cowie, J. Martin, E. Douglas-cowie, S. Abrilian, M. Mcrorie
------------------------
A major barrier to the development of accurate and realistic models of
human emotions is the absence of multi-cultural / multilingual
databases of real-life behaviours and of a federative and reliable
annotation protocol. QUB and LIMSI teams are working towards the
definition of an integrated coding scheme combining their
complementary approaches. This multilevel integrated scheme combines
the dimensions that appear to be useful for the study of real-life
emotions: verbal labels, abstract dimensions and contextual (appraisal
based) annotations. This paper describes this integrated coding
scheme, a protocol that was set-up for annotating French and English
video clips of emotional interviews and the results (e.g. inter-coder
agreement measures and subjective evaluation of the scheme).

lrec06TP.pdf

[LREC 2006, Genoa, May 2006]
---------
Experimental detection of vowel pronunciation variants in Amharic
Thomas Pellegrini and Lori Lamel
------------------------
The pronunciation lexicon is a fundamental element in an automatic
speech transcription system. It associates each lexical entry (usually
a grapheme), with one or more phonemic or phone-like forms, the
pronunciation variants. Thorough knowledge of the target language is a
priori necessary to establish the pronunciation baseforms and
variants. The reliance on human expertise can pose difficulties in
developing a system for a language where such knowledge may not be
readily available. In this article a speech recognizer is used to help
select pronunciation variants in Amharic, the official language of
Ethiopia, focusing on alternate choices for vowels. This study is
carried out using an audio corpus composed of 37 hours of speech from
radio broadcasts which were orthographically transcribed by native
speakers. Since the corpus is relatively small for estimating
pronunciation variants, a first set of studies were carried out at a
syllabic level. Word lexica were then constructed based on the
observed syllable occurences. Automatic alignments were compared for
lexica containing different vowel variants, with both
context-independent and context-dependent acoustic models sets. The
variant2+ measure proposed in (Adda-Decker and Lamel, 1999) is used to
assess the potential need for pronunciation variants.

icassp06FL.pdf

[ICASSP 2006, Toulouse, May 2006]
---------
Discriminant Initialization for Factor Analyzed HMM Training
Fabrice Lefevre and Jean-Luc Gauvain
------------------------
Factor analysis has been recently used to model the covariance of the
feature vector in speech recognition systems. Maximum likelihood
estimation of the parameters of factor analyzed HMMs (FAHMMs) is
usually done via the EM algorithm, meaning that initial estimates of
the model parameters is a key issue. In this paper we report on
experiments showing some evidence that the use of a discriminative
criterion to initialize the FAHMM maximum likelihood parameter
estimation can be effective. The proposed approach relies on the
estimation of a discriminant linear transformation to provide initial
values for the factor loading matrices, as well as appropriate
initializations for the other model parameters. Speech recognition
experiments were carried out on the Wall Street Journal LVCSR task
with a 65k vocabulary. Contrastive results are reported with various
model sizes using discriminant and non discriminant initialization.

icassp06LID.pdf

[ICASSP 2006, Toulouse, May 2006]
---------
Discriminative Classifiers for Language Recognition
Christopher White, Izhak Shafran and Jean-Luc Gauvain
------------------------
Most language recognition systems consist of a cascade of three
stages: (1) tokenizers that produce parallel phone streams, (2)
phonotactic models that score the match between each phone stream and
the phonotactic constraints in the target language, and (3) a final
stage that combines the scores from the parallel streams appropriately
[1]. This paper reports a series of contrastive experiments to assess
the impact of replacing the second and third stages with large-margin
discriminative classifiers. In addition, it investigates how sounds
that are not represented in the tokenizers of the first stage can be
approximated with composite units that utilize cross-stream
dependencies obtained via multi-string alignments. This leads to a
discriminative framework that can potentially incorporate a richer set
of features such as prosodic and lexical cues. Experiments are
reported on the NIST LRE 1996 and 2003 task and the results show that
the new techniques give substantial gains over a competitive PPRLM
baseline.

icassp06arabic.pdf

[ICASSP 2006, Toulouse, May 2006]
---------
Arabic Broadcast News Transcription Using a One Million Word Vocalized Vocabulary
Abdel. Messaoudi, Jean-Luc Gauvain and Lori Lamel
------------------------
Recently it has been shown that modeling short vowels in Arabic can
significantly improve performance even when producing a non-vocalized
transcript. Since Arabic texts and audio transcripts are almost
exclusively non-vocalized, the training methods have to overcome this
missing data problem. For the acoustic models the procedure was
bootstrapped with manually vocalized data and extended with
semi-automatically vocalized data. In order to also capture the vowel
information in the language model, a vocalized 4- gram language model
trained on the audio transcripts was interpolated with the original
4-gram model trained on the (non-vocalized) written texts. Another
challenge of the Arabic language is its large lexical variety. The
out-of-vocabulary rate with a 65k word vocabulary is in the range of
4-8% (compared to under 1% for English). To address this problem a
vocalized vocabulary containing over 1 million vocalized words,
grouped into 200k word classes is used. This reduces the
out-of-vocabulary rate to about 2%. The extended vocabulary and
vocalized language model trained on the manually annotated data give a
1.2% absolute word error reduction on the DARPA RT04 development
data. However, including the automatically vocalized transcripts in
the language model reduces performance indicating that automatic
vocalization needs to be improved.

LIMSI_rt06s_asr.pdf

[Lecture Notes in Computer Science, Vol. 4299 - Proc. 3rd Joint Workshop on
Multimodal Interaction and Related Machine Learning Algorithms (MLMI 2006)
Rich Transcription 2006 Spring Meeting Recognition Evaluation Workshop, May 2006]]
---------
The LIMSI RT06s Lecture Transcription System
L. Lamel, E. Bilinski, G. Adda, J.L. Gauvain, H. Schwenk
------------------------
This paper describes recent research carried out in the context of the
FP6 Integrated Project CHIL in developing a system to
automatically transcribe lectures and presentations. Widely available
corpora were used to train both the acoustic and language models,
since only a small amount of CHIL data was available for system
development. Acoustic model training made use of the transcribed
portion of the TED corpus of Eurospeech recordings, as well as the
ICSI, ISL, and NIST meeting corpora. For language model training,
text materials were extracted from a variety of on-line conference
proceedings. Experimental results are reported for close-talking and
far-field microphones on development and evaluation data.

rt06s_limsi_diarization.pdf

[Lecture Notes in Computer Science, Proc. CLEAR'06 Evaluation Campaign and Workshop -
Classification of Events, Activities and Relationships, May 2006]
---------
Speaker Diarization: from Broadcast News to Lectures
X. Zhu, C. Barras, L. Lamel, J.L. Gauvain
------------------------
This paper presents the LIMSI speaker diarization system for lecture
data, in the framework of the Rich Transcription 2006 Spring (RT-06S)
meeting recognition evaluation. This system builds upon the baseline
diarization system designed for broadcast news data. The baseline
system combines agglomerative clustering based on Bayesian information
criterion with a second clustering using state-of-the-art speaker
identification techniques. In the RT-04F evaluation, the baseline
system provided an overall diarization error of 8.5% on broadcast
news data. However since it has a high missed speech error rate on
lecture data, a different speech activity detection approach based on
the log-likelihood ratio between the speech and non-speech models
trained on the seminar data was explored. The new speaker diarization
system integrating this module provides an overall diarization error
of 20.2% on the RT-06S Multiple Distant Microphone (MDM) data.

CansecoASRU05.pdf

[IEEE ASRU 2005, Cancun, November 2005]
---------
A Comparative Study Using Manual and Automatic Transcriptions for Diarization
Leonardo Canseco, Lori Lamel, Jean-Luc Gauvain
------------------------
Abstract This paper describes recent studies on speaker diarization
from automatic broadcast news transcripts. Linguistic information
revealing the true names of who speaks during a broadcast (the next,
the previous and the current speaker) is detected by means of
linguistic patterns. In order to associate the true speaker names with
the speech segments, a set of rules are de ned for each pattern. Since
the effectiveness of linguistic patterns for diarization depends on
the quality of the transcription, the performance using automatic
transcripts generated with an LVCSR system are compared with those
obtained using manual transcriptions. On about 150 hours of broadcast
news data (295 shows) the global ratio of false identity association
is about 13% for the automatic and the manual transcripts.

DechelotteASRU05.pdf

[IEEE ASRU 2005, Cancun, November 2005]
---------
Investigating Translation of Parliament Speeches
D. Dechelotte, H. Schwenk, J.-L. Gauvain, O. Galibert, and L. Lamel
-----------------------
This paper reports on recent experiments for speech to text (STT)
translation of European Parliamentary speeches. A Spanish speech to
English text translation system has been built using data from the
TC-STAR European project. The speech recognizer is a state-of-the-art
multipass system trained for the Spanish EPPS task and the statistical
translation system relies on the IBM-4 model. First, MT results are
compared using manual transcriptions and 1-best ASR hypotheses with
different word error rates. Then, an n-best interface between the ASR
and MT components is investigated to improve the STT
process. Derivation of the fundamental equation for machine
translation suggests that the source language model is not necessary
for STT. This was investigated by using weak source language models
and by n-best rescoring adding the acoustic model score only. A
significant loss in the BLEU score was observed suggesting that the
source language model is needed given the insufficiencies of the
translation model. Adding the source language model score in the
n-best rescoring process recovers the loss and slightly improves the
BLEU score over the 1-best ASR hypothesis. The system achieves a BLEU
score of 37.3 with an ASR word error rate of 10% and a BLEU score of
40.5 using the manual transcripts.

IS051735.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Different Size Multilingual Phone Inventories and
Context-Dependent Acoustic Models for Language Identification
Dong Zhu, Martine Adda-Decker, Fabien Antoine
-----------------------
Experimental work using phonotactic and syllabotactic approaches for
automatic language identification (LID) is presented. Various
questions have originated this research: what is the best choice for a
multilingual phone inventory? Can a syllabic unit be of interest to
extend the scope of the modeling unit? Are context-dependent (CD)
acoustic models, widely used for speech recognition, able to improve
LID accuracy? Can the multilingual acoustic models process efficiently
additional languages, which are different from the training languages?
The LID system is experimentally studied using different sizes of
multilingual phone sets: 73, 50 and 35 phones. Experiments are carried
out on broadcast news in seven languages (German, English, Arabic,
Mandarin, Spanish, French, and Italian) with 140-hours audio data for
training and 7 hours for testing. It is shown that smaller phone
inventories achieve higher LID accuracy and that CD models outperform
CI models. Further experiments have been conducted to test generality
of both the multilingual acoustic model and phonotactics methods on
another 11+10 languages corpus (11 known + 10 unknown languages).

IS052391.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Do Speech Recognizers Prefer Female Speakers?
Martine Adda-Decker, Lori Lamel
-----------------------
In this contribution we examine large speech corpora of prepared
broadcast and spontaneous telephone speech in American English and in
French. Starting with the question whether ASR systems behave
differently on male and female speech, we then try to find evidence on
acoustic-phonetic, lexical and idiomatic levels to explain the
observed differences. Recognition results have been analysed on 3--7h
of speech in each language and speech type condition (totaling 20
hours). Results consistently show a lower word error rate on female
speech ranging from 0.7 to 7% depending on the condition. An analysis
of automatically produced pronunciations in speech training corpora
(totaling 4000 hours of speech) revealed that female speakers tend to
stick more consistently to standard pronunciations than male
speakers. Concerning speech disfluencies, male speakers show larger
proportions of filled pauses and repetitions, as compared to females.

IS051976.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Perceptual Salience of Language-Specific Acoustic
Differences in Autonomous Fillers Across Eight Languages
Ioana Vasilescu, Maria Candea, Martine Adda-Decker
-----------------------
Are acoustic differences in autonomous fillers salient for the human
perception ? Acoustic measurements have been carried out on autonomous
fillers from eight languages (Arabic, Mandarin Chinese, French,
German, Italian, European Portuguese, American English and Latin
American Spanish). They exhibit timbre differences of the support
vowel of autonomous fillers across languages. In order to evaluate
their salience for human perception, two discrimination experiments
have been conducted, French/L2 and Portuguese/L2. These experiments
test the capacity to discriminate languages by listening to isolated
autonomous fillers without any lexical support and without any
context. As listeners have been native French speakers, the
Portuguese/L2 experiments aim at evaluating a potential mother tongue
bias. Results show that the perceptual discrimination performance
depends on the language pair identity and they are consistent with the
acoustic measures. The present study aims at providing some insight
for acoustic filler models used in automatic speech processing in a
multilingual context.

IS052010.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Semantic Annotation of the French Media Dialog Corpus
H. Bonneau-Maynard, S. Rosset, C. Ayache, A. Kuhn, D. Mostefa
-----------------------
The French Technolangue MEDIA-EVALDA project aims to evaluate spoken
understanding approaches. This paper describes the semantic annotation
scheme of a common dialog corpus which will be used for developing and
evaluating spoken understanding models and for linguistic studies. A
common semantic representation has been formalized and agreed upon by
the consortium. Each utterance is divided into semantic segments and
each segment is annotated with a 5-tuplet containing the mode,
attribute name representing the underlying concept, normalized form of
the attribute, list of related segments, and an optional comment about
the annotation. Periodic inter-annotator agreement studies demonstrate
that the annotation are of good quality, with an agreement of almost
90% on mode and attribute identification. An analysis of the semantic
content of 12292 annotated client utterances shows that only 14.1% of
the observed attributes are domain-dependent and that the semantic
dictionary ensures a good coverage of the task.

IS052349.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Multi-Level Information and Automatic Dialog Acts Detection
in Human-Human Spoken Dialogs
Sophie Rosset, Delphine Tribout
-----------------------
This paper reports on our experience in annotating and the
automatically detecting dialog acts in human-human spoken dialog. Our
work is based on three hypotheses: first, the dialog act succession is
strongly constrained; second, initial word and semantic class of word
are more important than the exact word in identifying the dialog act;
third, information is encoded in specific entities. We also used
historical information in order to account for the dialogical
structure. A memory based learning approach is used to detect dialog
acts. Experiments have been conducted using different kind of
information levels. In order to verify our hypotheses, the model
trained on a French corpus was tested on an English corpus for a
similar task and on a French corpus from a different domain. A
correct dialog act detection rate of about 87% is obtained for the
same domain/ language condition and 80% for the cross-language and
cross-domain conditions.

IS052393.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
An Open-Domain, Human-Computer Dialog System
Olivier Galibert, Gabriel Illouz, Sophie Rosset
-----------------------
The project shape Ritel aims at integrating a spoken language dialog
system and an open-domain question answering system to allow a human
to ask general questions (``Who is currently presiding the Senate?'')
and refine the search interactively. As this point in time the shape
Ritel platform is being used to collect a human-computer dialog
corpus. The user can receive factual answers to some questions ( Q :
who is the president of France, R : Jacques Chirac is the president
for France since may 1995). This paper briefly presents the current
system, the collected corpus, the problems encountered by such a
system and our first answers to these problems. When the system is
more advance, it will allow measuring the net worth of integrating a
dialog system into a QA system. Does allowing such a dialog really
enables to reach faster and more precisely the ``right'' answer to a
question?

IS052175.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Detection of Real-Life Emotions in Call Centers
Laurence Vidrascu, Laurence Devillers
-----------------------
In contrast to most previous studies conducted on artificial data with
archetypal emotions, this paper addresses some of the challenges faced
when studying real-life non-basic emotions. A 10-hour dialog corpus
recorded in a French Medical emergency call center has been studied.
In this paper, our aim is twofold, to study emotion mixtures in natural
data and to achieve high level of performances in real-life emotion
detection. An annotation scheme using two labels per segment has been
used for representing emotion mixtures. A closer study of these mixtures
has been carried out, revealing the presence of conflictual valence
emotions. A correct detection rate of about 82% was obtained between
Negative and Positive emotions using paralinguistic cues without taking
into account these conflictual blended emotions. A perceptive study
and the confusions analysis obtained with automatic detection enable
to validate the annotation scheme with two emotion labels per segment.

IS052018.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Multimodal Databases of Everyday Emotion: Facing up to Complexity
Ellen Douglas-Cowie, Laurence Devillers, Jean-Claude Martin,
Roddy Cowie, Suzie Savvidou, Sarkis Abrilian, Cate Cox
-----------------------
In everyday life, speech is part of a multichannel system involved
in conveying emotion. Understanding how it operates in that context
requires suitable data, consisting of multimodal records of emotion
drawn from everyday life. This paper reflects the experience of two
teams active in collecting and labelling data of this type. It sets
out the core reasons for pursuing a multimodal approach, reviews issues
and problems for developing relevant databases, and indicates how we
can move forward both in terms of data collection and approaches to
labelling.

IS051589.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Transcribing Lectures and Seminars
Lori Lamel, Gilles Adda, Eric Bilinski, Jean-Luc Gauvain
-----------------------
This paper describes recent research carried out in the context of the
FP6 Integrated Project shape Chil in developing a system to
automatically transcribe lectures and seminars. We made use of widely
available corpora to train both the acoustic and language models,
since only a small amount of shape Chil data were available for system
development. For acoustic model training made use of the transcribed
portion of the TED corpus of Eurospeech recordings, as well as the
ICSI, ISL, and NIST meeting corpora. For language model training, text
materials were extracted from a variety of on-line conference
proceedings. Word error rates of about 25% are obtained on test data
extracted 12 seminars.

IS052002.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Diachronic Vocabulary Adaptation for Broadcast News Transcription
Alexandre Allauzen, Jean-Luc Gauvain
-----------------------
This article investigates the use of Internet news sources to
automatically adapt the vocabulary of a French and an English
broadcast news transcription system. A specific method is developed to
gather training, development and test corpora from selected websites,
normalizing them for further use. A vectorial vocabulary adaptation
algorithm is described which interpolates word frequencies estimated
on adaptation corpora to directly maximize lexical coverage on a
development corpus. To test the generality of this approach,
experiments were carried out simultaneously in French and in English
(UK) on a daily basis for the month May 2004. In both languages, the
OOV rate is reduced by more than a half.

IS052366.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Building Continuous Space Language Models
for Transcribing European Languages
Holger Schwenk, Jean-Luc Gauvain
-----------------------
Large vocabulary continuous speech recognizers for English Broadcast
News achieve today word error rates below 10%. An important factor for
this success is the availability of large amounts of acoustic and
language modeling training data. In this paper the recognition of
French Broadcast News and English and Spanish parliament speeches is
addressed, tasks for which less resources are available. A neural
network language model is applied that takes better advantage of the
limited amount of training data. This approach performs the estimation
of the probabilities in a continuous space, allowing by this means
smooth interpolations. Word error reduction of up to 0.9% absolute
are reported with respect to a carefully tuned backoff language model
trained on the same data.

IS051895.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Impact of Duration on F1/F2 Formant Values of Oral Vowels:
An Automatic Analysis of Large Broadcast News Corpora
in French and German
Cedric Gendrot, Martine Adda-Decker
-----------------------
Formant values of oral vowels are automatically measured in a total
of 50000 segments from four hours of journalistic broadcast speech
in French and German. After automatic segmentation using the LIMSI
speech alignment system, formant values are automatically extracted
using PRAAT. Vowels with unlikely formant values are discarded (4%).
The measured values are exposed with respect to segment duration and
language identity. The gap between the measured mean Fi values and
reference Fi values is inversely proportional to vowel duration: a
tendency to reduction for vowels of short duration clearly emerges
for both languages. These results are briefly discussed in terms of
centralisation and coarticulation.

IS051821.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Combining Speaker Identification and BIC for Speaker Diarization
Xuan Zhu, Claude Barras, Sylvain Meignier, Jean-Luc Gauvain
-----------------------
This paper describes recent advances in speaker diarization by
incorporating a speaker identification step. This system builds upon
the LIMSI baseline data partitioner used in the broadcast news
transcription system. This partitioner provides a high cluster purity
but has a tendency to split the data from a speaker into several
clusters, when there is a large quantity of data for the
speaker. Several improvements to the baseline system have been
made. Firstly, a standard Bayesian information criterion (BIC)
agglomerative clustering has been integrated replacing the iterative
Gaussian mixture model (GMM) clustering. Then a second clustering
stage has been added, using a speaker identification method with MAP
adapted GMM. A final post-processing stage refines the segment
boundaries using the output of the transcription system. On the RT-04f
and ESTER evaluation data, the improved multi-stage system provides
between 40% and 50% reduction of the speaker error, relative to a
standard BIC clustering system.

IS052392.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Where Are We in Transcribing French Broadcast News?
Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Alexandre Allauzen,
Veronique Gendner, Lori Lamel, Holger Schwenk
-----------------------
Given the high flexional properties of the French language,
transcribing French broadcast news (BN) is more challenging than
English BN. This is in part due to the large number of homophones in
the inflected forms. This paper describes advances in automatic
processing of broadcast news speech in French based on recent
improvements to the LIMSI English system. The main differences between
the English and French BN systems are: a 200k vocabulary to overcome
the lower lexical coverage in French (including contextual
pronunciations to model liaisons), a case sensitive language model,
and the use of a POS based language model to lower the impact of
homophonic gender and number disagreement. The resulting system was
evaluated in the first French TECHNOLANGUE-ESTER ASR benchmark
test. This system achieved the lowest word error rate in this
evaluation by a significant margin. We also report on a 1xRT version
of this system.

IS051574.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
The 2004 BBN/LIMSI 20xRT English Conversational
Telephone Speech Recognition System
R. Prasad, S. Matsoukas, C.-L. Kao, J.Z. Ma,
D.-X. Xu, T. Colthurst, O. Kimball, R. Schwartz,
J.L. Gauvain, L. Lamel, H. Schwenk, G. Adda, F. Lefevre
-----------------------
In this paper we describe the English Conversational Telephone Speech
(CTS) recognition system jointly developed by BBN and LIMSI under the
DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004
BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT
(real-time as measured on Pentium 4 Xeon 3.4 GHz Processor) on the
EARS progress test set. This translates into a 22.8% relative
improvement in WER over the 2003 BBN/LIMSI EARS evaluation system,
which was run without any time constraints. In addition to reporting
on the system architecture and the evaluation results, we also
highlight the significant improvements made at both sites.

IS051588.PDF

[InterSpeech 2005, Lisbon, September 2005]
---------
Modeling Vowels for Arabic BN Transcription
Abdel Messaoudi, Lori Lamel, Jean-Luc Gauvain
-----------------------
This paper presents a method for improving acoustic model precision by
incorporating wide phonetic context units in speech recognition. The
wide phonetic context model is constructed from several narrower
context-dependent models based on the Bayesian framework. Such a
composition is performed in order to avoid the crucial problem of a
limited availability of training data and to reduce the model
complexity. To enhance the model reliability due to unseen contexts
and limited training data, flooring and deleted interpolation
techniques are used. Experimental results show that this method gives
improvement of the word accuracy with respect to the standard triphone
model.

Allauzen_ICASSP05.pdf

[ICASSP 2005, Philadelphia, March 2005]
---------
Open Vocabulary ASR for Audiovisual Document Indexation
-----------------------
This paper reports on an investigation of an open vocabulary
recognizer that allows new words to be introduce in the recognition
vocabulary, without the need to retrain or adapt the language
model. This method uses special word classes, whose ngram
probabilities are estimated during the training process by discounting
a mass of probability from the out of vocabulary words. A part of
speech tagger is used to determine the word classes during language
model training and for vocabulary adaptation. Metadata information
provided by the the French audiovisual archive institute are used to
identify important document-specic missing words which are added to
appropriate word class in the system vocabulary. Pronunciations for
the new words are derived by grapheme-to-phoneme conversion. On over 3
hours of broadcast news data, this approach leads to a reduction of
0.35% in the OOV rate, of 0.6% of the word error rate, with 80% of the
occurrences of the newly introduced being correctly recognized.

Lamel_ICASSP05.pdf

[ICASSP 2005, Philadelphia, March 2005]
---------
Alternate Phone Models for Conversational Speech
Lori Lamel and Jean-Luc Gauvain
-----------------------
This paper investigates the use of alternate phone models for the
transcription of conversational telephone speech. The challenges of
transcribing conversational speech are many, and recent research has
focused mainly at the acoustic and linguistic levels. The focus of
this work is to explore alternative ways of modeling different manners
of speaking so as to better cover the observed articulatory styles and
pronunciation variants. Four alternate phone sets are compared ranging
from 38 to 129 units. Two of the phone sets make use of
syllable-position dependent phone models. The acoustic models were
trained on 2300 hours of conversational telephone speech data from the
Switchboard and Fisher corpora, and experimental results are reported
on the EARS Dev04 test set which contains 3 hours of speech from 36
Fisher conversations. While no one particular phone set was found to
outperform the others for a majority of speakers, the best overall
performance was obtained with the original 48 phone set and a reduced
38 phone set, however combining the hypotheses of the individual
models reduces the word error from 17.5% (original phone set) to
16.8%.

rt04f_diarization.pdf

[RT'04, Palisades, Novembre 2004]
---------
Improving Speaker Diarization
Claude Barras, Xuan Zhu, Sylvain Meignier, and Jean-Luc Gauvain
-----------------------
This paper describes the LIMSI speaker diarization system used in the
RT-04F evaluation. The RT-04F system builds upon the LIMSI baseline
data partitioner, which is used in the broadcast news transcription
system. This partitioner provides a high cluster purity but has a
tendency to split the data from a speaker into several clusters when
there is a large quantity of data for the speaker. In the RT-03S
evaluation the baseline partitioner had a 24.5% diarization error
rate. Several improvements to the baseline diarization system have
been made. A standard Bayesian information criterion (BIC)
agglomerative clustering has been integrated replacing the iterative
Gaussian mixture model (GMM) clustering; a local BIC criterion is used
for comparing single Gaussians with full covariance matrices. A second
clustering stage has been added, making use of a speaker
identification method: maximum a posteriori adaptation of a reference
GMM with 128 Gaussians. A final post-processing stage refines the
segment boundaries using the output of the transcription
system. Compared to the best configuration baseline system for this
task, the improved system reduces the speaker error time by over 75%
on the development data. On evaluation data, a 8.5% overall
diarization error rate was obtained, a 60% reduction in error compared
to the baseline.

rt04f_limsi_lm.pdf

[RT'04, Palisades, Novembre 2004]
---------
Using neural network language models for LVCSR
Holger Schwenk and Jean-Luc Gauvain
-----------------------
In this paper we describe how we used a neural network language model
for the BN and CTS task for the RT04 evaluation. This new language
modeling approach performs the estimation of the n-gram probabilities
in a continuous space, allowing by this means probability
interpolation for any context, without the need to back-off to shorter
contexts. Details are given on training data selection, parameter
estimation and fast training and decoding algorithms. The neural
network language model achieved word error reductions of 0.5% for the
CTS task and of 0.3% for the BN task with an additional decoding cost
of 0.05xRT.

rt04f_bbn_limsi_eng_bn10xrt.pdf

[RT'04, Palisades, Novembre 2004]
---------
The 2004 BBN/LIMSI 10xRT English Broadcast News Transcription System
Long Nguyen, Sherif Abdou, Mohamed Afify, John Makhoul,
Spyros Matsoukas, Richard Schwartz, Bing Xiang
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Holger Schwenk, Fabrice Lefevre
-----------------------
This paper describes the 2004 BBN/LIMSI 10xRT English Broadcast News
(BN) transcription system which uses a tightly integrated combination
of components from the BBN and LIMSI speech recognition systems. The
integrated system uses both cross-site adaptation and system
combination via ROVER, obtaining a word hypothesis that is better than
is produced by either system alone, while remaining within the
allotted time limit. The system configuration used for the evaluation
has two components from each site and two ROVER combinations, and
achieved a word error rate (WER) of 13.9% on the Dev04f set and 9.3%
on the Dev04 set selected to match the progress set. Compared to last
year's system, there is around 30% relative reduction on the WER.

rt04f_bbn_limsi_eng_cts20xrt.pdf

[RT'04, Palisades, Novembre 2004]
---------
The 2004 BBN/LIMSI 20xRT English Conversational Telephone Speech System
R. Prasad, S. Matsoukas, C.-L. Kao, J. Ma, D.-X. Xu,
T. Colthurst, G. Thattai, O. Kimball, R. Schwartz,
J.-L. Gauvain, L. Lamel, H. Schwenk, G. Adda, and F. Lefevre
-----------------------
In this paper we describe the English Conversational Telephone Speech
(CTS) recognition system jointly developed by BBN and LIMSI under the
DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004
BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT
(real-time as measured on Pentium 4 Xeon 3.4 GHz Processor) on the
EARS progress test set. This translates into a 22.8% relative
improvement in WER over the 2003 BBN/LIMSI EARS evaluation system,
which was run without any time constraints. In addition to reporting
on the system architecture and the evaluation results, we also
highlight the significant improvements made at both sites.

rt04f_limsi_altphone.pdf

[RT'04, Palisades, Novembre 2004]
---------
Alternate phone models for CTS
Lori Lamel and Jean-Luc Gauvain
-----------------------
This paper investigates the use of alternate phone models for the
transcription of conversational telephone speech. The challenges of
transcribing conversational speech are many, and recent research has
focused mainly at the acoustic and linguistic levels. The focus of
this work is to explore alternative ways of modeling different manners
of speaking so as to better cover the observed articulatory styles and
pronunciation variants. Four alternate phone sets are compared ranging
from 38 to 129 units. Two of the phone sets make use of
syllable-position dependent phone models. The acoustic models were
trained on 2300 hours of conversational telephone speech data from the
Switchboard and Fisher corpora, and experimental results are reported
on the EARS Dev04 test set which contains 3 hours of speech from 36
Fisher conversations. While no one particular phone set was found to
outperform the others for a majority of speakers, the best overall
performance was obtained with the original 48 phone set and a reduced
38 phone set, however combining the hypotheses of the individual
models reduces the word error from 17.5% (original phone set) to
16.8%.

rt04f_limsi_sttspkr.pdf

[RT'04, Palisades, Novembre 2004]
---------
Towards using STT for Broadcast News Speaker Diarization
Leornardo Canseco-Rodriguez, Lori Lamel, and Jean-Luc Gauvain
-----------------------
The aim of this study is to investigate the use of the linguistic
information present in the audio signal to structure broadcast news
data, and in particular to associate speaker identities with audio
segments. While speaker recognition has been an active area of
research for many years, addressing the problem of identifying
speakers in huge audio corpora is relatively recent and has been
mainly concerned with speaker tracking. The speech transcriptions
contain a wealth of linguistic information that is useful for speaker
diarization. Patterns which can be used to identify the current,
previous or next speaker have been developed based on the analysis of
150 hours of manually transcribed broadcast news data. Each pattern is
associated with one or more rules to assign speaker identities. After
validation on the training transcripts, these patterns and rules were
tested on an independent data set containing transcripts of 9 hours of
broadcasts, and a speaker diarization error rate of about 11% was
obtained. Future work will validate the approach on automatically
generated transcripts and also combine the linguistic information with
information derived from the acoustic level.

odyssey04_barras.pdf

[Odyssey'04, Toledo, May-June 2004]
---------
Unsupervised Online Adaptation for Speaker Verification over the Telephone
Claude Barras, Sylvain Meignier and Jean-Luc Gauvain
-----------------------
This paper presents experiments of unsupervised adaptation for a
speaker detection system. The system used is a standard speaker
verification system based on cepstral features and Gaussian mixture
models. Experiments were performed on cellular speech data taken from
the NIST 2002 speaker detection evaluation. There was a total of about
30.000 trials involving 330 target speakers and more than 90% of
impostor trials. Unsupervised adaptation significantly increases the
system accuracy, with a reduction of the minimal detection cost
function (DCF) from 0.33 for the baseline system to 0.25 with
unsupervised online adaptation. Two incremental adaptation modes were
tested, either by using a fixed decision threshold for adaptation, or
by using the a posteriori probability of the true target for weighting
the adaptation. Both methods provide similar results in the best
configurations, but the latter is less sensitive to the actual
threshold value.

lrec04_barras.pdf

[LREC'04, Lisbon, May 2004]
---------
Automatic Audio and Manual Transcripts Alignment, Time-code Transfer
and Selection of Exact Transcripts
Claude Barras, Gilles Adda, Martine Adda-Decker, Benoit Habert
Philippe Boula de Mareuil and Patrick Paroubek
-----------------------
The present study focuses on automatic processing of sibling resources
of audio and written documents, such as available in audio archives or
for parliament debates: written texts are close but not exact audio
transcripts. Such resources deserve attention for several reasons:
they represent an interesting testbed for studying differences between
written and spoken material and they yield low cost resources for
acoustic model training. When automatically transcribing the audio
data, regions of agreement between automatic transcripts and written
sources allow to transfer time-codes to the written documents: this
may be helpful in an audio archive or audio information retrieval
environment. Regions of disagreement can be automatically selected for
further correction by human transcribers. This study makes use of 10
hours of French radio interview archives with corresponding
press-oriented transcripts. The audio corpus has then been transcribed
using the LIMSI speech recognizer resulting in automatic transcripts,
exhibiting an average word error rate of 12%. 80% of the text corpus
(with word chunks of at least five words) can be exactly aligned with
the automatic transcripts of the audio data. The residual word error
rate on these 80% is less than 1%.

lrec04_barras.pdf

[LREC'04, Lisbon, May 2004]
---------
Reliability of Lexical and Prosodic Cues in two Real-life Spoken Dialog Corpora
Laurence Devillers and Laurence Vidrascu
-----------------------
The present research focuses on analyzing and detecting emotions in
speech as revealed by task-dependent spoken dialogs
corpora. Previously, we have conducted several experiments on a
real-life corpus in order to develop a reliable annotation method and
to detect lexical and prosodic cues correlated to the main emotion
class. In this paper we evaluate both the robustness of the annotation
scheme and of the lexical and prosodic cues by testing them on a new
corpus. This work is carried out in the context of the Amities project
in which spoken dialog systems for call center services are being
developed.

icassp04_chen.pdf

[ICASSP'04, Montreal, May 2004]
---------
Lightly supervised acoustic model training using consensus networks
Langzhou Chen, Lori Lamel and Jean-Luc Gauvain
-----------------------
This paper presents some recent work on using consensus networks to
improve lightly supervised acoustic model training for the LIMSI
Mandarin BN system. Lightly supervised acoustic model training has
been attracting growing interest, since it can help to substantially
reduce the development costs for speech recognition systems. Compared
to supervised training with accurate transcriptions, the key problem
in lightly supervised training is getting the approximate transcripts
to be as close as possible to manually produced detailed ones,
i.e. finding a proper way to provide the information for
supervision. Previous work using a language model to provide
supervision has been quite successful. This paper extends the original
method presenting a new way to get the information needed for
supervision during training. Studies are carried out using the TDT4
Mandarin audio corpus and associated closedcaptions. After
automatically recognizing the training data, the closed-captions are
aligned with a consensus network derived from the hypothesized
lattices. As is the case with closed-caption filtering, this method
can remove speech segments whose automatic transcripts contain errors,
but it can also recover errors in the hypothesis if the information is
present in the lattice. Experiment results show that compared with
simply training on all of the data, consensus network based lightly
supervised acoustic model training based results in about a small
reduction in the character error rate on the DARPA/NIST RT 03
development and evaluation data.

icassp04_jlg.pdf

[ICASSP'04, Montreal, May 2004]
---------
Speech recognition in multiple languages and domains: The 2003 BBN/LIMSI EARS system
Richard Schwartz, Thomas Colthurst, Nicolae Duta, Herb Gish and
Rukmini Iyer, Chia-Lin Kao, Daben Liu, Owen Kimball, J. Ma, John
Makhoul, Spyros Matsoukas, Long Nguyen, Mohamed Noamany, Rohit Prasad,
Bing Xiang, Dongxin Xu, Jean-Luc Gauvain, Lori Lamel, Holger Schwenk,
Gilles Adda and Langzhou Chen
-----------------------
We report on the results of the first evaluations for the BBN/LIMSI
system under the new DARPA EARS Program. The evaluations were carried
out for conversational telephone speech (CTS) and broadcast news (BN)
for three languages: English, Mandarin, and Arabic. In addition to
providing system descriptions and evaluation results, the paper
highlights methods that worked well across the two domains and those
few that worked well on one domain but not the other. For the BN
evaluations, which had to be run under 10 times real-time, we
demonstrated that a joint BBN/LIMSI system with that time constraint
achieved better results than either system alone.

icassp04_ll.pdf

[ICASSP'04, Montreal, May 2004]
---------
Speech Transcription in Multiple Languages
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker,
Leonard Canseco, Langzhou Chen, Olivier Galibert, Abdel Messaoudi and
Holger Schwenk
-----------------------
This paper summarizes recent work underway at LIMSI on speech-to-text
transcription in multiple languages. The research has been oriented
towards the processing of broadcast audio and conversational speech
for information access. Broadcast news transcription systems have been
developed for seven languages and it is planned to address several
other languages in the near term. Research on conversational speech
has mainly focused on the English language, with initial work on the
French, Arabic and Spanish languages. Automatic processing must take
into account the characteristics of the audio data, such as needing to
deal with the continuous data stream, specificities of the language
and the use of an imperfect word transcription for accessing the
information content. Our experience thus far indicates that at today s
word error rates, the techniques used in one language can be
successfully ported to other languages, andmost of the language
specificities concern lexical and pronunciationmodeling.

ThC3202o.5_p1215.pdf

[ICSLP'04, Jeju Island, October 2004]
---------
Neural Network Language Models for Conversational Speech Recognition
Holger Schwenk and Jean-Luc Gauvain
-----------------------
Recently there is growing interest in using neural networks for
language modeling. In contrast to the well known backoff ngram
language models (LM), the neural network approach tries to limit
problems from the data sparseness by performing the estimation in a
continuous space, allowing by these means smooth
interpolations. Therefore this type of LM is interesting for tasks for
which only a very limited amount of in-domain training data is
available, such as the modeling of conversational speech. In this
paper we analyze the generalization behavior of the neural network LM
for in-domain training corpora varying from 7M to over 21M words. In
all cases, significant word error reductions were observed compared to
a carefully tuned 4-gram backoff language model in a state of the art
conversational speech recognizer for the NIST rich transcription
evaluations. We also apply ensemble learning methods and discuss their
connections with LM interpolation.

TuA2301o.1_p1283.pdf

[ICSLP'04, Jeju Island, October 2004]
---------
Language Recognition Using Phone Lattices
Jean-Luc Gauvain, Abdel Messaoudi and Holger Schwenk
-----------------------
This paper proposes a new phone lattice based method for automatic
language recognition from speech data. By using phone lattices some
approximations usually made by language identification (LID) systems
relying on phonotactic constraints to simplify the training and
decoding processes can be avoided. We demonstrate the use of phone
lattices both in training and testing significantly improves the
accuracy of a phonotactically based LID system. Performance is further
enhanced by using a neural network to combine the results of multiple
phone recognizers. Using three phone recognizerswith context
independent phone models, the system achieves an equal error rate of
2.7% on the Eval03 NIST detection test (30s segment, primary
condition) with an overall decoding process that runs faster than
real-time (0.5xRT).

TuB401o.2_p540.pdf

[ICSLP'04, Jeju Island, October 2004]
---------
Automatic Detection of Dialog Acts Based on Multi-level Information
Sophie Rosset and Lori Lamel
-----------------------
Recently there has been growing interest in using dialog acts to
characterize human-human and human-machine dialogs. This paper reports
on our experience in the annotation and the automatic detection of
dialog acts in human-human spoken dialog corpora. Our work is based on
two hypotheses: first, word position is more important than the exact
word in identifying the dialog act; and second, there is a strong
grammar constraining the sequence of dialog acts. A memory based
learning approach has been used to detect dialog acts. In a first set
of experiments the number of utterances per turn is known, and in a
second set, the number of utterances is hypothesized using a language
model for utterance boundary detection. In order to verify our first
hypothesis, the model trained on a French corpus was tested on a
corpus for a similar task in English and for a second French corpus
from a different domain. A correct dialog act detection rate of about
84% is obtained for the same domain and language condition and about
75% for the cross-language and cross-domain conditions.

TuC2105o.3_p1272.pdf

[ICSLP'04, Jeju Island, October 2004]
---------
Speaker Diarization from Speech Transcripts
Leornardo Canseco-Rodriguez, Lori Lamel and Jean-Luc Gauvain
-----------------------
The aim of this study is to investigate the use of the linguistic
information present in the audio signal to structure broadcast news
data, and in particular to associate speaker identities with audio
segments. While speaker recognition has been an active area of
research for many years, addressing the problem of identifying
speakers in huge audio corpora is relatively recent and has been
mainly concerned with speaker tracking. The speech transcriptions
contain a wealth of linguistic information that is useful for speaker
diarization. Patterns which can be used to identify the current,
previous or next speaker have been developed based on the analysis of
150 hours of manually transcribed broadcast news data. Each pattern is
associated with one or more rules. After validation on the training
transcripts, these patterns and rules were tested on an independent
data set containing transcripts of 10 hours of broadcasts.

WeA3203p.2_p1281.pdf

[ICSLP'04, Jeju Island, October 2004]
---------
Dynamic Language Modeling for Broadcast News
Langzhou Chen, Jean-Luc Gauvain and Lori Lamel amd Gilles Adda
-----------------------
This paper describes some recent experiments on unsupervised language
model adaptation for transcription of broadcast news data. In previous
work, a framework for automatically selecting adaptation data using
information retrieval techniques was proposed. This work extends the
method and presents experimental results with unsupervised language
model adaptation. Three primary aspects are considered: (1) the
performance of 5 widely used LMadaptation methods using the same
adaptation data is compared; (2) the influence of the temporal
distance between the training and test data epoch on the adaptation
effi- ciency is assessed; and (3) show-based language model adaptation
is compared with story-based language model adaptation. Experiments
have been carried out for broadcast news transcription in English and
Mandarin Chinese. A relative word error rate reduction of 4.7% was
obtained in English and a 5.6% relative character error rate reduction
in Mandarin with story-based MDI adaptation.

ThA2001o.2_p521.pdf

[ICSLP'04, Jeju Island, October 2004]
---------
Transcription of Arabic Broadcast News
Abdel Messaoudi, Lori Lamel and Jean-Luc Gauvain
-----------------------
This paper describes recent research on transcribing Modern Standard
Arabic broadcast news data. The Arabic language presents a number of
challenges for speech recognition, arising in part from the
significant differences in the spoken and written forms, in particular
the conventional form of texts being non-vowelized. Arabic is a highly
inflected language where articles and affixes are added to roots in
order to change the word s meaning. A corpus of 50 hours of audio data
from 7 television and radio sources and 200 M words of newspaper texts
were used to train the acoustic and language models. The transcription
system based on these models and a vowelized dictionary obtains an
average word error rate on a test set comprised of 12 hours of test
data from 8 sources is about 18%.

ThC2701o.5_p1105.pdf

[ICSLP'04, Jeju Island, October 2004]
---------
Fiction Database for Emotion Detection in Abnormal Situations
Chloe Clavel and Ioan Vasilescu and Laurence Devillers and Thibault Ehrette
-----------------------
The present research focuses on the acquisition and annotation of
vocal resources for emotion detection. We are interested in detecting
emotions occurring in abnormal situations and particularly in
detecting fear . The present study considers a preliminary database of
audiovisual sequences extracted from movie fictions. The sequences
selected provide various manifestations of target emotions and are
described with a multimodal annotation tool. We focus on audio cues in
the annotation strategy and we use the video as support for validating
the audio labels. The present article deals with the description of
the methodology of data acquisition and annotation. The validation of
annotation is realized via two perceptual paradigms in which the
+/-video condition in stimuli presentation varies. We show the
perceptual significance of the audio cues and the presence of target
emotions.

asru03ma.pdf

[IEEE ASRU 2003, St Thomas, November 2003]
---------
Transcribing Mandarin Broadcast News
Langzhou Chen, Lori Lamel and Jean-Luc Gauvain
-----------------------
This paper describes improvements to the LIMSI broadcast news
transcription system for the Mandarin language in preparation for the
DARPA/NIST Rich Transcription 2003 (RT'03) evaluation. The
transcription system has been substantially updated to deal with the
varied acoustic and linguistic characteristics of the RT'03 test
conditions. The major improvements come from the use of lightly
supervised acoustic model training in order to benefit from
unannotated audio data, the use of source specific language models,
and MDI adaptation to tune the language models for sources with
limited amounts of training data. The character error rate on the
development data has been reduced from 34.5% with the baseline system
to 22.6% with the evaluation system.

ES031038.PDF

[ISCA Eurospeech'03, Geneva, September 2003]
---------
The 300k LIMSI German Broadcast News Transcription System
Kevin McTait and Martine Adda-Decker
-----------------------
This paper describes important improvements to the existing LIMSI
German broadcast news transcription system, especially its extension
from a 65k vocabulary to 300k words. Automatic speech recognition for
German is more problematic than for a language such as English in that
the inflectional morphology of German and its highly generative
process of compounding lead to many more out of vocabulary words for a
given vocabulary size. Experiments undertaken to tackle this problem
and reduce the transcription error rate include bringing the language
models up to date, improved pronunciation models, semi-automatically
constructed pronunciation lexicons and increasing the size of the
system's vocabulary.

ES030250.PDF

[ISCA Eurospeech'03, Geneva, September 2003]
---------
A corpus-based decompounding algorithm for German lexical modeling in LVCSR
Martine Adda-Decker
-----------------------
In this paper a corpus-based decompounding algorithm is described and
applied for German LVCSR. The decompounding algorithm contributes to
address two major problems for LVCSR: lexical coverage and
letter-to-sound conversion. The idea of the algorithm is simple:
given a word start of length k only few different characters can
continue an admissible word in the language. But concerning compounds,
if word start k reaches a constituent word boundary, the set of
successor characters can theoretically include any character. The
algorithm has been applied to a 300M word corpus with 2.6M distinct
words. 800k decomposition rules have been extracted automatically. OOV
(out of vocabulary) word reductions of 25% to 50% relative have been
achieved using word lists from 65k to 600k words. Pronunciation
dictionaries have been developed for the LIMSI 300k German recognition
system. As no language specific knowledge is required beyond the text
corpus, the algorithm can apply more generally to any compounding
language.

diss03_disfluencies.ps.Z

[ISCA DiSS '03 - Disfluency in Spontaneous Speech
Gothenburg University, Sweden, September 5-8, 2003]
---------
A disfluency study for cleaning spontaneous speech automatic transcripts
and improving speech language models
M. Adda-Decker, B. Habert, C. Barras, G. Adda,
Ph. Boula de Mareuil, P. Paroubek
-----------------------

The aim of this study is to elaborate a disfluent speech model by
comparing different types of audio transcripts. The study makes use of
10 hours of French radio interview archives, involving journalists and
personalities from political or civil society. A first type of
transcripts is press-oriented where most disfluencies are
discarded. For 10% of the corpus, we produced exact audio transcripts:
all audible phenomena and overlapping speech segments are transcribed
manually. In these transcripts about 14% of the words correspond to
disfluencies and discourse markers. The audio corpus has then been
transcribed using the LIMSI speech recognizer. With 8% of the corpus
the disfluency words explain 12% of the overall error rate. This shows
that disfluencies have no major effect on neighboring speech
segments. Restarts are the most error prone, with a 36.9% within class
error rate.

euro03HBMSR.ps.Z

[ISCA Eurospeech'03, Geneva, September 2003]
---------
A Semantic representation for spoken dialogs
H. Bonneau-Maynard and S. Rosset
-----------------------

This paper describes a semantic annotation scheme for spoken dialog
corpora. Manual semantic annotation of large corpora is tedious,
expensive, and subject to inconsistencies. Consistency is a necessity
to increase the usefulness of corpus for developing and evaluating
spoken understanding models and for linguistics studies. A semantic
representation, which is based on a concept dictionary definition, has
been formalized and is described. Each utterance is divided into
semantic segments and each segment is assigned with a 5-tuplets
containing a mode, the underlying concept, the normalized form of the
concept, the list of related segments, and an optional comment about
the annotation. Based on this scheme, a tool was developed which
ensures that the provided annotations respect the semantic
representation. The tool includes interfaces for both the formal
definition of the hierarchical concept dictionary and the annotation
process. An experiment was conducted to assess inter-annotator
agreement using both a human-human dialog corpus and a human-machine
dialog corpus. For human-human dialogs, the agreement rate, computed
on the triplets (mode, concept, value) is 61%, and the agreement rate
on the concepts alone is 74%. For the human-machine dialogs, the
percentage of agreement on the triplet is 83% and the correct concept
identification rate is 93%.

AnnotationHH032803.pdf

[ISCA Eurospeech'03, Geneva, September 2003]
---------
Semantic and Dialogic Annotation
for Automated Multilingual Customer Service
Hilda Hardy, Kirk Baker Helene Bonneau-Maynard,
Laurence Devillers, Sophie Rosset, Tomek Strzalkowski
-----------------------
One central goal of the AMITIES multilingual human-computer dialogue
project is to create a dialogue management system capable of engaging
the user in human-like conversation in a specific domain. To that end,
we have developed new methods for the manual annotation of spoken
dialogue transcriptions from European financial call centers. We have
modified the DAMSL dialogic schema to create a dialogue act taxonomy
appropriate for customer services. To capture the semantics, we use a
domain-independent framework populated with domain-specific lists. We
have designed a new flexible, platform-independent annotation tool,
XDMLTool, and annotated several hundred dialogues in French and
English. Inter-annotator agreement was moderate to high. We are using
these data to design our dialogue system, and we hope that they will
help us to derive appropriate dialogue strategies for novel
situations.

1025anav.pdf

[ISCA Eurospeech'03, Geneva, September 2003]
---------
Prosodic cues for emotion characterization in real-life spoken dialogs
Laurence Devillers and Ioana Vasilescu
-----------------------
This paper reports on an analysis of prosodic cues for emotion
characterization in 100 natural spo ken dialogs recorded at a
telephone customer service center. The corpus annotated with
task-dependent emoti on tags which were validated by a perceptual
test. Two F0 range parameters, one at the sentence level and t he
other at the subsegment level, emerge as the most salient cues for
emotion classification. These paramet ers can differentiate between
negative emotion (irritation/anger, anxiety/fear) and neutral attitude
and co nfirm trends illustrated by the perceptual experiment.

0572anav.pdf

[ICPhS 2003, Barcelona, August 2003]
---------
Prosodic cues for perceptual emotion detection in task-oriented Human-Human corpus
Laurence Devillers, Ioana Vasilescu and Catherine Mathon
-----------------------
This paper addresses the question of perceptual detection and prosodic
cues analysis of emotional behavior in a spontaneous speech corpus of
real Human-Human dialogs. Detecting real emotions should be a cl ear
focus for research on modeling human dialog, as it could help with
analyzing the evolution of the dialo g. Our aims are to define
appropriate emotions for call center services, to validate the
presence of emotio ns via perceptual tests and to find robust cues for
emotion detection. Most research has focused mainly on artificial
data in which predefined-emotions were simulated by actors. For
real-life corpora a set of app ropriate emotion labels must be
determined. To this purpose, we conducted a perceptual test exploring
2 exp erimental conditions: with and without the capacity of listening
the audio-signal. Perceived emotions refle ct the presence of shaded
and mixed emotions/attitudes. We report correlations between objective
values and both perceived prosodic parameters and emotion labels.

ICPhSliaison.pdf

[ICPhS 2003, Barcelona, August 2003]
---------
Liaisons in French: a corpus based study using morpho-syntactic information
P. Boula de Mareuil, M. Adda-Decker, V. Gendner
-----------------------
French liaison consists in producing a normally mute consonant before
a word starting with a vowel. Whereas the general context for liaison
is relatively straightforward to describe, actual occurrences are
difficult to predict. In the present work, we quantitatively examine
the productivity of 20 liaison rules or so described in the
literature, such as the rule which states that after a determiner,
liaison is compulsory. To do so, we used the French BREF corpus of
read newspaper speech (100 hours), automatically tagged with
morpho-syntactic information. The speech corpus has been
automatically aligned using a pronunciation dictionary which includes
liaisons. There are 90k liaison contexts in the corpus, about half of
which (45%) are realised. A better knowledge of liaison production
is of particular interest for pronunciation modelling in automatic
speech recognition, for text-to-speech synthesis, descriptive
phonetics and second language acquisition.

0ICPhSlid.pdf

[ICPhS 2003, Barcelona, August 2003]
---------
Phonetic knowledge, phonotactics and perceptual validation
for automatic language identification
Martine Adda-Decker, Fabien Antoine, Philippe Boula de Mareuil, Ioana Vasilescu,
Lori Lamel, Jacqueline Vaissiere, Edouard Geoffrois, Jean-Sylvain Li�nard
-----------------------
This study explores a multilingual phonotactic approach to automatic
language identification using Broadcast News data. The definition of
a multilingual phoneset is discussed and an upper limit on the
performance of the phonotactic approach is estimated by eliminating
any degradation due to recognition errors. This upper bound is
compared to automatic language identification based on a phonotactic
approach. The eight languages of interest are: Arabic, Mandarin ,
English, French, German, Italian, Portuguese and Spanish. A
perceptual test has been carried out to compare human and machine
performance in similar configurations.
Different phoneset classes have been experimented with, ranging from
a binary C/V distinction to a shared phone set of 70
phones. Experiments show that phonotactic constraints are in theory
able to identify a language (among 8) with close to 100% on very
short sequences of 1-2 seconds. Automatic and human performances on
very short sequences both remain below the theoretical performances.

ICME2003_4pages.pdf

[ICME 2003, Baltimore, July 2003]
---------
Emotion Detection in Task-Oriented Spoken Dialogs
Laurence Devillers, Lori Lamel, Ioana Vasilescu
-----------------------
Detecting emotions in the context of automated call center services
can be helpful for following the evolution of the human-computer
dialogs, enabling dynamic modification of the dialog strategies and
influencing the final outcome. The emotion detection work reported
here is a part of larger study aiming to model user behavior in real
interactions. We make use of a corpus of real agent-client spoken
dialogs in which the manifestation of emotion is quite complex, and it
is common to have shaded emotions since the interlocutors attempt to
control the expression of their internal attitude. Our aims are to
define appropriate emotions for call center services, to annotate the
dialogs and to validate the presence of emotions via perceptual tests
and to find robust cues for emotion detection. In contrast to research
carried out with artificial data with simulated emotions, for
real-life corpora the set of appropriate emotion labels must be
determined. Two studies are reported: the first investigates
automatic emotion detection using linguistic information, whereas the
second concerns perceptual tests for identifying emotions as well as
the prosodic and textual cues which signal them. About 11% of the
utterances are annotated with non-neutral emotion labels. Preliminary
experiments using lexical cues detect about 70% of these labels.

eacl2003peace.ps.Z

[EACL'03, Budapest, April 2003]
---------
The PEACE SLDS understanding evaluation paradigm of the French MEDIA campaign
Laurence Devillers, Helene Maynard, Patrick Paroubek, Sophie Rosset
-----------------------

This paper presents a paradigm for evaluating the context-sensitive
understanding capability of any spoken language dialog system: PEACE
(French acronym for Paradigme d'Evaluation Automatique de la
Comprehension hors et En-contexte. This paradigm will be the basis of
the French Technolangue MEDIA project, in which dialog systems from
various academic and industrial sites will be tested in an evaluation
campaign coordinated by ELRA/ELDA (over the next two years). Despite
previous efforts such as Eagles, Disc, Aupelf ArcB2 or the ongoing
American DARPA Communicator project, the spoken dialog community still
lacks common reference tasks and widely agreed upon methods for
comparing and diagnosing systems and techniques. Automatic solutions
are nowadays being sought both to make possible the comparison of
different approaches by means of reliable indicators with generic
evaluation methodologies and also to reduce system development costs.
However achieving independence from both the dialog system and the
task performed seems to be more and more a utopia. Most of the
evaluations have up to now either tackled the system as a whole, or
based the measurements on dialog-context-free information. The PEACE
aims at bypassing some of these shortcomings by extracting, from real
dialog corpora, test sets that synthesize contextual information.

sspr03_ftp.pdf

[IEEE Workshop on Spontaneous Speech, Tokyo, April 2003]
---------
Using Continuous Space Language Models for Conversational Speech Recognition
Holger Schwenk and Jean-Luc Gauvain
-----------------------
Language modeling for conversational speech suffers from the limited
amount of available adequate training data. This paper describes a
new approach that performs the estimation of the language model
probabilities in a continuous space, allowing by these means smooth
interpolation of unobserved n-grams. This continuous space language
model is used during the last decoding pass of a state-of-the-art
conversational telephone speech recognizer to rescore word
lattices. For this type of speech data, it achieves consistent word
error reductions of more than 0.4% compared to a carefully tuned
backoff n-gram language model.

ica03cts.pdf

[ICASSP'03, Hong Kong, April 2003]
---------
Conversational Telephone Speech Recognition
J.L. Gauvain, L. Lamel, H. Schwenk, G. Adda, L. Chen, F. Lef�vre
-----------------------
This paper describes the development of a speech recognition system
for the processing of telephone conversations, starting with a
state-of-the-art broadcast news transcription system. We identify
major changes and improvements in acoustic and language modeling, as
well as decoding, which are required to achieve state-of-the-art
performance on conversational speech. Some major changes on the
acoustic side include the use of speaker normalization (VTLN), the
need to cope with channel variability, and the need for efficient
speaker adaptation and better pronunciation modeling. On the
linguistic side the primary challenge is to cope with the limited
amount of language model training data. To address this issue we make
use of a data selection technique, and a smoothing technique based on
a neural network language model. At the decoding level lattice
rescoring and minimum word error decoding are applied. On the
development data, the improvements yield an overall word error rate of
24.9% whereas the original BN transcription system had a word error
rate of about 50% on the same data.

ica03CLZ.pdf

[ICASSP'03, Hong Kong, April 2003]
---------
Unsupervised Language Model Adaptation for Broadcast News
L. Chen, J.L. Gauvain, L. Lamel, G. Adda
-----------------------
Unsupervised language model adaptation for speech recognition is
challenging, particularly for complicated tasks such the transcription
of broadcast news (BN) data. This paper presents an unsupervised
adaptation method for language modeling based on information retrieval
techniques. The method is designed for the broadcast news
transcription task where the topics of the audio data cannot be
predicted in advance. Experiments are carried out using the LIMSI
American English BN transcription system and the NIST 1999 BN
evaluation sets. The unsupervised adaptation method reduces the
perplexity by 7% relative to the baseline LM and yields a 2%
relative improvement for a 10xRT system.

ica03CB.pdf

[ICASSP'03, Hong Kong, April 2003]
---------
Feature and Score Normalization for Speaker Verification of Cellular Data
C. Barras and J.L. Gauvain
-----------------------
This paper presents some experiments with feature and score
normalization for text-independent speaker verification of cellular
data. The speaker verification system is based on cepstral features
and Gaussian mixture models with 1024 components. The following
methods, which have been proposed for feature and score normalization,
are reviewed and evaluated on cellular data: cepstral mean subtraction
(CMS), variance normalization, feature warping, T-norm, Z-norm and the
cohort method. We found that the combination of feature warping and
T-norm gives the best results on the NIST 2002 test data (for the
one-speaker detection task). Compared to a baseline system using both
CMS and variance normalization and achieving a 0.410 minimal decision
cost function (DCF), feature warping and T-norm respectively bring 8%
and 12% relative reductions, whereas the combination of both
techniques yields a 22% relative reduction, reaching a DCF of 0.320.
This result approaches the state-of-the-art performance level obtained
for speaker verification with land-line telephone speech.

isca2003-yylo.ps.Z

[ISCA Workshop on Multilingual Spoken Document Retrieval, April 2003]
---------
Tracking Topics in Broadcast News Data
Yuen-Yee Lo and Jean-Luc Gauvain
-----------------------
This paper describes a topic tracking system and its ability to cope
with sparse training data for broadcast news tracking. The baseline
tracker which relies on a unigram topic model. In order to compensate
for the very small amount of training data for each topic, document
expansion is used in estimating the initial topic model, and
unsupervised model adaptation is carried out after processing each
test story. A new technique of variable weight unsupervised online
adaptation has been developed and was found to outperform traditional
fixed weight online adaptation. Combining both document expansion and
adaptation resulted in a 37% cost reduction tested on both English
and machine translated Mandarin broadcast news data transcribed by an
ASR system, with manual story boundaries. Another challenging
condition is one in which the story boundaries are not known for the
broadcast news data. A window-based automatic story boundary detector
has been developed for the tracking system. The tracking results with
the window-based tracking system are comparable to those obtained with
a state-of-the-art automatic story segmentation on the TDT3 corpus.

isle02em.ps.Z

[ISLE workshop, Edinburgh Dec 16-17, 2002
http://www.research.att.com/~walker/isle-dtag-wrk/]
---------
Annotation and Detection of Emotion in a Task-oriented
Human-Human Dialog Corpus
Laurence Devillers, Ioana Vasilescu (ENST), Lori Lamel
-----------------------
Automatic emotion detection is potentially important for customer
care in the context of call center services. A major difficulty is
that the expression of emotion is quite complex in the context of
agent-client dialogs. Due to unstated rules of politeness it is common
to have shaded emotions, in which the interlocutors attempt to limit
their internal attitude (frustration, anger, disappointment).
Furthermore, the telephonic quality speech complicates the tasks of
detecting perceptual cues and of extracting relevant measures from the
signal. Detecting real emotions can be helpful to follow the
evolution of the human-computer dialogs, enabling dynamic modification
of the dialog strategies and influencing the final outcome. This
paper reports on three studies: the first concerns the use of
multi-level annotations including emotion tags for diagnosis of the
dialog state; the second investigates automatic emotion detection
using linguistic information; and the third reports on two perceptual
tests for identifying emotions as well as the prosodic and textual
cues which signal them. Two experimental conditions were considered --
with and without the capability of listening the audio
signal. Finally, we propose a new set of emotions/attitudes tags
suggested by the previous experiments and discuss perspectives.

isle02axes.ps.Z

[ISLE workshop, Edinburgh Dec 16-17, 2002
http://www.research.att.com/~walker/isle-dtag-wrk/]
---------
Representing Dialog Progression for Dynamic State Assessment
Sophie Rosset and Lori Lamel
-----------------------
While developing several spoken language dialog systems for
information retrieval tasks, we found that representing the dialog
progression along an axis was useful to facilitate dynamic evaluation
of the dialog state. Since dialogs evolve from the first exchange
until the end, it is of interest to assess whether an ongoing dialog
is running smoothly or if is encountering problems. We consider that
the dialog progression can be represented on two axes: a Progression
axis, which represents the ``good'' progression of the dialog and an
Accidental axis, which represents the accidents that occur,
corresponding to misunderstandings between the system and the user.
The time (in number of turns) used by the system to repair the
accident is represented by the Residual Error, which is incremented
when an accident occurs and is decremented when the dialog
progresses. This error reflects the difference between a perfect
(i.e., theoretical) dialog (e.g. without errors, miscommunication...)
and the real ongoing dialog. One particularly interesting use of the
dialog axis progression annotation is to extract problematic dialogs
from large data collections for analysis, with the aim of improving
the dialog system. In the context of the IST Amities project we have
extended this representation to the annotation of human-human dialogs
recorded at a stock exchange call center and intend to use the
extended representation in an automated natural dialog system for call
routing.

tdt02.ps.Z

[TDT'02, Gaithersburg, November 2002]
---------
The LIMSI Topic Tracking System for TDT2002
Yuen-Yee Lo and Jean-Luc Gauvain
-----------------------
In this paper we describe the LIMSI topic tracking system used for
the DARPA 2002 Topic Detection and Tracking evaluation (TDT2002). The
system relies on a unigram topic model, where the score for an
incoming document is the normalized likelihood ratio of the topic
model and a general English model. In order to compensate for the very
small amount of training data for each topic, document expansion is
used in estimating the topic model which is the adapted in an
unsupervised manner after each incoming document is processed.
Experimental results demonstrate the effectiveness of these two
techniques for the primary evaluation condition on both the TDT3
development corpus and the official TDT2002 test data. Another
challenge is that story boundaries are not known for the broadcast
news data. A window-based automatic boundary detector has been
developed for the tracking system. The tracking results with the
window-based tracking system are comparable to those obtained with a
state-of-the-art automatic story segmentation on the TDT3 corpus.

icslp02_tcs.ps.Z

[ICSLP 2002, Denver, Sep 2002]
---------
A phonetic study of Vietnamese tones: acoustic and electrographic measurements
Vu Ngoc Tuan and Christophe d'Alessandro and Sophie Rosset
-----------------------

**tbc**

icslp02hbm.ps.Z

[ICSLP 2002, Denver, Sep 2002]
---------
Issues in the Development of a Stochastic Speech Understanding System
Fabrice Lefevre and Helene Bonneau-Maynard
-----------------------
In the development of a speech understanding system, the recourse to
stochastic techniques can greatly reduce the need for human
expertise. A known disadvantage is that stochastic models require
large annotated training corpora in order to reliably estimate model
parameters. Manual semantic annotation of such corpora is tedious,
expensive, and subject to inconsistencies. In order to decrease the
development cost, this work investigates the performance of stochastic
understanding models with two parameters:~the use of automatically
segmented data and the use of automatically learned lexical
normalisation rules.

eusipco02.ps.Z

[Eusipco'02, Toulouse, Sep. 2002]
---------
Automatic Processing of Broadcast Audio in Multiple Languages
Lori Lamel and Jean-Luc Gauvain}
-----------------------
This paper addresses recent progress in LVCSR in multiple languages
which has enabled the processing of broadcast audio for information
access. At LIMSI, broadcast news transcription systems have been
developed for seven languages. %English, French, German, Mandarin,
Portuguese, Spanish and Arabic. Automatic processing to access the
content must take into account the specificities of audio data, such
as needing to deal with the continuous data stream and an imperfect
word transcription, and specificities of the language. Some near-term
applications are audio data mining, structurization of audiovisual
archives, selective dissemination of information and media monitoring.

spc02mask.ps.Z

[Speech Communication, Vol. 38, (1-2):131-139 Sep 2002]
---------
User Evaluation of the MASK Kiosk
L. Lamel, S. Bennacef, J.L. Gauvain, H. Dartigues, J.N. Temem
-----------------------
In this paper we report on a series of user trials carried out to
assess the performance and usability of the Multimodal Multimedia
Service Kiosk (MASK) prototype kiosk. The aim of the ESPRIT MASK
project was to pave the way for advanced public service applications
with user interfaces employing multimodal, multi-media input and
output. The prototype kiosk was developed after analyzing the
technological requirements in the context of users performing travel
enquiry tasks, in close collaboration with the French Railways (SNCF)
and the Ergonomics group at the University College of London (UCL).
The time to complete the transaction with the MASK kiosk is reduced by
about 30% compared to that required for the standard kiosk, and the
success rate is 85% for novices and 94% once familiar with the system.
In addition to meeting or exceeding the performance goals set at the
project onset in terms of success rate, transaction time, and user
satisfaction, the MASK kiosk was judged to be user-friendly and simple
to use.

Dans cet article nous pr�sentons les r�sultats de tests aupr�s
d'utilisateurs du kiosque MASK (Multimodal Multimedia Service Kiosk).
Le but du projet ESPRIT MASK �tait de d�velopper un kiosque
d'information et de distribution avec une interface innovante et
conviviale combinant les modalit�s tactiles et vocales. Le prototype
a �t� d�velopp� apr�s une analyse des besoins dans le cadre du
transport ferroviaire en collaboration avec la SNCF et le groupe
d'ergonomie � University College of London (UCL). Le temps de
transaction avec le kiosque MASK est r�duit de 30% par rapport aux
kiosques existants. Le taux de succ�s est de 85% pour les utilisateurs
novices et de 94% pour ceux qui ont d�j� utilis� le syst�me (plus de 3
utilisations). Tous les objectifs fix�s par la SNCF au d�but du projet
ont �t� atteints par le prototype, qu'ils concernent le taux de
succ�s, le temps de transaction, ou la satisfaction des utilisateurs.
Le kiosque MASK a �t� jug� convivial et simple d'emploi.

lrec02_eval_defi.pdf

[LREC'02, Las Palmas, June 2002]
---------
Predictive and objective evaluation of speech understanding:
the "challenge" evaluation campaign of the I3 speech workgroup
of the French CNRS
J.Y. Antoine, C. Bousquet-Vernhettes, J. Goulian, M. Zakaria Kurdi,
S. Rosset, N. Vigouroux, J. Villaneau
-----------------------
The recent development of spoken language processing has gone along
with large-scale evaluation programmes that concern spoken language
dialogue systems as well as their components (speech recognition,
speech understanding, dialogue management). This paper deals with the
evaluation of Spoken Language Understanding (SLU) systems in the
general framework of spoken Man-Machine Communication.

lrec02port.ps.Z

[ISCA SALTMIL SIG workshop at LREC'02 on Portability Issues in Human
Language Technologies, Las Palmas, June 2002]
---------
Some Issues in Speech Recognizer Portability
Lori Lamel
-----------------------
Speech recognition technology has greatly evolved over the last
decade. However, one of the remaining challenges is reducing the
development cost. Most recognition systems are tuned to a particular
task and porting the system to a new task (or language) requires
substantial investment of time and money, as well as human
expertise. Todays state-of-the-art systems rely on the availability of
large amounts of manually transcribed data for acoustic model training
and large normalized text corpora for language model
training. Obtaining such data is both time-consuming and expensive,
requiring trained human annotators with substantial amounts of
supervision. This paper addresses some of the main issues in porting
a recognizer to another task or language, and highlights some some
recent research activities aimed a reducing the porting cost and at
developing generic core speech recognition technology.

spcH4_limsi.ps.Z

[Speech Communication, Vol. 37, (1-2):89-108, May 2002]
---------
The LIMSI Broadcast News Transcription System
Jean-Luc Gauvain, Lori Lamel and Gilles Adda
-----------------------
This paper reports on activites at LIMSI over the last few years
directed at the transcription of broadcast news data. We describe our
development work in moving from laboratory read speech data to
real-world or `found' speech data in preparation for the ARPA Nov96,
Nov97 and Nov98 evaluations. Two main problems needed to be addressed
to deal with the continuous flow of inhomogenous data. These concern
the varied acoustic nature of the signal (signal quality,
environmental and transmission noise, music) and different linguistic
styles (prepared and spontaneous speech on a wide range of topics,
spoken by a large variety of speakers). The problem of partitioning
the continuous stream of data is addressed using an iterative
segmentation and clustering algorithm with Gaussian mixtures. The
speech recognizer makes use of continuous density HMMs with Gaussian
mixture for acoustic modeling and 4-gram statistics estimated on large
text corpora. Word recognition is performed in multiple passes, where
initial hypotheses are used for cluster-based acoustic model
adaptation to improve word graph generation. The overall word
transcription error of the LIMSI evaluation systems were 27.1% (Nov96,
partitioned test data), 18.3% (Nov97, unpartitioned data), 13.6%
(Nov98, unpartitioned data) and 17.1% (Fall99, unpartitioned data with
computation time under 10x real-time).

jep2002hbm.ps.Z

[JEP-TALN 2002, Nancy, June 2002]
---------
Issues in the Development of a Stochastic Speech Understanding System
Helene Bonneau-Maynard and Fabrice Lefevre
-----------------------
The need for human expertise in the development of a speech
understanding system can be greatly reduced by the use of stochastic
techniques. However corpus-based techniques require the annotation of
large amounts of training data. Manual semantic annotation of such
corpora is tedious, expensive, and subject to inconsistencies. In
order to decrease the development cost, this work investigates the
performance of the understanding module with two parameters:~the
influence of the training corpus size and the use of automatically
annotated data.

ica02cb.ps.Z

[IEEE ICASSP'02, Orlando, May 2002]
---------
Transcribing Audio-Video Archives
Claude Barras, Alexandre Allauzen, Lori Lamel, Jean-Luc Gauvain
-----------------------
This paper addresses the automatic transcription of audiovideo
archives using a state-of-the-art broadcast news speech transcription
system. A 9-hour corpus spanning the latter half of the 20th century
(1945-1995) has been transcribed and an analysis of the transcription
quality carried out. In addition to the challenges of transcribing
heterogenous broadcast news data, we are faced with changing
properties of the archive over time, such as the audio quality, the
speaking style, vocabulary items and manner of expression. After
assessing the performance of the transcription system, several paths
are explored in an attempt to reduce the mismatch between the acoustic
and language models and the archived data.

ica02hs.ps.Z

[IEEE ICASSP'02, Orlando, May 2002]
---------
Connectionist Language Modeling
for Large Vocabulary Continuous Speech Recognition
Holger Schwenk and Jean-Luc Gauvain
-----------------------
This paper describes ongoing work on a new approach for language
modeling for large vocabulary continuous speech recognition. Almost
all state-of-the-art systems use statistical n-gram language models
estimated on text corpora. One principle problem with such language
models is the fact that many of the $n$-grams are never observed even
in very large training corpora, and therefore it is common to back-off
to a lower-order model. In this paper we propose to address this
problem by carrying out the estimation task in a continuous space,
enabling a smooth interpolation of the probabilities. A neural network
is used to learn the projection of the words onto a continuous space
and to estimate the $n$-gram probabilities.
The connectionist language model is being evaluated on the DARPA Hub5
conversational telephone speech recognition task and preliminary
results show consistent improvements in both perplexity and word error
rate.

ica02light.ps.Z

[IEEE ICASSP'02, Orlando, May 2002]
---------
Unsupervised Acoustic Model Training
Lori Lamel, Jean-Luc Gauvain and Gilles Adda
-----------------------
This paper describes some recent experiments using unsupervised
techniques for acoustic model training in order to reduce the system
development cost. The approach uses a speech recognizer to transcribe
unannotated raw broadcast news data. The hypothesized transcription
is used to create labels for the training data. Experiments providing
supervision only via the language model training materials show that
including texts which are contemporaneous with the audio data is not
crucial for success of the approach, and that the acoustic models can
be initialized with as little as 10 minutes of manually annotated
data. These experiments demonstrate that unsupervised training is a
viable training scheme and can dramatically reduce the cost of
building acoustic models.

csl01.ps.Z

[Computer Speech and Language, Vol. 16, Jan 2002]
---------
Lightly supervised and unsupervised acoustic model training
Lori Lamel and Jean-Luc Gauvain and Gilles Adda
-----------------------
The last decade has witnessed substantial progress in speech
recognition technology, with todays state-of-the-art systems being
able to transcribe unrestricted broadcast news audio data with a word
error of about 20%. However, acoustic model development for these
recognizers relies on the availability of large amounts of manually
transcribed training data. Obtaining such data is both time-consuming
and expensive, requiring trained human annotators and substantial
amounts of supervision.
This paper describes some recent experiments using lightly supervised
and unsupervised techniques for acoustic model training in order to
reduce the system development cost. The approach uses a speech
recognizer to transcribe unannotated broadcast news data from the
Darpa TDT-2 corpus. The hypothesized transcription is optionally
aligned with closed captions or transcripts to create labels for the
training data. Experiments providing supervision only via the language
model training materials show that including texts which are
contemporaneous with the audio data is not crucial for success of the
approach, and that the acoustic models can be initialized with as
little as 10 minutes of manually annotated data. These experiments
demonstrate that lightly or unsupervised supervision can dramatically
reduce the cost of building acoustic models.

mtap00.ps.Z

[Journal on Multimedia Tools and Applications Vol 14 (2): 187-200, June 2001]
---------
Audio Partitioning and Transcription for Broadcast Data Indexation
J.L. Gauvain, L. Lamel and G. Adda
-----------------------
This work addresses automatic transcription of television and radio
broadcasts in multiple languages. Transcription of such types of data
is a major step in developing automatic tools for indexation and
retrieval of the vast amounts of information generated on a daily
basis. Radio and television broadcasts consist of a continuous data
stream made up of segments of different linguistic and acoustic
natures, which poses challenges for transcription. Prior to word
recognition, the data is partitioned into homogeneous acoustic
segments. Non-speech segments are identified and removed, and the
speech segments are clustered and labeled according to bandwidth and
gender. Word recognition is carried out with a speaker-independent
large vocabulary, continuous speech recognizer which makes use of
n-gram statistics for language modeling and of continuous density HMMs
with Gaussian mixtures for acoustic modeling. This system has
consistently obtained top-level performance in DARPA evaluations. Over
500 hours of unpartitioned unrestricted American English broadcast
data have been partitioned, transcribed and indexed, with an average
word error of about 20%. With current IR technology there is
essentially no degradation in information retrieval performance for
automatic and manual transcriptions on this data set.

asru01hbmfl.ps.Z

[IEEE ASRU'01, Madonna di Campiglio, December 2001]
---------
Investigating Stochastic Speech Understanding
Helene Bonneau-Maynard, Fabrice Lefevre
-----------------------
The need for human expertise in the development of a speech
understanding system can be greatly reduced by the use of stochastic
techniques. However corpus-based techniques require the annotation of
large amounts of training data. Manual semantic annotation of such
corpora is tedious, expensive, and subject to inconsistencies. This
work investigates the influence of the training corpus size on the
performance of understanding module. The use of automatically
annotated data is also investigated as a means to increase the corpus
size at a very low cost.
First, a stochastic speech understanding model developed using data
collected with the LIMSI ARISE dialog system is presented. Its
performance is shown to be comparable to that of the rule-based
caseframe grammar currently used in the system. In a second step, two
ways of reducing the development cost are pursued: (1) reducing of the
amount of manually annotated data used to train the stochastic models
and (2) using automatically annotated data in the training process.

limsi_trk01.ps.Z

[TDT'01, Gaithersburg, November 2001]
---------
The LIMSI Topic Tracking System for TDT2001
Yuen-Yee Lo and Jean-Luc Gauvain
-----------------------
In this paper we describe the LIMSI topic tracking system used for
the DARPA 2001 Topic Detection and Tracking evaluation (TDT2001). The
system relies on a unigram topic model, where the score for an
incoming document is the normalized likelihood ratio of the topic
model and a general English model. In order to compensate for the very
small amount of training data for each topic, document expansion is
used in estimating the initial topic model, and unsupervised model
adaptation is carried out after each test story is processed.
Experimental results demonstrate the effectiveness of these two
techniques for the primary evaluation condition on both the TDT3
development corpus and the official TDT2001 test data.

euro01fab.ps.Z

[ISCA Eurospeech'01, Aalborg, August 2001]
---------
Improving Genericity for Task-Independent Speech Recognition
Fabrice Lefevre, Jean-Luc Gauvain, Lori Lamel
-----------------------
Although there have been regular improvements in speech recognition
technology over the past decade, speech recognition is far from being
a solved problem. Recognition systems are usually tuned to a
particular task and porting the system to a new task (or language) is
both time-consuming and expensive.
In this paper, issues in speech recognizer portability are addressed
through the development of generic core speech recognition
technology. First, the genericity of wide domain models is assessed by
evaluating performance on several tasks. Then, the use of transparent
methods for adapting generic models to a specific task is
explored. Finally, further techniques are evaluated aiming at
enhancing the genericity of the wide domain models. We show that
unsupervised acoustic model adaptation and multi-source training can
reduce the performance gap between task-independent and task-dependent
acoustic models, and for some tasks even out-perform task-dependent
acoustic models.

euro01chen.ps.Z

[ISCA Eurospeech'01, Aalborg, August 2001]
---------
Using Information Retrieval Methods for Language Model Adaptation
Langzhou Chen, Jean-Luc Gauvain, Lori Lamel, Gilles Adda, Martine Adda-Decker
-----------------------
In this paper we report experiments on language model adaptation
using information retrieval methods, drawing upon recent developments
in information extraction and topic tracking. One of the problems is
extracting reliable topic information with high confidence from the
audio signal in the presence of recognition errors. The work in the
information retrieval domain on information extraction and topic
tracking suggested a new way to solve this problem. In this work, we
make use of information retrieval methods to extract topic information
in the word recognizer hypotheses, which are then used to
automatically select adaptation data from a very large general text
corpus. Two adaptive language models, a mixture based model and a MAP
based model, have been investigated using the adaptation data.
Experiments carried out with the LIMSI Mandarin broadcast news
transcription system gives a relative character error rate reduction
of 4.3% with this adaptation method.

itrw01fabfinal.ps.Z

[ISCA ITRW'01, Sophia-Antipolis, August 2001]
---------
Genericity and Adaptability Issues for Task-Independent Speech Recognition
Fabrice Lefevre, Jean-Luc Gauvain, Lori Lamel
---------------------
The last decade has witnessed major advances in core speech
recognition technology, with today's systems able to recognize
continuous speech from many speakers without the need for an explicit
enrollment procedure. Despite these improvements, speech recognition
is far from being a solved problem. Most recognition systems are
tuned to a particular task and porting the system to another task or
language is both time-consuming and expensive.
Our recent work addresses issues in speech recognizer portability,
with the goal of developing generic core speech recognition
technology. In this paper, we first assess the genericity of wide
domain models by evaluating performance on several tasks. Then,
transparent methods are used to adapt generic acoustic and language
models to a specific task. Unsupervised acoustic models adaptation is
contrasted with supervised adaptation, and a system-in-loop scheme for
incremental unsupervised acoustic and linguistic models adaptation is
investigated. Experiments on a spontaneous dialog task show that with
the proposed scheme, a transparently adapted generic system can
perform nearly as well (about a 1% absolute gap in word error rates)
as a task-specific system trained on several tens of hours of manually
transcribed data.

itrw01chen.ps.Z
[ISCA ITRW'01, Sophia-Antipolis, August 2001]
---------
Language Model Adaptation for Broadcast News Transcription
Langzhou Chen, Jean-Luc Gauvain, Lori Lamel, Gilles Adda, Martine Adda-Decker
---------------------
This paper reports on language model adaptation for the broadcast
news transcription task. Language model adaptation for this task is
challenging in that the subject of any particular show or portion
thereof is unknown in advance and is often related to more than one
topic. One of the problems in language model adaptation is the
extraction of reliable topic information from the audio signal,
particularly in the presence of recognition errors. In this work, we
draw upon techniques used in information retrieval to extract topic
information from the word recognizer hypotheses, which are then used
to automatically select adaptation data from a large general text
corpus. Two adaptive language models, a mixture-based model and a
MAP-based model, have been investigated using the adaptation data.
Experiments carried out with the LIMSI Mandarin broadcast news
transcription system gives a relative character error rate reduction
of 4.3% by combining both adaptation methods.

taln01sr.ps.Z

[TALN'01, Tours, July 2001]
---------
Gestionnaire de dialogue pour un syst�me d'informations
� reconnaissance vocale
Sophie Rosset, Lori Lamel
---------------------
In this paper we describe the dialog manager of a spoken language
dialog system for information retrieval. The dialog manager maintains
both static and dynamic knowledge sources, which are used according to
the dialog strategies. These strategies have been developed so as to
obtain the overall system objectives while taking into consideration
the desired ergonomic choices. Dialog management is based on a dialog
model which divides the interaction into phases and a dynamic model of
the task. This allows the dialog strategy to be adapted as a function
of the history and the dialog state. The dialog manager was evaluated
in the context of the Le3-Arise project, where in the last user tests
the dialog success improved from 53% to 85%.

Dans cet article, nous pr�sentons un gestionnaire de dialogue pour
un syst�me de demande d'informations � reconnaissance vocale.
Le gestionnaire de dialogue dispose de diff�rentes sources de
connaissance, des connaissances statiques et des connaissances
dynamiques. Ces connaissances sont g�r�es et utilis�es par le
gestionnaire de dialogue via des strat�gies. Elles sont mises en
oeuvre et organis�es en fonction des objectifs concernant le
syst�me de dialogue et en fonction des choix ergonomiques que nous
avons retenus. Le gestionnaire de dialogue utilise un mod�le de
dialogue fond� sur la d�termination de phases et un mod�le de
la t�che dynamique. Il augmente les possibilit�s d'adaptation
de la strat�gie en fonction des historiques et de l'�tat du
dialogue. Ce gestionnaire de dialogue, impl�ment� et
�valu� lors de la derni�re campagne d'�valuation du projet
Le-3 Arise, a permi une am�lioration du taux de succ�s de
dialogue (de 53% � 85%).

acl01.ps.Z

[ACL'01, Toulouse, July 2001]
---------
Processing Broadcast Audio for Information Access
Jean-Luc Gauvain, Lori Lamel, Gilles Adda, Martine Adda-Decker,
Claude Barras, Langzhou Chen, and Yannick de Kercadio

This paper addresses recent progress in speaker-independent, large
vocabulary, continuous speech recognition, which has opened up a wide
range of near and mid-term applications. One rapidly expanding
application area is the processing of broadcast audio for information
access. At LIMSI, broadcast news transcription systems have been
developed for English, French, German, Mandarin and Portuguese, and
systems for other languages are under development. Audio indexation
must take into account the specificities of audio data, such as
needing to deal with the continuous data stream and an imperfect word
transcription. Some near-term applications areas are audio data
mining, selective dissemination of information and media monitoring.

ica01light.ps.Z

[ICASSP'01, Salt Lake City, May 2001]
---------
Investigating Lightly Supervised Acoustic Model Training
Lori Lamel, Jean-Luc Gauvain, Gilles Adda
---------------------
The last decade has witnessed substantial progress in speech
recognition technology, with todays state-of-the-art systems being
able to transcribe broadcast audio data with a word error of about
20%. However, acoustic model development for the recognizers requires
large corpora of manually transcribed training data. Obtaining such
data is both time-consuming and expensive, requiring trained human
annotators with substantial amounts of supervision.
In this paper we describe some recent experiments using different
levels of supervision for acoustic model training in order to reduce
the system development cost. The experiments have been carried out
using the DARPA TDT-2 corpus (also used in the SDR99 and SDR00
evaluations). Our experiments demonstrate that light supervision is
sufficient for acoustic model development, drastically reducing the
development cost.

ica01fl.ps.Z

[ICASSP'01, Salt Lake City, May 2001]
---------
Towards Task-Independent Speech Recognition
Fabrice Lefevre, Jean-Luc Gauvain, Lori Lamel
---------------------
Despite the considerable progress made in the last decade, speech
recognition is far from a solved problem. For instance, porting a
recognition system to a new task (or language) still requires
substantial investment of time and money, as well as expertise in
speech recognition. This paper takes a first step at evaluating to
what extent a generic state-of-the-art speech recognizer can reduce
the manual effort required for system development.
We demonstrate the genericity of wide domain models, such as
broadcast news acoustic and language models, and techniques to achieve
a higher degree of genericity, such as transparent methods to adapt
such models to a specific task. This work targets three tasks using
commonly available corpora: small vocabulary recognition (TI-digits),
text dictation (WSJ), and goal-oriented spoken dialog (ATIS).

ica01cb.ps.Z

[ICASSP'01, Salt Lake City, May 2001]
---------
{Automatic Transcription of Compressed Broadcast Audio
Claude Barras, Lori Lamel, Jean-Luc Gauvain
---------------------
With increasing volumes of audio and video data broadcast over the
web, it is of interest to assess the performance of state-of-the-art
automatic transcription systems on compressed audio data for media
indexation applications. In this paper the performance of the LIMSI
10x French broadcast news transcription system is measured on a
two-hour audio set for a range of MP3 and RealAudio codecs at various
bitrates and the GSM codec used for European cellular phone
communications. The word error rates are compared with those obtained
on high quality PCM recordings prior to compression. For a 6.5 kbps
audio bit rate (the most commonly used on the web), word error rates
under 40% can be achieved, which makes automatic media monitoring
systems over the web a realistic task.

hlt01.ps.Z

[HLT'01, San Diego, March 2001]
---------
Portability Issues for Speech Recognition Technologies
Lori Lamel, Fabrice Lefevre, Jean-Luc Gauvain, Gilles Adda
---------------------
Although there has been regular improvement in speech recognition
technology over the past decade, speech recognition is far from being
a solved problem. Most recognition systems are tuned to a particular
task and porting the system to a new task (or language) still requires
substantial investment of time and money, as well as expertise. Todays
state-of-the-art systems rely on the availability of large amounts of
manually transcribed data for acoustic model training and large
normalized text corpora for language model training. Obtaining such
data is both time-consuming and expensive, requiring trained human
annotators with substantial amounts of supervision.
In this paper we address issues in speech recognizer portability and
activities aimed at developing generic core speech recognition
technology, in order to reduce the manual effort required for system
development. Three main axes are pursued: assessing the genericity of
wide domain models by evaluating performance under several tasks;
investigating techniques for lightly supervised acoustic model
training; and exploring transparent methods for adapting generic
models to a specific task so as to achieve a higher degree of
genericity.

sdr00.ps.Z

[TREC-9, Gaithersburg, Nov 2000]
---------
The LIMSI SDR System for TREC-9
Jean-Luc Gauvain, Lori Lamel, Claude Barras,
Gilles Adda, and Yannick de Kercardio
---------------------
In this paper we describe the LIMSI Spoken Document Retrieval system
used in the TREC-9 evaluation. This system combines an adapted version
of the LIMSI 1999 Hub-4E transcription system for speech recognition
with text-based IR methods. Compared with the LIMSI TREC-8 system,
this year's system is able to index the audio data without knowledge
of the story boundaries using a double windowing approach. The query
expansion procedure of the information retrieval component has been
revised and makes use of contemporaneous text sources. Experimental
results are reported in terms of mean average precision for both the
TREC SDR'99 and SDR'00 queries using the same 557h data set. The mean
average precision of this year's system is 0.5250 for SDR'99 and
0.3706 for SDR'00 for the focus unknown story boundary condition with
a 20% word error rate.

nle99.ps.Z

[Natural Language Engineering special issue on Best Practice in Spoken
Language Dialogue Systems Engineering, Vol. 6 (3), Oct 2000]
---------
Towards Best Practice in the Development and Evaluation of Speech
Recognition Components of a Spoken Language Dialogue System
L. Lamel, W. Minker, P. Paroubek
---------------------
Spoken Language Dialog Systems (SLDSs) aim to use natural spoken
input for performing an information processing task such as automated
standards, call routing or travel planning and reservations. The main
functionality of an SLDS are speech recognition, natural language
understanding, dialog management, database access and interpretation,
response generation and speech synthesis. Speech recognition, which
transforms the acoustic signal into a string of words, is a key
technology in any SLDS. This article summarizes the main aspects of
the current practice in the design, implementation and evaluation of
speech recognition components for spoken language dialogue systems.
It is based on the framework used in the European project DISC.

icslp00_h4ger.ps.Z

[ICSLP'2000, Beijing, Oct 2000]
---------
Investigating text normalization and pronunciation variants
for German broadcast transcription
Martine Adda-Decker, Gilles Adda, Lori Lamel
---------------------
In this paper we describe our ongoing work concerning lexical modeling
in the LIMSI broadcast transcription system for German. Lexical
decomposition is investigated with a twofold goal: lexical coverage
optimization and improved letter-to-sound conversion. A set of about
450 decompounding rules, developed using statistics from a 300M word
corpus, reduces the OOV rate from 4.5% to 4.0% on a 30k development
text set. Adding partial inflection stripping, the OOV rate drops to
2.9%. For letter-to-sound conversion, decompounding reduces
cross-lexeme ambiguities and thus contributes to more consistent
pronunciation dictionaries. Another point of interest concerns
reduced pronunciation modeling. Word error rates, measured on 1.3
hours of ARTE TV broadcast, vary between 18 and 24% depending on the
show and the system configuration. Our experiments indicate that
using reduced pronunciations slightly decreases word error rates.

icslp00_h4m.ps.Z

[ICSLP'2000, Beijing, Oct 2000]
---------
Broadcast News Transcription in Mandarin
Langzhou Chen, Lori Lamel, Gilles Adda and Jean-Luc Gauvain
---------------------
In this paper, our work in developing a Mandarin broadcast news
transcription system is described. The main focus of this work is a
port of the LIMSI American English broadcast news transcription system
to the Chinese Mandarin language. The system consists of an audio
partitioner and an HMM-based continuous speech recognizer. The
acoustic models were trained on about 24 hours of data from the 1997
Hub4 Mandarin corpus available via LDC. In addition to the
transcripts, the language models were trained on Mandarin Chinese News
Corpus containing about 186 million characters. We investigate
recognition performance as a function of lexical size, with and
without tone in the lexicon, and with a topic dependent language
model. The transcription character error rate on the DARPA 1997 test
set is 18.1 using a lexicon with 3 tone levels and a topic-based
language model.

icslp00_slds.ps.Z

[ICSLP'2000, Beijing, Oct 2000]
---------
Considerations in the Design and Evaluation
of Spoken Language Dialog Systems
Lori Lamel, Sophie Rosset and Jean-Luc Gauvain
---------------------
In this paper we summarize our experience at LIMSI in the design,
development and evaluation of spoken language dialog systems for
information retrieval tasks. This work has been for the most part
carried out in the context of several European and international
projects. Evaluation plays an integral role in the development of
spoken language dialog systems. While there are commonly used
measures and methodologies for evaluating speech recognizers, the
evaluation of spoken dialog systems is considerably more complicated
due to the interactive nature and the human perception of
performance. It is therefore important to assess not only the
individual system components, but the overall system performance using
objective and subjective measures.

icslp00_10x.ps.Z

[ICSLP'2000, Beijing, Oct 2000]
---------
Fast Decoding for Indexation of Broadcast Data
Jean-Luc Gauvain and Lori Lamel
---------------------
Processing time is an important factor in making a speech
transcription system viable for automatic indexation of radio and
television broadcasts. When only concerned by the word error rate, it
is common to design systems that run in 100 times real-time or
more. This paper addresses issues in reducing the speech recognition
time for automatic indexation of radio and TV broadcasts with the aim
of obtaining reasonable performance for close to real-time
operation. We investigated computational resources in the range 1 to
10xRT on commonly available platforms. Constraints on the
computational resources led us to reconsider design issues,
particularly those concerning the acoustic models and the decoding
strategy. A new decoder was implemented which transcribes broadcast
data in few times real-time with only a slight increase in word error
rate when compared to our best system. Experiments with spoken
document retrieval show that comparable IR results are obtained with a
10xRT automatic transcription or with manual transcription, and that
reasonable performamce is still obtained with a 1.4xRT transcription
system.

TLPreco00.ps.Z

[France-China Speech Processing Workshop, Beijing, Oct 2000]
---------
An Overview of Speech Recognition Activities at LIMSI
Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Claude Barras,
Langzhou Chen, Michele Jardino, Lori Lamel, Holger Schwenk
---------------------
This paper provides an overview of recent activities at LIMSI in
multilingual speech recognition and its applications. The main goal
of speech recognition is to provide a transcription of the speech
signal as a sequence of words. Speech recognition is a core
technology for most applications involving voice technology. The two
main classes of applications currently addressed are transcription and
indexation of broadcast data and spoken language dialog systems for
information access. Speaker-independent, large vocabulary, continuous
speech recognition systems for different European languages (French,
German and British English) and for American English and Mandarin
Chinese have been developed. These systems rely on supporting research
in acoustic-phonetic modeling, lexical modeling and language modeling.

icslp00_holger.ps.Z

[ICSLP'2000, Beijing, Oct 2000]
---------
Combining Multiple Speech Recognizers using Voting and Language Model Information
Holger Schwenk and Jean-Luc Gauvain
---------------------
In 1997, NIST introduced a voting scheme called ROVER for combining
word scripts produced by different speech recognizers. This approach
has achieved a relative word error reduction of up to 20% when used
to combine the systems' outputs from the 1998 and 1999 Broadcast News
evaluations. Recently, there has been increasing interest in using
this technique. This paper provides an analysis of several
modifications of the original algorithm. Topics addressed are the
order of combination, normalization/filtering of the systems' outputs
prior to combining them, treatment of ties during voting and the
incorporation of language model information. The modified ROVER
achieves an additional 5% relative word error reduction on the 1998
and 1999 Broadcast News evaluation test sets. Links with recent
theoretical work on alternative error measures are also discussed.

icslp00_holger.ps.Z

[ASR'2000, Paris, Sep 2000]
---------
Improved ROVER using Language Model Information
Holger Schwenk and Jean-Luc Gauvain
---------------------
In the standard approach to speech recognition, the goal is to find
the sentence hypothesis that maximizes the posterior probability of
the word sequence given the acoustic observation. Usually speech
recognizers are evaluated by measuring the word error so that there is
a mismatch between the training and the evaluation
criterion. Recently, algorithms for minimizing directly the word error
and other task specific error criterions have been proposed. This
paper presents an extension of the ROVER algorithm for combining
outputs of multiple speech recognizers using both a word error
criterion and a sentence error criterion. The algorithm has been
evaluated on the 1998 and 1999 broadcast news evaluation test sets, as
well as the SDR 1999 speech recognition 10 hour subset and
consistently outperformed the standard ROVER algorithm. The approach
seems to be of particular interest for improving the recognition
performance by combining only two or three speech recognizers
achieving relative performance improvements of up to 20% compared to
the best single recognizer.

asr2000_light.ps.Z

[ASR'2000, Paris, Sep 2000]
---------
Lightly Supervised Acoustic Model Training
Lori Lamel, Jean-Luc Gauvain, Gilles Adda
---------------------
Although tremendous progress has been made in speech recognition
technology, with the capability of todays state-of-the-art systems to
transcribe unrestricted continuous speech from broadcast data, these
systems rely on the availability of large amounts of manually
transcribed acoustic training data. Obtaining such data is both
time-consuming and expensive, requiring trained human annotators with
substantial amounts of supervision. In this paper we describe some
recent experiments using lightly supervised techniques for acoustic
model training in order to reduce the system development cost. The
strategy we investigate uses a speech recognizer to transcribe
unannotated broadcast news data, and optionally combines the
hypothesized transcription with associated, but unaligned closed
captions or transcripts to create labeled training. We show that this
approach can dramatically reduces the cost of building acoustic
models.

phon00.ps.Z

[Workshop on Phonetics and Phonology in ASR, Saarbr�cken 2000]
---------
Modeling Reduced Pronunciations in German
Martine Adda-Decker and Lori Lamel
---------------------
This paper deals with pronunciation modeling for automatic speech
recognition in German with a special focus on reduced pronunciations.
Starting with our 65k full form pronunciation dictionary we have
experimented with different phone sets for pronunciation modeling. For
each phone set, different lexica have been derived using mapping rules
for unstressed syllables, where /schwa-vowel+[lnm]/ are replaced by
syllabic /[lnm]/. The different pronunciation dictionaries are used
both for acoustic model training and during recognition. Speech
corpora correspond to TV broadcast shows, which contain signal
segments of various acoustic and linguistic natures. The speech is
produced by a wide variety of speakers with linguistic styles ranging
from prepared to spontaneous speech with changing background and
channel conditions. Experiments were carried out using 4 shows of news
and documentaries lasting for more than 15 minutes each (total of
1h20min). Word error rates obtained vary between 19 and 29% depending
on the show and the system configuration. Only small differences in
recognition rates were measured for the different experimental setups,
with slightly better results obtained by the reduced lexica.

spc00arise.ps.Z

[Speech Communication, Vol 31, (4):339-354, Aug 2000]
---------
The LIMSI ARISE System
L. Lamel, S. Rosset, J.L. Gauvain, S. Bennacef, M. Garnier-Rizet, B. Prouts
---------------------
The LIMSI ARISE system provides vocal access by telephone to rail
travel information for main French intercity connections, including
timetables, simulated fares and reservations, reductions and services.
Our goal is to obtain high dialog success rates with a very open
interaction, where the user is free to ask any question or to provide
any information at any point in time. In order to improve performance
with such an open dialog strategy, we make use of implicit
confirmation using the callers wording (when possible), and change to
a more constrained dialog level when the dialog is not going well.

Le syst�me ARISE du LIMSI est un serveur t�l�phonique d'informations
sur les trains grandes lignes en France. Il fournit les horaires, les
prix et r�servations simul�s, les r�ductions ainsi que les
services. Notre but est d'obtenir des taux de succ�s de dialogue
�lev�s avec une interaction tr�s ouverte, offrant � l'utilisateur la
possibilit� de poser des questions ou de donner des informations �
n'importe quel moment. Afin d'am�liorer les performances avec une
telle strat�gie, nous avons utilis� la confirmation implicite en
fonction des formulations des utilisateurs ainsi qu'un dialogue plus
contraint lorsque l'interaction se passe mal.

Das vom LIMSI entwickelte ARISE System erlaubt den vokalen Zugriff
auf Zugreiseinformationen. Neben Fahrplanausk�nften zwischen den
wichtigsten franz�sischen Intercity-Verbindungen simulierte das System
Zugfahrpreise, Erm�Ssigungen und anderen Serviceleistungen. Unser Ziel
ist es, hohe Erfolgsraten im Mensch-Maschine Dialog zu
erhalten. Erreicht werden soll dieses Ziel aber unter Benutzung einer
sehr offenen Dialogstrategie, bei der der Benutzer zu jedem Zeitpunkt
dem System Fragen stellen oder Informationen liefern kann. Um die
Erfolgsraten mit solch einer offenen Dialogstrategie zu verbessern,
benutzen wir die Technik oler impliziten Konfirmation, bei der (wenn
m�glich) die Formulierungen des Benutzers wiederverwendet
werden. Sollte der Dialog schlecht verlaufen, so wechseln wir zu einem
begrenzteren Dialog-Niveau.

ieee00.ps.Z

[Proceedings of the IEEE, Special issue on LVCSR: Advances
and Applications. Vol 88(8):1181-1200, Aug 2000]
---------
Large Vocabulary Continuous Speech Recognition: Advances and Applications
Jean-Luc Gauvain and Lori Lamel
---------------------
The last decade has witnessed substantial advances in speech
recognition technology, which when combined with the increase in
computational power and storage capacity, has resulted in a variety of
commercial products already or soon to be on the market. In this
paper we review the state-of-the-art in core technology in large
vocabulary continuous speech recognition, with a view towards
highlighting recent advances. We then highlight issues in moving
towards applications, discussing system efficiency, portability across
languages and tasks, enhancing the system output by adding tags and
non-linguistic information. Current performance in speech recognition
and outstanding challenges for three classes of applications:
dictation, audio indexation and spoken language dialog systems are
discussed.

lrec00hbm.ps.Z

[LREC'00, Athens, May'98]
---------
Predictive performance of dialog systems
H. Bonneau-Maynard, L. Devillers, S. Rosset
---------------------
This paper relates some of our experiments on the possibility of
predictive performance measures of dialog systems. Experimenting
dialog systems is often a very high cost procedure due to the
necessity to carry out user trials. Obviously it is advantageous when
evaluation can be carried out automatically. It would be helpfull if
for each application we were able to measure the system performances
by an objective cost function. This performance function can be used
for making predictions about a future evolution of the systems without
user interaction. Using the PARADISE paradigm, a performance function
derived from the relative contribution of various factors is first
obtained for one system developed at LIMSI: PARIS-SITI (kiosk for
tourist information retrieval in Paris). A second experiment with
PARIS-SITI with a new test population confirms that the most important
predictors of user satisfaction are understanding accuracy,
recognition accuracy and number of user repetitions. Futhermore,
similar spoken dialog features appear as important features for the
Arise system (train timetable telephone information system). We also
explore different ways of measuring user satisfaction. We then
discuss the introduction of subjective factors in the predictive
coefficients.

ica00h4.ps.Z

[ICASSP'2000, Istanbul, Jun 2000]
---------
Transcription and Indexation of Broadcast Data
Jean-Luc Gauvain, Lori Lamel, Yannick de Kercadio, and Gilles Adda
---------------------
In this paper we report on recent research on transcribing and
indexing broadcast news data for information retrieval purposes. The
system described here combines an adapted version of the LIMSI 1998
Hub-4E transcription system for speech recognition with text-based IR
methods. Experimental results are reported in terms of recognition
word error rate and mean average precision for both the TREC SDR98
(100h) and SDR99 (600h) data sets. With query expansion using
commercial transcripts, comparable mean average precisions are
obtained on manual reference transcriptions and automatic
transcriptions with a word error rate of 21.5% measured on a 10 hour
data subset.

spc00sv.ps.Z

[Speech Communication, Vol 31 (2-3):141-154, June 2000]
---------
Speaker Verification over the Telephone
L.F. Lamel and J.L. Gauvain
---------------------

Speaker verification has been the subject of active research for many
years, yet despite these efforts and promising results on laboratory
data, speaker verification performance over the telephone remains
below that required for many applications. This experimental study
aimed to quantify speaker recognition performance out of the context
of any specific application, as a function of factors more-or-less
acknowledged to affect the accuracy. Some if the issues addressed
are: the speaker model (Gaussian mixture models are compared with
phone-based models), the influence of the amount and content of
training and test data on performance; performance degradation due to
model ageing and how can this be counteracted by using adaptation
techniques; achievable performance levels using text-dependent and
text-independent recognition modes. These and other factors were
addressed using a large corpus of read and spontaneous speech (over
250 hours collected from 100 target speakers and 1000 impostors) in
French designed and recorded for the purpose of this study. On this
data, the lowest equal error rate is 1% for the text-dependent mode
when 2 trials are allowed per attempt and with a minimum of 1.5s of
speech per trial.

L'authentification automatique du locuteur a �t� le sujet d'actives
recherches durant de nombreuses ann�es, et malgr�s ces efforts et des
r�sultats prometteurs en laboratoire, le niveau de performance sur le
r�seau t�l�ephonique reste inf�erieur au niveau requis pour de
nombreuses applications. L��tude exp�rimentale, dont les principaux
r�sultats sont pr�sent�s dans cette article, avait pour objectif de
quantifier en dehors de toute application l'influence de facteurs plus
ou moins reconnus pour leur effet sur les performances de syst�mes
d'authentification du locuteur. Les questions address�es sont: le
choix du mod�le (m�lange de gaussiennes ou mod�le phon�tiques); la
connaissance ou non du texte prononc� par le locuteur; l'importance de
la quantit� et de la nature des donn�es d'apprentissage et
d'authentification, en particulier l'influence du contenu linguistique
des �nonc�s sur le niveau de performance pour des textes lus et de la
parole spontan�e; la d�gradation des r�sultats due au vieillisssement
des mod�les et la mani�re de le compenser avec des techniques
d'adaptation. Les r�sultats exr�rimentaux ont �t� obtenus sur d'un
corpus t�l�phonique con�u et enregistr� pour cette �tude qui comprend
plus de 250 heures de parole pour un total de 100 locuteurs abonn�es
et 1000 imposteurs. Sur ces donn�es le taux d'�gale erreur est de
l'ordre de 1% dans le mode d�pendant du texte lorsque deux essais sont
autoris�s par tentative d'authentification avec une dur�e minimale de
1.5s par essai.

hub4_00.ps.Z

[DARPA Speech Transcription Workshop, May 2000]
---------
The LIMSI 1999 Hub-4E Transcription System
Jean-Luc Gauvain, Lori Lamel and Gilles Adda
---------------------
In this paper we report on the LIMSI 1999 Hub-4E system for broadcast
news transcription. The main difference from our previous broadcast
news transcription system is that a new decoder was implemented to
meet the 10xRT requirement. This single pass 4-gram dynamic network
decoder is based on a time-synchronous Viterbi search with dynamic
expansion of LM-state conditioned lexical trees, and with acoustic and
language model lookaheads. The decoder can handle position-dependent,
cross-word triphones and lexicons with contextual
pronunciations. Faster than real-time decoding can be obtained using
this decoder with a word error under 30%, running in less than 100Mb
of memory on widely available platforms such Pentium III or Alpha
machines.
The same basic models (lexicon, acoustic models, language models) and
partitioning procedure used in past systems have been used for this
evaluation. The acoustic models were trained on about 150 hours of
transcribed speech material. 65K word language models were obtained
by interpolation of backoff n-gram language models trained on
different text data sets. Prior to word decoding a maximum likelihood
partitioning algorithm segments the data into homogenous regions and
assigns gender, bandwidth and cluster labels to the speech segments.
Word decoding is carried out in three steps, integrating cluster-based
MLLR acoustic model adaptation. The final decoding step uses a 4-gram
language model interpolated with a category trigram model. The
overall word transcription error on the 1999 evaluation test data was
17.1% for the baseline 10X system.

riao00.ps.Z

[RAIO'00, Paris, May 2000]

An Audio Transcriber for Broadcast Document Indexation
Contacts: Bernard Prouts (Vecsys), Jean-Luc Gauvain (LIMSI)
---------
Demonstration of the AudioSurf automatic transcription system. The
demonstrator shows state-of-the-art automatic transcription and
indexing capabilities of broadcast data in three languages: American
English, French and German.

sdr99.ps.Z

[Proc. of the Text Retrieval Conference, TREC-8, Gaithersburg, Nov 1999]
---------
The LIMSI SDR system for TREC-8
J.L. Gauvain, Y. Kercadio, L. Lamel, and G. Adda
--------------------
In this paper we report on our TREC-8 SDR system, which combines an
adapted version of the LIMSI 1998 Hub-4E transcription system for
speech recognition with an IR system based on the Okapi term weighting
function. Experimental results are given in terms of word error rate
and average precision for both the SDR'98 and SDR'99 data sets. In
addition to the Okapi approach, we also investiged a Markovian
approach, which although not used in the TREC-8 evaluation, yields
comparable results. The evaluation system obtained an average
precision of 0.5411 on the reference transcriptions and of 0.5072 on
the automatic transcriptions. The word error rate measured on a 10
hour subset is of 21.5%.

spc99pron.ps.Z

[Speech Communication, Special Issue on Pronunciation Variation
Modeling, Vol 29 (2-4): 83-98, Nov 1999]
---------
Pronunciation Variants Across Systems, Languages and Speaking Style
M. Adda-Decker, L. Lamel
--------------------
This contribution aims at evaluating the use of pronunciation
variants for different recognition system configurations, languages
and speaking styles. This study is limited to the use of variants
during speech alignment, given an orthographic transcription of the
utterance and a phonemically represented lexicon, and is thus focused
on the modeling capabilities of the acoustic word models. To measure
the need for variants we have defined the variant2+ rate which is the
percentage of words in the corpus not aligned with the most common
phonemic transcription. This measure may be indicative of the
possible need for pronunciation variants in the recognition system.
Pronunciation lexica have been automatically created so as to
include a large number of variants (overgeneration). In particular,
lexica with parallel and sequential variants were automatically
generated in order to assess the spectral and temporal modeling
accuracy. We first investigated the dependence of the aligned
variants on the recognizer configuration. Then a cross-lingual study
was carried out for read speech in French and American English using
the BREF and the WSJ corpora. A comparison between read and
spontaneous speech was made for French based on alignments of BREF
(read) and MASK (spontaneous) data. Comparative alignment results
using different acoustic model sets demonstrate the dependency between
the acoustic model accuracy and the need for pronunciation variants.
The alignment results obtained with the above lexica have been used to
study the link between word frequencies and variants using different
acoustic model sets.

Cette contribution vise � �valuer l'utilisation des variantes de
prononciation pour diff�rentes configurations de syst�me, diff�rentes
langues et diff�rents types d'�locution. Cette �tude se limite �
l'utilisation de variantes pendant l'alignement automatique de la
parole �tant donn�e une transcription orthographique correcte et un
lexique de prononciation. Nous focalisons ainsi notre �tude sur la
capacit� des mod�les acoustique des mots � rendre compte du signal
observ�. Pour �valuer le besoin de variantes nous avons d�fini le
taux de variant2+ qui correspond au pourcentage de mots du corpus qui
ne sont pas align�s avec la meilleure transcription phon�mique. Ce
taux peut �tre consid�r� comme indicatif d'un �ventuel besoin de
variantes de prononciation dans le syst�me de reconnaissance.
Diff�rents lexiques de prononciation ont �t� cr��s automatiquement
g�n�rant diff�rents types et quantit�s de variantes (avec
surg�n�ration). En particulier des lexiques avec des variantes
parall�les et s�quentielles ont �t� distingu�s afin d'�valuer la
pr�cision de la mod�lisation spectrale et temporelle. Dans une
premi�re �tape nous avons montr� le lien entre le besoin de variantes
de prononciation et la qualit� des mod�les acoustiques. Nous avons
ensuite compar� diff�rents ph�nom�nes de variantes pour l'anglais et
le fran�ais sur des grands corpus de parole lue (WSJ et BREF). Une
comparaison entre parole spontan�e et parole lue est pr�sent�e. Cette
�tude montre que le besoin de variantes diminue avec la pr�cision des
mod�les acoustiques. Pour le fran�ais, elle permet de r�v�ler
l'importance des variantes s�quentielles, en particulier du e-muet.

cbmi99.ps.Z

[Content-Based Multimedia Indexing, CBMI'99 Toulouse, Oct 1999
http://www.eurecom.fr/~bmgroup/cbmi99/]
---------
Audio partitioning and transcription for broadcast data indexation
J.L. Gauvain, L. Lamel, and G. Adda
--------------------
This work addresses automatic transcription of television and radio
broadcasts. Transcription of such types of data is a major step in
developing automatic tools for indexation and retrieval of the vast
amounts of information generated on a daily basis. Radio and
television broadcasts consist of a continuous stream of data comprised
of segments of different linguistic and acoustic natures, which poses
challenges for transcription. Prior to word recognition, the data is
partitioned into homogeneous acoustic segments. Non-speech segments
are identified and removed, and the speech segments are clustered and
labeled according to bandwidth and gender. The speaker-independent
large vocabulary, continuous speech recognizer makes use of n-gram
statistics for language modeling and of continuous density HMMs with
Gaussian mixtures for acoustic modeling. The system has consistently
obtained top-level performance in DARPA evaluations. An average word
error of about 20% has been obtained on 700 hours of unpartitioned
unrestricted American English broadcast data.

cbmi99-olive.ps.Z

[Content-Based Multimedia Indexing, CBMI'99 Toulouse, Oct 1999
http://www.eurecom.fr/~bmgroup/cbmi99/]
---------
Olive: Speech Based Video Retrieval
Franciska de Jong (TNO/University of Twente), Jean-Luc Gauvain,
Jurgen den Hartog (TNO), Klaus Netter (DFKI)
--------------------
This paper describes the OLIVE project which aims to support
automated indexing of video material by use of human language
technologies. OLIVE is making use of speech recognition to
automatically derive transcriptions of the sound tracks, generating
time-coded linguistic elements which serve as the basis for text-based
retrieval functionality. The retrieval demonstrator builds on and
extends the architecture from the POP-EYE project, a system applying
human language technology on subtitles for the disclosure of video
fragments.

SFC99-article.ps.Z

[Mathematique Informatique et Sciences Humaines, numero 147
(speciale classification), septembre 1999]
---------
Classification de mots non �tiquet�s par des m�thodes statistiques
Christel Beaujard et Mich�le Jardino
--------------------
Notre th�matique de recherche est le d�veloppement de mod�les de
langage robustes pour la reconnaissance de la parole. Ces mod�les
doivent pr�dire un mot connaissant les mots qui le pr�c�dent. Malgr�
le nombre croissant de donn�es textuelles �lectroniques, toutes les
possibilit�s de la langue ne sont pas pr�sentes dans ces donn�es, un
moyen de les obtenir est de g�n�raliser la repr�sentation textuelle en
regroupant les mots dans des classes. Les mod�les de langage fond�s
sur des classes pr�sentent alors une plus large couverture de la
langue avec un nombre r�duit de param�tres permettant une
reconnaissance plus rapide des mots par les syst�mes de reconnaissance
de la parole dans lesquels ils sont introduits. Nous d�crivons deux
types de classification automatique de mots, appris statistiquement
sur des textes �crits de journaux et de transcriptions de parole. Ces
classifications ne n�cessitent pas d'�tiquetage des mots, elles sont
r�alis�es suivant les contextes locaux dans lesquels les mots sont
observ�s. L'une est bas�e sur la distance de Kullback-Leibler et
r�partit tous les mots dans un nombre de classes fix� � l'avance. La
seconde regroupe les mots consid�r�s comme similaires dans un nombre
de classes non pr�d�fini. Cette �tude a �t� r�alis�e sur des donn�es
d'apprentissage en francais de domaine, de taille et de vocabulaire
diff�rents.

Our goal is to develop robust language models for speech
recognition. These models have to predict a word knowing its history.
Although the increasing size of electronic text data, all the possible
word sequences of a language cannot be observed. A way to generate
these non encountered word sequences is to map words in classes. The
class-based langage models have a better coverage of the language with
a reduced number of parameters, a situation which is favourable to
speed up the speech recognition systems. Two types of automatic word
classification are described. They are trained on word statistics
estimated on texts derived from newspapers and transcribed
speech. These classifications do not require any tagging, words are
classified according to the local context in which they occur. The
first one is a mapping of the vocabulary words in a fixed number of
classes according to a Kullback-Leibler measure. In the second one,
similar words are clustered in classes whose number is not fixed in
advance. This work has been performed with French training data coming
from two domains, both different in size and vocabulary.

esca99-minker.ps.Z

[ESCA Tutorial and Research Workshop on Interactive Dialogue in
Multi-Modal Systems, KlosterIrsee, Germany, June 1999]
---------
Hidden Understanding Models for Machine Translation
W. Minker and M. Gavalda and A. Waibel
---------------------

We demonstrate the portability of a stochastic method for
understanding natural language from a setting of human-machine
interactions (ATIS - Air Travel Information Services and MASK
Multimodal Multimedia Automated Service Kiosk) into the more open one
of human-to-human interactions. The application we use is the English
Spontaneous Speech Task (ESST) for multilingual appointment
scheduling. Spoken language systems developed for this task translate
spontaneous conver sational speech among different languages.

voiceglottalope.ps.Z

[Workshop on Models and Analysis of Vocal Emissions for
biomedical applications, Firenze]
---------
Glottal open quotient estimation using linear prediction
N. Henrich, B. Doval and C. d'Alessandro
---------------------
A new method for the estimation of the voice open quotient is
presented. Assuming abrupt glottal closures, the glottal flow waveform
is considered as t he impulse response of an anticausal two-poles
filter. It is defined by four parameters: T0, Av, Oq and ffm. The last
three ones are estimated by a second-order linear prediction of the
inverse filtered speech. Results on synthetic and natural speech
signals are reported and compared with measurements on the correspon
ding electroglottographic signals.

euro99_tuan.ps.Z

[EuroSpeech'99 Budapest, Sep'99]
---------
Robust glottal closure detection using the wavelet transform.
Tuan Vu Ngoc and Christophe d'Alessandro
---------------------
In this work, a time-scale framework for analysis of glottal closure
instants is proposed. As glottal closure can be soft or sharp,
depending on the type of vocal activity, the analysis method should be
able to deal with both wide-band and low-pass signals. Thus, a
multi-scale analysis seems well-suited. The analysis is based on a
dyadic wavelet filterbank. Then, the amplitude maxima of the wavelet
transform are computed, at each scale. These maxima are organized into
lines of maximal amplitude (LOMA) using a dynamic programming
algorithm. These lines are forming ``trees'' in the time-scale domain.
Glottal closure instants are then interpreted as the top of the
strongest branch, or trunk, of these trees. Interesting features of
the LOMA are their amplitudes. The LOMA are strong and well organized
for voiced speech, and rather weak and widespread for unvoiced speech.
The accumulated amplitude along the LOMA gives a very good measure of
the degree of voicing.

euro99_dial.ps.Z

[EuroSpeech'99 Budapest, Sep'99]
---------
Design Strategies for Spoken Language Dialog Systems
Sophie Rosset, Samir Bennacef and Lori Lamel
---------------------
The development of task-oriented spoken language dialog system
requires expertise in multiple domains including speech recognition,
natural spoken language understanding and generation, dialog
management and speech synthesis. The dialog manager is the core of a
spoken language dialog system, and makes use of multiple knowledge
sources. In this contribution we report on our methodology for
developing and testing different strategies for dialog management,
drawing upon our experience with several travel information tasks. In
the LIMSI Arise system for train travel information we have
implemented a 2-level mixed-initiative dialog strategy, where the user
has maximum freedom when all is going well, and the system takes the
initiative if problems are detected. The revised dialog strategy and
error recovery mechanisms have resulted in a 5-10% increase in dialog
success depending upon the word error rate.

euro99_hub4.ps.Z

[EuroSpeech'99 Budapest, Sep'99]
---------
Recent Advances in Transcribing Television and Radio Broadcasts
Jean-Luc Gauvain and Lori Lamel and Gilles Adda and Michele Jardino
---------------------
Transcription of broadcast news shows (radio and television) is a
major step in developing automatic tools for indexation and retrieval
of the vast amounts of information generated on a daily basis.
Broadcast shows are challenging to transcribe as they consist of a
continuous data stream with segments of different linguistic and
acoustic natures. Transcribing such data requires addressing two main
problems: those related to the varied acoustic properties of the
signal, and those related to the linguistic properties of the speech.
Prior to word transcription, the data is partitioned into homogeneous
acoustic segments. Non-speech segments are identified and rejected,
and the speech segments are clustered and labeled according to
bandwidth and gender. The speaker-independent large vocabulary,
continuous speech recognizer makes use of n-gram statistics for
language modeling and of continuous density HMMs with Gaussian
mixtures for acoustic modeling. The LIMSI system has consistently
obtained top-level performance in DARPA evaluations, with an overall
word transcription error on the Nov98 evaluation test data of
13.6%. The average word error on unrestricted American English
broadcast news data is under 20%.

ica99arise.ps.Z

[ICASSP'99 Phoenix]
---------
The LIMSI ARISE System for Train Travel Information
Lori Lamel, Sophie Rosset, Jean-Luc Gauvain and Samir Bennacef
---------------------
In the context of the LE-3 ARISE project we have been developing a
dialog system for vocal access to rail travel information. The system
provides schedule information for the main French intercity
connections, as well as, simulated fares and reservations, reductions
and services. Our goal is to obtain high dialog success rates with a
very open dialog structure, where the user is free to ask any question
or to provide any information at any point in time. In order to
improve performance with such an open dialog strategy, we make use of
implicit confirmation using the callers wording (when possible), and
change to a more constrained dialog level when the dialog is not going
well. In addition to own assessment, the prototype system undergoes
periodic user evaluations carried out by the our partners at the
French Railways.

icassp99_holger.ps.Z

[ICASSP'99, Phoenix, March 1999]
---------
Using boosting to improve a hybrid HMM/neural network speech recognizer
Holger Schwenk
---------------------
"Boosting" is a general method for improving the performance of almost
any learning algorithm. A recently proposed and very promisin g
boosting algorithm is AdaBoost. In this paper we investigate if
AdaBoost can be used to improve a hybrid HMM/neural network continuous
speech recognizer. Boosting significantly improves the word error rate
from 6.3% to 5.3% on a test set of the OGI Numbers95 corpus, a medium
size continuous numbers recognition task. These results compare
favorably with other combining techniques using several different
feature representations or additional information from longer time
spans.

icassp99bref.ps.Z

[ICASSP'99, Phoenix, March 1999]
---------
Large Vocabulary Speech Recognition in French
Martine Adda-Decker, Gilles Adda, Jean-Luc Gauvain, Lori Lamel
---------------------
In this contribution we present some design considerations
concerning our large vocabulary continuous speech recognition system
in French. The impact of the epoch of the text training material on
lexical coverage, language model perplexity and recognition
performance on newspaper texts is demonstrated. The effectiveness of
larger vocabulary sizes and larger text training corpora for language
modeling is investigated. French is a highly inflected language
producing large lexical variety and a high homophone rate, the
language model contribution is of major importance. About 30% of
recognition errors are shown to be due to substitutions between
inflected forms of a given root form. When word error rates are
analysed as a function of word frequency, a significant increase in
the error rate can be measured for frequency ranks above 5000.

hub4_99.ps.Z

[DARPA'99 Broadcast News Workshop, Herndon, VA, Feb'99]
---------
The LIMSI 1998 Hub-4E Transcription System
Jean-Luc Gauvain, Lori Lamel, Gilles Adda and Michele Jardino
---------------------
In this paper we report on our Nov98 Hub-4E system, which is an
extension of our Nov97 system. The LIMSI system for the November 1998
Hub-4E evaluation is a continuous mixture density, tied-state
cross-word context-dependent HMM system. The acoustic models were
trained on the 1995, 1996 and 1997 official Hub-4E training data
containing about 150 hours of transcribed speech material. 65K word
language models were obtained by interpolation of backoff n-gram
language models trained on different text data sets. Prior to word
decoding a maximum likelihood partitioning algorithm segments the data
into homogenous regions and assigns gender, bandwidth and cluster
labels to the speech segments. Word decoding is carried out in three
steps, integrating cluster-based MLLR acoustic model adaptation. The
final decoding step uses a 4-gram language model interpolated with a
category trigram model.
The main differences compared to last year's system arise from the
use of additional acoustic and language model training data, the use
of divisive decision tree clustering instead of agglomerative
clustering for state-tying, generation graph word using adapted
acoustic models, the use of interpolated LMs trained on different data
sets instead of training a single model on weighted texts, and a
4-gram LM interpolated with a category model.
The overall word transcription error on the Nov98 evaluation
test data was 13.6%.

icslp98lid.ps.Z

[ICSLP'98, Sydney, Dec'98]
---------
Language Identification Incorporating Lexical Information
D. Matrouf, M. Adda-Decker, L.F. Lamel, J.L. Gauvain
---------------------
In this paper we explore the use of lexical information for language
identification (LID). Our reference LID system uses
language-dependent acoustic phone models and phone-based bigram
language models. For each language, lexical information is introduced
by augmenting the phone vocabulary with the N most frequent words in
the training data. Combined phone and word bigram models are used to
provide linguistic constraints during acoustic decoding.
Experiments were carried out on a 4-language telephone speech
corpus. Using lexical information achieves a relative error reduction
of about 20% on spontaneous and read speech compared to the reference
phone-based system. Identification rates of 92%, 96% and 99% are
achieved for spontaneous, read and task-specific speech segments
respectively, with prior speech detection.

icslp98dial.ps.Z

[ICSLP'98, Sydney, Dec'98]
---------
Evaluation of Dialog Strategies for a Tourist Information Retrieval System
L. Devillers, H.Bonneau-Maynard
---------------------
In this paper, we describe the evaluation of the dialog management
and response generation strategies being developed for retrieval of
touristic information, selected as a common domain for the ARC
AUPELF-B2 action. Two dialog strategy versions within the same
general platform have been implemented. We have investigated
qualitative and quantitative criteria for evaluation of these dialog
control strategies: in particular, by testing the efficiency of our
system with and without automatic mechanisms for guiding the user via
suggestive prompts. An evaluation phase has been carried out to
assess the utility of guiding the user with 32 subjects. We report
performance comparisons for naive and experienced subjects and also
describe how experimental results can be placed in the PARADISE
framework for evaluating dialog systems. The experiments show that
user guidance is appropriate for novices and appreciated by all users.

icslp98h4.ps.Z

[ICSLP'98, Sydney, Dec'98]
---------
Partitioning and Transcription of Broadcast News Data
Jean-Luc Gauvain, Lori Lamel, Gilles Adda
---------------------
Radio and television broadcasts consist of a continuous stream of
data comprised of segments of different linguistic and acoustic
natures, which poses challenges for transcription. In this paper we
report on our recent work in transcribing broadcast news data,
including the problem of partitioning the data into homogeneous
segments prior to word recognition. Gaussian mixture models are used
to identify speech and non-speech segments. A maximum-likelihood
segmentation/clustering process is then applied to the speech segments
using GMMs and an agglomerative clustering algorithm. The clustered
segments are then labeled according to bandwidth and gender. The
recognizer is a continuous mixture density, tied-state cross-word
context-dependent HMM system with a 65k trigram language model.
Decoding is carried out in three passes, with a final pass
incorporating cluster-based test-set MLLR adaptation. The overall
word transcription error on the Nov'97 unpartitioned evaluation test
data was 18.5%.

icslp98mask.ps.Z

[ICSLP'98, Sydney, Dec'98]
---------
User Evaluation of the MASK Kiosk
L. Lamel, S. Bennacef, J.L. Gauvain, H. Dartigues, J.N. Temem
---------------------
In this paper we report on a series of user trials carried out to
assess the performance and usability of the Mask prototype kiosk. The
aim of the ESPRIT Multimodal Multimedia Service Kiosk (MASK) project
was to pave the way for more advanced public service applications with
user interfaces employing multimodal, multi-media input and
output. The prototype kiosk, was developed after analysis of the
technological requirements in the context of users and the tasks they
perform in carrying out travel enquiries, in close collaboration with
the French Railways (SNCF) and the Ergonomics group at UCL. The time
to complete the transaction with the Mask kiosk is reduced by about
30% compared to that required for the standard kiosk, and the success
rate is 85% for novices and 94% once familiar with the system. In
addition to meeting or exceeding the performance goals set at the
project onset in terms of success rate, transaction time, and user
satisfaction, the Mask kiosk was judged to be user-friendly and simple
to use.

issd98.ps.Z

[ISSD'98, Sydney, Nov'98]
---------
Spoken Language Dialog System Development and Evaluation at LIMSI
Lori Lamel
---------------------
The development of natural spoken language dialog systems requires
expertise in multiple domains, including speech recognition, natural
spoken language understanding and generation, dialog managment and
speech synthesis. In this paper I report on our experience at LIMSI
in the design, development and evaluation of spoken language dialog
systems for information retrieval tasks. Drawing upon our experience
in this area, I attempt to highlight some aspects of the design
process, such as the use of general and task-specific knowledge
sources, the need for an iterative development cycle, and some of the
difficulties related to evaluation of development progress.

tts98atr.ps.Z

[3rd Int'l Workshop on Speech Synthesis, Jenolan Caves, Nov'98]
---------
Voice quality modification using periodic-aperiodic decomposition
and spectral processing of the voice source signal
C. d'Alessandro and B. Doval
---------------------
Voice quality is currently a key issue in speech synthesis research.
The lack of realistic intra-speaker voice quality variation is an
important source of concern for concatenation-based synthesis
methods. A challenging problem is to reproduce the voice quality
changes that are occuring in natural speech when the vocal effort is
varying. A new method for voice quality modification is presented. It
takes advantage of a spectral theory for voice source signal
representation. An algorithm based on periodic-aperiodic
decomposition and spectral processing (using the short-term Fourier
transform) is described. The use of adaptive inverse filtering in
this framework is also discussed. Applications of this algorithm may
include: pre-processing of speech corpora, modification of voice
quality parameters together with intonation in synthesis, voice
transformation. Some experiments are reported, showing convincing
voice quality modifications for various speakers.

tts98b3.ps.Z

[3rd Int'l Workshop on Speech Synthesis, Jenolan Caves, Nov'98]
---------
Joint evaluation of Text-To-Speech synthesis in French within
the AUPELF ARC-B3 project
C. d'Alessandro and B3 Partners
---------------------
A joint international evaluations of Text-To-Speech synthesis (TTS)
systems is being conducted for the French language. This project
involves eight laboratories of French-speaking countries (Belgium,
Canada, France and Switzerland), and is funded by AUPELF (Association
of French Speaking Universities) The results obtained after 2 years of
work are presented in this paper. The project is split into 4 tasks:
1/ evaluation of grapheme-to-phoneme conversion; 2/ evaluation of
prosody; 3/ evaluation of segments concatenation/modification; 4/
global system evaluation. Grapheme-phoneme conversion evaluation has
now been completed, and both methodological issues and results for the
eight systems are presented at the workshop. For prosody evaluation,
the problem is to study several systems that incorporate different
linguistic analyses, different prosodic systems, different intonation
and rhythmic models. Using the same phonemic input and the same
concatenation/modification system with the same diphones, it will be
possible to assess prosodic quality independently of the other
modules. Perceptual evaluation is used at this stage. The degradation
introduced by concatenation/modification systems and by the quality of
segment data-bases will be studied using perceptual tests. Evaluation
of the global systems will also be performed, using both
intellegibility and agreement measures. Finally, one of the aims of
this project is to make available corpora and evaluation paradigms
that may be reused in future research. This will enable a quantitative
analysis of the results obtained, and a measurement of the progress
achieved for each specific system.

tts98BdM.ps.Z

[3rd Int'l Workshop on Speech Synthesis, Jenolan Caves, Nov'98]
---------
Text chunking for prosodic phrasing in French
P. Boula de Mareuil and C. d'Alessandro
---------------------
In this paper, we describe experiments in text chunking for prosodic
phrasing and generation in French. We present a quick, robust and
deterministic parser which uses part-of-speech information and a set
of rules, to consistently assign prosodic boundaries in Text-To-Speech
synthesis. The syntactic phrasing, consisting of segmenting sentences
in non-recursive sequences, is defined in terms of sets of possible
categories. The syntax-prosody interface is presented: the sequences
enable the location of potential prosodic boundaries (minor, major or
intermediate). The first results are given, including of a listening
test, which demonstrated the advantage of our chunk grammar over a
simpler approach, based on function words and
punctuation. Quantitative measures are made on the chunks defined
between boundaries. Our model prefers structural criteria to
probabilities, and an approach by intension rather than by extension,
is also compared with other models.

specom98.ps.Z

[SPECOM'98, St. Petersburg]
---------
Dialog Strategies in a Tourist Information Spoken Dialog System
H. Bonneau-Maynard, L. Devillers
---------------------
We describe in this paper the dialog control strategy being
developed for the tourist information retrieval system, PARIS-SITI, in
the context of ARC AUPELF-B2 action on spoken dialog systems
evaluation. The dialog control strategy has been evaluated by testing
the efficiency of two systems : with and without automatic mechanisms
for guiding the user via suggestive prompts with both novice and
expererienced subjects on a set of 4 scenarios performed by each
speaker with both systems. Qualitative and quantitative criteria such
as user satisfaction, understanding and recognition error rate,
percentage of followed prompts and scenario success are discussed
showing the interest of the proposed strategy.

ivtta98.ps.Z

[IVTTA'98, Torino, Sep'98]
---------
The LIMSI ARISE System
L. Lamel, S. Rosset, J.L. Gauvain, S. Bennacef,
M. Garnier-Rizet, B. Prouts
---------------------
The LIMSI Arise system provides vocal access to rail travel
information for main French intercity connections, including
timetables, simulated fares and reservations, reductions and services.
Our goal is to obtain high dialog success rates with a very open
structure, where the user is free to ask any question or to provide
any information at any point in time. In order to improve performance
with such an open dialog strategy, we make use of implicit
confirmation using the callers wording (when possible), and change to
a more constrained dialog level when the dialog is not going well.
The same system architecture is being used to develop a French/English
prototype timetable service for the high speed trains between Paris
and London.

asaica98.ps.Z

[ASA/ICA'98, Seattle, June'98]
---------
Recent Activities in Spoken Language Processing at LIMSI
J.L. Gauvain and L. Lamel
---------------------
This paper summarizes recent activities at LIMSI in multilingual
speech recognition and its applications. While the main goal of
speech recognition is to provide a transcription of the speech signal
as a sequence of words, the same basic technology serves as the first
step in other application areas, such as in automatic systems for
information access and for automatic indexation of audiovisual data.

sfc98.ps.Z

[SFC'98, Montpellier]
---------
Classification de mots non �tiquet�s par des m�thodes statistiques
Christel Beaujard et Mich�le Jardino
--------------------
Nous d�crivons deux types de classification de mots, appris
statistiquement sur des textes �crits de journaux et de transcriptions
de parole. Ces classifications ne n�cessitent pas d'�tiquetage des
mots, elles sont r�alis�es suivant les contextes locaux dans lesquels
les mots sont observ�s. L'une est bas�e sur la distance de Kullback
Leibler et r�partit tous les mots dans un nombre de classes fix� �
l'avance. La seconde regroupe les mots consid�r�s comme similaires
dans un nombre de classes non pr�d�fini. Cette �tude a �t� r�alis�e
sur des donn�es d'apprentissage de domaine, de taille et de
vocabulaire diff�rents.

tagworkshop98.ps.Z

[TAG+ Workshop, Philadelphia, July'98]
---------
An improved Earley parser with LTAG
Y. de Kercadio
---------------------

This paper presents an adaptation of the Earley algorithm to enable
parsing with lexicalized tree-adjoining grammars (LTAGs). Each
elementary tree is represented by a rewrite rule, so that the adapted
version of the Earley algorithm can be used for parsing. This
algorithm conserves the valid prefix property. Taking advantage of the
top-down strategy and the valid prefix property, improvements in
parsing are obtained by using the local span of the trees and some
linguistic properties.

lrec98minker.ps.Z

[LREC'98, Granada, May'98]
---------
Evaluation Methodologies for Interactive Speech Systems
W. Minker
---------------------
In this paper, several criteria and paradigms are described to
measure the performance of spoken language systems. The focus is on
the evaluation of natural language understanding components. These
evaluations are carried out in the domain of spontaneous human-human
interaction as supported by automatic translation systems. They are
also applied in the domain of spontaneous human-machine interaction
typically used in information retrieval applications. Some system
response evaluation paradigms for different applications and domains
are discussed in more detail. It is also shown that official
performance tests and site-specific evaluation paradigms are
complementary in use.

lrec98W_minker.ps.Z

[LREC'98 Workshop on "The Evaluation of Parsing Systems", Granada, May'98]
---------
Evaluating Parses for Spoken Language Dialogue Systems
W. Minker and L. Chase
---------------------
In this paper, several criteria and paradigms are described to
measure the performance of spoken language systems. The focus is on
the evaluation of natural language understanding components. These
evaluations are carried out in the domain of spontaneous human-human
interaction as supported by automatic translation systems. They are
also applied in the domain of spontaneous human-machine interaction
typically used in information retrieval applications. Some system
response evaluation paradigms for different applications and domains
are discussed in more detail. It is also shown that official
performance tests and site-specific evaluation paradigms are
complementary in use.

lrec98ideal.ps.Z

[LREC'98, Granada, May'98]
---------
A Multilingual Corpus for Language Identification
L. Lamel, G. Adda, M. Adda-Decker, C. Corredor-Ardoy,
J.J. Gangolf, J.L. Gauvain
---------------------
In this paper we describe the design, recording, and transcription
of a large, multilingual (French, English, German and Spanish) corpus
of telephone speech for research in automatic language identification.
The corpus contains over 250 calls from native speakers of each
language from their home country, and an additional 50 calls per
language from another country. Although the same recording protocol
was used for all languages, slight modifications were necessary to
account for language or country specificities. Issues in designing
comparable corpora in different languages are addressed, including how
to interact with callers so as to obtain the desired responses.

pron98.ps.Z

[ESCA ETRW on Modeling Pronunciation Variants for ASR, Rolduc, May'98]
---------
Pronunciation Variants Across Systems, Languages and Speaking Style
Martine Adda-Decker and Lori Lamel
---------------------
This contribution aims at evaluating the use of pronunciation
variants across different system configurations, languages and
speaking styles. This study is limited to the use of variants during
speech alignment, given an orthographic transcription and a
phonemically represented lexicon, thus focusing on the modeling
abilities of the acoustic word models. Parallel and sequential
variants are tested in order to measure the spectral and temporal
modeling accuracy. As a preliminary step we investigated the
dependance of the aligned variants on the recognizer configuration. A
cross-lingual study was carried out for read speech in French and
American English using the BREF and the WSJ corpora. A comparison
between read and spontaneous speech is presented for French based on
alignments from BREF (read) and MASK (spontaneous) data.

ica98phrec.ps.Z

[ICASSP'98, Seattle, WA, May'98]
---------
Multi-lingual Phone Recognition on Telephone Spontaneous Speech
C. Corredor-Ardoy, L. Lamel, M. Adda-Decker, J.L. Gauvain
---------------------
In this paper we report phone accuracies on telephone spontaneous
speech. Language-dependent phone recognizers were trained and
assessed on IDEAL, a multi -lingual corpus containing telephone speech
in French, British English, German and Castillan Spanish. We discuss
phone recognition using context independent Hidden Markov Models
(CIHMM) and back-off phonotactic bigram models built on corpora with
different sizes and speech materials. Although the training corpus
containing spontaneous speech only includes 14% of all available data,
it represents the best choice according to the task. Phone
recognition was improved using CD phone units: 51.9% compared with
57.4%, with CDHMM and CIHMM respectively (average error rates on the 4
languages). We also suggest a straightforward way to detect non
speech phenomena. The basic idea is to remove sequences of any
consonants between two silence labels, from the recognized phone
strings prior to scoring. This simple technique reduces the relative
phone error rate by 5.4% (average on the 4 languages).

rla2c98.ps.Z

[RLA2C'98, Avignon, April'98]
---------
Speaker Verification Over the Telephone
Lori Lamel and Jean-Luc Gauvain
---------------------
In this paper we present a study on speaker verification using
telephone speech and for two operational modes, i.e. text-dependent
and text-independent speaker verification. A statistical modeling
approach is taken, where for text-independent verification the talker
is viewed as a source of phones, modeled by a fully connected Markov
chain and for text-dependent verification, a left-to-right HMM is
built by concatenating the phone models corresponding to the
transcription. A series of experiments were carried out on a large
telephone corpus recorded specifically for speaker verification
algorithm development assessing performance as a function of the type
and amount of data used for training and for verification. Experimental
results are presented for both read and spontaneous speech. On this
data, the lowest equal error rate is 1% for the text-dependent mode
when 2 trials are allowed per attempt and with a minimum of 1.5s of
speech per trial.

hub4_98.ps.Z

[DARPA'98 Broadcast News Transcription and Understanding, Landsdowne, VA, Feb'98]
---------
The LIMSI 1997 Hub-4E Transcription System
Jean-Luc Gauvain, Lori Lamel, Gilles Adda
---------------------
In this paper we report on the LIMSI system used in the Nov'97
Hub-4E benchmark test on transcription of American English broadcast
news shows. There are two main differences from the LIMSI system
developed for the Nov'96 evaluation. The first concerns the
preprocessing stages for partitioning the data, and the second
concerns a reduction in the number of acoustic model sets used to deal
with the various acoustic signal characteristics.
The LIMSI system for the November 1997 Hub-4E evaluation is a
continuous mixture density, tied-state cross-word context-dependent
HMM system. The acoustic models were trained on the 1995 and 1996
official Hub-4E training data containing about 80 hours of transcribed
speech material. The 65K word trigram language models are trained on
155 million words of newspaper texts and 132 million words of
broadcast news transcriptions. The test data is segmented and labeled
using Gaussian mixture models, and non-speech segments are rejected.
The speech segments are classified as telephone or wide-band, and
according to gender. Decoding is carried out in three passes, with a
final pass incorporating cluster-based test-set MLLR adaptation. The
overall word transcription error of the Nov'97 unpartitioned
evaluation test data was 18.5%.

convsys98.ps.Z

[DARPA'98 Broadcast News Transcription and Understanding, Landsdowne, VA, Feb'98]
---------
An Overview of EU Programs Related to Conversational/Interactive Systems
Joseph Mariani and Lori Lamel
---------------------
Since 1984, the European Commission has supported projects related to
spoken language processing within various programs (particularly The
projects address various aspects of spoken language processing
including basic research, technology development, infrastructure
(language resources and evaluation) and applications in different
areas (telecommunications, language training, office systems, games,
cars, information access and retrieval, banking, aids to the
handicapped...).
The next program, the 5th Framework Program is now being prepared,
and should unite the activities in the area of Human Language
Technology and its components in a single sector. The three sub-areas
which have been identified are "Multilinguality", "Natural
Interactivity" and "Active Content".
Several on-going projects are addressing the design of
conversational systems. Three example applications are: the ESPRIT
Mask project which aims to develop a Multimodal-Multimedia Automated
Service Kiosk providing access to rail travel information; the LE
Arise project which aims to provide similar information via the
telephone for the Dutch, French, and Italian railways; and the LE
Vodis project whose goal is to develop a vocal interface to enhance
the usability and functionality of in-car driver assistance and
information services.

spc97intro.ps.Z

[Speech Communication, Vol 23, (1-2):63-65, Oct 1997]
---------
RailTel: Railway Telephone Services
R. Billi (CSELT), L.F. Lamel
---------------------
The aim of the LE-MLAP project RailTel (Railway Telephone
Information Service) was to evaluate the technical adequacy of
available speech technology for interactive telephone services, taking
into account the requirements of the service providers and the
users. In particular the consortium evaluated the specifications and
potential for vocal access to rail travel information. The project,
coordinated by CSELT (Italy), included two user organizations: British
Rail/ British Systems (U.K.) and FS (Italy). The French railways
(SNCF) was a partner in the MAIS project (The RailTel project worked
in close cooperation with the LE-MLAP project MAIS, that addressed the
same task for a different geographical area) but contributed to the
RailTel specification of user and application requirements, and the
evaluation methodology. Three research centers, CCIR (U.K.), CSELT,
LIMSI-CNRS (France) provided the spoken language technologies, and
service providers Saritel and LTV (Italy) contributed expertise
concerning telephone services and management of railway timetable
databases.

spc97ivtta.ps.Z

[Speech Communication, Vol 23, (1-2):67-82, Oct 1997]
---------
The LIMSI RailTel System: Field Trials of a Telephone Service
for Rail Travel Information
L. Lamel, S.K. Bennacef, S. Rosset, L. Devillers, S. Foukia,
J.J. Gangolf, J.L. Gauvain
---------------------
This paper describes the RailTel system developed at LIMSI to
provide vocal access to static train timetable information in French,
and a field trial carried out to assess the technical adequacy of
available speech technology for interactive services. The data
collection system used to carry out the field trials is based on the
LIMSI Mask spoken language system and runs on a Unix workstation with
a high quality telephone interface. The spoken language system allows
a mixed-initiative dialog where the user can provide any information
at any point in time. Experienced users are thus able to provide all
the information needed for database access in a single sentence,
whereas less experienced users tend to provide shorter responses,
allowing the system to guide them. The RailTel field trial was
carried out using a common methodology defined by the consortium. 100
naive subjects participated in the field trials, each calling the
system one time and completing a user questionnaire. 72% of the
callers successfully completed their scenario. The subjective
assessment of the prototype was for the most part favorable, with
subjects expressing an interest in using such a service.

Cet article d�crit le syst�me RailTel d�velopp� au LIMSI, destin� �
l'acc�s vocal en Fran�ais aux horaires des trains de la SNCF et
permettant d'�valuer l'ad�quation des techniques vocales pour les
services interactifs. Le syst�me utilis� pour le recueil de corpus des
tests a �t� d�velopp� � partir du syst�me Mask du LIMSI. Il fonctionne
sur une station Unix avec une interface t�l�phonique de haute
qualit�. Le syst�me offre un dialogue � initiative partag�e o�
l'utilisateur peut fournir les informations � tout instant. Les
utilisateurs exp�riment�s peuvent fournir toutes les informations en
une seule phrase, tandis que les utilisateurs moins exp�riment�s ont
tendance � donner de courtes r�ponses laisant le syst�me les
guider. Les tests de RailTel ont �t� effectu�s selon une m�thodologie
commune d�finie par le consortium. 100 sujets na�fs ont particip� aux
tests, chacun a appel� le syst�me une seule fois et rempli un
questionnaire. 72% des sujets ont achev� leurs sc�narios avec
succ�s. L'�valuation subjective du prototype �tait en majorit�
favorable, avec un int�r�t d'utiliser un tel service.

In dem vorliegenden Beitrag beschreiben wir das am LIMSI entwickelte
RailTel System. Es erm�oglicht, telefonisch Zugfahrplanausk�nfte in
Franz�osisch zu erfragen. Im Rahmen der Systementwicklung wurden
Experimente durchgef�hrt, um den technischen Stand heutiger
Sprachtechnologien f�r interaktive Dienste zu bewerten. Zum Zweck der
Datensammlung wurde das am LIMSI entwickelte Sprachverarbeitungssystem
Mask verwendet. Es arbeitet auf einer UNIX Station unter Benutzung
einer Telefonschnittstelle von gehobener Qualit�t. Die
Dialogstrategie ist gemischt und erlaubt dem Benutzer, jederzeit
zus�tzliche Informationen zu liefern. Erfahrene Benutzer sind daher
in der Lage, s�mtliche Ausk�nfte, die f�r den Datenbankzugriff
notwendig sind, in einem einzelnen Satz zusammenzufassen. Unerfahrene
Anwender neigen hingegen eher dazu, kurz zu antworten und k�nnen somit
vom System angeleitet werden. Eine gemeinsame Vorgehensweise bei den
RailTel Experimenten wurde im Vorfeld vom Konsortium
definiert. 100 unerfahrene Benutzer nahmen an den Experimenten teil,
bei denen jeder von ihnen das System einmal anw�hlte und im
AnschluS einen Fragebogen ausf�llte. 72% der Teilnehmer
schlossen ihr Szenario erfolgreich ab. Die subjektive Bewertung des
Prototyps fiel in den meisten F�llen positiv aus, die Benutzer
zeigten mit anderen Worten Interesse, einen derartigen Service auch in
Anspruch zu nehmen.

larynx97.ps.Z

[ESCA Workshop Larynx 97, Marseille June'97]
---------
Spectral representation and modelling of glottal flow signals
C.d'Alessandro and B. Doval
---------------------
Nous pr�sentons � cet atelier de recherche nos travaux r�cents sur
la mod�lisation du signal de d�bit glottique. Un mod�le du signal
comprenant une composante p�riodique (li�e la vibration
quasip�riodique des cordes vocales) et une composante ap�riodique
(li�e aux bruit d'aspiration, de frication, aux irr�gularit�s de la
vibration des cordes vocales) est consid�r�. Un algorithme de
d�composition p�riodique-ap�riodique est pr�sent�, ainsi que son
�valuation. Un mod�le spectral lin�aire du d�bit glottique p�riodique
est propos�, et on montre son utilit� pour estimer les param�tres
perceptivement pertinents de la source. Ces travaux, con�us au d�part
pour la synth�se de parole peuvent s'appliquer � d'autres domaines,
comme l'�tude des pathologies vocale ou de la voix chant�e.
[The article is in English.]

specom97.ps.Z

[SPECOM'97, Cluj-Napoca, Roumanie]
---------
Evaluation of a class-based language model in a speech recognizer
Christel Beaujard, Mich�le Jardino and H�l�ne Maynard
--------------------
We present a novel corpus-based clustering algorithm which takes
into account a word context similarity criterion between words of a
training text. This work has been done with a database recorded at
LIMSI, and made of rail travel information requests. The induced
class-based model has been evaluated and compared with prior models.
Each model has been integrated in the LIMSI speech recognizer. The
new similarity class-based model gives the best performances in terms
of both perplexity and recognition rates.

euro97doval.ps.Z

[EUROSPEECH'97, Rhodes, Sep'97]
---------
Spectral methods for voice source parameter estimation
B. Doval, C. d'Alessandro, B. Diard
---------------------
A spectral approach is proposed for voice source parameters
representation and estimation. Parameter estimation is based on
decomposition of the periodic and the aperiodic components of the
speech signal, and on spectral modelling of the periodic component.
The paper focusses on parameters estimation for the periodic component
of the glottal flow. A new anticausal all-pole model of the glottal
flow is derived. Glottal flow is seen as an anticausal 2-pole filter
followed by a spectral tilt filter. The anticausal filter has complex
poles, instead of the real poles that are usually assumed. Time-domain
and frequency domain parameters are linked by analytic formulas. Two
spectral domain algorithms are proposed for estimation of open
quotient. The first one is based on measurement of the first
harmonics, and the second one is based on spectral
modelling. Experimental results demonstrate the accuracy of the
estimation procedures.

euro97h4.ps.Z

[EUROSPEECH'97, Rhodes, Sep'97]
---------
Transcription of Broadcast News
J.L. Gauvain, L. Lamel, G. Adda, M. Adda-Decker
---------------------
In this paper we report on our recent work in transcribing broadcast
news shows. Radio and television broadcasts contain signal segments
of various linguistic and acoustic natures. The shows contain both
prepared and spontaneous speech. The signal may be studio quality or
have been transmitted over a telephone or other noisy channel (ie.,
corrupted by additive noise and nonlinear distorsions), or may contain
speech over music. Transcription of this type of data poses challenges
in dealing with the continuous stream of data under varying
conditions. Our approach to this problem is to segment the data into a
set of categories, which are then processed with category specific
acoustic models. We describe our 65k speech recognizer and
experiments using different sets of acoustic models for transcription
of broadcast news data. The use of prior knowledge of the segment
boundaries and types is shown to not crucially affect the performance.

euro97lid.ps.Z

[EUROSPEECH'97, Rhodes, Sep'97]
---------
Language Identification with Language-Independent Acoustic Models
C. Corredor Ardoy, J.L. Gauvain, M. Adda-Decker, L. Lamel
---------------------
In this paper we explore the use of language-independent acoustic
models for language identification (LID). The phone sequence output
by a single language-independent phone recognizer is rescored with
language-dependent phonotactic models approximated by phone big rams.
The language-independent phoneme inventory was obtained by
Agglomerative Hierarchical Clustering, using a measure of similarity
between phones. This system is compared with a parallel
language-dependent phone architecture, which uses optimally the
acoustic log likelihood and the phonotactic score for language
identification. Experiments were carried out on the 4-language
telephone speech corpus IDEAL, containing calls in British English,
Spanish, French and German. Results show that the language-independent
approach performs as well as the language-dependent one: 9% versus 10%
of error rate on 10 second chunks, for the 4-language task.

euro97bref.ps.Z

[EUROSPEECH'97, Rhodes, Sep'97]
---------
Text Normalization and Speech Recognition in French
G. Adda, M. Adda-Decker, J.L. Gauvain, L. Lamel

---------------------
In this paper we present a quantitative investigation into the
impact of text normalization on lexica and language models for speech
recognition in French. The text normalization process defines what is
considered to be a word by the recognition system. Depending on this
definition we can measure different lexical coverages and language
model perplexities, both of which are closely related to the speech
recognition accuracies obtained on read newspaper texts.Different text
normalizations of up to 185M words of newspaper texts are presented
along with corresponding lexical coverage and perplexity measures.Some
normalizations were found to be necessary to achieve good lexical
coverage, while others were more or less equivalent in this regard.
The choice of normalization to create language models for use in the
recognition experiments with read newspaper texts was based on these
findings. Our best system configuration obtained a 11.2% word error
rate in the AUPELF `French-speaking' speech recognizer evaluation test
held in February 1997.

euro97minker.ps.Z

[EUROSPEECH'97, Rhodes, Sep'97]
---------
Stochastically-Based Natural Language Understanding
Across Tasks and Languages
W. Minker
---------------------
A stochastically-based method for natural language understanding has
been ported from the American ATIS (Air Travel Information Services)
to the French MASK (Multimodal-Multimedia Automated Service Kiosk)
task. The porting was carried out by designing and annotating a corpus
of semantic representations via a semi-automatic iterative labeling.
The study shows that domain and language porting is rather flexible,
since it is sufficient to train the system on data sets specific to
the application and language. A limiting factor of the current
implementation is the quality of the semantic representation and the
use of query preprocessing strategies which strongly suffer from human
influence. The performances of the stochastically-based and a
rule-based method are compared on both tasks.

darpa97hub4.ps.Z

[ARPA Speech Recognition Workshop '97, Chantilly, VA, Feb'97]
---------
Transcribing Broadcast News: The LIMSI Nov96 Hub4 System
J.L. Gauvain, G. Adda, L. Lamel, M. Adda-Decker
---------------------
In this paper we report on the LIMSI Nov96 Hub4 system for
transcription of broadcast news shows. We describe the development
work in moving from laboratory read speech data to real-world speech
data in order to build a system for the ARPA Nov96 evaluation. Two
main problems were addressed to deal with the continuous flow of
inhomogenous data. These concern the varied acoustic nature of the
signal (signal quality, environmental and transmission noise, music)
and different linguistic styles (prepared and spontaneous speech on a
wide range of topics, spoken by a large variety of speakers). The
speech recognizer makes use of continuous density HMMs with Gaussian
mixture for acoustic modeling and n-gram statistics estimated on large
text corpora. The base acoustic models were trained on the WSJ0/WSJ1
corpus, and adapted using MAP estimation with 35 hours of transcribed
task-specific training data. The 65k language models are trained on
160 million words of newspaper texts and 132 million words of
broadcast news transcriptions. The problem of segmenting the
continuous stream of data was investigated using 10 MarketPlace shows.
The overall word transcription error of the Nov96 partitioned
evaluation test data was 27.1%.

ica97doval.ps.Z

[ICASSP'97, Munich, April'97]
---------
Spectral Correlates of Glottal Waveform Models: An Analytic Study
B. Doval and C. d'Alessandro
---------------------
This paper deals with spectral representation of the glottal flow.
The LF and the KLGLOTT88 models of the glottal flow are studied. In a
first part, we compute analytically the spectrum of the
LF-model. Then, formulas are given for computing spectral tilt and
amplitudes of the first harmonics as functions of the LF-model
parameters. In a second part we consider the spectrum of the KLGLOTT88
model. It is shown that this model can be modeled in the spectral
domain by an all-pole third-order linear filter. Moreover, the
anticausal impulse response of this filter is a good approximation of
the glottal flow model. Parameter estimation seems easier in the
spectral domain. Therefore our results can be used for modification of
the (hidden) glottal flow characteristic of natural speech signals, by
processing directly the spectrum, without needing time-domain
parameter estimation.

ica97hub4.ps.Z

[ICASSP'97, Munich, April'97]
---------
Transcribing Broadcast News Shows
J.L. Gauvain, G. Adda, L. Lamel, M. Adda-Decker
---------------------
While significant improvements have been made over the last 5 years
in large vocabulary continuous speech recognition of large read-speech
corpora such as the ARPA Wall Street Journal-based CSR corpus (WSJ)
for American English and the BREF corpus for French, these tasks
remain relatively artificial. In this paper we report on our
development work in moving from laboratory read speech data to
real-world speech data in order to build a system for the new ARPA
broadcast news transcription task.
The LIMSI Nov96 speech recognizer makes use of continuous density
HMMs with Gaussian mixture for acoustic modeling and n-gram statistics
estimated on newspaper texts. The acoustic models are trained on the
WSJ0/WSJ1, and adapted using MAP estimation with task-specific
training data. The overall word error on the Nov96 partitioned
evaluation test was 27.1%.

ica97sv.ps.Z

[ICASSP'97, Munich, April'97]
---------
Speaker Recognition with the Switchboard corpus
Lori Lamel and Jean-Luc Gauvain
---------------------
In this paper we present our development work carried
out in preparation for the March'96 speaker recognition test on the
Switchboard corpus organized by NIST. The speaker verification system
evaluated was a Gaussian mixture model. We provide experimental
results on the development test and evaluation test data, and some
experiments carried out since the evaluation comparing the GMM with a
phone-based approach. Better performance is obtained by training on
data from multiple sessions, and with different handsets. High error
rates are obtained even using a phone-based approach both with and
without the use of orthographic transcriptions of the training data.
We also describe a human perceptual test carried out on a subset of
the development data, which demonstrates the difficulty human
listeners had with this task.

ica97matrouf.ps.Z

[ICASSP'97, Munich, April'97]
---------
Model Compensation for Noises in Training and Test Data
Driss Matrouf and Jean-Luc Gauvain
---------------------
It is well known that the performances of speech recognition systems
degrade rapidly as the mismatch between the training and test
conditions increases. Approaches to compensate for this mismatch
generally assume that the training data is noise-free, and the test
data is noisy. In practice, this assumption is seldom correct. In this
paper, we propose an iterative technique to compensate for noises in
both the training and test data. The adopted approach compensates the
speech model parameters using the noise present in the test data, and
compensates the test data frames using the noise present in the
training data. The training and test data are assumed to come from
different and unknown microphones and acoustic environments. The
interest of such a compensation scheme has been assessed on the MASK
task using a continuous density HMM-based speech recognizer.
Experimental results show the advantage of compensating for both test
and training noises.

jst97corpusB2.ps.Z

[JST'97, Avignon, April'97]
---------
Corpus oral de renseignements touristiques
S. Rosset, L. Lamel, S. Bennacef, L. Devillers, J.L. Gauvain
---------------------
Cet article decrit les activites du LIMSI concernant l'acquisition
de corpus de dialogue homme-machine pour une tache de renseignements
touristiques dans le contexte de l'action B2 de l'AUPELF: Dialogue
oral. Notre experience dans le developpement de systemes de
comprehension pour des taches de renseignements sur les tarifs et les
horaires de trains ou d'avions (ATIS et Mask) nous a appris qu'il
etait avantageux de placer les sujets devant un systeme aussi tot que
possible. Ceci permet en effet d'ameliorer iterativement le systeme en
terme de performances et de fonctionnalites en se basant sur les
interactions entre les sujets et le systeme.
Un premier module de comprehension du langage parle a ete developpe
(analyse par grammaire de cas avec un systeme de dialogue et un
systeme de generation de reponses restreints pour la premiere
phase). Ce module a ete utilise pour acquerir un premier corpus
d'enonces ecrits. Ces enonces ont permis de constituer un premier
lexique et un premier modele de langage d'un systeme complet de
dialogue qui est actuellement utilise pour acquerir un corpus oral.

ieice96.ps.Z

[Institute of Electronics, Information and Communication Engineers,
J79-D-II:2005-2021, Dec. 1999]
---------
Large Vocabulary Continuous Speech Recognition:
from Laboratory Systems towards Real-World Applications
J.L. Gauvain and L. Lamel
---------------------

This paper provides an overview of the state-of-the-art in
laboratory speaker-independent, large vocabulary continuous speech
recognition (LVCSR) systems with a view towards adapting such
technology to the requirements of real-world applications. While in
speech recognition the principal concern is to transcribe the speech
signal as a sequence of words, the same core technology can be applied
to domains other than dictation. The main topics addressed are
acoustic-phonetic modeling, lexical representation, language modeling,
decoding and model adaptation. After a brief summary of experimental
results some directions towards usable systems are given. In moving
from laboratory systems towards real-world applications, different
constraints arise which influence the system design. The application
imposes limitations on computational resources, constraints on signal
capture, requirements for noise and channel compensation, and
rejection capability. The difficulties and costs of adapting existing
technology to new languages and application need to be assessed. Near
term applications for LVCSR technology are likely to grow in somewhat
limited domains such as spoken language systems for information
retrieval, and limited domain dictation. Perspectives on some
unresolved problems are given, indicating areas for future research.

arpa96nab.ps.Z

[ARPA CSR'96, Arden House, Harriman, NY, Feb'96]
---------
The LIMSI 1995 Hub3 System
J.L. Gauvain, L. Lamel, G. Adda, D. Matrouf
---------------------
In this paper we report on the LIMSI recognizer evaluated in the ARPA
1995 North American Business (NAB) News Hub 3 benchmark test. The
LIMSI recognizer is an HMM-based system with Gaussian mixture.
Decoding is carried out in multiple forward acoustic passes, where
more refined acoustic and language models are used in successive
passes and information is transmitted via word graphs. In order to
deal with the varied acoustic conditions, channel compensation is
performed iteratively, refining the noise estimates before the first
three decoding passes. The final decoding pass is carried out with
speaker-adapted models obtained via unsupervised adaptation using the
MLLR method. In contrast to previous evaluations, the new Hub 3 test
aimed at improving basic SI, CSR performance on unlimited-vocabulary
read speech recorded under more varied acoustical conditions
(background environmental noise and unknown microphones). On the
Sennheiser microphone (average SNR 29dB) a word error of 9.1% was
obtained, which can be compared to 17.5% on the secondary microphone
data (average SNR 15dB) using the same recognition system.

ica96nab.ps.Z

[ICASSP'96, Atlanta, May'96]
---------
Developments in Continuous Speech Dictation
using the 1995 ARPA NAB News Task
J.L. Gauvain, L. Lamel, G. Adda, D. Matrouf
---------------------
In this paper we report on the LIMSI recognizer evaluated in the ARPA
1995 North American Business (NAB) News benchmark test. In contrast
to previous evaluations, the new Hub 3 test aims at improving basic
SI, CSR performance on unlimited-vocabulary read speech recorded under
more varied acoustical conditions (background environmental noise and
unknown microphones). The LIMSI recognizer is an HMM-based system
with Gaussian mixture. Decoding is carried out in multiple forward
acoustic passes, where more refined acoustic and language models are
used in successive passes and information is transmitted via word
graphs. In order to deal with the varied acoustic conditions, channel
compensation is performed iteratively, refining the noise estimates
before the first three decoding passes. The final decoding pass is
carried out with speaker-adapted models obtained via unsupervised
adaptation using the MLLR method. On the Sennheiser microphone
(average SNR 29dB) a word error of 9.1% was obtained, which can be
compared to 17.5% on the secondary microphone data (average SNR 15dB)
using the same recognition system.

ica96ger.ps.Z

[ICASSP'96, Atlanta, May'96]
---------
Developments in Large Vocabulary, Continuous Speech Recognition of German
M. Adda-Decker, G. Adda, L. Lamel, J.L. Gauvain
---------------------
In this paper we describe our large vocabulary continuous speech
recognition system for the German language, the development of which
was partly carried out within the context of the European LRE project
62-058 SQALE. The recognition system is the LIMSI recognizer
originally developed for French and American English, which has been
adapted to German. Specificities of German, as relevant to the
recognition system, are presented. These specificities have been
accounted for during the recognizer's adaptation process. We present
experimental results on a first test set ger-dev95 to measure progress
in system development. Results are given with the final system using
different acoustic model sets on two test sets ger-dev95 and
ger-eval95. This system achieved a word error rate of 17.3% (official
word error rate of 16.1% after SQALE adjudication process) on
the ger-eval95 test set.

ica96jardino.ps.Z

[ICASSP'96, Atlanta, May'96]
---------
Multilingual stochastic n-gram class language models
Michele Jardino
---------------------
Stochastic language models are widely used in continuous speech
recognition systems where a priori probabilites of word sequences are
needed. These probabilities are usually given by n-gram word models,
estimated on very large training texts. When n increases, it becomes
harder to find reliable statistics, even wit h huge texts. Grouping
words is a way to overcome this problem. We have developed an
automatic language independant classification procedure, which is able
to optimize the classification of tens of millions of untagged words
in less than a few hours on a Unix workstation. With this language
independent approach, three corpora each containing about 30 million
words of newspaper texts, in French, German and English, have been
mapped into different numbers of classes. From these classifications,
bi-gram and tri-gram class language models have been built. The
perplexities of held-out test texts have been assessed, showing that
tri-gram class models give lower values than those obtained with
tri-gram word models, for the three languages.

ivtta96.ps.Z

[IVTTA'96, BaskingRidge, Sep'96]
---------
Field Trials of a Telephone Service for Rail Travel Information
L.F. Lamel, J.L. Gauvain, S.K. Bennacef, L. Devillers,
S. Foukia, J.J. Gangolf, S. Rosset
--------------------
This paper reports on the RAILTEL field trial carried out by LIMSI,
to assess the technical adequacy of available speech technology for
interactive vocal access to static train timetable information. The
data collection system used to carry out the field trials, is based on
the LIMSI MASK spoken language system and runs on a Unix workstation
with a high quality telephone interface. The spoken language system
allows a mixed-initiative dialog where the user can provide any
information at any point in time. Experienced users are thus able to
provide all the information needed for database access in a single
sentence, whereas less experienced users tend to provide shorter
responses, allowing the system to guide them. The RAILTEL field trial
was carried out using a commonly defined upon methodology. 100 naive
subjects participated in the field trials, each contributing one call
and completing a user questionnaire. 72% of the callers successfully
completed their scenarios. The subjective assessment of the service
was for the most part favorable, with subjects expressing interest in
using such a service.

icslp96lex.ps.Z

[ICSLP'96, Philadelphia, Oct'96]
---------
On Designing Pronunciation Lexicons for
Large Vocabulary, Continuous Speech Recognition
Lori Lamel and Gilles Adda
---------------------
Creation of pronunciation lexicons for speech recognition is widely
acknowledged to be an important, but labor-intensive, aspect of system
development. Lexicons are often manually created and make use of
knowledge and expertise that is difficult to codify. In this paper we
describe our American English lexicon developed primarily for the ARPA
WSJ/NAB tasks. The lexicon is phonemically represented, and contains
alternate pronunciations for about 10% of the words. Tools have been
developed to add new lexical items, as well as to help ensure
consistency of the pronunciations. Our experience in large
vocabulary, continuous speech recognition is that systematic lexical
design can improve system performance. Some comparative results with
commonly available lexicons are given.

icslp96maskreco.ps.Z

[ICSLP'96, Philadelphia, Oct'96]
---------
Speech Recognition for an Information Kiosk
J.L. Gauvain, J.J. Gangolf, L. Lamel
---------------------
In the context of the ESPRIT MASK project we face the problem of
adapting a ``state-of-the-art'' laboratory speech recognizer for use
in the real world with naive users. The speech recognizer is a
software-only system that runs in real-time on a standard Risc
processor. All aspects of the speech recognizer have been reconsidered
from signal capture to adaptive acoustic models and language
models. The resulting system includes such features as microphone
selection, response cancellation, noise compensation, query rejection
capability and decoding strategies for real-time recognition.

icslp96maskwoz.ps.Z

[ICSLP'96, Philadelphia, Oct'96]
---------
Data Collection for the MASK Kiosk: WOz vs Prototype System
A. Life, I. Salter, J.N. Temem, F. Bernard, S. Rosset, S. Bennacef, L. Lamel
---------------------
The MASK consortium is developing a prototype multimodal multimedia
service kiosk for train travel information and reservation, exploiting
state-of-the-art speech technology. In this paper we report on our
efforts aimed at evaluating alternative user interface designs and at
obtaining acoustic and language modeling data for the spoken language
component of the overall system. Simulation methods with increasing
degrees of complexity have been used to test the interface design, and
to experiment with alternate operating modes. The majority of the
speech data has been collected using successive versions of the spoken
language system, whose capabilities are incrementally expanded after
analysis of the most frequent problems encountered by users of the
preceding version. The different data requirements of user interface
design and speech corpus acquisition are discussed in light of the
experience of the MASK project.

icslp96minker.ps.Z

[ICSLP'96, Philadelphia, Oct'96]
---------
A Stochastic Case Frame Approach for Natural Language Understanding
Wolfgang Minker, Samir Bennacef, Jean-Luc Gauvain
---------------------
A stochastically based approach for the semantic analysis component of
a natural spoken language system for the ATIS task has been developed.
The semantic analyzer of the spoken language system already in use at
LIMSI makes use of a rule-based case grammar. In this work, the
system of rules for the semantic analysis is replaced with a
relatively simple, first order Hidden Markov Model. The performance
of the two approaches can be compared because they use identical
semantic representations despite their rather different methods for
meaning extraction. We use an evaluation methodology that assesses
performance at different semantic levels, including the database
response comparison used in the ARPA ATIS paradigm.

icslp96ml.ps.Z

[ICSLP'96, Philadelphia, Oct'96]
---------
Spoken Language Processing in a Multilingual Context
L.F. Lamel, M. Adda-Decker, J.L. Gauvain, G. Adda
---------------------
In this paper we overview the spoken language processing activities at
LIMSI, which are carried out in a multilingual framework. These
activities include speech-to-text conversion, spoken language systems
for information retrieval, speaker and language recognition, and
speech response. The Spoken Language Processing Group has also been
actively involved in corpora development and evaluation. The group has
regularly participated in evaluations organized by ARPA, in the
LE-SQALE project, and in the AUPELF-UREF program for provision of
linguistic resources and evaluation tests for French.

icslp96rtel.ps.Z

[ICSLP'96, Philadelphia, Oct'96]
---------
Dialog in the RAILTEL Telephone-Based System
S. Bennacef, L. Devillers, S. Rosset, L. Lamel
---------------------
Dialog management is of particular importance in telephone-based
services. In this paper we describe our recent activities in dialog
management and natural language generation in the LIMSI RAILTEL system
for access to rail travel information. The aim of LE-MLAP project
RAILTEL was to assess the capabilities of spoken language technology
for interactive telephone information services. Because all
interaction is over the telephone, oral dialog management and response
generation are very important aspects of the overall system design and
usability. Each dialog is analysed to determine the source of any
errors (speech recognition, understanding, information retreival,
processing, or dialog management). An analysis is provided for 100
dialogs taken from the RAILTEL field trials with naive subjects
accessing timetable information.

asru95mask.ps.Z

[1995 IEEE SPS Workshop on Automatic Speech Recognition
Snowbird, Utah, Dec'95]
---------
Spoken Language System Development for the MASK Kiosk
J.L. Gauvain, S. Bennacef, L. Devillers, L. Lamel
---------------------
Spoken language systems aim to provide a natural interface between
humans and computers through the use of simple and natural dialogues,
so that the users can access stored information. In this paper we
present our recent activities in developing such a system for the
ESPRIT MASK (Multimodal-Multimedia Automated Service Kiosk)
project. The goal of the MASK project is to develop a multimodal,
multimedia service kiosk to be located in train stations, so as to
improve the user-friendliness of the service. The service kiosk will
allow the user to speak to the system, as well as to use a touch
screen and keypad. The role of LIMSI in the project is to develop the
spoken language system which will be incorporated in the kiosk. High
quality speech synthesis, graphics, animation and video will used to
provide feedback to the user. We have developed a complete spoken
language data collection system for this task. In the actual service
kiosk the spoken language system has to be modified so that the
multimodal level transaction manager can control the overall dialog.
The main information provided by the system is access to rail travel
information such as timetables, tickets and reservations, as well as
services offered on the trains, and fare-related restrictions and
supplements. Other important travel information such as up-to-date
departure and arrival time and track information will also be
provided. Eventual extensions to the system will enable the user to
obtain additional information about the train station and local
tourist information, such as restaurants, hotels, and attractions in
the surrounding area.

asru95ml.ps.Z

[1995 IEEE SPS Workshop on Automatic Speech Recognition
Snowbird, Utah, Dec'95]
---------
Speech Recognition of European Languages
Lori Lamel and Renato DeMori
---------------------
In this paper we aim to provide a basic overview of the main ongoing
efforts in large vocabulary, continuous speech recognition for
European languages. We address issues in acoustic modeling, lexical
representation, and language modeling. In addition to presenting a
snapshot of speech recognition in different European languages, we try
to highlight language specific characteristics that must be taken into
account in developing a recognition system for a given language. Some
other issues that are touched upon are the availability of training
and testing data in the different languages, the choice of recognition
units, lexical coverage, language specificities such as the
frequencies of homophones, monophone words, compounding, liaison, and
phonological effects (reduction), and multilingual evaluation.

csl95-nl.ps.Z

[Computer Speech & Language, 9(1), pp. 87-103, Jan. 1995]
---------
A Phone-based Approach to Non-Linguistic Speech Feature Identification
L. Lamel and J.L. Gauvain
---------------------
In this paper we present a general approach to identifying
non-linguistic speech features from the recorded signal using
phone-based acoustic likelihoods. The basic idea is to process the
unknown speech signal by feature-specific phone model sets in
parallel, and to hypothesize the feature value associated with the
model set having the highest likelihood. This technique is shown to
be effective for text-independent gender, speaker, and language
identification. Text-independent speaker identification accuracies of
98.8% on TIMIT (168 speakers) and 99.2% on BREF (65 speakers), were
obtained with one utterance per speaker, and 100% with 2 utterances
for both corpora. Experiments in which speaker-specific models were
estimated without using of the phonetic transcriptions for the TIMIT
speakers had the same identification accuracies obtained with the use
of the transcriptions. French/English language identification is
better than 99% with 2s of read, laboratory speech. On spontaneous
telephone speech from the OGI corpus, the language can be identified
as French or English with 82% accuracy with 10s of speech. 10
language identification using the OGI corpus is 59.7% with 10s of
signal.

slt95.ps.Z

[ARPA Workshop Spoken Language Technology, Austin, TX, Jan 1995]
---------
Developments in Large Vocabulary Dictation: The LIMSI Nov94 NAB System
J.L. Gauvain, L. Lamel, M. Adda-Decker
---------------------
In this paper we report on our development work in large vocabulary,
American English continuous speech dictation on the ARPA NAB task in
preparation for the November 1994 evaluation. We have experimented with
(1) alternative analyses for the acoustic front end, (2) the use of an
enlarged vocabulary of 65k words so as to reduce the number of errors
due to out-of-vocabulary words, (3) extensions to the lexical
representation, (4) the use of additional acoustic training data, and
(5) modification of the acoustic models for telephone speech. The
recognizer was evaluated on Hubs 1 and 2 of the fall 1994 ARPA NAB CSR
Hub and Spoke Benchmark test. Experimental results on development and
evaluation test data are given, as well as an analysis of the errors on
the development data.

ica95lv.ps.Z

[ICASSP'95, Detroit, MI, May 1995]
---------
Developments in Continuous Speech Dictation using the ARPA WSJ Task
J.L. Gauvain, L. Lamel, M. Adda-Decker
---------------------
In this paper we report on our recent development work in large
vocabulary, American English continuous speech dictation. We have
experimented with (1) alternative analyses for the acoustic front end,
(2) the use of an enlarged vocabulary so as to reduce the number of
errors due to out-of-vocabulary words, (3) extensions to the lexical
representation, (4) the use of additional acoustic training data, and
(5) modification of the acoustic models for telephone speech. The
recognizer was evaluated on Hubs 1 and 2 of the fall 1994 ARPA NAB CSR
Hub and Spoke Benchmark test. Experimental results for development and
evaluation test data are given, as well as an analysis of the errors
on the development data.

esca-sls95.ps.Z

[ESCA ETRW Spoken Dialog Systems, Visgo, Denmark, May 1995]
---------
Recent Developments in Spoken Language Sytems for Information Retrieval
L. Lamel, S. Bennacef, H. Bonneau-Maynard, S. Rosset, J.L. Gauvain
---------------------
In this paper we present our recent activities in developing systems
for vocal access to a database information retrieval system, for three
applications in a travel domain: l'Atis, Mask and RailTel. The aim of
this work is to use spoken language to provide a user-friendly
interface with the computer. The spoken language system integrates a
speech recognizer (based on HMM with statistical language models), a
natural language understanding component (based on a caseframe
semantic analyzer and including a dialog manager) and an information
retrieval and response generation component. We present the
development status of prototype spoken language systems for l'Atis and
Mask being used for data collection.

euro95slc.ps.Z

[Eurospeech'95, Madrid, Sep'95]
---------
Development of Spoken Language Corpora for Travel Information
L. Lamel, S. Rosset, S. Bennacef, H. Bonneau-Maynard,
L. Devillers, J.L. Gauvain
---------------------
In this paper we report on our ongoing work in developing spoken
language corpora in the context of information access in two travel
domain tasks, l'Atis and Mask. The collection of spoken language
corpora remains an important research area and represents a
significant portion of work in the development of spoken language
systems. The use of additional acoustic and language model training
data has been shown to almost systematically improve performance in
continuous speech recognition. Similarly, progress in spoken language
understanding is closely linked to the availability of spoken language
corpora. We record subjects on a regular basis using development
versions of the spoken language systems for both tasks, obtaining over
1000 queries/month from 20 subjects. To help assess our progress in
system development, each subject since March'95 completes a
questionnaire addressing the user-friendliness, reliability,
ease-of-use of the Mask data collection system.

euro95sv.ps.Z

[Eurospeech'95, Madrid, Sep'95]
---------
Experiments with Speaker Verification over the Telephone
J.L. Gauvain, L.F. Lamel, B. Prouts
---------------------
In this paper we present a study on speaker verification showing
achievable performance levels for both high quality speech and
telephone speech and for two operational modes, i.e. text-dependent
and text-independent speaker verification. A statistical modeling
approach is taken, where for text independent verification the talker
is viewed as a source of phones, modeled by a fully connected Markov
chain, where the lexical and syntactic structures of the language are
approximated by local phonotactic constraints. A first series of
experiments were carried out on high quality speech from the BREF
corpus to validate this approach and resulted in an a posteriori equal
error rate of 0.3% in text-dependent as well as in text-independent
mode. A second series of experiments were carried out on a telephone
corpus recorded specifically for speaker verification algorithm
development. On this data, the lowest equal error rate is 2.9% for the
text-dependent mode when 2 trials are allowed per attempt and with a
minimum of 2s of speech per trial.

euro95sqale.ps.Z

[Eurospeech'95, Madrid, Sep'95]
---------
Issues in Large Vocabulary, Multilingual Speech Recognition
L. Lamel, M. Adda-Decker, J.L. Gauvain
---------------------
In this paper we report on our activities in multilingual,
speaker-independent, large vocabulary continuous speech recognition.
The multilingual aspect of this work is of particular importance in
Europe, where each country has its own national language. Our
existing recognizer for American English and French, has been ported
to British English and German. It has been assessed in the context of
the LRE Sqale project whose objective was to experiment with
installing in Europe a multilingual evaluation paradigm for the
assessment of large vocabulary, continuous speech recognition systems.
The recognizer makes use of phone-based continuous density HMM for
acoustic modeling and n-gram statistics estimated on newspaper texts
for language modeling. The system has been evaluated on a dictation
task with read, newspaper-based corpora, the ARPA Wall Street Journal
corpus of American English, the WSJCAM0 corpus of British English, the
BREF-LeMonde corpus of French and the PHONDAT-Frankfurter Rundschau
corpus of German. Under closely matched conditions, the average word
accuracy across all 4 languages is 85%, obtained with an
open-vocabulary test and 20k trigram systems (64k system German).

mask95hcs.ps.Z

[Human Comfort & Security, Brussels, Oct'95]
---------
The Spoken Language Component of the MASK Kiosk
J.L. Gauvain, S. Bennacef, L. Devillers, L.F. Lamel, S. Rosset
---------------------
The aim of the Multimodal-Multimedia Automated Service Kiosk (MASK)
project is to pave the way for more advanced public service
applications by user interfaces employing multimodal, multi-media
input and output. The project has analyzed the technological
requirements in the context of users and the tasks they perform in
carrying out travel enquiries, and developed a prototype information
kiosk that will be installed in the Gare St. Lazare in Paris. The
kiosk will improve the effectiveness of such services by enabling
interaction through the coordinated use of multimodal inputs (speech
and touch) and multimedia output (sound, video, text, and graphics)
and in doing so create the opportunity for new public services. Vocal
input is managed by a spoken language system, which aims to provide a
natural interface between the user and the computer through the use of
simple and natural dialogs. In this paper the architecture and the
capabilities of the spoken language system are described, with
emphasis on the speaker-independent, large vocabulary continuous
speech recognizer, the natural language component (including semantic
analysis and dialog management), and the response generator. We also
describe our data collection and evaluation activities which are
crucial to system development.

lim9512.ps.Z

[LIMSI technical report no. 9512]
---------
An English Version of the LIMSI L'ATIS System
Wolfgang Minker
---------------------
Spoken language systems aim to provide a natural interface between
humans and computers with which users can access stored
information. Such systems should be sufficiently robust to cope with
naturally occuring spontaneous speech effects as arise in applications
with human-machine interaction. The LIMSI spoken language research is
oriented around several tasks in the travel domain. The L'ATIS system
is a French spoken language system for vocal access to the Air Travel
Information Services (ATIS) task database, a designated common task
for data collection and evaluation within the ARPA Speech and Natural
Language (NL) program. The system includes a speech recognizer, an
understanding component and a component handling database query and
response generation. A case frame approach is used for the NL
understanding component. This unit determines the meaning of the
query and builds an appropriate semantic frame representation. The
semantic frame is then used to generate a database request to the
database management system and a response is returned to the
user. This report reviews the main characteristics of the LIMSI spoken
language information retrieval system and demonstrates the flexibility
of the case frame approach by describing the process of porting the
French NL component to American English. Understanding results are
presented in accordance with the methods used in the ARPA spoken und
natural language evaluation paradigm.

forwiss94.ps.Z

[CRIM/FORWISS Workshop on Progress and Prospects of Speech Research
and Technology, Munich, Sep. 1994]
---------
Continuous Speech Dictation at LIMSI
J.L. Gauvain, L.F. Lamel, M. Adda-Decker
---------------------
One of our major research activities at LIMSI is multilingual,
speaker-independent, large vocabulary speech dictation. The
multilingual aspect of this work is of particular importance in
Europe, where each country has it's own national language.
Speaker-independence and large vocabulary are characteristics
necessary to envision real world applications of such technology. The
recognizer makes use of phone-based continuous density HMM for
acoustic modeling and n-gram statistics estimated on newspaper texts
for language modeling. The system has been evaluated on two dictation
tasks developed with read, newspaper-based corpora, the ARPA Wall
Street Journal corpus of American English and the BREF LeMonde corpus
of French. Experimental results under closely matched conditions are
reported. For both languages an average word accuracy of 95% is
obtained for a 5k vocabulary test. For a 20,000 word lexicon with an
unrestricted vocabulary test the word error for WSJ is 10% and for
BREF is 16%.

slt94.ps.Z

[ARPA Workshop on Spoken Language Technology, Princton, NJ, March'94]
---------
The LIMSI Nov93 WSJ System
J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker
---------------------
In this paper we report on the LIMSI Wall Street Journal system which
was evaluated in the November 1993 test. The recognizer makes use of
continuous density HMM with Gaussian mixture for acoustic modeling and
n-gram statistics estimated on the newspaper texts for language
modeling. The decoding is carried out in two forward acoustic passes.
The first pass is a time-synchronous graph-search, which is shown to
still be viable with vocabularies of up to 20k words when used with
bigram back-off language models. The second pass, which makes use of
a word graph generated with the bigram, incorporates a trigram
language model. Acoustic modeling uses cepstrum-based features,
context-dependent phone models (intra and interword), phone duration
models, and sex-dependent models. The official Nov93 evaluation
results are given for vocabularies of up to 64,000 words, as well as
results on the Nov92 5k and 20k test material.

slt94sqale.ps.Z

[ARPA Workshop on Spoken Language Technology, Princton, NJ, March'94]
---------
SQALE: Speech Recognizer Quality Assessment for Linguistic Engineering
Herman J.M. Steeneken and Lori Lamel
---------------------
The objective of the SQALE project (Speech recognizer Quality
Assessment for Linguistic Engineering) is to experiment with
establishing an evaluation paradigm in Europe for the assessment of
large-vocabulary, continuous speech recognition systems in a
multilingual environment.

hlt94.ps.Z

[ARPA Workshop on Human Language Technology, Princeton, NJ, March'94]
---------
The LIMSI Continuous Speech Dictation System
J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker
---------------------
A major axis of research at LIMSI is directed at multilingual,
speaker-independent, large vocabulary speech dictation. In this paper
the LIMSI recognizer which was evaluated in the ARPA NOV93 CSR test is
described, and experimental results on the WSJ and BREF corpora under
closely matched conditions are reported. For both corpora word
recognition experiments were carried out with vocabularies containing
up to 20k words. The recognizer makes use of continuous density HMM
with Gaussian mixture for acoustic modeling and n-gram statistics
estimated on the newspaper texts for language modeling. The
recognizer uses a time-synchronous graph-search strategy which is
shown to still be viable with a 20k-word vocabulary when used with
bigram back-off language models. A second forward pass, which makes
use of a word graph generated with the bigram, incorporates a trigram
language model. Acoustic modeling uses cepstrum-based features,
context-dependent phone models (intra and interword), phone duration
models, and sex-dependent models.

map93.ps.Z

[IEEE Trans. on Speech and Audio, April 1994]
---------
Maximum A Posteriori Estimation for Multivariate Gaussian
Mixture Observations of Markov Chains
J.L. Gauvain and Chin-Hui Lee
---------------------
In this paper a framework for maximum a posteriori (MAP) estimation of
hidden Markov models (HMM) is presented. Three key issues of MAP
estimation, namely the choice of prior distribution family, the
specification of the parameters of prior densities and the evaluation of
the MAP estimates, are addressed. Using HMMs with Gaussian mixture
state observation densities as an example, it is assumed that the prior
densities for the HMM parameters can be adequately represented as a
product of Dirichlet and normal-Wishart densities. The classical
maximum likelihood estimation algorithms, namely the forward-backward
algorithm and the segmental k-means algorithm, are expanded and MAP
estimation formulas are developed. Prior density estimation issues are
discussed for two classes of applications: parameter smoothing and model
adaptation, and some experimental results are given illustrating the
practical interest of this approach. Because of its adaptive nature,
Bayesian learning is shown to serve as a unified approach for a wide
range of speech recognition applications.

ica94lid.ps.Z

[ICASSP'94 Adelaide, Australia, May 1994]
---------
Language Identification Using Phone-based Acoustic Likelihoods
L.F. Lamel and J.L. Gauvain
---------------------
In this paper we apply the technique of phone-based acoustic
likelihoods to the problem of language identification. The basic idea
is to process the unknown speech signal by language-specific phone
model sets in parallel, and to hypothesize the language associated
with the model set having the highest likelihood. Using laboratory
quality speech the language can be identified as French or English
with better than 99% accuracy with only as little as 2s of speech.
On spontaneous telephone speech from the OGI corpus, the language can
be identified as French or English with 82% accuracy with 10s of
speech. The 10 language identification rate using the OGI corpus is
59.7% with 10s of signal.

ica94lv.ps.Z

[ICASSP'94 Adelaide, Australia, May 1994]
---------
The LIMSI Continuous Speech Dictation System: Evaluation on the
ARPA Wall Street Journal Task
J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker
---------------------
In this paper we report progress made at LIMSI in speaker-independent
large vocabulary speech dictation using the ARPA Wall Street
Journal-based CSR corpus. The recognizer makes use of continuous
density HMM with Gaussian mixture for acoustic modeling and n-gram
statistics estimated on the newspaper texts for language modeling.
The recognizer uses a time-synchronous graph-search strategy which is
shown to still be viable with vocabularies of up to 20K words when
used with bigram back-off language models. A second forward pass,
which makes use of a word graph generated with the bigram,
incorporates a trigram language model. Acoustic modeling uses
cepstrum-based features, context-dependent phone models (intra and
interword), phone duration models, and sex-dependent models. The
recognizer has been evaluated in the Nov92 and Nov93 ARPA tests for
vocabularies of up to 20,000 words.

icslp94latis.ps.Z

[ICSLP'94, Yokohama, Japan, Sept'94]
---------
A Spoken Language System For Information Retrieval
S.K. Bennacef, H. Bonneau-Maynard, J.L. Gauvain, L. Lamel, W. Minker
---------------------
Spoken language systems aim to provide a natural interface between
humans and computers by using simple and natural dialogues to enable
the user to access stored information. The LIMSI spoken language work
is being pursued in several task domains. In this paper we present a
system for vocal access to database for a French version of the Air
Travel Information Services (ATIS) task. The ATIS task is a designated
common task for data collection and evaluation within the ARPA Speech
and Natural Language program. A complete spoken language system
including a speech recognizer, a natural language component, a
database query generator and a natural language response generator is
described. The speaker independent continuous speech recognizer makes
use of task-independent acoustic models trained on the BREF corpus and
a task-specific language model. A case-frame approach is used for the
natural language component. This component determines the meaning of
the query and builds an appropriate semantic frame representation.
The semantic frame is used to generate a database request to the
ddatabase management system and the returned information is used to
generate a response. First evaluation results for the ATIS task are
given for the recognition and understanding components, as well as for
the combined system.

icslp94lv.ps.Z

[ICSLP'94, Yokohama, Japan, Sept'94]
---------
Continuous Speech Dictation in French
J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker
---------------------
A major research activity at LIMSI is multilingual,
speaker-independent, large vocabulary speech dictation. In this paper
we report on efforts in large vocabulary, speaker-independent
continuous speech recognition of French using the BREF corpus.
Recognition experiments were carried out with vocabularies containing
up to 20k words. The recognizer makes use of continuous density HMM
with Gaussian mixture for acoustic modeling and n-gram statistics
estimated on 38 million words of newspaper text from LeMonde
for language modeling. The recognizer uses a time-synchronous
graph-search strategy. When a bigram language model is used,
recognition is carried out in a single forward pass. A second forward
pass, which makes use of a word graph generated with the bigram
language model, incorporates a trigram language model. Acoustic
modeling uses cepstrum-based features, context-dependent phone models
and phone duration models. An average phone accuracy of 86% was
achieved. A word accuracy of 84% has been obtained for an
unrestricted vocabulary test and 95% for a 5k vocabulary test.

icslp94ted.ps.Z

[ICSLP'94, Yokohama, Japan, Sep'94]
---------
The Translanguage English Database (TED)
L.F. Lamel, F. Schiel, A. Fourcin, J. Mariani, H. Tillmann
---------------------
The Translanguage English Database is a corpus of recordings made of
oral presentations at Eurospeech93 in Berlin. The corpus name derives
from the high percentage of presentations given in English by
non-native speakers of English. 224 oral presentations at the
conference were successfully recorded, providing a total of about 75
hours of speech material. These recordings provide a relatively large
number of speakers speaking a variant of the same language (English)
over a relatively large amount of time (15 min each + 5 min
discussion) on a specific topic. A subset of speakers were recorded
with a laryngograph in addition to the standard microphone. A set of
Polyphone-like recordings were made, for which a subset also had a
laryngograph signal recorded. These recordings were made in English
and in the speaker's mother language.

In addition to the spoken material, associated text materials are
being collected. These include written versions of the proceedings
papers and any oral preparations texts which were made available. The
text materials will provide vocabulary items and data for language
modeling. Speakers were also asked to complete a short questionnaire
regarding their mother language, any other languages they speak, as
well as their knowledge of English.

lre94sqale.ps.Z

[LRE Journees du Genie Linguistique, Paris, July'94]
---------
Speech Recognizer Quality Assessment for Linguistic Engineering (SQALE)
Lori F. Lamel
---------------------
The aim of the LRE-SQALE project (Speech recognizer Quality Assessment
for Linguistic Engineering) is to experiment with establishing an
evaluation paradigm in Europe for the assessment of large-vocabulary,
continuous speech recognition systems in a multilingual environment.
This 18 month project is will define and carry out the assessment
experiments, paving the way for future projects with a larger scope
and wider participation of European sites. The SQALE Consortium
consists of a coordinator, the Institute for Human Factors at TNO, and
three laboratories (CUED, LIMSI-CNRS, PHILIPS) who will evaluate their
recognition systems using commonly agreed upon protocols, with the
evaluation organized by the coordinating laboratory. Multiple sites
will test their algorithms on the same database, so as to compare the
merits of different methods, and each site will evaluate on at least
two languages, so as to compare the relative difficulties of the
languages, and the degree of independency of the algorithm to a given
language.

spcom94.ps.Z

[Speech Communication, 15, pp. 21-37, Sep'94]
---------
Speaker-Independent Continuous Speech Dictation
J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker
---------------------
In this paper we report on progress made at LIMSI in
speaker-independent large vocabulary speech dictation using
newspaper-based speech corpora in English and French. The recognizer
makes use of continuous density HMMs with Gaussian mixtures for
acoustic modeling and n-gram statistics estimated on newspaper texts
for language modeling. Acoustic modeling uses cepstrum-based
features, context-dependent phone models (intra and interword), phone
duration models, and sex-dependent models. For English the ARPA Wall
Street Journal-based CSR corpus is used and for French the BREF corpus
containing recordings of texts from the French newspaper LeMonde is
used. Experiments were carried out with both these corpora at the
phone level and at the word level with vocabularies containing up to
20,000 words. Word recognition experiments are also described for the
ARPA RM task which has been widely used to evaluate and compare
systems.

Nous presentons dans cet article les avancees realisees au LIMSI sur
la reconnaissance de parole continue de grand vocabulaire,
independante du locuteur dans une application de dictee de textes. Le
systeme utilise des modeles de Markov caches a densites continues au
niveau acoustique, et des modeles de langage n-grammes au niveau
syntaxique. La modelisation acoustique repose sur: une analyse
cepstrale du signal vocal, des modeles de phones en contexte (inter-
et intramot) dependant du genre du locuteur, et des modeles de duree
phonemique. Nous avons utilise, pour la langue anglaise, le corpus de
parole continue ARPA-WSJ contenant des enregistrements de textes lus
extraits du Wall Street Journal, et, pour la langue francaise, le
corpus BREF contenant des enregistrements de textes lus extraits du
journal LeMonde. Les performances du systeme de reconnaissance,
mesurees au niveau phonetique et au niveau mot sont donnees sur ces
deux corpus pour des vocabulaires contenant jusqu'a 20.000 mots. Nous
donnons egalement pour reference les resultats obtenus sur le corpus
ARPA-RM qui a ete tres largement utilise pour evaluer et comparer des
systemes de reconnaissance de parole.

In diesem Beitrag beschreiben wir Fortschritte in der Entwicklung
eines sprecherunabhangigen Spracherkennungssystems fur groSen
Wortschatz, welches mit (gesprochenem und geschriebenem) Datenmaterial
von Zeitungartikeln trainiert wurde. Die akustische Modellierung des
Spracherkennungssystems besteht aus Mischungen kontinuierlicher
gauSschen Dichten in Hidden Markov Modellen (HMM). Die Modellierung
der (geschriebenen) Sprache beruht auf statistischen N-grams, deren
Wahrscheinlichkeiten aus einer Datenbasis bestehend aus
Zeitungsartikeln geschatzt wurden. Was die akustischen Modelle
betrifft, verwenden wir Cepstrum Parameter in kontextabhangigen
Phonmodellen, mit zeitlicher Modellierung und geschlechtspezifischen
Modellen. Fur die englische Sprache benutzen wir die ARPA Wall Street
Journal Datenbasis und fur die franzosische, die BREF Daten, die
gesprochene Texte der franzosischen Zeitung LeMonde enthalten. Fur
beide Sprachen wurden Experimente auf Phonem- und Wortbasis
durchgefuhrt. Der Wortschatz besteht aus bis zu 20K Worter.
Weiterhin prasentieren wir Ergebnisse auf der ARPA RM Datenbasis, da
diese weltweit zur Bewertung von Spracherkennungssystemen verwendet
wurde.

iprai94.ps.Z

[International Journal of Pattern Recognition and Artificial Intelligence,
special issue on Speech Recognition for different languages
8(1):99-131, 1994.]
---------
Speech-to-Text Conversion in French
J.L. Gauvain, L. F. Lamel, G. Adda, and J. Mariani
---------------------
Speech-to-text conversion of French necessitates that both the
acoustic level recognition and language modeling be tailored to the
French language. Work in this area was initiated at LIMSI over 10
years ago. In this paper a summary of the ongoing research in this
direction is presented. Included are studies on distributional
properties of French text materials; problems specific to
speech-to-text conversion particular to French; studies in
phoneme-to-grapheme conversion, for continuous, error-free phonemic
strings; past work on isolated-word speech-to-text conversion; and
more recent work on continuous-speech speech-to-text conversion. Also
demonstrated is the use of phone recognition for both language and
speaker identification.
The continuous speech-to-text conversion for French is based on a
speaker-independent, vocabulary-independent recognizer. In this
paper phone recognition and word recognition results are reported
evaluating this recognizer on read speech taken from the BREF corpus.
The recognizer was trained on over 4 hours of speech from 57 speakers,
and tested on sentences from an independent set of 19 speakers. A
phone accuracy of 78.7% was obtained using a set of 35 phones. The
word accuracy was 88% for a 1139 word lexicon and 86% for a 2716
word lexicon, with a word pair grammar with respective perplexities of
100 and 160. Using a bigram grammar word accuracies of 85.5% and
81.7% were obtained with 5K and 20K word vocabularies, with respective
perplexities of 122 and 205.

hlt93.ps.Z

[ARPA Human Language Technology Workshop, Plainsboro, NJ,
pp. 96-101, March'93]
---------
Identification of Non-Linguistic Speech Features
Jean-Luc Gauvain and Lori F. Lamel
---------------------
Over the last decade technological advances have been made which
enable us to envision real-world applications of speech technologies.
It is possible to foresee applications where the spoken query is to be
recognized without even prior knowledge of the language being spoken,
for example, information centers in public places such as train
stations and airports. Other applications may require accurate
identification of the speaker for security reasons, including control
of access to confidential information or for telephone-based
transactions. Ideally, the speaker's identity can be verified
continually during the transaction, in a manner completely transparent
to the user. With these views in mind, this paper presents a unified
approach to identifying non-linguistic speech features from the
recorded signal using phone-based acoustic likelihoods.
This technique is shown to be effective for text-independent language,
sex, and speaker identification and can enable better and more
friendly human-machine interaction. With 2s of speech, the language
can be identified with better than 99% accuracy. Error in
sex-identification is about 1% on a per-sentence basis, and speaker
identification accuracies of 98.5% on TIMIT (168 speakers) and 99.2%
on BREF (65 speakers), were obtained with one utterance per speaker,
and 100% with 2 utterances for both corpora. An experiment using
unsupervised adaptation for speaker identification on the 168 TIMIT
speakers had the same identification accuracies obtained with
supervised adaptation.

csl93.ps.Z

[Computer, Speech & Language, Vol 7 (2):169-191, April 1993]
---------
A knowledge-based system for stop consonant identification
based on speech spectogram reading
Lori F. Lamel
---------------------
In order to formalize the information used in spectrogram reading, a
knowledge-based system for identifying spoken stop consonants was
developed. Speech spectrogram reading involves interpreting the
acoustic patterns in the image to determine the spoken utterance. One
must selectively attend to many different acoustic cues, interpret
their significance in light of other evidence, and make inferences
based on information from multiple sources. The evidence, obtained
from both spectrogram reading experiments and from teaching
spectrogram reading, indicates that the process can be modeled with
rules. Formalizing spectrogram reading entails refining the language
used to describe acoustic events in the spectrogram, selecting a set
of relevant acoustic events that distinguish among phonemes, and
developing rules which map these acoustic attributes into phonemes.
One way to assess how well the knowledge used by experts has been
captured is by embedding the rules in a computer program. A
knowledge-based system was selected because the expression and use of
knowledge are explicit. The emphasis was in capturing the acoustic
descriptions and modeling the reasoning used by human spectrogram
readers. In this paper, the knowledge acquisition and knowledge
representation, in terms of descriptions and rules, are described. A
performance evaluation and error analysis are also provided, and the
performance of an HMM-based phone recognizer on the same test data is
given for comparison.

ica93.ps.Z

[ICASSP'93 Minneapolis, MN, April'93]
---------
Cross-Lingual Experiments with Phone Recognition
Lori F. Lamel and Jean-Luc Gauvain
---------------------
This paper presents some of the recent research on speaker-independent
continuous phone recognition for both French and English. The phone
accuracy is assessed on the BREF corpus for French, and on the Wall
Street Journal and TIMIT corpora for English. Cross-language
differences concerning language properties are presented. It was
found that French is easier to recognize at the phone level (the phone
error for BREF is 23.6% vs. 30.1% for WSJ), but harder to recognize
at the lexical level due to the larger number of homophones.
Experiments with signal analysis indicate that a 4kHz signal bandwidth
is sufficient for French, whereas 8kHz is needed for English. Phone
recognition is a powerful technique for language, sex, and speaker
identification. With 2s of speech, the language can be identified with
better than 99% accuracy. Sex-identification for BREF and WSJ is
error-free. Speaker identification accuracies of 98.2% on TIMIT (462
speakers) and 99.1% on BREF (57 speakers), were obtained with one
utterance per speaker, and 100% with 2 utterances.

ast93.ps.Z

[ESCA ETRW on Applications of Speech Technology, Lautrach, Sep'93]
---------
Generation and Synthesis of Broadcast Messages
L.F. Lamel, J.L. Gauvain, B. Prouts, C. Bouhier, R. Boesch
---------------------
In this paper we present a system designed by VECSYS, in collaboration
with LIMSI, to generate and synthesize high quality broadcast messages.
The system is currently undergoing evaluation tests to automatically
generate and broadcast messages about weather and airport conditions.
The generated messages are alternately broadcast in French and in
English.

This paper focuses on the synthesis aspects of the system which
synthesizes messages by concatenation of speech units stored in the form
of a dictionary. The texts corresponding to the messages to be
synthesized and broadcast are automatically generated by a server, and
transmitted to the synthesis component. Should the system be unable to
generate the message from existing entries in the dictionary, the system
permits a human operator to record the message which is subsequently
broadcast.

euro93nl.ps.Z

[Eurospeech'93, Berlin, Sept'93]
---------
Identifying Non-Linguistic Speech Features
L.F. Lamel and J.L. Gauvain
---------------------
Over the last decade technological advances have been made which enable
us to envision real-world applications of speech technologies. It is
possible to foresee applications, for example, information centers in
public places such as train stations and airports, where the spoken
query is to be recognized without even prior knowledge of the language
being spoken. Other applications may require accurate identification of
the speaker for security reasons, including control of access to
confidential information or for telephone-based transactions.
In this paper we present a unified approach to identifying
non-linguistic speech features from the recorded signal using
phone-based acoustic likelihoods. The basic idea is to process the
unknown speech signal by feature-specific phone model sets in
parallel, and to hypothesize the feature value associated with the
model set having the highest likelihood. This technique is shown to
be effective for text-independent sex, speaker, and language
identification and can enable better and more friendly human-machine
interaction.
Text-independent speaker identification accuracies of 98.8% on TIMIT
(168 speakers) and 99.2% on BREF (65 speakers), were obtained with
one utterance per speaker, and 100% with 2 utterances for both
corpora. Experiments estimating speaker-specific models without
use of the phonetic transcription for the TIMIT speakers had the same
identification accuracies obtained with the use of the transcriptions.
French/English language identification is better than 99% with 2s of
read, laboratory speech. On spontaneous telephone speech from the OGI
corpus, the language can be identified as French or English with 82%
accuracy with 10s of speech. 10 language identification using the OGI
corpus is 59.7% with 10s of signal.

euro93lv.ps.Z

[Eurospeech'93, Berlin, Sept'93]
---------
Speaker-Independent Continuous Speech Dictation
J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker
---------------------
In this paper we report progress made at LIMSI in speaker-independent
large vocabulary speech dictation using newspaper speech corpora. The
recognizer makes use of continuous density HMM with Gaussian mixture
for acoustic modeling and n-gram statistics estimated on the newspaper
texts for language modeling. Acoustic modeling uses cepstrum-based
features, context-dependent phone models (intra and interword), phone
duration models, and sex-dependent models. Two corpora of read speech
have been used to carry out the experiments: the DARPA Wall Street
Journal-based CSR corpus and the BREF corpus containing recordings of
texts from the French newspaper LeMonde. For both corpora
experiments were carried out with up to 20K word lexicons.
Experimental results are also given for the DARPA RM task which has
been widely used to evaluate and compare systems.

euro93ph.ps.Z

[Eurospeech'93, Berlin, Sep'93]
---------
High Performance Speaker-Independent Phone Recognition Using CDHMM
L.F. Lamel and J.L. Gauvain
---------------------
In this paper we report high phone accuracies on three corpora: WSJ0,
BREF and TIMIT. The main characteristics of the phone recognizer are:
high dimensional feature vector (48), context- and gender-dependent
phone models with duration distribution, continuous density HMM with
Gaussian mixtures, and n-gram probabilities for the phonotatic
constraints. These models are trained on speech data that have either
phonetic or orthographic transcriptions using maximum likelihood and
maximum a posteriori estimation techniques. On the WSJ0 corpus with a
46 phone set we obtain phone accuracies of 72.4% and 74.4% using 500
and 1600 CD phone units, respectively. Accuracy on BREF with 35
phones is as high as 78.7% with only 428 CD phone units. On TIMIT
using the 61 phone symbols and only 500 CD phone units, we obtain a
phone accuracy of 67.2% which correspond to 73.4% when the
recognizer output is mapped to the commonly used 39 phone set. Making
reference to our work on large vocabulary CSR, we show that it is
worthwhile to perform phone recognition experiments as opposed to only
focusing attention on word recognition results.

asru93lv.ps.Z

[1993 IEEE SPS Workshop on Automatic Speech Recognition
Snowbird, Utah, Dec'93 (invited)]
---------
Large Vocabulary Speech Recognition in English and French
J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker
---------------------
In this paper we report efforts at LIMSI in speaker independent
large vocabulary speech recognition in French and in English. The
recognizer makes use of continuous density HMM (CDHMM) with Gaussian
mixture for acoustic modeling and n-gram statistics estimated on text
material for language modeling. Acoustic modeling uses cepstrum-based
features, context-dependent phone models (intra and interword), phone
duration models, and sex-dependent models.
Evaluation of the system has been based on read speech corpora:
DARPA Wall Street Journal-based CSR corpus and the BREF corpus
containing recordings of texts from the French newspaper LeMonde. For
both corpora experiments were carried out with up to 20K word
lexicons. Experimental results on the WSJ and BREF corpora with
similar size lexicons and task complexities are given, with attention
to language-specific properties and problems.

asru93lid.ps.Z

[1993 IEEE SPS Workshop on Automatic Speech Recognition
Snowbird, Utah, Dec'93]
---------
Language Identification Using Phone-based Acoustic Likelihoods
Lori F. Lamel and Jean-Luc Gauvain
---------------------
This paper presents our recent work in language identification using
phone-based acoustic likelihoods. The basic idea is to process the
unknown utterance by language-dependent phone models, identifying the
language to be that language associated with the phone model set
having the highest likelihood. This approach has been evaluated for
French/English language identification in laboratory conditions, and
for 10 languages using the OGI Multilingual telepone corpus.
Phone-based acoustic likelihoods have also been shown to be effective
for sex and speaker-identification.

rmsep92.ps.Z

[Final review of the DARPA Artificial Neural Network Technology
Speech Program, Stanford, CA, pp. 59-64, Sep'92]
---------
Continuous Speech Recognition at LIMSI
Lori F. Lamel and Jean-Luc Gauvain
---------------------
This paper presents some of the recent research on speaker-independent
continuous speech recognition at LIMSI including efforts in phone and
word recognition for both French and English. Evaluation of an
HMM-based phone recognizer on a subset of the BREF corpus, gives a
phone accuracy of 67.1% with 35 context-independent phone models and
74.2% with 428 context-dependent phone models. The word accuracy is
88% for a 1139 word lexicon and 86% for a 2716 word lexicon, using a
word pair grammar with respective perplexities of 101 and 160. Phone
recognition is also shown to be effective for language, sex, and
speaker identification.
The second part of the paper describes the recognizer used for the
September-92 Resource Management evaluation test. The HMM-based word
recognizer is built by concatenation of the phone models for each
word, where each phone model is a 3-state left-to-right HMM with
Gaussian mixture observation densities. Separate male and female
models are run in parallel. The lexicon is represented with a reduced
set of 36 phones so as to permit additional sharing of contexts.
Intra- and inter-word phonological rules are optionally applied during
training and recognition. These rules attempt to account for some of
the phonological variations observed in fluent speech. The
speaker-independent word accuracy on the Sep92 test data was 95.6%.
On the previous test materials which were used for development, the
word accuracies are: 96.7% (Jun88), 97.5% (Feb89), 96.7% (Oct89)
and 97.4% (Feb91).

spc92map.ps.Z

[Speech Communication, Vol. 11:205-213, June 1992]
---------
Bayesian Learning for Hidden Markov Model with Gaussian Mixture State
Observation Densities
Jean-Luc Gauvain and Chin-Hui Lee*
[*AT&T Bell Laboratories]
---------------------
An investigation into the use of Bayesian learning of the
parameters of a multivariate Gaussian mixture density has been carried
out. In a framework of continuous density hidden Markov model
(CDHMM), Bayesian learning serves as a unified approach for parameter
smoothing, speaker adaptation, speaker clustering and corrective
training. The goal is to enhance model robustness in a CDHMM-based
speech recognition system so as to improve performance. Our approach
is to use Bayesian learning to incorporate prior knowledge into the
training process in the form of prior densities of the HMM parameters.
The theoretical basis for this procedure is presented and results
applying it to parameter smoothing, speaker adaptation, speaker
clustering, and corrective training are given.

Une �tude sur l'utilisation de l'apprentissage bay�sien des param�tres
de densit�es multigaussiennes a �t� effectu�e. Dans le cadre des
mod�les markoviens cach�s � densities d'observations continues
(CDHMM), l'apprentissage bay�sien est un outil tr�s g�n�ral applicable
au lissage des param�tres, � l'adaptation au locuteur, � l'estimation
de mod�les par groupe de locuteurs et � l'apprentissage correctif. Le
but est d'augmenter la robustesse des mod�les d'un syst�me de
reconnaissance afin d'en am�liorer les performances. Notre approche
consiste a utiliser l'apprentissage bay�sien pour incorporer une
connaissance a priori dans le processus d'apprentissage sous forme de
densit�s de probabilit�s des param�tres des mod�les markoviens. La
base th�orique de cette proc�dure est pr�sent�e, ainsi que les
r�sultats obtenus pour le lissage des param�tres, l'adaptation au
locuteur, l'estimation de mod�les propres � chaque sexe, et
l'apprentissage correctif.

Wir berichten �ber eine Untersuchung zum Einsatz der Bayes'schen
Lerntheorie zur Schaetzung der Parameter von multi-variaten
Gauss'schen Verteilungsdichten. Im Rahmen eines ``Hidden Markov
Modells'' mit kontinuierlicher Dichteverteilungen (CDHMM) stellt die
Bayes'sche Theorie einen einheitlichen Ansatz dar zur
Parameterglaettung, Sprecheradaption, Sprecherklusterung und zum
korrigierenden Training. Das Ziel ist, die Modellrobustheit eines auf
CDHMM basierenden Spracherkennungssystems in Hinblick auf die
Ergebnisse zu verbessern. Unser Ansatz ist, Bayes'sches Lernen zu
benutzen, um Vorwissen in Form von initialen Dichten der HMM-Parameter
in den Trainingsprozess einzubringen. Wir stellen die theoretische
Basis f�r dieses Verfahren dar und wenden es zur Glaettung von
Parametern, Sprecheradaption, Sprecherklusterung und im korrigierenden
Training an.

ica92map.ps.Z (missing)

[ICASSP'92 San Francisco, CA, March'92]
---------
Improved Acoustic Modeling with Bayesian Learning
Jean-Luc Gauvain and Chin-Hui Lee
---------------------

darpa92.ps.Z

[DARPA Speech & Natural Language Workshop, Arden House, NY, Feb'92]
---------
Speaker-Independent Phone Recognition Using BREF
Jean-Luc Gauvain and Lori F. Lamel
---------------------
A series of experiments on speaker-independent phone recognition of
continuous speech have been carried out using the recently recorded
BREF corpus. These experiments are the first to use this large
corpus, and are meant to provide a baseline performance evaluation for
vocabulary-independent phone recognition of French. The HMM-based
recognizer was trained with hand-verified data from 43 speakers.
Using 35 context-independent phone models, a baseline phone accuracy
of 60% (no phone grammar) was obtained on an independent test set of
7635 phone segments from 19 new speakers. Including phone bigram
probabilities as phonotactic constraints resulted in a performance of
63.5%. A phone accuracy of 68.6% was obtained with 428 context
dependent models and the bigram phone language model.
Vocabulary-independent word recognition results with no grammar are
also reported for the same test data.

darfeb92.ps.Z

[DARPA Speech & Nat. Lang. Wshp, Arden House, pp. 185-190, feb'92]
---------
MAP Estimation of Continuous Density HMM : Theory and Applications
Jean-Luc Gauvain and Chin-Hui Lee*
[*AT&T Bell Laboratories]
---------------------
We discuss maximum a posteriori estimation of continuous density
hidden Markov models (CDHMM). The classical MLE reestimation
algorithms, namely the forward-backward algorithm and the segmental
k-means algorithm, are expanded and reestimation formulas are given
for HMM with Gaussian mixture observation densities. Because of its
adaptive nature, Bayesian learning serves as a unified approach for
the following four speech recognition applications, namely parameter
smoothing, speaker adaptation, speaker group modeling and corrective
training. New experimental results on all four applications are
provided to show the effectiveness of the MAP estimation approach.

euro91map.ps.Z (missing)

[Eurospeech'91, Genoa, Italy, pp. 505-508, Sep'91]
---------
Bayesian Learning for Hidden Markov Model with Gaussian Mixture
State Observation Densities
Jean-Luc Gauvain and Chin-Hui Lee*
[*AT&T Bell Laboratories]
---------------------
T

euro91.ps.Z

[Eurospeech'91, Genoa, Italy, pp. 505-508, Sep'91]
---------
BREF, a Large Vocabulary Spoken Corpus for French
Lori F. Lamel, Jean-Luc Gauvain, and Maxine Eskenazi
---------------------
This paper presents some of the design considerations of BREF, a large
read-speech corpus for French. BREF was designed to provide
continuous speech data for the development of dictation machines, for
the evaluation of continuous speech recognition systems (both
speaker-dependent and speaker-independent), and for the study of
phonological variations. The texts to be read were selected from 5
million words of the French newspaper, LeMonde. In total,
11,000 texts were selected, with selection criteria that emphasisized
maximizing the number of distinct triphones. Separate text materials
were selected for training and test corpora. Ninety speakers have
been recorded, each providing between 5,000 and 10,000 words
(approximately 40-70 min.) of speech.

darpa91.ps.Z

[DARPA Speech & Nat. Lang., Morgan Kaufmann, pp. 272-277, feb'91]
---------
Bayesian Learning of Gaussian Mixture Densities for Hidden Markov Models
Jean-Luc Gauvain and Chin-Hui Lee*
[*AT&T Bell Laboratories]
---------------------
An investigation into the use of Bayesian learning of the parameters
of a multivariate Gaussian mixture density has been carried out. In a
continuous density hidden Markov model (CDHMM) framework, Bayesian
learning serves as a unified approach for parameter smoothing, speaker
adaptation, speaker clustering, and corrective training. The goal of
this study is to enhance model robustness in a CDHMM-based speech
recognition system so as to improve performance. Our approach is to
use Bayesian learning to incorporate prior knowledge into the CDHMM
training process in the form of prior densities of the HMM parameters.
The theoretical basis for this procedure is presented and preliminary
results applying to HMM parameter smoothing, speaker adaptation, and
speaker clustering are given.
Performance improvements were observed on tests using the DARPA RM
task. For speaker adaptation, under a supervised learning mode with 2
minutes of speaker-specific training data, a 31% reduction in word
error rate was obtained compared to speaker-independent results.
Using Baysesian learning for HMM parameter smoothing and sex-dependent
modeling, a 21% error reduction was observed on the FEB91 test.

kobe90.ps.Z

[ICSLP'90, Kobe, Japan, Oct'90]
---------
Design Considerations and Text Selection for BREF,
a large French read-speech corpus
Jean-Luc Gauvain, Lori F. Lamel, and Maxine Eskenazi
---------------------
BREF, a large read-speech corpus in French has been designed with
several aims: to provide enough speech data to develop dictation
machines, to provide data for evaluation of continuous speech
recognition systems (both speaker-dependent and speaker-independent),
and to provide a corpus of continuous speech to study phonological
variations. This paper presents some of the design considerations of
BREF, focusing on the text analysis and the selection of text
materials. The texts to be read were selected from 4.6 million words
of the French newspaper, LeMonde. In total, 11,000 texts were
selected, with an emphasis on maximizing the number of distinct
triphones. Separate text materials were selected for training and
test corpora. The goal is to obtain about 10,000 words (approximately
60-70 min.) of speech from each of 100 speakers, from different French
dialects.

Years: 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Activities - Themes - Projects - People