Years: 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
IS160180.PDF [InterSpeech 2016, San Francisco, CA, Sept 2016] --------- A Divide-and-Conquer Approach for Language Identification Based on Recurrent Neural Networks Gregory Gelly, Jean-Luc Gauvain, Viet-Bac Le, Abdelkhalek Messaoudi ------------------------ This paper describes the design of an acoustic language recognition system based on BLSTM that can discriminate closely related languages and dialects of the same language. We introduce a Divide-and-Conquer (D&C) method to quickly and successfully train an RNN-based multi-language classifier. Ex- periments compare this approach to the straightforward train- ing of the same RNN, as well as to two widely used LID tech- niques: a phonotactic system using DNN acoustic models and an i-vector system. Results are reported on two different data sets: the 14 languages of NIST LRE07 and the 20 closely re- lated languages and dialects of NIST OpenLRE15. In addition to reporting the NIST Cavg metric which served as the primary metric for the LRE07 and OpenLRE15 evaluations, the EER and LER are provided. When used with BLSTM, the D&C training scheme significantly outperformed the classical train- ing method for multi-class RNNs. On the OpenLRE15 data set, this method also outperforms classical LID techniques and combines very well with a phonotactic system. hlt12_lamel.pdf [HLT 2012 The Fifth International Conference: Human Language Technologies - The Baltic Perspective] --------- Multilingual Speech Processing Activities in Quaero: Application to Multimedia Search in Unstructured Data Lori Lamel ------------------------ Spoken language processing technologies are principle components in most of the applications being developed as part of the Quaero program. Quaero is a large research and industrial innovation program focusing on the development of technologies for automatic analysis and classification of multimedia and multilingual documents. Concerning speech processing, research aims to substantially improve the state-ofthe- art in speech-to-text transcription, speaker diarization and recognition, language recognition, and speech translation. Oparin_ICASSP_2012.pdf [IEEE ICASSP 2012, Kyoto, Japan, March 2012] --------- Performance Analysis of Neural Networks in Combination with N-Gram Language Models Ilya Oparin, Martin Sundermeyer, Hermann Ney, Jean-Luc Gauvain ------------------------ Neural Network language models (NNLMs) have recently become an important complement to conventional n-gram language models (LMs) in speech-to-text systems. However, little is known about the behavior of NNLMs. The analysis presented in this paper aims to understand which types of events are better modeled by NNLMs as compared to n-gram LMs, in what cases improvements are most substantial and why this is the case. Such an analysis is important to take further benefit from NNLMs used in combination with conventional ngram models. The analysis is carried out for different types of neural network (feed-forward and recurrent) LMs. The results showing for which type of events NNLMs provide better probability estimates are validated on two setups that are different in their size and the degree of data homogeneity. odyssey12_mfb.pdf [Odyssey 2012 The Speaker and Language Recognition Workshop Singapore June 2012] --------- Fusing Language Information from Diverse Data Sources for Phonotactic Language Recognition Mohamed Faouzi BenZeghiba, Jean-Luc Gauvain and Lori Lamel ------------------------ The baseline approach in building phonotactic language recognition systems is to characterize each language by a single phonotactic model generated from all the available languagespecific training data. When several data sources are available for a given target language, system performance can be improved using language source-dependent phonotactic models. In this case, the common practice is to fuse language source information (i.e., the phonotactic scores for each language/ source) early (at the input) to the backend. This paper proposes to postpone the fusion to the end (at the output) of the backend. In this case, the language recognition score can be estimated from well-calibrated language source scores. Experiments were conducted using the NIST LRE 2007 and the NIST LRE 2009 evaluation data sets with the 30s condition. On the NIST LRE 2007 eval data, a Cavg of 0.9% is obtained for the closed-set task and 2.5% for the open-set task. Compared to the common practice of early fusion, these results represent relative improvements of 18% and 11%, for the closed-set and open-set tasks, respectively. Initial tests on the NIST LRE 2009 eval data gave no improvement on the closedset task. Moreover, the Cllr measure indicates that language recognition scores estimated by the proposed approach are better calibrated than the common practice (early fusion). is12_1398_jk.pdf [InterSpeech 2012, Portland, Oregon, Sept 2012] --------- Development and Evaluation of Automatic Punctuation for French and English Speech-to-Text Jachym Kolar and Lori Lamel ------------------------ Automatic punctuation of speech is important to make speechto- text output more readable and to facilitate downstream language processing. This paper describes the development of an automatic punctuation system for French and English. The punctuation model uses both textual information and acoustic (prosodic) information and is based on adaptive boosting. The system is evaluated on a challenging speech corpus under real-application conditions using output from a state-of-the-art speech-to-text system and automatic audio segmentation and speaker diarization. Unlike previous work, automatic punctuation is scored on two independent manual references. Comparisons are made for the two languages and the performance of the automatic system is compared with inter-annotator agreement. is12_1063_mfb.pdf [InterSpeech 2012, Portland, Oregon, Sept 2012] --------- Phonotactic Language Recognition Using MLP Features Mohamed Faouzi BenZeghiba, Jean-Luc Gauvain, Lori Lamel ------------------------ This paper describes a very efficient Parallel Phone Recognizers followed by Language Modeling (PPRLM) system in terms of both performance and processing speed. The system uses context-independent phone recognizers trained on MLP features concatenated with the conventional PLP and pitch features. MLP features have several interesting properties that make them suitable for speech processing, in particular the temporal context provided to the MLP inputs and the discriminative criterion used to learn the MLP parameters. Results of preliminary experiments conducted on the NIST LRE 2005 for the closed-set task show significant improvements obtained by the proposed system compared with a PPRLM system using context-independent phone models trained on PLP features. Moreover, the proposed system performs as well as a PPRLM system using context-dependent phone models, while running 6 times faster. is12_1398_jk.pdf [InterSpeech 2012, Portland, Oregon, Sept 2012] --------- Discriminatively trained phoneme confusion model for keyword spotting Panagiota Karanasou, Lukas Burget, Dimitra Vergyri, Murat Akbacak, Arindam Mandal ------------------------ Keyword Spotting (KWS) aims at detecting speech segments that contain a given query within large amounts of audio data. Typically, a speech recognizer is involved in a first indexing step. One of the challenges of KWS is how to handle recognition errors and out-of-vocabulary (OOV) terms. This work proposes the use of discriminative training to construct a phoneme confusion model, which expands the phonemic index of a KWS system by adding phonemic variation to handle the abovementioned problems. The objective function that is optimized is the Figure of Merit (FOM), which is directly related to the KWS performance. The experiments conducted on English data sets show some improvement on the FOM and are promising for the use of such technique. is12_473_hb.pdf [InterSpeech 2012, Portland, Oregon, Sept 2012] --------- Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast Johann Poignant, Herve Bredin, Viet-Bac Le, Laurent Besacier, Claude Barras, Georges Quenot ------------------------ We propose an approach for unsupervised speaker identification in TV broadcast videos, by combining acoustic speaker diarization with person names obtained via video OCR from overlaid texts. Three methods for the propagation of the overlaid names to the speech turns are compared, taking into account the co-occurence duration between the speaker clusters and the names provided by the video OCR and using a task-adapted variant of the TF-IDF information retrieval coefficient. These methods were tested on the REPERE dry-run evaluation corpus, containing 3 hours of annotated videos. Our best unsupervised system reaches a F-measure of 70.2% when considering all the speakers, and 81.7% if anchor speakers are left out. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.5% F-measure when considering all the speakers and 45.7% without anchor. sltu12_fraga.pdf [SLTU 2012, Cape Town, South Africa May 2012] --------- Incorporating MLF features in the unsupervised training process Thiago Fraga Silva, Viet-Bac Le, Lori Lamel and Jean-Luc Gauvain ------------------------ The combined use of multi layer perceptron (MLP) and perceptual linear prediction (PLP) features has been reported to improve the performance of automatic speech recognition systems for many different languages and domains. However, MLP features have not yet been used on unsupervised acoustic model training. This approach is introduced in this paper with encouraging results. In addition, unsupervised language model training was also investigated for a Portuguese broadcast speech recognition task, leading to a slight improvement of performance. The joint use of the unsupervised techniques presented here leads to an absolute WER reduction up to 3.2% over a baseline unsupervised system. sltu12_lamel.pdf [SLTU 2012, Cape Town, South Africa May 2012] --------- Transcription of Russian conversational speech Lori Lamel, Sandrine Courcinous, Jean-Luc Gauvain, Yvan Josse and Viet Bac Le ------------------------ This paper presents initial work in transcribing conversational telephone speech in Russian. Acoustic seed models were derived from other languages. The initial studies are carried out with 9 hours of transcribed data, and explore the choice of the phone set and use of other data types to improve transcription performance. Discriminant features produced by a Multi Layer Perceptron trained on a few hours of Russian conversational data are contrasted with those derived from well-trained networks for English telephone speech and from Russian broadcast data. Acoustic models trained on broadcast data filtered to match the telephone band achieve results comparable to those obtained with models trained on the small conversation telephone speech corpus. lrec12_300_iv.pdf [LREC 2012 Istabul, May 2012] --------- Cross-lingual studies of ASR errors: paradigms for perceptual evaluations ------------------------ Ioana Vasilescu, Martine Adda-Decker and Lori Lamel It is well-known that human listeners significantly outperform machines when it comes to transcribing speech. This paper presents a progress report of the joint research in the automatic vs human speech transcription and of the perceptual experiments developed at LIMSI that aims to increase our understanding of automatic speech recognition errors. Two paradigms are described here in which human listeners are asked to transcribe speech segments containing words that are frequently misrecognized by the system. In particular, we sought to gain information about the impact of increased context to help humans disambiguate problematic lexical items, typically homophone or near-homophone words. The long-term aim of this research is to improve the modeling of ambiguous contexts so as to reduce automatic transcription errors. 0004866.pdf [IEEE ICASSP 2010, Dallas, Texas, March 2010] --------- Multi-style MLP Features for BN Transcription Viet-Bac Le and Lori Lamel and Jean-Luc Gauvain ------------------------ It has become common practice to adapt acoustic models to specific-conditions (gender, accent, bandwidth) in order to improve the performance of speech-to-text (STT) transcription systems. With the growing interest in the use of discriminative features produced by a multi layer perceptron (MLP) in such systems, the question arise of whether it is necessary to specialize the MLP to particular conditions, and if so, how to incorporate the condition-specific MLP features in the system. This paper explores three approaches (adaptation, full training, and feature merging) to use condition-specific MLP features in a state-of-the-art BN STT system for French. The third approach without condition-specific adaptation was found to outperform the original models with condition-specific adaptation, and was found to perform almost as well as full training of multiple condition-specific HMMs. 0004325.pdf [IEEE ICASSP 2009, Taipei, Taiwan, April 2009] --------- Modeling characters versus words for Mandarin speech recognition Jun Luo and Lori Lamel and Jean-Luc Gauvain ------------------------ Word based models are widely used in speech recognition since they typically perform well. However, the question of whether it is better to use a word-based or a character-based model warrants being for the Mandarin Chinese language. Since Chinese is written without any spaces or word delimiters, a word segmentation algorithm is applied in a pre-processing step prior to training a word-based language model. Chinese characters carry meaning and speakers are free to combine characters to construct new words. This suggests that character information can also be useful in communication. This paper explores both word-based and character-based models, and their complementarity. Although word-based modeling is found to outperform character-based modeling, increasing the vocabulary size from 56k to 160k words did not lead to a gain in performance. Results are reported for the Gale Mandarin speech-to-text task. 0004537.pdf [IEEE ICASSP 2009, Taipei, Taiwan, April 2009] --------- Lattice-based MLLR for speaker recognition Marc Ferras and Claude Barras Jean-Luc Gauvain ------------------------ Maximum-Likelihod Linear Regression (MLLR) transform coefficients have shown to be useful features for text-independent speaker recognition systems. These use MLLR coefficients computed on a Large Vocabulary Continuous Speech Recognition System (LVCSR) as features and Support Vector machines(SVM) classification. However, performance is limited by transcripts, which are often erroneous with high word error rates (WER) for spontaneous telephone speech applications. In this paper, we propose using lattice-based MLLR to overcome this issue. Using wordlattices instead of 1-best hypotheses, more hypotheses can be considered for MLLR estimation and, thus, better models are more likely to be used. As opposed to standard MLLR, language model probabilities are taken into account as well. We show how systems using lattice MLLR outperform standard MLLR systems in the Speaker Recognition Evaluation (SRE) 2006. Comparisonto other standard acoustic systems is provided as well. Langlais08translating.pdf [ Workshop on Intelligent Data Analysis in bioMedicine and Pharmacology (IDAMAP) 2008] --------- Translating Medical Words by Analogy Philippe Langlais and François Yvon and Pierre Zweigenbaum ------------------------ Term translation has become a recurring need in medical informatics. This creates an interest for robust methods which can translate medical words in various languages. We propose a novel, analogy-based method to generate word translations. It relies on a partial bilingual lexicon and solves bilingual analog- ical equations to create candidate translations. To evaluate the potential of this method, we tested it to translate a set of 1,306 French medical words to English. It could propose a correct translation for up to 67.9% of the input words, with a precision of up to 80.2% depending on the number of selected candidates. We compare it to previous methods including word alignment in parallel corpora and edit distance to a list of words, and show how these methods can complement each other. Rosset07_et_al2.pdf [ ASRU 2007, Kyoto, December 2007] --------- The LIMSI Qast systems: comparison between human and automatic rules generation for question-answering on speech transcriptions S. Rosset and O. Galibert and G. Adda and E. Bilinski ------------------------ In this paper, we present two different question-answering systems on speech transcripts. These two systems are based on a complete and multi-level analysis of both queries and documents. The first system uses handcrafted rules for small text fragments (snippet) selection and answer extraction. The second one replaces the handcrafting with an automatically generated research descriptor. A score based on those descriptors is used to select documents and snippets. The extraction and scoring of candidate answers is based on proximity measurements within the research descriptor elements and a number of secondary factors. The preliminary results obtained on QAst (QA on speech transcripts) development data are promising ranged from 72% correct answer at 1st rank on manually transcribed meeting data to 94% on manually transcribed lecture data. final_srsl07.pdf [ Workshop on the semantic representation of spoken language - SRSL07, Salamanca, November 2007] --------- Semantic relations for an oral and interactive question-answering system J. Villaneau and S. Rosset and O. Galibert ------------------------ RITEL is an oral and QA dialogue system, which aims to enable a human to refine his research interactively. Analysis is based on typing of chunks, according to several categories: named, linguistic or specific entities. Since the system currently obtains promising results by using a research based on a common presence of the typed chunks in the documents and in the query, we are expecting better score quality by detecting semantic relations between the chunks. After a short description of the RITEL system, this paper describes the first step of our research to define and to detect generic semantic relations in the user's spoken utterances, by coping with difficulties of such an analysis: necessary quickness, recognition errors, asyntactical utterances. We also give the first results we have obtained, and some perspectives of our future works to extend the number of detected relations and to extend this detection to the documents. turmoCLEF2007-QASToverview.pdf [ CLEF 2007 Workshop, Budapest, September 2007] --------- Overview of the QAST 2007 J. Turmo and P. Comas and C. Ayache and D. Mostefa and S. Rosset and L. Lamel ------------------------ This paper describes QAST, a pilot track of CLEF 2007 aimed at evaluating the task of Question Answering in Speech Transcripts. The paper summarizes the evaluation framework, the systems that participated and the results achieved. These results have shown that question answering technology can be useful to deal with spontaneous speech transcripts, so for manually transcribed speech as for automatically recognized speech. The loss in accuracy from dealing with manual transcripts to dealing with automatic ones implies that there is room for future research in this area. rossetCLEF2007.pdf [ CLEF 2007 Workshop, Budapest, September 2007] --------- The LIMSI participation to the QAst track S. Rosset and O. Galibert and G. Adda and E. Bilinski ------------------------ In this paper, we present twe two different question-answering systems on speech transcripts which participated to the QAst 2007 evaluation. These two systems are based on a complete and multi-level analysis of both queries and documents. The first system uses handcrafted rules for small text fragments (snippet) selection and answer extraction. The second one replaces the handcrafting with an automatically generated research descriptor. A score based on those descriptors is used to select documents and snippets. The extraction and scoring of candidate answers is based on proximity measurements within the research descriptor elements and a number of secondary factors. The evaluation results are ranged from 17% to 39% as accuracy depending on the tasks. toaddpdffile [InterSpeech Eurospeech 2007, Antwerp, August 2007] --------- The Relevance of Feature Type for the Automatic Classification of Emotional User States: Low Level Descriptors and Functionals Björn Schuller, Anton Batliner,Dino Seppi, Stefan Steidl, Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous and Vered Aharonson ------------------------ In this paper, we report on classification results for emotional user states (4 classes, German database of children interacting with a pet robot). Six sites computed acoustic and linguistic features independently from each other, following in part different strategies. A total of 4244 features were pooled together and grouped into 12 low level descriptor types and 6 functional types. For each of these groups, classification results using Support Vector Machines and Random Forests are reported for the full set of features, and for 150 features each with the highest individual Information Gain Ratio. The performance for the different groups varies mostly between approx. 50% and approx. 60%. [InterSpeech Eurospeech 2007, Antwerp, August 2007] --------- Speech fundamental frequency estimation using the Alternate Comb Jean-Sylvain Lienard, Francois Signol and Claude Barras ------------------------ Reliable estimation of speech fundamental frequency is crucial in the perspective of speech separation. We show that the gross errors on F0 measurement occur for particular configurations of the periodic structure to be estimated and the other periodic structure used to achieve this estimation. The error families are characterized by a set of two positive integers. The Alternate Comb method uses this knowledge to cancel most of the erroneous solutions. Its efficiency is assessed by an evaluation on a classical pitch database. toaddpdffile [InterSpeech Eurospeech 2007, Antwerp, August 2007] --------- Using phonetic features in unsupervised word decompounding for ASR with application to a less-represented language Thomas Pellegrini and Lori Lamel ------------------------ In this paper, a data-driven word decompounding algorithm is described and applied to a broadcast news corpus in Amharic. The baseline algorithm has been enhanced in order to address the problem of increased phonetic confusability arising from word decompounding by incorporating phonetic properties and some constraints on recognition units derived from prior forced alignment experiments. Speech recognition experiments have been carried out to validate the approach. Out of vocabulary (OOV) words rates can be reduced by 30% to 40% and an absolute Word Error Rate (WER) reduction of 0.4% has been achieved. The algorithm is relatively language independent and requires minimal adaptation to be applied to other languages. toaddpdffile [InterSpeech Eurospeech 2007, Antwerp, August 2007] --------- Improved Acoustic Modeling for Transcribing Arabic Broadcast Data Lori Lamel, Abdel. Messaoudi and Jean-Luc Gauvain ------------------------ This paper summarizes our recent progress in improving the automatic transcription of Arabic broadcast audio data, and some efforts to address the challenges of the broadcast conversational speech. Our efforts are aimed at improving the acoustic, pronunciation and language models taking into account specificities of the Arabic language. In previous work we demonstrated that explicit modeling of short vowels improved recognition performance, even when producing non-vocalized hypotheses. In addition to modeling short vowels, consonant gemination and nunation are now explicitly modeled, alternative pronunciations have been introduced to better represent dialectical variants, and a duration model has been integrated. In order to facilitate training on Arabic audio data with non-vocalized transcripts a generic vowel model has been introduced. Compared with the previous system (used in the 2006 GALE evaluation) the relative word error rate has been reduced by over 10%. toaddpdffile [InterSpeech Eurospeech 2007, Antwerp, August 2007] --------- Improved Machine Translation of Speech-to-Text outputs Daniel Déchelotte, Holger Schwenk, Gilles Adda and Jean-Luc Gauvain ------------------------ Combining automatic speech recognition and machine translation is frequent in current research programs. This paper first presents several pre-processing steps to limit the performance degradation observed when translating an automatic transcription (as opposed to a manual transcription). Indeed, automatically transcribed speech often differs significantly from the machine translation system's training material, with respect to caseing, punctuation and word normalization. The proposed system outperforms the best system at the 2007 TC-STAR evaluation by almost 2 points BLEU. The paper then attempts to determine a criteria characterizing how well an STT system can be translated, but the current experiments could only confirm that lower word error rates lead to better translations. ritel-final-20070530.pdf [InterSpeech Eurospeech 2007, Antwerp, August 2007] --------- Handling speech input in the Ritel QA dialogue system B. van Schooten and S. Rosset and O. Galibert and A. Max and R. op den Akker and G. Illouz ------------------------ The Ritel system aims to provide open-domain question answering to casual users by means of a telephone dialogue. Providing a sufficiently natural speech dialogue in a QA system has some unique challenges, such as very fast overall performance and large-vocabulary speech recognition. Speech QA is an error-prone process, but errors or problems that occur may be resolved with help of user feedback, as part of a natural dialogue. This paper reports on the latest version of Ritel, which includes search term confirmation and improved followup question handling, as well as various improvements on the other system components. We collected a small dialogue corpus with the new system, which we use to evaluate and discuss it. toaddpdffile [InterSpeech Eurospeech 2007, Antwerp, August 2007] --------- Comparing Praat and Snack formant measurements on two large corpora of northern and southern French C. Woehrling & P. Boula de Mareüil ------------------------ We compare formant frequency measurements between two authoritative tools (Praat and Snack), two large corpora (of face-to-face and telephone speech) and two French varieties (northern and southern). There are both an evaluation of formant tracking (as well as related filtering techniques) and an application to find out salient pronunciation traits. Despite differences between Praat and Snack with regard to telephone speech (Praat yielding greater F1 values), results seem to converge to suggest that northern and southern French varieties mainly differ in the second formant of the open /O/. /O/ fronting in northern French (with F2 values greater than 1100 Hz for males and 1200 Hz for females) is by far the most discriminating feature provided by decision trees applied to oral vowel formants. allauzen_is07.pdf [InterSpeech Eurospeech 2007, Antwerp, August 2007] --------- Error detection in confusion network Alexandre Allauzen ------------------------ In this article, error detection for broadcast news transcription system is addressed in a post-processing stage. We investigate a logistic regression model based on features extracted from confusion networks. This model aims to estimate a confidence score for each confusion set and detect errors. Different kind of knowledge sources are explored such as the confusion set solely, statistical language model, and lexical properties. Impact of the different features are assessed and show the importance of those extracted from the confusion network solely. To enrich our modeling with information about the neighborhood, features of adjacent confusion sets are also added to the vector of features. Finally, a distinct processing of confusion sets is also explored depending on the value of their best posterior probability. To be compared with the standard ASR output, our best system yields to a significant improvement of the classification error rate from 17.2% to 12.3%. VieruBoulaMadda_ParaLing07.pdf [ParaLing07, August 2007] --------- Identification of foreign-accented French using data-mining techniques B. Vieru-Dimulescu, P. Boula de Mareüil & M. Adda-Decker ------------------------ The goal of this paralinguistic study is to automatically differentiate foreign accents in French (Arabic, English, German, Italian, Spanish and Portuguese). We took advantage of automatic alignment into phonemes of non-native French recordings to measure vowel formants, consonant duration and voicing, prosodic cues as well as pronunciation variants combining French and foreign acoustic units (e.g. a rolled /r/). We retained 62 features which we used for training a classifier to discriminate speakers' foreign accents. We obtained over 50% correct identification by using a cross-validation method including unseen data (a few sentences from 36 speakers). The best automatically selected features are robust to corpus change and make sense with regard to linguistic knowledge. toaddpdffile [XVIth International Conference on Phonetic Sciences ICPhS Saarbrücken, August 2007] --------- Impact of duration and vowel inventory size on formant values of oral vowels: an automated formant analysis from eight languages. Cédric Gendrot and Martine Adda-Decker ------------------------ Eight languages (Arabic, English, French, German, Italian, Mandarin Chinese, Portuguese, Spanish) with 6 differently sized vowel inventories were analysed in terms of vowel formants. A tendency to phonetic reduction for vowels of short acoustic durations clearly emerges for all languages. The data did not provide evidence for an effect of inventory size on the global acoustic space and only the acoustic stability of quantal vowel /i/ is greater than that of other vowels in many cases. [XVIth International Conference on Phonetic Sciences ICPhS Saarbrücken, August 2007] --------- Analysis of oral and nasal vowel realisation in northern and southern French varieties P. Boula de Mareüil, M. Adda-Decker & C. Woehrling ------------------------ We present data on the pronunciation of oral and nasal vowels in northern and southern French varieties. In particular a sharp contrast exists in the fronting of the open /O/ towards [oe] in the North and the denasalisation of nasal vowels in the South. We examine how linguistic changes in progress may affect these vowels, which are governed by the left/right context and bring to light differences between reading and spontaneous speech. This study was made possible by automatic phoneme alignment on a large corpus of over 100 speakers. VieruBoulaMadda_Icphs07.pdf [XVIth International Conference on Phonetic Sciences ICPhS Saarbrücken, August 2007] --------- Characterizing non-native French Accents using automatic alignement B. Vieru-Dimulescu, P. Boula de Mareüil & M. Adda-Decker ------------------------ The goal of this study is to investigate the most relevant cues that differentiate 6 foreign accents in French (Arabic, English, German, Italian, Spanish and Portuguese). We took advantage of automatic alignment into phonemes of non-native French recordings. Starting from standard acoustic models, we introduced pronunciation variants which were reminiscent of foreign-accented speech: first allowing alternations between French phonemes (e.g. [s]~[z]), then combining them with foreign acoustic units (e.g. a rolled /r/). Results reveal discriminating accent-specific pronunciations which, to a large extent, confirm both linguistic predictions and human listeners' judgments. toaddpdffile [XVIth International Conference on Phonetic Sciences ICPhS Saarbrücken, August 2007] --------- Bayesian framework for voicing alternation & assimilation studies on large corpora in French Martine Adda-Decker, Pierre Hallé ------------------------ The presented work aims at exploring voicing alternation and assimilation on very large corpora using a Bayesian framework. A voice feature (VF) variable has been introduced whose value is determined using statistical acoustic phoneme models, corresponding to 3-state Gaussian mixture Hidden Markov Models. For all relevant consonants, i.e. oral plosives and fricatives their surface form voice feature is determined by maximising the acoustic likelihood of the competing phoneme models. A voicing alternation (VA) measure counts the number of changes between underlying and surface form voice features. Using a corpus of 90h of French journalistic speech, an overall voicing alternation rate of 2.7% has been measured, thus calibrating the method's accuracy. The VA rate remains below 2% word-internally and on word starts and raises up to 9% on lexical word endings. In assimilation contexts rates grow significantly (>20%), highlighting regressive voicing assimilation. Results also exhibit a weak tendency for progressive devoicing. IV07nato.pdf [to appear in NATO Security Thourgh Science Program, IOS Press BV,] ''The Fundamentals of Verbal and Non-verbal Communication and the Biometrical Issue'', --------- A Cross-Language Study of Acoustic and Prosodic Characteristics of Vocalic Hesitations Ioana Vasilescu and Martine Adda-Decker ------------------------ This contribution provides a cross-language study on the acoustic and prosodic characteristics of vocalic hesitations.One aim of the presented work is to use large corpora to investigate whether some language universals can be found. A complementary point of view is to determine if vocalic hesitations can be considered as bearing language-specific information. An additional point of interest concerns the link between vocalic hesitations and the vowels in the phonemic inventory of each language. Finally, the gained insights are of interest to research in acoustic modeling in for automatic speech, speaker and language recognition. Hesitations have been automatically extracted from large corpora of journalistic broadcast speech and parliamentary debates in three languages (French, American English and European Spanish). Duration, fundamental frequency and formant values were measured and compared. Results confirm that vocalic hesitations share (potentially universal) properties across languages, characterized by longer durations and lower fundamental frequency than are observed for intra-lexical vowels in the three languages investigated here. The results on vocalic timbre show that while the measures on hesitations are close to existing vowels of the language, they do not necessarily coincide with them. The measured average timbre of vocalic hesitations in French is slightly more open than its closest neighbor (/oe/). For American English, the average F1 and F2 formant values position the vocalic hesitation as a mid-open vowel somewhere between /^/ and /ae/. The Spanish vocalic hesitation almost completely overlaps with the mid-closed front vowel /e/. icphs2007ivrnmafinal.pdf [XVIth International Conference on Phonetic Sciences ICPhS Saarbrücken, August 2007] --------- Vocalic Hesitations vs Vocalic Systems: A Cross-Language Comparison Ioana Vasilescu, Rena Nemoto and Martine Adda-Decker ------------------------ This paper deals with the acoustic characteristics of vocalic hesitations in a cross-language perspective. The underlying questions concern the ``neutral'' vs. language-dependent timbre of vocalic hesitations and the link between their vocalic quality and the phonemic system of the language. An additional point of interest concerns the duration effect on vocalic hesitations compared to intra-lexical vowels. Acoustic measurements have been carried out in American English, French and Spanish. Results on vocalic timbre show that hesitations (i) carry language-specific information; (ii) whereas often close to measurements of existing vowels, they do not necessarily collapse with them. Finally, (iii) duration variation affects the timbre of vocalic hesitation and a centralization towards a ``neutral'' realization is observed for decreasing durations. toaddpdffile [XVIth International Conference on Phonetic Sciences ICPhS Saarbrücken, August 2007] --------- Voicing assimilation in journalistic speech Pierre Hallé, Martine Adda-Decker ------------------------ We used a corpus of radio and television speech to run a quantitative study of voicing assimilation in French. The results suggest that, although voicing itself can be incomplete, voice assimilation is essentially categorical. The amount of voicing assimilation little depends on underlying voicing but clearly varies with segment duration and also with consonant manner of articulation. The results also suggest that voicing assimilation, though largely regressive, is not purely unidirectional. jel2007PBM.pdf [ 5èmes Journées d'Études Linguistiques de Nantes, June 2007] --------- Traitement du schwa : de la synthèse à l'alignement automatique automatiques de la parole P. Boula de Mareüil ------------------------ Le but de cette communication est de fournir une meilleure image du schwa en français, à travers le traitement automatique de la parole. Après une présentation générale de la conversion graphème-phonème pour la synthèse vocale, le traitement du schwa est envisagé et la question de l'évaluation est soulevée. Pour sortir du cadre déterministe de la synthèse où une seule façon de lire une phrase donnée est autorisée, des techniques d'alignement automatique sur de grands corpus de parole sont mises à profit. Le traitement automatique permet de quantifier des différences stylistiques et régionales entre variétés de français. La confirmation de résultats attendus sur la prononciation du schwa suggère la validité de l'approche. EMNLP-CoNLL200745.pdf [Empirical Methods in Natural Language Processing,] --------- Smooth Bilingual N-gram Translation Holger Schwenk and Daniel Déchelotte and Hélène Bonneau-Maynard and Alexandre Allauzen ------------------------ WMT25.pdf [ACL Workshop on Statistical Machine Translation] --------- Building a statistical machine translation system for French using the Europarl corpus Holger Schwenk ------------------------ toaddpdffile [Computer Speech and Language, 21:492-518, 2007] --------- Continuous space language models. Holger Schwenk ------------------------ This paper describes the use of a neural network language model for large vocabulary continuous speech recognition. The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability estimation in a continuous space. Very efficient learning algorithms are described that enable the use of training corpora of several hundred million words. It is also shown that this approach can be incorporated into a large vocabulary continuous speech recognizer using a lattice rescoring framework at a very low additional processing time. The neural network language model has been thoroughly evaluated in a state-of-the-art large vocabulary continuous speech recognizer for several international benchmark tasks, in particular the NIST evaluations on broadcast news and conversational speech recognition. The new approach is compared to 4-gram back-off language models trained with modified Kneser-Ney smoothing which has been often reported to be the best known smoothing method. The neural network language model achieved consistent word error rate reductions for all considered tasks and languages, ranging from 0.5% to up to 1.6% absolute. toaddpdffile [ 5èmes Journées d'Études Linguistiques de Nantes, June 2007] --------- Problèmes posés par le schwa en reconnaissance et en alignement automatiques de la parole Martine Adda-Decker ------------------------ Cette contribution aborde les questions suivantes : Quels problèmes pose le schwa pour la reconnaissance automatique de la parole, et pour l'alignement automatique qui sert déjà et servira de plus en plus dans le futur à de nombreuses analyses phonétiques et phonologiques? Pour cela il faudra revenir à des problèmes en amont concernant le schwa et la modélisation acoustique. Quels liens existent entre dictionnaires de prononciation, variantes de prononciation, modélisation acoustique et segmentation automatique? Que pouvons-nous apprendre sur le schwa à partir du traitement automatique de grands corpus? Que pouvons-nous apprendre par le schwa sur les autres voyelles du système? toaddpdffile [ 7éme Journées Jeunes Chercheurs en Parole, Paris, July 2007] --------- Comparaison entre l'extraction de formants par Praat et Snack sur deux grands corpus de français du nord et du sud C. Woehrling-Dimulescu & P. Boula de Mareüil ------------------------ We compare formant frequency measurements between two authoritative tools (Praat and Snack), two large corpora (of face-to-face and telephone speech) and two French varieties (northern and southern). There are both an evaluation of formant tracking (as well as related filtering techniques) and an application to find out salient pronunciation traits. Despite differences between Praat and Snack with regard to telephone speech (Praat yielding greater F1 values), results seem to converge to suggest that northern and southern French varieties mainly differ in the second formant of the open /O/. /O/ fronting in northern French (with F2 values greater than 1100 Hz for males and 1200 Hz for females) is by far the most discriminating feature provided by decision trees applied to oral vowel formants. Vieru_rjcp07.pdf [ 7éme Journées Jeunes Chercheurs en Parole, Paris, July 2007] --------- Identification de 6 accents étrangers en français utilisant des techniques de fouilles de données B. Vieru-Dimulescu & P. Boula de Mareüil ------------------------ The goal of this study is to automatically differentiate foreign accents in French (Arabic, English, German, Italian, Spanish and Portuguese). We took advantage of automatic alignment into phonemes of non-native French recordings to measure vowel formants, consonant duration and voicing, prosodic cues as well as pronunciation variants combining French and foreign acoustic units (e.g. a rolled 'r'). We retained 62 features which we used for training a classifier to discriminate speakers' foreign accents. We obtained over 50% correct identification by using a cross-validation method including unseen data (a few sentences from 36 speakers). The best automatically selected features are robust to corpus change and make sense with regard to linguistic knowledge. toaddpdffile [Revue Française de Linguistique Appliquée, pp.71-84 vol. XII-1/juin 2007] --------- Corpus pour la transcription automatique de l'oral Martine Adda-Decker ------------------------ Cette contribution vise à illustrer la réalisation et l'utilisation de corpus à des fins de recherche en transcription automatique de la parole. Ces recherches s'appuyant largement sur une modélisation statistique, s'accompagnent naturellement de production de corpus écrits et de corpus oraux transcrits ainsi que d'outils facilitant la transcription manuelle. Les méthodes et techniques mises au point permettent aujourd'hui un déploiement vers le traitement automatique de l'oral à grande échelle, tout en contribuant à un domaine de recherche interdisciplinaire émergeant : la linguistique des corpus oraux. LIMSI_rt07asr_final.pdf [to appear Springer Lecture Notes in Computer Science] --------- The LIMSI RT07 Lecture Transcription System Lori Lamel and Eric Bilinski and Jean-Luc Gauvain and Gilles Adda and Claude Barras and Xuan Zhu ------------------------ A system to automatically transcribe lectures and presentations has been developed in the context of the FP6 Integrated Project CHIL. In addition to the seminar data recorded by the CHIL partners, widely available corpora were used to train both the acoustic and language models. Acoustic model training made use of the transcribed portion of the TED corpus of Eurospeech recordings, as well as the ICSI, ISL, and NIST meeting corpora. For language model training, text materials were extracted from a variety of on-line conference proceedings. Experimental results are reported for close-talking and far-field microphones on development and evaluation data. LIMSI_rt07diar.pdf [to appear Springer Lecture Notes in Computer Science] --------- Multi-Stage Speaker Diarization for Conference and Lecture Meetings X. Zhu and C. Barras and L. Lamel and J.L. Gauvain and Claude Barras and Xuan Zhu ------------------------ The LIMSI RT-07S speaker diarization system for the conference and lecture meetings is presented in this paper. This system builds upon the RT-06S diarization system designed for lecture data. The baseline system combines agglomerative clustering based on Bayesian information criterion (BIC) with a second clustering using state-of-the-art speaker identification (SID) techniques. Since the baseline system provides a high speech activity detection (SAD) error around of 10% on lecture data, some different acoustic representations with various normalization techniques are investigated within the framework of log-likelihood ratio (LLR) based speech activity detector. UBMs trained on the different types of acoustic features are also examined in the SID clustering stage. All SAD acoustic models and UBMs are trained with the forced alignment segmentations of the conference data. The diarization system integrating the new SAD models and UBM gives comparable results on both the RT-07S conference and lecture evaluation data for the multiple distant microphone (MDM) condition. toaddpdffile [HLT/NACL workshop on Syntax and Structure in Statistical Translation, April 2007] --------- Combining morphosyntactic enriched representation with n-best reranking in statistical translation Hélène Bonneau-Maynard, Alexandre Allauzen, Daniel Déchelotte, and Holger Schwenk ------------------------ The purpose of this work is to explore the integration of morphosyntactic information into the translation model itself, by enriching words with their morphosyntactic categories. We investigate word disambiguation using morphosyntactic categories, n-best hypotheses reranking, and the combination of both methods with word or morphosyntactic n-gram language model reranking. Experiments are carried out on the English-to-Spanish translation task. Using the morphosyntactic language model alone does not results in any improvement in performance. However, combining morphosyntactic word disambiguation with a word based 4-gram language model results in an improvement in the BLEU score of 0.6% on the development set and 0.3% on the test set. 0400021.pdf [IEEE ICASSP 2007, Honolulu, Hawaii, April 2007] --------- Detection and Analysis of Abnormal Situations Through Fear-Type Acoustic Manifestations C. Clavel and L. Devillers and G. Richard and I. Vasilescu and T. Ehrette ------------------------ Recent work on emotional speech processing has demonstrated the interest to consider the information conveyed by the emotional component in speech to enhance the understanding of human behaviors. But to date, there has been little integration of emotion detection systems in effective applications. The present research focuses on the development of a fear-type emotions recognition system to detect and analyze abnormal situations for surveillance applications. The Fear vs. Neutral classi cation gets a mean accuracy rate at 70.3%. It corresponds to quite optimistic results given the diversity of fear manifestations illustrated in the data. More speci c acoustic models are built inside the fear class by considering the context of emergence of the emotional manifestations, i.e. the type of the threat during which they occur, and which has a strong in uence on fear acoustic manifestations. The potential use of these models for a threat type recognition system is also investigated. Such information about the situation can indeed be useful for surveillance systems. 0400053.pdf [IEEE ICASSP 2007, Honolulu, Hawaii, April 2007] --------- Constrained MLLR for Speaker Recognition Marc Ferras, Cheung-Chi Leung, Claude Barras and Jean-Luc Gauvain ------------------------ Maximum-Likelihood Linear Regression (MLLR) and Constrained MLLR (CMLLR) are two widely-used techniques for speaker adaptation in large-vocabulary speech recognition systems. Recently, using MLLR transforms as features for speaker recognition tasks has been proposed, achieving performance comparable to that obtained with cepstral features. This paper describes a new feature extraction technique for speaker recognition based on CMLLR speaker adaptation which avoids the use of transcripts. Modeling is carried out through Support Vector Machines (SVM). Results on the NIST Speaker Recognition Evaluation 2005 dataset are provided as well as in combination with two cepstral approaches such as MFCC-GMM and MFCC-SVM, for which system performance is improved by a 10% in Equal Error Rate relative terms. 0401277.pdf [IEEE ICASSP 2007, Honolulu, Hawaii, April 2007] --------- Speech Recognition System Combination for Machine Translation M.J.F. Gales and X. Liu and R. Sinha and P.C. Woodland and K. Yu and S. Matsoukas and T. Ng and K. Nguyen and L. Nguyen and J-L Gauvain and L. Lamel and A. Messaoudi ------------------------ The majority of state-of-the-art speech recognition systems make use of system combination. The combination approaches adopted have traditionally been tuned to minimising Word Error Rates (WERs). In recent years there has been growing interest in taking the output from speech recognition systems in one language and translating it into another. This paper investigates the use of cross-site combination approaches in terms of both WER and impact on translation performance. In addition the stages involved in modifying the output from a Speech-to-Text (STT) system to be suitable for translation are described. Two source languages, Mandarin and Arabic, are recognised and then translated using a phrase-based statistical machine translation system into English. Performance of individual systems and cross-site combination using cross-adaptation and ROVER are given. Results show that the best STT combination scheme in terms of WER is not necessarily the most appropriate when translating speech. 0400997.pdf [IEEE ICASSP 2007, Honolulu, Hawaii, April 2007] --------- The LIMSI 2006 TC-STAR EPPS Transcription Systems Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Claude Barras, Eric Bilinski, Olivier Galibert, Agusti Pujol, Holger Schwenk, Xuan Zhu ------------------------ This paper describes the speech recognizers developed to transcribe European Parliament Plenary Sessions (EPPS) in English and Spanish in the 2nd TC-STAR Evaluation Campaign. The speech recognizers are state-of-the-art systems using multiple decoding passes with models (lexicon, acoustic models, language models) trained for the different transcription tasks. Compared to the LIMSI TC-STAR 2005 EPPS systems, relative word error rate reductions of about 30% have been achieved on the 2006 development data. The word error rates with the LIMSI systems on the 2006 EPPS evaluation data are 8.2% for English and 7.8% for Spanish. Experiments with cross-site adaptation and system combination are also described. 0400641.pdf [IEEE ICASSP 2007, Honolulu, Hawaii, April 2007] --------- Modeling Duration via Lattice Rescoring Nicolas Jennequin and Jean-Luc Gauvain ------------------------ It is often acknowledged that HMMs do not properly model phone and word durations. In this paper phone and word duration models are used to improve the accuracy of state-of-the-art large vocabulary speech recognition systems. The duration information is integrated into the systems in a rescoring of word lattices that include phone-level segmentations. Experimental results are given for a conversational telephone speech (CTS) task in French and for the TC-Star EPPS transcription task in Spanish and English. An absolute word error rate reduction of about 0.5% is observed for the CTS task, and smaller but consistent gains are observed for the EPPS task in both languages. Tal_ritel.pdf [Traitement Automatique des Langues, vol 46 (3)] --------- Interaction et recherche d'information : le projet Ritel S. Rosset and O. Galibert and G. Illouz and A. Max ------------------------ L'objectif du projet Ritel est de réaliser un système de dialogue homme-machine permettant à un utilisateur de poser oralement des questions, et de dialoguer avec un système de recherche d'information généraliste. Dans cet article, nous présentons la plateforme actuelle et plus particulièrement les modules d'analyse, d'extraction d'information et de génération des réponses. Nous présentons les résultats préliminaires obtenus par les différents modules de la plateforme. The Ritel project aims at integrating spoken language dialog and open-domain information retrieval to allow a human to ask general questions and refine her search interactively. In this paper, we describe the current platform and more specifically the analysis, information extraction and natural language generation modules. We present some preliminary results of the different modules. parole06.pdf [Revue Parole, 37: 25-65] --------- Identification d'accents régionaux en français : perception et analyse C. Woehrling and P. Boula de Mareüil ------------------------ This article is dedicated to the perceptual identification of French varieties by listeners from the surroundings of Paris and Marseilles. It is based on recordings of about forty speakers from six Francophone regions: Normandy, Vendee, Romand Switzerland, Languedoc and the Basque Country. Contrary to the speech type (read or spontaneous speech) and the listeners' region of origin (Paris or Marseilles), the speakers' degree of accentedness has a major effect and interacts with the speakers' age. The origin of the oldest speakers who have the strongest accent according to Paris listeners' judgements is better recognised than the origin of the youngest speakers. However confusions are frequent among the Southern varieties. On the whole, three accents can be distinguished (by clustering and multidimensional scaling techniques): Northern French, Southern French and Romand Swiss. Boula_Vieru_phonetica06.pdf [Phonetica 63 (p.274-267)] --------- The contribution of prosody to the perception of foreign accent P. Boula de Mareüil and B. Vieru-Dimulescu ------------------------ The general goal of this study is to expand our understanding of what is meant by foreign /accent/ in people's parlance. More specifically, we insist on the role of prosody (timing and melody), which has rarely been examined. New technologies, including diphone speech synthesis (Experiment 1) and speech manipulation (Experiment 2), are used to study the relative importance of prosody in what is perceived as a foreign /accent/. The methodology we propose, based on the prosody transplantation paradigm, can be applied to different languages or language varieties. Here, it is applied to Spanish and Italian. We built up a dozen sentences which are spoken in almost the same way in both languages (e.g. /ha visto la casa //del// presidente americano/ "you/(s)he saw the American president's house"). Spanish/Italian monolinguals and bilinguals were recorded. We then studied what is perceived when the segmental specification of an utterance is combined with suprasegmental features belonging to a different language. Results obtained with Italian, Spanish and French listeners resemble one another, and suggest that prosody is at least as important as the articulation of phonemes and voice quality in identifying Spanish-accented Italian and Italian-accented Spanish. IS061776.PDF [InterSpeech ICSLP 2006, Pittsburgh, September 2006] --------- Investigating Automatic Decomposition for ASR in Less Represented Languages Thomas Pellegrini and Lori Lamel ------------------------ The Ritel project aims at integrating spoken language dialog and open-domain information retrieval to allow a human to ask general question (f.i. ``Who is currently presiding the Senate?'') and refine her search interactively. This project is at the junction of several distinct research communities, and has therefore several challenges to tackle: real-time streamed speech recognition with very large vocabulary, open-domain dialog management, fast information retrieval from text, query cobuilding, communication between information retrieval and dialog components, and generation of natural sounding answers. In this paper, we present our work on the different components of Ritel, and provide initial results. IS061776.PDF [InterSpeech ICSLP 2006, Pittsburgh, September 2006] --------- Investigating Automatic Decomposition for ASR in Less Represented Languages Thomas Pellegrini and Lori Lamel ------------------------ This paper addresses the use of an automatic decomposition method to reduce lexical variety and thereby improve speech recognition of less well-represented languages. The Amharic language has been selected for these experiments since only a small quantity of resources are available compared to well-covered languages. Inspired by the Harris algorithm, the method automatically generates plausible affixes, that combined with decompounding can reduce the size of the lexicon and the OOV rate. Recognition experiments are carried out for four different configurations (full-word and decompounded) and using supervised training with a corpus containing only two hours of manually transcribed data. IS061636.PDF [InterSpeech ICSLP 2006, Pittsburgh, September 2006] --------- Real-life emotions detection with lexical and paralinguistic cues on Human-Human call center dialogs Laurence Devillers and Laurence Vidrascu ------------------------ The emotion detection work reported here is part of a larger study aiming to model user behavior in real interactions. We already studied emotions in a real-life corpus with human-human dialogs on a financial task. We now make use of another corpus of real agent-caller spoken dialogs from a medical emergency call center in which emotion manifestations are much more complex, and extreme emotions are common. Our global aims are to define appropriate verbal labels for annotating real-life emotions, to annotate the dialogs, to validate the presence of emotions via perceptual tests and to find robust cues for emotion detection. Annotations have been done by two experts with twenty verbal classes organized in eight macro-classes. We retained for experiments in this paper four macro classes: Relief, Anger, Fear and Sadness. The relevant cues for detecting natural emotions are paralinguistic and linguistic. Two studies are reported in this paper: the first investigates automatic emotion detection using linguistic information, whereas the second investigates emotion detection with paralinguistic cues. On the medical corpus, preliminary experiments using lexical cues detect about 78% of the four labels showing very good detection for Relief (about 90%) and Fear (about 86%) emotions. Experiments using paralinguistic cues show about 60% of good detection, Fear being best detected. IS061251.PDF [InterSpeech ICSLP 2006, Pittsburgh, September 2006] --------- Perceptual identification and phonetic analysis of 6 foreign accents in French Bianca Vieru-Dimulescu and Philippe Boula de Mareüil ------------------------ A perceptual experiment was designed to determine to what extent naïve French listeners are able to identify foreign accents in French: Arabic, English, German, Italian, Portuguese and Spanish. They succeed in recognizing the speaker s mother tongue in more than 50% of cases (while rating their degree of accentedness as average). They perform best with Arabic speakers and worst with Portuguese speakers. The Spanish/Italian and English/German accents are the most mistaken ones. Phonetic analyses were conducted; clustering and scaling techniques were applied to the results, and were related to the listeners reactions that were recorded during the test. They support the idea that differences in the vowel realization (especially concerning the phoneme /y/) seem to outweigh rhythmic cues. Terms: perception, foreign accent, non-natif French. IS061261.PDF [InterSpeech ICSLP 2006, Pittsburgh, September 2006] --------- Identification of regional accents in French: perception and categorization Cécile Woehrling and Philippe Boula de Mareüil ------------------------ This article is dedicated to the perceptual identification of French varieties by listeners from the surroundings of Paris and Marseilles. It is based on the geographical localization of about forty speakers from 6 Francophone regions: Normandy, Vendee, Romand Switzerland, Languedoc and the Basque Country. Contrary to the speech type (read or spontaneous speech) and the listeners region of origin, the speakers degree of accentedness has a major effect and interacts with the speakers age. The origin of the oldest speakers (who have the strongest accent according to Paris listeners judgments) is better recognized than the origin of the youngest speakers. However confusions are frequent among the Southern varieties. On the whole, three accents can be distinguished (by clustering and multidimensional scaling techniques): Northern French, Southern French and Romand Swiss. zhu_odyssey.pdf [IEEE Odyssey The Speaker and Language Recognition Workshop, San Juan, June 2006] --------- Dong Zhu and Martine Adda-Decker ------------------------ Recent results have shown that phone lattices may significantly improve language recognition in a PPRLM (parallel phone recognition followed by language dependent modeling) framework. In this contribution we investigate the effectiveness of lattices to language identification (LID) when using different sets of multilingual phone and syllable inventories and corresponding multilingual acoustic phone models. The LID system that achieves best results is a PPRLM structure composed of four acoustic recognizers using both phone- and syllable-based decoders. A 7-language broadcast news corpus of approximately 150 hours of speech is used for training, development and test of our LID systems. Experimental results show that the use of lattices consistently improves LID results for all multilingual acoustic model sets and for both phonotactic and syllabotactic approaches. tcstar06_dechelotte.pdf [TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, June 2006] --------- The 2006 LIMSI Statistical Machine Translation System for TC-STAR Daniel D´echelotte, Holger Schwenk and Jean-Luc Gauvain ------------------------ This paper presents the LIMSI statistical machine translation system developed for 2006 TC-STAR evaluation campaign. We describe an A*-decoder that generates translation lattices using a word-based translation model. A lattice is a rich and compact representation of alternative translations that includes the probability scores of all the involved sub-models. These lattices are then used in subsequent processing steps, in particular to perform sentence splitting and joining, maximum BLEU training and to use improved statistical target language models. tcstar06_lamel.pdf [TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, June 2006] --------- The LIMSI 2006 TC-STAR Transcription Systems Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Claude Barras, Eric Bilinski, Olivier Galibert, Agusti Pujol, Holger Schwenk, Xuan Zhu ------------------------ This paper describes the speech recognizers evaluated in the TC-STAR Second Evaluation Campaign held in January- February 2006. Systems were developed to transcribe parliamentary speeches in English and Spanish, as well as Broadcast news in Mandarin Chinese. The speech recognizers are stateof- the-art systems using multiple decoding passes with models lexicon, acoustic models, language models) trained for the different transcription tasks. Compared to the LIMSI TC-STAR 2005 European Parliament Plenary Sessions (EPPS) systems, relative word error rate reductions of about 30% have been achieved on the 2006 development data. The word error rates with the LIMSI systems on the 2006 EPPS evaluation data are 8.2% for English and 7.8% for Spanish. The character error rate for Mandarin for a joint system submission with the University of Karlsruhe was 9.8%. Experiments with cross-site adaptation and system combination are also described. tcstar06_jennequin.pdf [TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, June 2006] --------- Lattice Rescoring Experiments with Duration Models Nicolas Jennequin and Jean-Luc Gauvain ------------------------ This paper reports on experiments using phone and word duration models to improve speech recognition accuracy. The duration information is integrated into state-of-the-art large vocabulary speech recognition systems by rescoring word lattices that include phone-level segmentations. Experimental results are given for a conversational telephone speech (CTS) task in French and for the TC-Star EPPS transcription task in Spanish and English. An absolute word error rate reduction of about 0.5% is observed for the CTS task, and smaller but consistent gains are observed for the EPPS task. jep06madda.pdf [JEP 2006, Dinard, June 2006] --------- De la reconnaissance automatique de la parole a l'analyse linguistique de corpus oraux Martine Adda-Decker ------------------------ This contribution aims at giving an overview of present automatic speech recognition in French highlighting typical transcription problems for this language. Explanations for errors can be partially obtained by examining the acoustics of the speech data. Such investigations however do not only inform about system and/or speech modeling limitations, but they also contribute to discover, describe and quantify specificities of spoken language as opposed to written language and speakers speech performance. To automatically transcribe different speech genres (e.g. broadcast news vs conversations) specific acoustic corpora are used for training, suggesting that frequency of occurrence and acoustic realisations of phonemes vary signi- ficantly across genres. Some examples of corpus studies are presented describing phoneme frequencies and segment durations on different corpora. In the future largescale corpus studies may contribute to increase our knowledge of spoken language as well as the performance of automatic processing. jep06bv.pdf} [JEP 2006, Dinard, June 2006] --------- Identification perceptive d'accents étrangers en français de corpus oraux B. Vieru-Dimulescu and P. Boula de Mareüil ------------------------ A perceptual experiment was designed to determine to what extent naïve French listeners are able to identify foreign accents in French: Arabic, English, German, Italian, Portuguese and Spanish. They succeed in recognising the speaker's mother tongue in more than 50% of cases (while rating their degree of accentedness as average). They perform best with Arabic speakers and worst with Portuguese speakers. The Spanish and Italian accents on the one hand and the English and German accents on the other hand are the most mistaken ones. Phonetic analyses were conducted; clustering and scaling techniques were applied to the results, and were related to the listeners' reactions that were recorded during the test. Emphasis was laid on differences in the vowel realisation (especially concerning the phoneme /y/). jep2006_zhu_madda_versionfinale.pdf [JEP 2006, Dinard, June 2006] --------- Identification automatique des langues : combinaison d'approches phonotactiques à base de treillis de phones et de syllabes Dong Zhu and Martine Adda-Decker ------------------------ This paper investigates the use of phone and syllable lattices to automatic language identification (LID) based on multilingual phone sets (73, 50 and 35 phones). We focus on efficient combinations of both phonotactic and syllabotactic approaches. The LID structure used to achieve the best performance within this framework is similar to PPRLM (parallel phone recognition followed by language dependent modeling): several acoustic recognizers based on either multilingual phone or syllabe inventories, followed by language-specific n-gram language models. A seven language broadcast news corpus is used for the development and the test of the LID systems. Our experiments show that the use of the lattice information significantly improves results over all system configurations and all test durations. Multiple system combinations further achieve improvements. jep06pellegrini.pdf [JEP 2006, Dinard, June 2006] --------- Experiences de transcription automatique d'une langue rare Thomas Pellegrini and Lori Lamel ------------------------ This work investigates automatic transcription of rare languages, where rare means that there are limited resources available in electronic form. In particular, some experiments on word decompounding for Amharic, as a means of compensating for the lack of textual data are described. A corpus-based decompounding algorithm has been applied to a 4.6M word corpus. Compounding in Amharic was found to result from the addition of prefixes and suffixes. Using seven frequent affixes reduces the out of vocabulary rate from 7.0% to 4.8% and total number of lexemes from 133k to 119k. Preliminary attempts at recombining the morphemes into words results in a slight decrease in word error rate relative to that obtained with a full word representation. jep06lefevre.pdf [JEP 2006, Dinard, June 2006] --------- Transformation lineaire discriminante pour l'apprentissage des HMM a analyse factorielle Fabrice Lefevre and Jean-Luc Gauvain ------------------------ Factor analysis has been recently used to model the covariance of the feature vector in speech recognition systems. Maximum likelihood estimation of the parameters of factor analyzed HMMs (FAHMMs) is usually done via the EM algorithm. The initial estimates of the model parameters is then a key issue for their correct training. In this paper we report on experiments showing some evidence that the use of a discriminative criterion to initialize the FAHMM maximum likelihood parameter estimation can be effective. The proposed approach relies on the estimation of a discriminant linear transformation to provide initial values for the factor loading matrices. Solutions for the appropriate initializations of the other model parameters are presented as well. Speech recognition experiments were carried out on the Wall Street Journal LVCSR task with a 65k vocabulary. Contrastive results are reported with various model sizes using discriminant and non discriminant initialization. lrec06SR.pdf [LREC 2006, Genoa, May 2006] --------- The Ritel Corpus - An annotated Human-Machine open-domain question answering spoken dialog corpus S. Rosset, S. Petel ------------------------ In this paper we present a real (as opposed to Wizard-of-Oz) Human-Computer QA-oriented spoken dialog corpus collected with our Ritel platform. This corpus has been orthographically transcribed and annotated in terms of Specific Entities and Topics. Twelve main topics have been chosen. They are refined into 22 sub-topics. The Specific Entities are from five categories and cover Named Entities, linguistic entities, topic-defining entities, general entities and extended entities. The corpus contains 582 dialogs for 6 hours of user speech. lrec06media.pdf [LREC 2006, Genoa, May 2006] --------- Results of the French Evalda-Media evaluation campaign for literal understanding H. Bonneau-maynard, C. Ayache, F. Bechet, A. Denis, A. Kuhn, F. Lefevre, D. Mostefa, M. Quignard, S. Rosset, C. Servan, J. Villaneau ------------------------ The aim of the Media-Evalda project is to evaluate the understanding capabilities of dialog systems. This paper presents the Media protocol for speech understanding evaluation and describes the results of the June 2005 literal evaluation campaign. Five systems, both symbolic or corpus-based, participated to the evaluation which is based on a common semantic representation. Different scorings have been performed on the system results. The understanding error rate, for the Full scoring is, depending on the systems, from 29% to 41.3%. A diagnosis analysis of these results is proposed. lrec06evaSy.pdf [LREC 2006, Genoa, May 2006] --------- A joint intelligibility evaluation of French text-to-speech synthesis systems: the EvaSy SUS/ACR campaign P. Mareüil, C. D'Alessandro, A. Raake, G. Bailly, M. Garcia, M. Morel ------------------------ The EVALDA/EvaSy project is dedicated to the evaluation of text-to-speech synthesis systems for the French language. It is subdivided into four components: evaluation of the grapheme-to-phoneme conversion module (Boula de Mareüil et al., 2005), evaluation of prosody (Garcia et al., 2006), evaluation of intelligibility, and global evaluation of the quality of the synthesised speech. This paper reports on the key results of the intelligibility and global evaluation of the synthesised speech. It focuses on intelligibility, assessed on the basis of semantically unpredictable sentences, but a comparison with absolute category rating in terms of e.g. pleasantness and naturalness is also provided. Three diphone systems and three selection systems have been evaluated. It turns out that the most intelligible system (diphone-based) is far from being the one which obtains the best mean opinion score. lrec06prosody.pdf [LREC 2006, Genoa, May 2006] --------- A joint prosody evaluation of French text-to-speech synthesis systems M. Garcia, C. d'Alessandro, G. Bailly, P. Mareüil, M. Morel ------------------------ This paper reports on prosodic evaluation in the framework of the EVALDA/EvaSy project for text-to-speech (TTS) evaluation for the French language. Prosody is evaluated using a prosodic transplantation paradigm. Intonation contours generated by the synthesis systems are transplanted on a common segmental content. Both diphone based synthesis and natural speech are used. Five TTS systems are tested along with natural voice. The test is a paired preference test (with 19 subjects), using 7 sentences. The results indicate that natural speech obtains consistently the first rank (with an average preference rate of 80%) , followed by a selection based system (72%) and a diphone based system (58%). However, rather large variations in judgements are observed among subjects and sentences, and in some cases synthetic speech is preferred to natural speech. These results show the remarkable improvement achieved by the best selection based synthesis systems in terms of prosody. In this way; a new paradigm for evaluation of the prosodic component of TTS systems has been successfully demonstrated. lrec06Abrilian.pdf [LREC 2006, Genoa, May 2006] --------- Annotation of Emotions in Real-Life Video Interviews: Variability between Coders S. Abrilian, L. Devillers, J.C. Martin ------------------------ Research on emotional real-life data has to tackle the problem of their annotation. The annotation of emotional corpora raises the issue of how different coders perceive the same multimodal emotional behaviour. The long-term goal of this paper is to produce a guideline for the selection of annotators. The LIMSI team is working towards the definition of a coding scheme integrating emotion, context and multimodal annotations. We present the current defined coding scheme for emotion annotation, and the use of soft vectors for representing a mixture of emotions. This paper describes a perceptive test of emotion annotations and the results obtained with 40 different coders on a subset of complex real-life emotional segments selected from the EmoTV Corpus collected at LIMSI. The results of this first study validate previous annotations of emotion mixtures and highlight the difference of annotation between male and female coders. lrec06clavel.pdf [LREC 2006, Genoa, May 2006] --------- Fear-type emotions of the SAFE Corpus: annotation issues C. Clavel, I. Vasilescu, L. Devillers, T. Ehrette, G. Richard ------------------------ The present research focuses on annotation issues in the context of the acoustic detection of fear-type emotions for surveillance applications. The emotional speech material used for this study comes from the previously collected SAFE Database (Situation Analysis in a Fictional and Emotional Database) which consists of audio-visual sequences extracted from movie fictions. A generic annotation scheme was developed to annotate the various emotional manifestations contained in the corpus. The annotation was carried out by two labellers and the two annotations strategies are confronted. It emerges that the borderline between emotion and neutral vary according to the labeller. An acoustic validation by a third labeller allows at analysing the two strategies. Two human strategies are then observed: a first one, context-oriented which mixes audio and contextual (video) information in emotion categorization; and a second one, based mainly on audio information. The k-means clustering confirms the role of audio cues in human annotation strategies. It particularly helps in evaluating those strategies from the point of view of a detection system based on audio cues. lrec06JCM.pdf [LREC 2006, Genoa, May 2006] --------- Manual Annotation and Automatic Image Processing of Multimodal Emotional Behaviours: Validating the Annotation of TV Interviews J. Martin, G. Caridakis, L. Devillers, K. Karpouzis, S. Abrilian ------------------------ There has been a lot of psychological researches on emotion and nonverbal communication. Yet, these studies were based mostly on acted basic emotions. This paper explores how manual annotation and image processing can cooperate towards the representation of spontaneous emotional behaviour in low resolution videos from TV. We describe a corpus of TV interviews and the manual annotations that have been defined. We explain the image processing algorithms that have been designed for the automatic estimation of movement quantity. Finally, we explore how image processing can be used for the validation of manual annotations. lrec06devil.pdf [LREC 2006, Genoa, May 2006] --------- Real life emotions in French and English TV video clips: an integrated annotation protocol combining continuous and discrete approaches L. Devillers, R. Cowie, J. Martin, E. Douglas-cowie, S. Abrilian, M. Mcrorie ------------------------ A major barrier to the development of accurate and realistic models of human emotions is the absence of multi-cultural / multilingual databases of real-life behaviours and of a federative and reliable annotation protocol. QUB and LIMSI teams are working towards the definition of an integrated coding scheme combining their complementary approaches. This multilevel integrated scheme combines the dimensions that appear to be useful for the study of real-life emotions: verbal labels, abstract dimensions and contextual (appraisal based) annotations. This paper describes this integrated coding scheme, a protocol that was set-up for annotating French and English video clips of emotional interviews and the results (e.g. inter-coder agreement measures and subjective evaluation of the scheme). lrec06TP.pdf [LREC 2006, Genoa, May 2006] --------- Experimental detection of vowel pronunciation variants in Amharic Thomas Pellegrini and Lori Lamel ------------------------ The pronunciation lexicon is a fundamental element in an automatic speech transcription system. It associates each lexical entry (usually a grapheme), with one or more phonemic or phone-like forms, the pronunciation variants. Thorough knowledge of the target language is a priori necessary to establish the pronunciation baseforms and variants. The reliance on human expertise can pose difficulties in developing a system for a language where such knowledge may not be readily available. In this article a speech recognizer is used to help select pronunciation variants in Amharic, the official language of Ethiopia, focusing on alternate choices for vowels. This study is carried out using an audio corpus composed of 37 hours of speech from radio broadcasts which were orthographically transcribed by native speakers. Since the corpus is relatively small for estimating pronunciation variants, a first set of studies were carried out at a syllabic level. Word lexica were then constructed based on the observed syllable occurences. Automatic alignments were compared for lexica containing different vowel variants, with both context-independent and context-dependent acoustic models sets. The variant2+ measure proposed in (Adda-Decker and Lamel, 1999) is used to assess the potential need for pronunciation variants. icassp06FL.pdf [ICASSP 2006, Toulouse, May 2006] --------- Discriminant Initialization for Factor Analyzed HMM Training Fabrice Lefevre and Jean-Luc Gauvain ------------------------ Factor analysis has been recently used to model the covariance of the feature vector in speech recognition systems. Maximum likelihood estimation of the parameters of factor analyzed HMMs (FAHMMs) is usually done via the EM algorithm, meaning that initial estimates of the model parameters is a key issue. In this paper we report on experiments showing some evidence that the use of a discriminative criterion to initialize the FAHMM maximum likelihood parameter estimation can be effective. The proposed approach relies on the estimation of a discriminant linear transformation to provide initial values for the factor loading matrices, as well as appropriate initializations for the other model parameters. Speech recognition experiments were carried out on the Wall Street Journal LVCSR task with a 65k vocabulary. Contrastive results are reported with various model sizes using discriminant and non discriminant initialization. icassp06LID.pdf [ICASSP 2006, Toulouse, May 2006] --------- Discriminative Classifiers for Language Recognition Christopher White, Izhak Shafran and Jean-Luc Gauvain ------------------------ Most language recognition systems consist of a cascade of three stages: (1) tokenizers that produce parallel phone streams, (2) phonotactic models that score the match between each phone stream and the phonotactic constraints in the target language, and (3) a final stage that combines the scores from the parallel streams appropriately [1]. This paper reports a series of contrastive experiments to assess the impact of replacing the second and third stages with large-margin discriminative classifiers. In addition, it investigates how sounds that are not represented in the tokenizers of the first stage can be approximated with composite units that utilize cross-stream dependencies obtained via multi-string alignments. This leads to a discriminative framework that can potentially incorporate a richer set of features such as prosodic and lexical cues. Experiments are reported on the NIST LRE 1996 and 2003 task and the results show that the new techniques give substantial gains over a competitive PPRLM baseline. icassp06arabic.pdf [ICASSP 2006, Toulouse, May 2006] --------- Arabic Broadcast News Transcription Using a One Million Word Vocalized Vocabulary Abdel. Messaoudi, Jean-Luc Gauvain and Lori Lamel ------------------------ Recently it has been shown that modeling short vowels in Arabic can significantly improve performance even when producing a non-vocalized transcript. Since Arabic texts and audio transcripts are almost exclusively non-vocalized, the training methods have to overcome this missing data problem. For the acoustic models the procedure was bootstrapped with manually vocalized data and extended with semi-automatically vocalized data. In order to also capture the vowel information in the language model, a vocalized 4- gram language model trained on the audio transcripts was interpolated with the original 4-gram model trained on the (non-vocalized) written texts. Another challenge of the Arabic language is its large lexical variety. The out-of-vocabulary rate with a 65k word vocabulary is in the range of 4-8% (compared to under 1% for English). To address this problem a vocalized vocabulary containing over 1 million vocalized words, grouped into 200k word classes is used. This reduces the out-of-vocabulary rate to about 2%. The extended vocabulary and vocalized language model trained on the manually annotated data give a 1.2% absolute word error reduction on the DARPA RT04 development data. However, including the automatically vocalized transcripts in the language model reduces performance indicating that automatic vocalization needs to be improved. LIMSI_rt06s_asr.pdf [Lecture Notes in Computer Science, Vol. 4299 - Proc. 3rd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI 2006) Rich Transcription 2006 Spring Meeting Recognition Evaluation Workshop, May 2006]] --------- The LIMSI RT06s Lecture Transcription System L. Lamel, E. Bilinski, G. Adda, J.L. Gauvain, H. Schwenk ------------------------ This paper describes recent research carried out in the context of the FP6 Integrated Project CHIL in developing a system to automatically transcribe lectures and presentations. Widely available corpora were used to train both the acoustic and language models, since only a small amount of CHIL data was available for system development. Acoustic model training made use of the transcribed portion of the TED corpus of Eurospeech recordings, as well as the ICSI, ISL, and NIST meeting corpora. For language model training, text materials were extracted from a variety of on-line conference proceedings. Experimental results are reported for close-talking and far-field microphones on development and evaluation data. rt06s_limsi_diarization.pdf [Lecture Notes in Computer Science, Proc. CLEAR'06 Evaluation Campaign and Workshop - Classification of Events, Activities and Relationships, May 2006] --------- Speaker Diarization: from Broadcast News to Lectures X. Zhu, C. Barras, L. Lamel, J.L. Gauvain ------------------------ This paper presents the LIMSI speaker diarization system for lecture data, in the framework of the Rich Transcription 2006 Spring (RT-06S) meeting recognition evaluation. This system builds upon the baseline diarization system designed for broadcast news data. The baseline system combines agglomerative clustering based on Bayesian information criterion with a second clustering using state-of-the-art speaker identification techniques. In the RT-04F evaluation, the baseline system provided an overall diarization error of 8.5% on broadcast news data. However since it has a high missed speech error rate on lecture data, a different speech activity detection approach based on the log-likelihood ratio between the speech and non-speech models trained on the seminar data was explored. The new speaker diarization system integrating this module provides an overall diarization error of 20.2% on the RT-06S Multiple Distant Microphone (MDM) data. CansecoASRU05.pdf [IEEE ASRU 2005, Cancun, November 2005] --------- A Comparative Study Using Manual and Automatic Transcriptions for Diarization Leonardo Canseco, Lori Lamel, Jean-Luc Gauvain ------------------------ Abstract This paper describes recent studies on speaker diarization from automatic broadcast news transcripts. Linguistic information revealing the true names of who speaks during a broadcast (the next, the previous and the current speaker) is detected by means of linguistic patterns. In order to associate the true speaker names with the speech segments, a set of rules are de ned for each pattern. Since the effectiveness of linguistic patterns for diarization depends on the quality of the transcription, the performance using automatic transcripts generated with an LVCSR system are compared with those obtained using manual transcriptions. On about 150 hours of broadcast news data (295 shows) the global ratio of false identity association is about 13% for the automatic and the manual transcripts. DechelotteASRU05.pdf [IEEE ASRU 2005, Cancun, November 2005] --------- Investigating Translation of Parliament Speeches D. Dechelotte, H. Schwenk, J.-L. Gauvain, O. Galibert, and L. Lamel ----------------------- This paper reports on recent experiments for speech to text (STT) translation of European Parliamentary speeches. A Spanish speech to English text translation system has been built using data from the TC-STAR European project. The speech recognizer is a state-of-the-art multipass system trained for the Spanish EPPS task and the statistical translation system relies on the IBM-4 model. First, MT results are compared using manual transcriptions and 1-best ASR hypotheses with different word error rates. Then, an n-best interface between the ASR and MT components is investigated to improve the STT process. Derivation of the fundamental equation for machine translation suggests that the source language model is not necessary for STT. This was investigated by using weak source language models and by n-best rescoring adding the acoustic model score only. A significant loss in the BLEU score was observed suggesting that the source language model is needed given the insufficiencies of the translation model. Adding the source language model score in the n-best rescoring process recovers the loss and slightly improves the BLEU score over the 1-best ASR hypothesis. The system achieves a BLEU score of 37.3 with an ASR word error rate of 10% and a BLEU score of 40.5 using the manual transcripts. IS051735.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Different Size Multilingual Phone Inventories and Context-Dependent Acoustic Models for Language Identification Dong Zhu, Martine Adda-Decker, Fabien Antoine ----------------------- Experimental work using phonotactic and syllabotactic approaches for automatic language identification (LID) is presented. Various questions have originated this research: what is the best choice for a multilingual phone inventory? Can a syllabic unit be of interest to extend the scope of the modeling unit? Are context-dependent (CD) acoustic models, widely used for speech recognition, able to improve LID accuracy? Can the multilingual acoustic models process efficiently additional languages, which are different from the training languages? The LID system is experimentally studied using different sizes of multilingual phone sets: 73, 50 and 35 phones. Experiments are carried out on broadcast news in seven languages (German, English, Arabic, Mandarin, Spanish, French, and Italian) with 140-hours audio data for training and 7 hours for testing. It is shown that smaller phone inventories achieve higher LID accuracy and that CD models outperform CI models. Further experiments have been conducted to test generality of both the multilingual acoustic model and phonotactics methods on another 11+10 languages corpus (11 known + 10 unknown languages). IS052391.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Do Speech Recognizers Prefer Female Speakers? Martine Adda-Decker, Lori Lamel ----------------------- In this contribution we examine large speech corpora of prepared broadcast and spontaneous telephone speech in American English and in French. Starting with the question whether ASR systems behave differently on male and female speech, we then try to find evidence on acoustic-phonetic, lexical and idiomatic levels to explain the observed differences. Recognition results have been analysed on 3--7h of speech in each language and speech type condition (totaling 20 hours). Results consistently show a lower word error rate on female speech ranging from 0.7 to 7% depending on the condition. An analysis of automatically produced pronunciations in speech training corpora (totaling 4000 hours of speech) revealed that female speakers tend to stick more consistently to standard pronunciations than male speakers. Concerning speech disfluencies, male speakers show larger proportions of filled pauses and repetitions, as compared to females. IS051976.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Perceptual Salience of Language-Specific Acoustic Differences in Autonomous Fillers Across Eight Languages Ioana Vasilescu, Maria Candea, Martine Adda-Decker ----------------------- Are acoustic differences in autonomous fillers salient for the human perception ? Acoustic measurements have been carried out on autonomous fillers from eight languages (Arabic, Mandarin Chinese, French, German, Italian, European Portuguese, American English and Latin American Spanish). They exhibit timbre differences of the support vowel of autonomous fillers across languages. In order to evaluate their salience for human perception, two discrimination experiments have been conducted, French/L2 and Portuguese/L2. These experiments test the capacity to discriminate languages by listening to isolated autonomous fillers without any lexical support and without any context. As listeners have been native French speakers, the Portuguese/L2 experiments aim at evaluating a potential mother tongue bias. Results show that the perceptual discrimination performance depends on the language pair identity and they are consistent with the acoustic measures. The present study aims at providing some insight for acoustic filler models used in automatic speech processing in a multilingual context. IS052010.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Semantic Annotation of the French Media Dialog Corpus H. Bonneau-Maynard, S. Rosset, C. Ayache, A. Kuhn, D. Mostefa ----------------------- The French Technolangue MEDIA-EVALDA project aims to evaluate spoken understanding approaches. This paper describes the semantic annotation scheme of a common dialog corpus which will be used for developing and evaluating spoken understanding models and for linguistic studies. A common semantic representation has been formalized and agreed upon by the consortium. Each utterance is divided into semantic segments and each segment is annotated with a 5-tuplet containing the mode, attribute name representing the underlying concept, normalized form of the attribute, list of related segments, and an optional comment about the annotation. Periodic inter-annotator agreement studies demonstrate that the annotation are of good quality, with an agreement of almost 90% on mode and attribute identification. An analysis of the semantic content of 12292 annotated client utterances shows that only 14.1% of the observed attributes are domain-dependent and that the semantic dictionary ensures a good coverage of the task. IS052349.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Multi-Level Information and Automatic Dialog Acts Detection in Human-Human Spoken Dialogs Sophie Rosset, Delphine Tribout ----------------------- This paper reports on our experience in annotating and the automatically detecting dialog acts in human-human spoken dialog. Our work is based on three hypotheses: first, the dialog act succession is strongly constrained; second, initial word and semantic class of word are more important than the exact word in identifying the dialog act; third, information is encoded in specific entities. We also used historical information in order to account for the dialogical structure. A memory based learning approach is used to detect dialog acts. Experiments have been conducted using different kind of information levels. In order to verify our hypotheses, the model trained on a French corpus was tested on an English corpus for a similar task and on a French corpus from a different domain. A correct dialog act detection rate of about 87% is obtained for the same domain/ language condition and 80% for the cross-language and cross-domain conditions. IS052393.PDF [InterSpeech 2005, Lisbon, September 2005] --------- An Open-Domain, Human-Computer Dialog System Olivier Galibert, Gabriel Illouz, Sophie Rosset ----------------------- The project shape Ritel aims at integrating a spoken language dialog system and an open-domain question answering system to allow a human to ask general questions (``Who is currently presiding the Senate?'') and refine the search interactively. As this point in time the shape Ritel platform is being used to collect a human-computer dialog corpus. The user can receive factual answers to some questions ( Q : who is the president of France, R : Jacques Chirac is the president for France since may 1995). This paper briefly presents the current system, the collected corpus, the problems encountered by such a system and our first answers to these problems. When the system is more advance, it will allow measuring the net worth of integrating a dialog system into a QA system. Does allowing such a dialog really enables to reach faster and more precisely the ``right'' answer to a question? IS052175.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Detection of Real-Life Emotions in Call Centers Laurence Vidrascu, Laurence Devillers ----------------------- In contrast to most previous studies conducted on artificial data with archetypal emotions, this paper addresses some of the challenges faced when studying real-life non-basic emotions. A 10-hour dialog corpus recorded in a French Medical emergency call center has been studied. In this paper, our aim is twofold, to study emotion mixtures in natural data and to achieve high level of performances in real-life emotion detection. An annotation scheme using two labels per segment has been used for representing emotion mixtures. A closer study of these mixtures has been carried out, revealing the presence of conflictual valence emotions. A correct detection rate of about 82% was obtained between Negative and Positive emotions using paralinguistic cues without taking into account these conflictual blended emotions. A perceptive study and the confusions analysis obtained with automatic detection enable to validate the annotation scheme with two emotion labels per segment. IS052018.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Multimodal Databases of Everyday Emotion: Facing up to Complexity Ellen Douglas-Cowie, Laurence Devillers, Jean-Claude Martin, Roddy Cowie, Suzie Savvidou, Sarkis Abrilian, Cate Cox ----------------------- In everyday life, speech is part of a multichannel system involved in conveying emotion. Understanding how it operates in that context requires suitable data, consisting of multimodal records of emotion drawn from everyday life. This paper reflects the experience of two teams active in collecting and labelling data of this type. It sets out the core reasons for pursuing a multimodal approach, reviews issues and problems for developing relevant databases, and indicates how we can move forward both in terms of data collection and approaches to labelling. IS051589.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Transcribing Lectures and Seminars Lori Lamel, Gilles Adda, Eric Bilinski, Jean-Luc Gauvain ----------------------- This paper describes recent research carried out in the context of the FP6 Integrated Project shape Chil in developing a system to automatically transcribe lectures and seminars. We made use of widely available corpora to train both the acoustic and language models, since only a small amount of shape Chil data were available for system development. For acoustic model training made use of the transcribed portion of the TED corpus of Eurospeech recordings, as well as the ICSI, ISL, and NIST meeting corpora. For language model training, text materials were extracted from a variety of on-line conference proceedings. Word error rates of about 25% are obtained on test data extracted 12 seminars. IS052002.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Diachronic Vocabulary Adaptation for Broadcast News Transcription Alexandre Allauzen, Jean-Luc Gauvain ----------------------- This article investigates the use of Internet news sources to automatically adapt the vocabulary of a French and an English broadcast news transcription system. A specific method is developed to gather training, development and test corpora from selected websites, normalizing them for further use. A vectorial vocabulary adaptation algorithm is described which interpolates word frequencies estimated on adaptation corpora to directly maximize lexical coverage on a development corpus. To test the generality of this approach, experiments were carried out simultaneously in French and in English (UK) on a daily basis for the month May 2004. In both languages, the OOV rate is reduced by more than a half. IS052366.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Building Continuous Space Language Models for Transcribing European Languages Holger Schwenk, Jean-Luc Gauvain ----------------------- Large vocabulary continuous speech recognizers for English Broadcast News achieve today word error rates below 10%. An important factor for this success is the availability of large amounts of acoustic and language modeling training data. In this paper the recognition of French Broadcast News and English and Spanish parliament speeches is addressed, tasks for which less resources are available. A neural network language model is applied that takes better advantage of the limited amount of training data. This approach performs the estimation of the probabilities in a continuous space, allowing by this means smooth interpolations. Word error reduction of up to 0.9% absolute are reported with respect to a carefully tuned backoff language model trained on the same data. IS051895.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Impact of Duration on F1/F2 Formant Values of Oral Vowels: An Automatic Analysis of Large Broadcast News Corpora in French and German Cedric Gendrot, Martine Adda-Decker ----------------------- Formant values of oral vowels are automatically measured in a total of 50000 segments from four hours of journalistic broadcast speech in French and German. After automatic segmentation using the LIMSI speech alignment system, formant values are automatically extracted using PRAAT. Vowels with unlikely formant values are discarded (4%). The measured values are exposed with respect to segment duration and language identity. The gap between the measured mean Fi values and reference Fi values is inversely proportional to vowel duration: a tendency to reduction for vowels of short duration clearly emerges for both languages. These results are briefly discussed in terms of centralisation and coarticulation. IS051821.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Combining Speaker Identification and BIC for Speaker Diarization Xuan Zhu, Claude Barras, Sylvain Meignier, Jean-Luc Gauvain ----------------------- This paper describes recent advances in speaker diarization by incorporating a speaker identification step. This system builds upon the LIMSI baseline data partitioner used in the broadcast news transcription system. This partitioner provides a high cluster purity but has a tendency to split the data from a speaker into several clusters, when there is a large quantity of data for the speaker. Several improvements to the baseline system have been made. Firstly, a standard Bayesian information criterion (BIC) agglomerative clustering has been integrated replacing the iterative Gaussian mixture model (GMM) clustering. Then a second clustering stage has been added, using a speaker identification method with MAP adapted GMM. A final post-processing stage refines the segment boundaries using the output of the transcription system. On the RT-04f and ESTER evaluation data, the improved multi-stage system provides between 40% and 50% reduction of the speaker error, relative to a standard BIC clustering system. IS052392.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Where Are We in Transcribing French Broadcast News? Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Alexandre Allauzen, Veronique Gendner, Lori Lamel, Holger Schwenk ----------------------- Given the high flexional properties of the French language, transcribing French broadcast news (BN) is more challenging than English BN. This is in part due to the large number of homophones in the inflected forms. This paper describes advances in automatic processing of broadcast news speech in French based on recent improvements to the LIMSI English system. The main differences between the English and French BN systems are: a 200k vocabulary to overcome the lower lexical coverage in French (including contextual pronunciations to model liaisons), a case sensitive language model, and the use of a POS based language model to lower the impact of homophonic gender and number disagreement. The resulting system was evaluated in the first French TECHNOLANGUE-ESTER ASR benchmark test. This system achieved the lowest word error rate in this evaluation by a significant margin. We also report on a 1xRT version of this system. IS051574.PDF [InterSpeech 2005, Lisbon, September 2005] --------- The 2004 BBN/LIMSI 20xRT English Conversational Telephone Speech Recognition System R. Prasad, S. Matsoukas, C.-L. Kao, J.Z. Ma, D.-X. Xu, T. Colthurst, O. Kimball, R. Schwartz, J.L. Gauvain, L. Lamel, H. Schwenk, G. Adda, F. Lefevre ----------------------- In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT (real-time as measured on Pentium 4 Xeon 3.4 GHz Processor) on the EARS progress test set. This translates into a 22.8% relative improvement in WER over the 2003 BBN/LIMSI EARS evaluation system, which was run without any time constraints. In addition to reporting on the system architecture and the evaluation results, we also highlight the significant improvements made at both sites. IS051588.PDF [InterSpeech 2005, Lisbon, September 2005] --------- Modeling Vowels for Arabic BN Transcription Abdel Messaoudi, Lori Lamel, Jean-Luc Gauvain ----------------------- This paper presents a method for improving acoustic model precision by incorporating wide phonetic context units in speech recognition. The wide phonetic context model is constructed from several narrower context-dependent models based on the Bayesian framework. Such a composition is performed in order to avoid the crucial problem of a limited availability of training data and to reduce the model complexity. To enhance the model reliability due to unseen contexts and limited training data, flooring and deleted interpolation techniques are used. Experimental results show that this method gives improvement of the word accuracy with respect to the standard triphone model. Allauzen_ICASSP05.pdf [ICASSP 2005, Philadelphia, March 2005] --------- Open Vocabulary ASR for Audiovisual Document Indexation ----------------------- This paper reports on an investigation of an open vocabulary recognizer that allows new words to be introduce in the recognition vocabulary, without the need to retrain or adapt the language model. This method uses special word classes, whose ngram probabilities are estimated during the training process by discounting a mass of probability from the out of vocabulary words. A part of speech tagger is used to determine the word classes during language model training and for vocabulary adaptation. Metadata information provided by the the French audiovisual archive institute are used to identify important document-specic missing words which are added to appropriate word class in the system vocabulary. Pronunciations for the new words are derived by grapheme-to-phoneme conversion. On over 3 hours of broadcast news data, this approach leads to a reduction of 0.35% in the OOV rate, of 0.6% of the word error rate, with 80% of the occurrences of the newly introduced being correctly recognized. Lamel_ICASSP05.pdf [ICASSP 2005, Philadelphia, March 2005] --------- Alternate Phone Models for Conversational Speech Lori Lamel and Jean-Luc Gauvain ----------------------- This paper investigates the use of alternate phone models for the transcription of conversational telephone speech. The challenges of transcribing conversational speech are many, and recent research has focused mainly at the acoustic and linguistic levels. The focus of this work is to explore alternative ways of modeling different manners of speaking so as to better cover the observed articulatory styles and pronunciation variants. Four alternate phone sets are compared ranging from 38 to 129 units. Two of the phone sets make use of syllable-position dependent phone models. The acoustic models were trained on 2300 hours of conversational telephone speech data from the Switchboard and Fisher corpora, and experimental results are reported on the EARS Dev04 test set which contains 3 hours of speech from 36 Fisher conversations. While no one particular phone set was found to outperform the others for a majority of speakers, the best overall performance was obtained with the original 48 phone set and a reduced 38 phone set, however combining the hypotheses of the individual models reduces the word error from 17.5% (original phone set) to 16.8%. rt04f_diarization.pdf [RT'04, Palisades, Novembre 2004] --------- Improving Speaker Diarization Claude Barras, Xuan Zhu, Sylvain Meignier, and Jean-Luc Gauvain ----------------------- This paper describes the LIMSI speaker diarization system used in the RT-04F evaluation. The RT-04F system builds upon the LIMSI baseline data partitioner, which is used in the broadcast news transcription system. This partitioner provides a high cluster purity but has a tendency to split the data from a speaker into several clusters when there is a large quantity of data for the speaker. In the RT-03S evaluation the baseline partitioner had a 24.5% diarization error rate. Several improvements to the baseline diarization system have been made. A standard Bayesian information criterion (BIC) agglomerative clustering has been integrated replacing the iterative Gaussian mixture model (GMM) clustering; a local BIC criterion is used for comparing single Gaussians with full covariance matrices. A second clustering stage has been added, making use of a speaker identification method: maximum a posteriori adaptation of a reference GMM with 128 Gaussians. A final post-processing stage refines the segment boundaries using the output of the transcription system. Compared to the best configuration baseline system for this task, the improved system reduces the speaker error time by over 75% on the development data. On evaluation data, a 8.5% overall diarization error rate was obtained, a 60% reduction in error compared to the baseline. rt04f_limsi_lm.pdf [RT'04, Palisades, Novembre 2004] --------- Using neural network language models for LVCSR Holger Schwenk and Jean-Luc Gauvain ----------------------- In this paper we describe how we used a neural network language model for the BN and CTS task for the RT04 evaluation. This new language modeling approach performs the estimation of the n-gram probabilities in a continuous space, allowing by this means probability interpolation for any context, without the need to back-off to shorter contexts. Details are given on training data selection, parameter estimation and fast training and decoding algorithms. The neural network language model achieved word error reductions of 0.5% for the CTS task and of 0.3% for the BN task with an additional decoding cost of 0.05xRT. rt04f_bbn_limsi_eng_bn10xrt.pdf [RT'04, Palisades, Novembre 2004] --------- The 2004 BBN/LIMSI 10xRT English Broadcast News Transcription System Long Nguyen, Sherif Abdou, Mohamed Afify, John Makhoul, Spyros Matsoukas, Richard Schwartz, Bing Xiang Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Holger Schwenk, Fabrice Lefevre ----------------------- This paper describes the 2004 BBN/LIMSI 10xRT English Broadcast News (BN) transcription system which uses a tightly integrated combination of components from the BBN and LIMSI speech recognition systems. The integrated system uses both cross-site adaptation and system combination via ROVER, obtaining a word hypothesis that is better than is produced by either system alone, while remaining within the allotted time limit. The system configuration used for the evaluation has two components from each site and two ROVER combinations, and achieved a word error rate (WER) of 13.9% on the Dev04f set and 9.3% on the Dev04 set selected to match the progress set. Compared to last year's system, there is around 30% relative reduction on the WER. rt04f_bbn_limsi_eng_cts20xrt.pdf [RT'04, Palisades, Novembre 2004] --------- The 2004 BBN/LIMSI 20xRT English Conversational Telephone Speech System R. Prasad, S. Matsoukas, C.-L. Kao, J. Ma, D.-X. Xu, T. Colthurst, G. Thattai, O. Kimball, R. Schwartz, J.-L. Gauvain, L. Lamel, H. Schwenk, G. Adda, and F. Lefevre ----------------------- In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT (real-time as measured on Pentium 4 Xeon 3.4 GHz Processor) on the EARS progress test set. This translates into a 22.8% relative improvement in WER over the 2003 BBN/LIMSI EARS evaluation system, which was run without any time constraints. In addition to reporting on the system architecture and the evaluation results, we also highlight the significant improvements made at both sites. rt04f_limsi_altphone.pdf [RT'04, Palisades, Novembre 2004] --------- Alternate phone models for CTS Lori Lamel and Jean-Luc Gauvain ----------------------- This paper investigates the use of alternate phone models for the transcription of conversational telephone speech. The challenges of transcribing conversational speech are many, and recent research has focused mainly at the acoustic and linguistic levels. The focus of this work is to explore alternative ways of modeling different manners of speaking so as to better cover the observed articulatory styles and pronunciation variants. Four alternate phone sets are compared ranging from 38 to 129 units. Two of the phone sets make use of syllable-position dependent phone models. The acoustic models were trained on 2300 hours of conversational telephone speech data from the Switchboard and Fisher corpora, and experimental results are reported on the EARS Dev04 test set which contains 3 hours of speech from 36 Fisher conversations. While no one particular phone set was found to outperform the others for a majority of speakers, the best overall performance was obtained with the original 48 phone set and a reduced 38 phone set, however combining the hypotheses of the individual models reduces the word error from 17.5% (original phone set) to 16.8%. rt04f_limsi_sttspkr.pdf [RT'04, Palisades, Novembre 2004] --------- Towards using STT for Broadcast News Speaker Diarization Leornardo Canseco-Rodriguez, Lori Lamel, and Jean-Luc Gauvain ----------------------- The aim of this study is to investigate the use of the linguistic information present in the audio signal to structure broadcast news data, and in particular to associate speaker identities with audio segments. While speaker recognition has been an active area of research for many years, addressing the problem of identifying speakers in huge audio corpora is relatively recent and has been mainly concerned with speaker tracking. The speech transcriptions contain a wealth of linguistic information that is useful for speaker diarization. Patterns which can be used to identify the current, previous or next speaker have been developed based on the analysis of 150 hours of manually transcribed broadcast news data. Each pattern is associated with one or more rules to assign speaker identities. After validation on the training transcripts, these patterns and rules were tested on an independent data set containing transcripts of 9 hours of broadcasts, and a speaker diarization error rate of about 11% was obtained. Future work will validate the approach on automatically generated transcripts and also combine the linguistic information with information derived from the acoustic level. odyssey04_barras.pdf [Odyssey'04, Toledo, May-June 2004] --------- Unsupervised Online Adaptation for Speaker Verification over the Telephone Claude Barras, Sylvain Meignier and Jean-Luc Gauvain ----------------------- This paper presents experiments of unsupervised adaptation for a speaker detection system. The system used is a standard speaker verification system based on cepstral features and Gaussian mixture models. Experiments were performed on cellular speech data taken from the NIST 2002 speaker detection evaluation. There was a total of about 30.000 trials involving 330 target speakers and more than 90% of impostor trials. Unsupervised adaptation significantly increases the system accuracy, with a reduction of the minimal detection cost function (DCF) from 0.33 for the baseline system to 0.25 with unsupervised online adaptation. Two incremental adaptation modes were tested, either by using a fixed decision threshold for adaptation, or by using the a posteriori probability of the true target for weighting the adaptation. Both methods provide similar results in the best configurations, but the latter is less sensitive to the actual threshold value. lrec04_barras.pdf [LREC'04, Lisbon, May 2004] --------- Automatic Audio and Manual Transcripts Alignment, Time-code Transfer and Selection of Exact Transcripts Claude Barras, Gilles Adda, Martine Adda-Decker, Benoit Habert Philippe Boula de Mareuil and Patrick Paroubek ----------------------- The present study focuses on automatic processing of sibling resources of audio and written documents, such as available in audio archives or for parliament debates: written texts are close but not exact audio transcripts. Such resources deserve attention for several reasons: they represent an interesting testbed for studying differences between written and spoken material and they yield low cost resources for acoustic model training. When automatically transcribing the audio data, regions of agreement between automatic transcripts and written sources allow to transfer time-codes to the written documents: this may be helpful in an audio archive or audio information retrieval environment. Regions of disagreement can be automatically selected for further correction by human transcribers. This study makes use of 10 hours of French radio interview archives with corresponding press-oriented transcripts. The audio corpus has then been transcribed using the LIMSI speech recognizer resulting in automatic transcripts, exhibiting an average word error rate of 12%. 80% of the text corpus (with word chunks of at least five words) can be exactly aligned with the automatic transcripts of the audio data. The residual word error rate on these 80% is less than 1%. lrec04_barras.pdf [LREC'04, Lisbon, May 2004] --------- Reliability of Lexical and Prosodic Cues in two Real-life Spoken Dialog Corpora Laurence Devillers and Laurence Vidrascu ----------------------- The present research focuses on analyzing and detecting emotions in speech as revealed by task-dependent spoken dialogs corpora. Previously, we have conducted several experiments on a real-life corpus in order to develop a reliable annotation method and to detect lexical and prosodic cues correlated to the main emotion class. In this paper we evaluate both the robustness of the annotation scheme and of the lexical and prosodic cues by testing them on a new corpus. This work is carried out in the context of the Amities project in which spoken dialog systems for call center services are being developed. icassp04_chen.pdf [ICASSP'04, Montreal, May 2004] --------- Lightly supervised acoustic model training using consensus networks Langzhou Chen, Lori Lamel and Jean-Luc Gauvain ----------------------- This paper presents some recent work on using consensus networks to improve lightly supervised acoustic model training for the LIMSI Mandarin BN system. Lightly supervised acoustic model training has been attracting growing interest, since it can help to substantially reduce the development costs for speech recognition systems. Compared to supervised training with accurate transcriptions, the key problem in lightly supervised training is getting the approximate transcripts to be as close as possible to manually produced detailed ones, i.e. finding a proper way to provide the information for supervision. Previous work using a language model to provide supervision has been quite successful. This paper extends the original method presenting a new way to get the information needed for supervision during training. Studies are carried out using the TDT4 Mandarin audio corpus and associated closedcaptions. After automatically recognizing the training data, the closed-captions are aligned with a consensus network derived from the hypothesized lattices. As is the case with closed-caption filtering, this method can remove speech segments whose automatic transcripts contain errors, but it can also recover errors in the hypothesis if the information is present in the lattice. Experiment results show that compared with simply training on all of the data, consensus network based lightly supervised acoustic model training based results in about a small reduction in the character error rate on the DARPA/NIST RT 03 development and evaluation data. icassp04_jlg.pdf [ICASSP'04, Montreal, May 2004] --------- Speech recognition in multiple languages and domains: The 2003 BBN/LIMSI EARS system Richard Schwartz, Thomas Colthurst, Nicolae Duta, Herb Gish and Rukmini Iyer, Chia-Lin Kao, Daben Liu, Owen Kimball, J. Ma, John Makhoul, Spyros Matsoukas, Long Nguyen, Mohamed Noamany, Rohit Prasad, Bing Xiang, Dongxin Xu, Jean-Luc Gauvain, Lori Lamel, Holger Schwenk, Gilles Adda and Langzhou Chen ----------------------- We report on the results of the first evaluations for the BBN/LIMSI system under the new DARPA EARS Program. The evaluations were carried out for conversational telephone speech (CTS) and broadcast news (BN) for three languages: English, Mandarin, and Arabic. In addition to providing system descriptions and evaluation results, the paper highlights methods that worked well across the two domains and those few that worked well on one domain but not the other. For the BN evaluations, which had to be run under 10 times real-time, we demonstrated that a joint BBN/LIMSI system with that time constraint achieved better results than either system alone. icassp04_ll.pdf [ICASSP'04, Montreal, May 2004] --------- Speech Transcription in Multiple Languages Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Leonard Canseco, Langzhou Chen, Olivier Galibert, Abdel Messaoudi and Holger Schwenk ----------------------- This paper summarizes recent work underway at LIMSI on speech-to-text transcription in multiple languages. The research has been oriented towards the processing of broadcast audio and conversational speech for information access. Broadcast news transcription systems have been developed for seven languages and it is planned to address several other languages in the near term. Research on conversational speech has mainly focused on the English language, with initial work on the French, Arabic and Spanish languages. Automatic processing must take into account the characteristics of the audio data, such as needing to deal with the continuous data stream, specificities of the language and the use of an imperfect word transcription for accessing the information content. Our experience thus far indicates that at today s word error rates, the techniques used in one language can be successfully ported to other languages, andmost of the language specificities concern lexical and pronunciationmodeling. ThC3202o.5_p1215.pdf [ICSLP'04, Jeju Island, October 2004] --------- Neural Network Language Models for Conversational Speech Recognition Holger Schwenk and Jean-Luc Gauvain ----------------------- Recently there is growing interest in using neural networks for language modeling. In contrast to the well known backoff ngram language models (LM), the neural network approach tries to limit problems from the data sparseness by performing the estimation in a continuous space, allowing by these means smooth interpolations. Therefore this type of LM is interesting for tasks for which only a very limited amount of in-domain training data is available, such as the modeling of conversational speech. In this paper we analyze the generalization behavior of the neural network LM for in-domain training corpora varying from 7M to over 21M words. In all cases, significant word error reductions were observed compared to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the NIST rich transcription evaluations. We also apply ensemble learning methods and discuss their connections with LM interpolation. TuA2301o.1_p1283.pdf [ICSLP'04, Jeju Island, October 2004] --------- Language Recognition Using Phone Lattices Jean-Luc Gauvain, Abdel Messaoudi and Holger Schwenk ----------------------- This paper proposes a new phone lattice based method for automatic language recognition from speech data. By using phone lattices some approximations usually made by language identification (LID) systems relying on phonotactic constraints to simplify the training and decoding processes can be avoided. We demonstrate the use of phone lattices both in training and testing significantly improves the accuracy of a phonotactically based LID system. Performance is further enhanced by using a neural network to combine the results of multiple phone recognizers. Using three phone recognizerswith context independent phone models, the system achieves an equal error rate of 2.7% on the Eval03 NIST detection test (30s segment, primary condition) with an overall decoding process that runs faster than real-time (0.5xRT). TuB401o.2_p540.pdf [ICSLP'04, Jeju Island, October 2004] --------- Automatic Detection of Dialog Acts Based on Multi-level Information Sophie Rosset and Lori Lamel ----------------------- Recently there has been growing interest in using dialog acts to characterize human-human and human-machine dialogs. This paper reports on our experience in the annotation and the automatic detection of dialog acts in human-human spoken dialog corpora. Our work is based on two hypotheses: first, word position is more important than the exact word in identifying the dialog act; and second, there is a strong grammar constraining the sequence of dialog acts. A memory based learning approach has been used to detect dialog acts. In a first set of experiments the number of utterances per turn is known, and in a second set, the number of utterances is hypothesized using a language model for utterance boundary detection. In order to verify our first hypothesis, the model trained on a French corpus was tested on a corpus for a similar task in English and for a second French corpus from a different domain. A correct dialog act detection rate of about 84% is obtained for the same domain and language condition and about 75% for the cross-language and cross-domain conditions. TuC2105o.3_p1272.pdf [ICSLP'04, Jeju Island, October 2004] --------- Speaker Diarization from Speech Transcripts Leornardo Canseco-Rodriguez, Lori Lamel and Jean-Luc Gauvain ----------------------- The aim of this study is to investigate the use of the linguistic information present in the audio signal to structure broadcast news data, and in particular to associate speaker identities with audio segments. While speaker recognition has been an active area of research for many years, addressing the problem of identifying speakers in huge audio corpora is relatively recent and has been mainly concerned with speaker tracking. The speech transcriptions contain a wealth of linguistic information that is useful for speaker diarization. Patterns which can be used to identify the current, previous or next speaker have been developed based on the analysis of 150 hours of manually transcribed broadcast news data. Each pattern is associated with one or more rules. After validation on the training transcripts, these patterns and rules were tested on an independent data set containing transcripts of 10 hours of broadcasts. WeA3203p.2_p1281.pdf [ICSLP'04, Jeju Island, October 2004] --------- Dynamic Language Modeling for Broadcast News Langzhou Chen, Jean-Luc Gauvain and Lori Lamel amd Gilles Adda ----------------------- This paper describes some recent experiments on unsupervised language model adaptation for transcription of broadcast news data. In previous work, a framework for automatically selecting adaptation data using information retrieval techniques was proposed. This work extends the method and presents experimental results with unsupervised language model adaptation. Three primary aspects are considered: (1) the performance of 5 widely used LMadaptation methods using the same adaptation data is compared; (2) the influence of the temporal distance between the training and test data epoch on the adaptation effi- ciency is assessed; and (3) show-based language model adaptation is compared with story-based language model adaptation. Experiments have been carried out for broadcast news transcription in English and Mandarin Chinese. A relative word error rate reduction of 4.7% was obtained in English and a 5.6% relative character error rate reduction in Mandarin with story-based MDI adaptation. ThA2001o.2_p521.pdf [ICSLP'04, Jeju Island, October 2004] --------- Transcription of Arabic Broadcast News Abdel Messaoudi, Lori Lamel and Jean-Luc Gauvain ----------------------- This paper describes recent research on transcribing Modern Standard Arabic broadcast news data. The Arabic language presents a number of challenges for speech recognition, arising in part from the significant differences in the spoken and written forms, in particular the conventional form of texts being non-vowelized. Arabic is a highly inflected language where articles and affixes are added to roots in order to change the word s meaning. A corpus of 50 hours of audio data from 7 television and radio sources and 200 M words of newspaper texts were used to train the acoustic and language models. The transcription system based on these models and a vowelized dictionary obtains an average word error rate on a test set comprised of 12 hours of test data from 8 sources is about 18%. ThC2701o.5_p1105.pdf [ICSLP'04, Jeju Island, October 2004] --------- Fiction Database for Emotion Detection in Abnormal Situations Chloe Clavel and Ioan Vasilescu and Laurence Devillers and Thibault Ehrette ----------------------- The present research focuses on the acquisition and annotation of vocal resources for emotion detection. We are interested in detecting emotions occurring in abnormal situations and particularly in detecting fear . The present study considers a preliminary database of audiovisual sequences extracted from movie fictions. The sequences selected provide various manifestations of target emotions and are described with a multimodal annotation tool. We focus on audio cues in the annotation strategy and we use the video as support for validating the audio labels. The present article deals with the description of the methodology of data acquisition and annotation. The validation of annotation is realized via two perceptual paradigms in which the +/-video condition in stimuli presentation varies. We show the perceptual significance of the audio cues and the presence of target emotions. asru03ma.pdf [IEEE ASRU 2003, St Thomas, November 2003] --------- Transcribing Mandarin Broadcast News Langzhou Chen, Lori Lamel and Jean-Luc Gauvain ----------------------- This paper describes improvements to the LIMSI broadcast news transcription system for the Mandarin language in preparation for the DARPA/NIST Rich Transcription 2003 (RT'03) evaluation. The transcription system has been substantially updated to deal with the varied acoustic and linguistic characteristics of the RT'03 test conditions. The major improvements come from the use of lightly supervised acoustic model training in order to benefit from unannotated audio data, the use of source specific language models, and MDI adaptation to tune the language models for sources with limited amounts of training data. The character error rate on the development data has been reduced from 34.5% with the baseline system to 22.6% with the evaluation system. ES031038.PDF [ISCA Eurospeech'03, Geneva, September 2003] --------- The 300k LIMSI German Broadcast News Transcription System Kevin McTait and Martine Adda-Decker ----------------------- This paper describes important improvements to the existing LIMSI German broadcast news transcription system, especially its extension from a 65k vocabulary to 300k words. Automatic speech recognition for German is more problematic than for a language such as English in that the inflectional morphology of German and its highly generative process of compounding lead to many more out of vocabulary words for a given vocabulary size. Experiments undertaken to tackle this problem and reduce the transcription error rate include bringing the language models up to date, improved pronunciation models, semi-automatically constructed pronunciation lexicons and increasing the size of the system's vocabulary. ES030250.PDF [ISCA Eurospeech'03, Geneva, September 2003] --------- A corpus-based decompounding algorithm for German lexical modeling in LVCSR Martine Adda-Decker ----------------------- In this paper a corpus-based decompounding algorithm is described and applied for German LVCSR. The decompounding algorithm contributes to address two major problems for LVCSR: lexical coverage and letter-to-sound conversion. The idea of the algorithm is simple: given a word start of length k only few different characters can continue an admissible word in the language. But concerning compounds, if word start k reaches a constituent word boundary, the set of successor characters can theoretically include any character. The algorithm has been applied to a 300M word corpus with 2.6M distinct words. 800k decomposition rules have been extracted automatically. OOV (out of vocabulary) word reductions of 25% to 50% relative have been achieved using word lists from 65k to 600k words. Pronunciation dictionaries have been developed for the LIMSI 300k German recognition system. As no language specific knowledge is required beyond the text corpus, the algorithm can apply more generally to any compounding language. diss03_disfluencies.ps.Z [ISCA DiSS '03 - Disfluency in Spontaneous Speech Gothenburg University, Sweden, September 5-8, 2003] --------- A disfluency study for cleaning spontaneous speech automatic transcripts and improving speech language models M. Adda-Decker, B. Habert, C. Barras, G. Adda, Ph. Boula de Mareuil, P. Paroubek ----------------------- The aim of this study is to elaborate a disfluent speech model by comparing different types of audio transcripts. The study makes use of 10 hours of French radio interview archives, involving journalists and personalities from political or civil society. A first type of transcripts is press-oriented where most disfluencies are discarded. For 10% of the corpus, we produced exact audio transcripts: all audible phenomena and overlapping speech segments are transcribed manually. In these transcripts about 14% of the words correspond to disfluencies and discourse markers. The audio corpus has then been transcribed using the LIMSI speech recognizer. With 8% of the corpus the disfluency words explain 12% of the overall error rate. This shows that disfluencies have no major effect on neighboring speech segments. Restarts are the most error prone, with a 36.9% within class error rate. euro03HBMSR.ps.Z [ISCA Eurospeech'03, Geneva, September 2003] --------- A Semantic representation for spoken dialogs H. Bonneau-Maynard and S. Rosset ----------------------- This paper describes a semantic annotation scheme for spoken dialog corpora. Manual semantic annotation of large corpora is tedious, expensive, and subject to inconsistencies. Consistency is a necessity to increase the usefulness of corpus for developing and evaluating spoken understanding models and for linguistics studies. A semantic representation, which is based on a concept dictionary definition, has been formalized and is described. Each utterance is divided into semantic segments and each segment is assigned with a 5-tuplets containing a mode, the underlying concept, the normalized form of the concept, the list of related segments, and an optional comment about the annotation. Based on this scheme, a tool was developed which ensures that the provided annotations respect the semantic representation. The tool includes interfaces for both the formal definition of the hierarchical concept dictionary and the annotation process. An experiment was conducted to assess inter-annotator agreement using both a human-human dialog corpus and a human-machine dialog corpus. For human-human dialogs, the agreement rate, computed on the triplets (mode, concept, value) is 61%, and the agreement rate on the concepts alone is 74%. For the human-machine dialogs, the percentage of agreement on the triplet is 83% and the correct concept identification rate is 93%. AnnotationHH032803.pdf [ISCA Eurospeech'03, Geneva, September 2003] --------- Semantic and Dialogic Annotation for Automated Multilingual Customer Service Hilda Hardy, Kirk Baker Helene Bonneau-Maynard, Laurence Devillers, Sophie Rosset, Tomek Strzalkowski ----------------------- One central goal of the AMITIES multilingual human-computer dialogue project is to create a dialogue management system capable of engaging the user in human-like conversation in a specific domain. To that end, we have developed new methods for the manual annotation of spoken dialogue transcriptions from European financial call centers. We have modified the DAMSL dialogic schema to create a dialogue act taxonomy appropriate for customer services. To capture the semantics, we use a domain-independent framework populated with domain-specific lists. We have designed a new flexible, platform-independent annotation tool, XDMLTool, and annotated several hundred dialogues in French and English. Inter-annotator agreement was moderate to high. We are using these data to design our dialogue system, and we hope that they will help us to derive appropriate dialogue strategies for novel situations. 1025anav.pdf [ISCA Eurospeech'03, Geneva, September 2003] --------- Prosodic cues for emotion characterization in real-life spoken dialogs Laurence Devillers and Ioana Vasilescu ----------------------- This paper reports on an analysis of prosodic cues for emotion characterization in 100 natural spo ken dialogs recorded at a telephone customer service center. The corpus annotated with task-dependent emoti on tags which were validated by a perceptual test. Two F0 range parameters, one at the sentence level and t he other at the subsegment level, emerge as the most salient cues for emotion classification. These paramet ers can differentiate between negative emotion (irritation/anger, anxiety/fear) and neutral attitude and co nfirm trends illustrated by the perceptual experiment. 0572anav.pdf [ICPhS 2003, Barcelona, August 2003] --------- Prosodic cues for perceptual emotion detection in task-oriented Human-Human corpus Laurence Devillers, Ioana Vasilescu and Catherine Mathon ----------------------- This paper addresses the question of perceptual detection and prosodic cues analysis of emotional behavior in a spontaneous speech corpus of real Human-Human dialogs. Detecting real emotions should be a cl ear focus for research on modeling human dialog, as it could help with analyzing the evolution of the dialo g. Our aims are to define appropriate emotions for call center services, to validate the presence of emotio ns via perceptual tests and to find robust cues for emotion detection. Most research has focused mainly on artificial data in which predefined-emotions were simulated by actors. For real-life corpora a set of app ropriate emotion labels must be determined. To this purpose, we conducted a perceptual test exploring 2 exp erimental conditions: with and without the capacity of listening the audio-signal. Perceived emotions refle ct the presence of shaded and mixed emotions/attitudes. We report correlations between objective values and both perceived prosodic parameters and emotion labels. ICPhSliaison.pdf [ICPhS 2003, Barcelona, August 2003] --------- Liaisons in French: a corpus based study using morpho-syntactic information P. Boula de Mareuil, M. Adda-Decker, V. Gendner ----------------------- French liaison consists in producing a normally mute consonant before a word starting with a vowel. Whereas the general context for liaison is relatively straightforward to describe, actual occurrences are difficult to predict. In the present work, we quantitatively examine the productivity of 20 liaison rules or so described in the literature, such as the rule which states that after a determiner, liaison is compulsory. To do so, we used the French BREF corpus of read newspaper speech (100 hours), automatically tagged with morpho-syntactic information. The speech corpus has been automatically aligned using a pronunciation dictionary which includes liaisons. There are 90k liaison contexts in the corpus, about half of which (45%) are realised. A better knowledge of liaison production is of particular interest for pronunciation modelling in automatic speech recognition, for text-to-speech synthesis, descriptive phonetics and second language acquisition. 0ICPhSlid.pdf [ICPhS 2003, Barcelona, August 2003] --------- Phonetic knowledge, phonotactics and perceptual validation for automatic language identification Martine Adda-Decker, Fabien Antoine, Philippe Boula de Mareuil, Ioana Vasilescu, Lori Lamel, Jacqueline Vaissiere, Edouard Geoffrois, Jean-Sylvain Liénard ----------------------- This study explores a multilingual phonotactic approach to automatic language identification using Broadcast News data. The definition of a multilingual phoneset is discussed and an upper limit on the performance of the phonotactic approach is estimated by eliminating any degradation due to recognition errors. This upper bound is compared to automatic language identification based on a phonotactic approach. The eight languages of interest are: Arabic, Mandarin , English, French, German, Italian, Portuguese and Spanish. A perceptual test has been carried out to compare human and machine performance in similar configurations. Different phoneset classes have been experimented with, ranging from a binary C/V distinction to a shared phone set of 70 phones. Experiments show that phonotactic constraints are in theory able to identify a language (among 8) with close to 100% on very short sequences of 1-2 seconds. Automatic and human performances on very short sequences both remain below the theoretical performances. ICME2003_4pages.pdf [ICME 2003, Baltimore, July 2003] --------- Emotion Detection in Task-Oriented Spoken Dialogs Laurence Devillers, Lori Lamel, Ioana Vasilescu ----------------------- Detecting emotions in the context of automated call center services can be helpful for following the evolution of the human-computer dialogs, enabling dynamic modification of the dialog strategies and influencing the final outcome. The emotion detection work reported here is a part of larger study aiming to model user behavior in real interactions. We make use of a corpus of real agent-client spoken dialogs in which the manifestation of emotion is quite complex, and it is common to have shaded emotions since the interlocutors attempt to control the expression of their internal attitude. Our aims are to define appropriate emotions for call center services, to annotate the dialogs and to validate the presence of emotions via perceptual tests and to find robust cues for emotion detection. In contrast to research carried out with artificial data with simulated emotions, for real-life corpora the set of appropriate emotion labels must be determined. Two studies are reported: the first investigates automatic emotion detection using linguistic information, whereas the second concerns perceptual tests for identifying emotions as well as the prosodic and textual cues which signal them. About 11% of the utterances are annotated with non-neutral emotion labels. Preliminary experiments using lexical cues detect about 70% of these labels. eacl2003peace.ps.Z [EACL'03, Budapest, April 2003] --------- The PEACE SLDS understanding evaluation paradigm of the French MEDIA campaign Laurence Devillers, Helene Maynard, Patrick Paroubek, Sophie Rosset ----------------------- This paper presents a paradigm for evaluating the context-sensitive understanding capability of any spoken language dialog system: PEACE (French acronym for Paradigme d'Evaluation Automatique de la Comprehension hors et En-contexte. This paradigm will be the basis of the French Technolangue MEDIA project, in which dialog systems from various academic and industrial sites will be tested in an evaluation campaign coordinated by ELRA/ELDA (over the next two years). Despite previous efforts such as Eagles, Disc, Aupelf ArcB2 or the ongoing American DARPA Communicator project, the spoken dialog community still lacks common reference tasks and widely agreed upon methods for comparing and diagnosing systems and techniques. Automatic solutions are nowadays being sought both to make possible the comparison of different approaches by means of reliable indicators with generic evaluation methodologies and also to reduce system development costs. However achieving independence from both the dialog system and the task performed seems to be more and more a utopia. Most of the evaluations have up to now either tackled the system as a whole, or based the measurements on dialog-context-free information. The PEACE aims at bypassing some of these shortcomings by extracting, from real dialog corpora, test sets that synthesize contextual information. sspr03_ftp.pdf [IEEE Workshop on Spontaneous Speech, Tokyo, April 2003] --------- Using Continuous Space Language Models for Conversational Speech Recognition Holger Schwenk and Jean-Luc Gauvain ----------------------- Language modeling for conversational speech suffers from the limited amount of available adequate training data. This paper describes a new approach that performs the estimation of the language model probabilities in a continuous space, allowing by these means smooth interpolation of unobserved n-grams. This continuous space language model is used during the last decoding pass of a state-of-the-art conversational telephone speech recognizer to rescore word lattices. For this type of speech data, it achieves consistent word error reductions of more than 0.4% compared to a carefully tuned backoff n-gram language model. ica03cts.pdf [ICASSP'03, Hong Kong, April 2003] --------- Conversational Telephone Speech Recognition J.L. Gauvain, L. Lamel, H. Schwenk, G. Adda, L. Chen, F. Lefèvre ----------------------- This paper describes the development of a speech recognition system for the processing of telephone conversations, starting with a state-of-the-art broadcast news transcription system. We identify major changes and improvements in acoustic and language modeling, as well as decoding, which are required to achieve state-of-the-art performance on conversational speech. Some major changes on the acoustic side include the use of speaker normalization (VTLN), the need to cope with channel variability, and the need for efficient speaker adaptation and better pronunciation modeling. On the linguistic side the primary challenge is to cope with the limited amount of language model training data. To address this issue we make use of a data selection technique, and a smoothing technique based on a neural network language model. At the decoding level lattice rescoring and minimum word error decoding are applied. On the development data, the improvements yield an overall word error rate of 24.9% whereas the original BN transcription system had a word error rate of about 50% on the same data. ica03CLZ.pdf [ICASSP'03, Hong Kong, April 2003] --------- Unsupervised Language Model Adaptation for Broadcast News L. Chen, J.L. Gauvain, L. Lamel, G. Adda ----------------------- Unsupervised language model adaptation for speech recognition is challenging, particularly for complicated tasks such the transcription of broadcast news (BN) data. This paper presents an unsupervised adaptation method for language modeling based on information retrieval techniques. The method is designed for the broadcast news transcription task where the topics of the audio data cannot be predicted in advance. Experiments are carried out using the LIMSI American English BN transcription system and the NIST 1999 BN evaluation sets. The unsupervised adaptation method reduces the perplexity by 7% relative to the baseline LM and yields a 2% relative improvement for a 10xRT system. ica03CB.pdf [ICASSP'03, Hong Kong, April 2003] --------- Feature and Score Normalization for Speaker Verification of Cellular Data C. Barras and J.L. Gauvain ----------------------- This paper presents some experiments with feature and score normalization for text-independent speaker verification of cellular data. The speaker verification system is based on cepstral features and Gaussian mixture models with 1024 components. The following methods, which have been proposed for feature and score normalization, are reviewed and evaluated on cellular data: cepstral mean subtraction (CMS), variance normalization, feature warping, T-norm, Z-norm and the cohort method. We found that the combination of feature warping and T-norm gives the best results on the NIST 2002 test data (for the one-speaker detection task). Compared to a baseline system using both CMS and variance normalization and achieving a 0.410 minimal decision cost function (DCF), feature warping and T-norm respectively bring 8% and 12% relative reductions, whereas the combination of both techniques yields a 22% relative reduction, reaching a DCF of 0.320. This result approaches the state-of-the-art performance level obtained for speaker verification with land-line telephone speech. isca2003-yylo.ps.Z [ISCA Workshop on Multilingual Spoken Document Retrieval, April 2003] --------- Tracking Topics in Broadcast News Data Yuen-Yee Lo and Jean-Luc Gauvain ----------------------- This paper describes a topic tracking system and its ability to cope with sparse training data for broadcast news tracking. The baseline tracker which relies on a unigram topic model. In order to compensate for the very small amount of training data for each topic, document expansion is used in estimating the initial topic model, and unsupervised model adaptation is carried out after processing each test story. A new technique of variable weight unsupervised online adaptation has been developed and was found to outperform traditional fixed weight online adaptation. Combining both document expansion and adaptation resulted in a 37% cost reduction tested on both English and machine translated Mandarin broadcast news data transcribed by an ASR system, with manual story boundaries. Another challenging condition is one in which the story boundaries are not known for the broadcast news data. A window-based automatic story boundary detector has been developed for the tracking system. The tracking results with the window-based tracking system are comparable to those obtained with a state-of-the-art automatic story segmentation on the TDT3 corpus. isle02em.ps.Z [ISLE workshop, Edinburgh Dec 16-17, 2002 http://www.research.att.com/~walker/isle-dtag-wrk/] --------- Annotation and Detection of Emotion in a Task-oriented Human-Human Dialog Corpus Laurence Devillers, Ioana Vasilescu (ENST), Lori Lamel ----------------------- Automatic emotion detection is potentially important for customer care in the context of call center services. A major difficulty is that the expression of emotion is quite complex in the context of agent-client dialogs. Due to unstated rules of politeness it is common to have shaded emotions, in which the interlocutors attempt to limit their internal attitude (frustration, anger, disappointment). Furthermore, the telephonic quality speech complicates the tasks of detecting perceptual cues and of extracting relevant measures from the signal. Detecting real emotions can be helpful to follow the evolution of the human-computer dialogs, enabling dynamic modification of the dialog strategies and influencing the final outcome. This paper reports on three studies: the first concerns the use of multi-level annotations including emotion tags for diagnosis of the dialog state; the second investigates automatic emotion detection using linguistic information; and the third reports on two perceptual tests for identifying emotions as well as the prosodic and textual cues which signal them. Two experimental conditions were considered -- with and without the capability of listening the audio signal. Finally, we propose a new set of emotions/attitudes tags suggested by the previous experiments and discuss perspectives. isle02axes.ps.Z [ISLE workshop, Edinburgh Dec 16-17, 2002 http://www.research.att.com/~walker/isle-dtag-wrk/] --------- Representing Dialog Progression for Dynamic State Assessment Sophie Rosset and Lori Lamel ----------------------- While developing several spoken language dialog systems for information retrieval tasks, we found that representing the dialog progression along an axis was useful to facilitate dynamic evaluation of the dialog state. Since dialogs evolve from the first exchange until the end, it is of interest to assess whether an ongoing dialog is running smoothly or if is encountering problems. We consider that the dialog progression can be represented on two axes: a Progression axis, which represents the ``good'' progression of the dialog and an Accidental axis, which represents the accidents that occur, corresponding to misunderstandings between the system and the user. The time (in number of turns) used by the system to repair the accident is represented by the Residual Error, which is incremented when an accident occurs and is decremented when the dialog progresses. This error reflects the difference between a perfect (i.e., theoretical) dialog (e.g. without errors, miscommunication...) and the real ongoing dialog. One particularly interesting use of the dialog axis progression annotation is to extract problematic dialogs from large data collections for analysis, with the aim of improving the dialog system. In the context of the IST Amities project we have extended this representation to the annotation of human-human dialogs recorded at a stock exchange call center and intend to use the extended representation in an automated natural dialog system for call routing. tdt02.ps.Z [TDT'02, Gaithersburg, November 2002] --------- The LIMSI Topic Tracking System for TDT2002 Yuen-Yee Lo and Jean-Luc Gauvain ----------------------- In this paper we describe the LIMSI topic tracking system used for the DARPA 2002 Topic Detection and Tracking evaluation (TDT2002). The system relies on a unigram topic model, where the score for an incoming document is the normalized likelihood ratio of the topic model and a general English model. In order to compensate for the very small amount of training data for each topic, document expansion is used in estimating the topic model which is the adapted in an unsupervised manner after each incoming document is processed. Experimental results demonstrate the effectiveness of these two techniques for the primary evaluation condition on both the TDT3 development corpus and the official TDT2002 test data. Another challenge is that story boundaries are not known for the broadcast news data. A window-based automatic boundary detector has been developed for the tracking system. The tracking results with the window-based tracking system are comparable to those obtained with a state-of-the-art automatic story segmentation on the TDT3 corpus. icslp02_tcs.ps.Z [ICSLP 2002, Denver, Sep 2002] --------- A phonetic study of Vietnamese tones: acoustic and electrographic measurements Vu Ngoc Tuan and Christophe d'Alessandro and Sophie Rosset ----------------------- **tbc** icslp02hbm.ps.Z [ICSLP 2002, Denver, Sep 2002] --------- Issues in the Development of a Stochastic Speech Understanding System Fabrice Lefevre and Helene Bonneau-Maynard ----------------------- In the development of a speech understanding system, the recourse to stochastic techniques can greatly reduce the need for human expertise. A known disadvantage is that stochastic models require large annotated training corpora in order to reliably estimate model parameters. Manual semantic annotation of such corpora is tedious, expensive, and subject to inconsistencies. In order to decrease the development cost, this work investigates the performance of stochastic understanding models with two parameters:~the use of automatically segmented data and the use of automatically learned lexical normalisation rules. eusipco02.ps.Z [Eusipco'02, Toulouse, Sep. 2002] --------- Automatic Processing of Broadcast Audio in Multiple Languages Lori Lamel and Jean-Luc Gauvain} ----------------------- This paper addresses recent progress in LVCSR in multiple languages which has enabled the processing of broadcast audio for information access. At LIMSI, broadcast news transcription systems have been developed for seven languages. %English, French, German, Mandarin, Portuguese, Spanish and Arabic. Automatic processing to access the content must take into account the specificities of audio data, such as needing to deal with the continuous data stream and an imperfect word transcription, and specificities of the language. Some near-term applications are audio data mining, structurization of audiovisual archives, selective dissemination of information and media monitoring. spc02mask.ps.Z [Speech Communication, Vol. 38, (1-2):131-139 Sep 2002] --------- User Evaluation of the MASK Kiosk L. Lamel, S. Bennacef, J.L. Gauvain, H. Dartigues, J.N. Temem ----------------------- In this paper we report on a series of user trials carried out to assess the performance and usability of the Multimodal Multimedia Service Kiosk (MASK) prototype kiosk. The aim of the ESPRIT MASK project was to pave the way for advanced public service applications with user interfaces employing multimodal, multi-media input and output. The prototype kiosk was developed after analyzing the technological requirements in the context of users performing travel enquiry tasks, in close collaboration with the French Railways (SNCF) and the Ergonomics group at the University College of London (UCL). The time to complete the transaction with the MASK kiosk is reduced by about 30% compared to that required for the standard kiosk, and the success rate is 85% for novices and 94% once familiar with the system. In addition to meeting or exceeding the performance goals set at the project onset in terms of success rate, transaction time, and user satisfaction, the MASK kiosk was judged to be user-friendly and simple to use. Dans cet article nous présentons les résultats de tests auprès d'utilisateurs du kiosque MASK (Multimodal Multimedia Service Kiosk). Le but du projet ESPRIT MASK était de développer un kiosque d'information et de distribution avec une interface innovante et conviviale combinant les modalités tactiles et vocales. Le prototype a été développé après une analyse des besoins dans le cadre du transport ferroviaire en collaboration avec la SNCF et le groupe d'ergonomie à University College of London (UCL). Le temps de transaction avec le kiosque MASK est réduit de 30% par rapport aux kiosques existants. Le taux de succès est de 85% pour les utilisateurs novices et de 94% pour ceux qui ont déjà utilisé le système (plus de 3 utilisations). Tous les objectifs fixés par la SNCF au début du projet ont été atteints par le prototype, qu'ils concernent le taux de succès, le temps de transaction, ou la satisfaction des utilisateurs. Le kiosque MASK a été jugé convivial et simple d'emploi. lrec02_eval_defi.pdf [LREC'02, Las Palmas, June 2002] --------- Predictive and objective evaluation of speech understanding: the "challenge" evaluation campaign of the I3 speech workgroup of the French CNRS J.Y. Antoine, C. Bousquet-Vernhettes, J. Goulian, M. Zakaria Kurdi, S. Rosset, N. Vigouroux, J. Villaneau ----------------------- The recent development of spoken language processing has gone along with large-scale evaluation programmes that concern spoken language dialogue systems as well as their components (speech recognition, speech understanding, dialogue management). This paper deals with the evaluation of Spoken Language Understanding (SLU) systems in the general framework of spoken Man-Machine Communication. lrec02port.ps.Z [ISCA SALTMIL SIG workshop at LREC'02 on Portability Issues in Human Language Technologies, Las Palmas, June 2002] --------- Some Issues in Speech Recognizer Portability Lori Lamel ----------------------- Speech recognition technology has greatly evolved over the last decade. However, one of the remaining challenges is reducing the development cost. Most recognition systems are tuned to a particular task and porting the system to a new task (or language) requires substantial investment of time and money, as well as human expertise. Todays state-of-the-art systems rely on the availability of large amounts of manually transcribed data for acoustic model training and large normalized text corpora for language model training. Obtaining such data is both time-consuming and expensive, requiring trained human annotators with substantial amounts of supervision. This paper addresses some of the main issues in porting a recognizer to another task or language, and highlights some some recent research activities aimed a reducing the porting cost and at developing generic core speech recognition technology. spcH4_limsi.ps.Z [Speech Communication, Vol. 37, (1-2):89-108, May 2002] --------- The LIMSI Broadcast News Transcription System Jean-Luc Gauvain, Lori Lamel and Gilles Adda ----------------------- This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers). The problem of partitioning the continuous stream of data is addressed using an iterative segmentation and clustering algorithm with Gaussian mixtures. The speech recognizer makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and 4-gram statistics estimated on large text corpora. Word recognition is performed in multiple passes, where initial hypotheses are used for cluster-based acoustic model adaptation to improve word graph generation. The overall word transcription error of the LIMSI evaluation systems were 27.1% (Nov96, partitioned test data), 18.3% (Nov97, unpartitioned data), 13.6% (Nov98, unpartitioned data) and 17.1% (Fall99, unpartitioned data with computation time under 10x real-time). jep2002hbm.ps.Z [JEP-TALN 2002, Nancy, June 2002] --------- Issues in the Development of a Stochastic Speech Understanding System Helene Bonneau-Maynard and Fabrice Lefevre ----------------------- The need for human expertise in the development of a speech understanding system can be greatly reduced by the use of stochastic techniques. However corpus-based techniques require the annotation of large amounts of training data. Manual semantic annotation of such corpora is tedious, expensive, and subject to inconsistencies. In order to decrease the development cost, this work investigates the performance of the understanding module with two parameters:~the influence of the training corpus size and the use of automatically annotated data. ica02cb.ps.Z [IEEE ICASSP'02, Orlando, May 2002] --------- Transcribing Audio-Video Archives Claude Barras, Alexandre Allauzen, Lori Lamel, Jean-Luc Gauvain ----------------------- This paper addresses the automatic transcription of audiovideo archives using a state-of-the-art broadcast news speech transcription system. A 9-hour corpus spanning the latter half of the 20th century (1945-1995) has been transcribed and an analysis of the transcription quality carried out. In addition to the challenges of transcribing heterogenous broadcast news data, we are faced with changing properties of the archive over time, such as the audio quality, the speaking style, vocabulary items and manner of expression. After assessing the performance of the transcription system, several paths are explored in an attempt to reduce the mismatch between the acoustic and language models and the archived data. ica02hs.ps.Z [IEEE ICASSP'02, Orlando, May 2002] --------- Connectionist Language Modeling for Large Vocabulary Continuous Speech Recognition Holger Schwenk and Jean-Luc Gauvain ----------------------- This paper describes ongoing work on a new approach for language modeling for large vocabulary continuous speech recognition. Almost all state-of-the-art systems use statistical n-gram language models estimated on text corpora. One principle problem with such language models is the fact that many of the $n$-grams are never observed even in very large training corpora, and therefore it is common to back-off to a lower-order model. In this paper we propose to address this problem by carrying out the estimation task in a continuous space, enabling a smooth interpolation of the probabilities. A neural network is used to learn the projection of the words onto a continuous space and to estimate the $n$-gram probabilities. The connectionist language model is being evaluated on the DARPA Hub5 conversational telephone speech recognition task and preliminary results show consistent improvements in both perplexity and word error rate. ica02light.ps.Z [IEEE ICASSP'02, Orlando, May 2002] --------- Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda ----------------------- This paper describes some recent experiments using unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated raw broadcast news data. The hypothesized transcription is used to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 minutes of manually annotated data. These experiments demonstrate that unsupervised training is a viable training scheme and can dramatically reduce the cost of building acoustic models. csl01.ps.Z [Computer Speech and Language, Vol. 16, Jan 2002] --------- Lightly supervised and unsupervised acoustic model training Lori Lamel and Jean-Luc Gauvain and Gilles Adda ----------------------- The last decade has witnessed substantial progress in speech recognition technology, with todays state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word error of about 20%. However, acoustic model development for these recognizers relies on the availability of large amounts of manually transcribed training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators and substantial amounts of supervision. This paper describes some recent experiments using lightly supervised and unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated broadcast news data from the Darpa TDT-2 corpus. The hypothesized transcription is optionally aligned with closed captions or transcripts to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 minutes of manually annotated data. These experiments demonstrate that lightly or unsupervised supervision can dramatically reduce the cost of building acoustic models. mtap00.ps.Z [Journal on Multimedia Tools and Applications Vol 14 (2): 187-200, June 2001] --------- Audio Partitioning and Transcription for Broadcast Data Indexation J.L. Gauvain, L. Lamel and G. Adda ----------------------- This work addresses automatic transcription of television and radio broadcasts in multiple languages. Transcription of such types of data is a major step in developing automatic tools for indexation and retrieval of the vast amounts of information generated on a daily basis. Radio and television broadcasts consist of a continuous data stream made up of segments of different linguistic and acoustic natures, which poses challenges for transcription. Prior to word recognition, the data is partitioned into homogeneous acoustic segments. Non-speech segments are identified and removed, and the speech segments are clustered and labeled according to bandwidth and gender. Word recognition is carried out with a speaker-independent large vocabulary, continuous speech recognizer which makes use of n-gram statistics for language modeling and of continuous density HMMs with Gaussian mixtures for acoustic modeling. This system has consistently obtained top-level performance in DARPA evaluations. Over 500 hours of unpartitioned unrestricted American English broadcast data have been partitioned, transcribed and indexed, with an average word error of about 20%. With current IR technology there is essentially no degradation in information retrieval performance for automatic and manual transcriptions on this data set. asru01hbmfl.ps.Z [IEEE ASRU'01, Madonna di Campiglio, December 2001] --------- Investigating Stochastic Speech Understanding Helene Bonneau-Maynard, Fabrice Lefevre ----------------------- The need for human expertise in the development of a speech understanding system can be greatly reduced by the use of stochastic techniques. However corpus-based techniques require the annotation of large amounts of training data. Manual semantic annotation of such corpora is tedious, expensive, and subject to inconsistencies. This work investigates the influence of the training corpus size on the performance of understanding module. The use of automatically annotated data is also investigated as a means to increase the corpus size at a very low cost. First, a stochastic speech understanding model developed using data collected with the LIMSI ARISE dialog system is presented. Its performance is shown to be comparable to that of the rule-based caseframe grammar currently used in the system. In a second step, two ways of reducing the development cost are pursued: (1) reducing of the amount of manually annotated data used to train the stochastic models and (2) using automatically annotated data in the training process. limsi_trk01.ps.Z [TDT'01, Gaithersburg, November 2001] --------- The LIMSI Topic Tracking System for TDT2001 Yuen-Yee Lo and Jean-Luc Gauvain ----------------------- In this paper we describe the LIMSI topic tracking system used for the DARPA 2001 Topic Detection and Tracking evaluation (TDT2001). The system relies on a unigram topic model, where the score for an incoming document is the normalized likelihood ratio of the topic model and a general English model. In order to compensate for the very small amount of training data for each topic, document expansion is used in estimating the initial topic model, and unsupervised model adaptation is carried out after each test story is processed. Experimental results demonstrate the effectiveness of these two techniques for the primary evaluation condition on both the TDT3 development corpus and the official TDT2001 test data. euro01fab.ps.Z [ISCA Eurospeech'01, Aalborg, August 2001] --------- Improving Genericity for Task-Independent Speech Recognition Fabrice Lefevre, Jean-Luc Gauvain, Lori Lamel ----------------------- Although there have been regular improvements in speech recognition technology over the past decade, speech recognition is far from being a solved problem. Recognition systems are usually tuned to a particular task and porting the system to a new task (or language) is both time-consuming and expensive. In this paper, issues in speech recognizer portability are addressed through the development of generic core speech recognition technology. First, the genericity of wide domain models is assessed by evaluating performance on several tasks. Then, the use of transparent methods for adapting generic models to a specific task is explored. Finally, further techniques are evaluated aiming at enhancing the genericity of the wide domain models. We show that unsupervised acoustic model adaptation and multi-source training can reduce the performance gap between task-independent and task-dependent acoustic models, and for some tasks even out-perform task-dependent acoustic models. euro01chen.ps.Z [ISCA Eurospeech'01, Aalborg, August 2001] --------- Using Information Retrieval Methods for Language Model Adaptation Langzhou Chen, Jean-Luc Gauvain, Lori Lamel, Gilles Adda, Martine Adda-Decker ----------------------- In this paper we report experiments on language model adaptation using information retrieval methods, drawing upon recent developments in information extraction and topic tracking. One of the problems is extracting reliable topic information with high confidence from the audio signal in the presence of recognition errors. The work in the information retrieval domain on information extraction and topic tracking suggested a new way to solve this problem. In this work, we make use of information retrieval methods to extract topic information in the word recognizer hypotheses, which are then used to automatically select adaptation data from a very large general text corpus. Two adaptive language models, a mixture based model and a MAP based model, have been investigated using the adaptation data. Experiments carried out with the LIMSI Mandarin broadcast news transcription system gives a relative character error rate reduction of 4.3% with this adaptation method. itrw01fabfinal.ps.Z [ISCA ITRW'01, Sophia-Antipolis, August 2001] --------- Genericity and Adaptability Issues for Task-Independent Speech Recognition Fabrice Lefevre, Jean-Luc Gauvain, Lori Lamel --------------------- The last decade has witnessed major advances in core speech recognition technology, with today's systems able to recognize continuous speech from many speakers without the need for an explicit enrollment procedure. Despite these improvements, speech recognition is far from being a solved problem. Most recognition systems are tuned to a particular task and porting the system to another task or language is both time-consuming and expensive. Our recent work addresses issues in speech recognizer portability, with the goal of developing generic core speech recognition technology. In this paper, we first assess the genericity of wide domain models by evaluating performance on several tasks. Then, transparent methods are used to adapt generic acoustic and language models to a specific task. Unsupervised acoustic models adaptation is contrasted with supervised adaptation, and a system-in-loop scheme for incremental unsupervised acoustic and linguistic models adaptation is investigated. Experiments on a spontaneous dialog task show that with the proposed scheme, a transparently adapted generic system can perform nearly as well (about a 1% absolute gap in word error rates) as a task-specific system trained on several tens of hours of manually transcribed data. itrw01chen.ps.Z [ISCA ITRW'01, Sophia-Antipolis, August 2001] --------- Language Model Adaptation for Broadcast News Transcription Langzhou Chen, Jean-Luc Gauvain, Lori Lamel, Gilles Adda, Martine Adda-Decker --------------------- This paper reports on language model adaptation for the broadcast news transcription task. Language model adaptation for this task is challenging in that the subject of any particular show or portion thereof is unknown in advance and is often related to more than one topic. One of the problems in language model adaptation is the extraction of reliable topic information from the audio signal, particularly in the presence of recognition errors. In this work, we draw upon techniques used in information retrieval to extract topic information from the word recognizer hypotheses, which are then used to automatically select adaptation data from a large general text corpus. Two adaptive language models, a mixture-based model and a MAP-based model, have been investigated using the adaptation data. Experiments carried out with the LIMSI Mandarin broadcast news transcription system gives a relative character error rate reduction of 4.3% by combining both adaptation methods. taln01sr.ps.Z [TALN'01, Tours, July 2001] --------- Gestionnaire de dialogue pour un système d'informations à reconnaissance vocale Sophie Rosset, Lori Lamel --------------------- In this paper we describe the dialog manager of a spoken language dialog system for information retrieval. The dialog manager maintains both static and dynamic knowledge sources, which are used according to the dialog strategies. These strategies have been developed so as to obtain the overall system objectives while taking into consideration the desired ergonomic choices. Dialog management is based on a dialog model which divides the interaction into phases and a dynamic model of the task. This allows the dialog strategy to be adapted as a function of the history and the dialog state. The dialog manager was evaluated in the context of the Le3-Arise project, where in the last user tests the dialog success improved from 53% to 85%. Dans cet article, nous présentons un gestionnaire de dialogue pour un système de demande d'informations à reconnaissance vocale. Le gestionnaire de dialogue dispose de différentes sources de connaissance, des connaissances statiques et des connaissances dynamiques. Ces connaissances sont gérées et utilisées par le gestionnaire de dialogue via des stratégies. Elles sont mises en oeuvre et organisées en fonction des objectifs concernant le système de dialogue et en fonction des choix ergonomiques que nous avons retenus. Le gestionnaire de dialogue utilise un modèle de dialogue fondé sur la détermination de phases et un modèle de la tâche dynamique. Il augmente les possibilités d'adaptation de la stratégie en fonction des historiques et de l'état du dialogue. Ce gestionnaire de dialogue, implémenté et évalué lors de la dernière campagne d'évaluation du projet Le-3 Arise, a permi une amélioration du taux de succès de dialogue (de 53% à 85%). acl01.ps.Z [ACL'01, Toulouse, July 2001] --------- Processing Broadcast Audio for Information Access Jean-Luc Gauvain, Lori Lamel, Gilles Adda, Martine Adda-Decker, Claude Barras, Langzhou Chen, and Yannick de Kercadio This paper addresses recent progress in speaker-independent, large vocabulary, continuous speech recognition, which has opened up a wide range of near and mid-term applications. One rapidly expanding application area is the processing of broadcast audio for information access. At LIMSI, broadcast news transcription systems have been developed for English, French, German, Mandarin and Portuguese, and systems for other languages are under development. Audio indexation must take into account the specificities of audio data, such as needing to deal with the continuous data stream and an imperfect word transcription. Some near-term applications areas are audio data mining, selective dissemination of information and media monitoring. ica01light.ps.Z [ICASSP'01, Salt Lake City, May 2001] --------- Investigating Lightly Supervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain, Gilles Adda --------------------- The last decade has witnessed substantial progress in speech recognition technology, with todays state-of-the-art systems being able to transcribe broadcast audio data with a word error of about 20%. However, acoustic model development for the recognizers requires large corpora of manually transcribed training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators with substantial amounts of supervision. In this paper we describe some recent experiments using different levels of supervision for acoustic model training in order to reduce the system development cost. The experiments have been carried out using the DARPA TDT-2 corpus (also used in the SDR99 and SDR00 evaluations). Our experiments demonstrate that light supervision is sufficient for acoustic model development, drastically reducing the development cost. ica01fl.ps.Z [ICASSP'01, Salt Lake City, May 2001] --------- Towards Task-Independent Speech Recognition Fabrice Lefevre, Jean-Luc Gauvain, Lori Lamel --------------------- Despite the considerable progress made in the last decade, speech recognition is far from a solved problem. For instance, porting a recognition system to a new task (or language) still requires substantial investment of time and money, as well as expertise in speech recognition. This paper takes a first step at evaluating to what extent a generic state-of-the-art speech recognizer can reduce the manual effort required for system development. We demonstrate the genericity of wide domain models, such as broadcast news acoustic and language models, and techniques to achieve a higher degree of genericity, such as transparent methods to adapt such models to a specific task. This work targets three tasks using commonly available corpora: small vocabulary recognition (TI-digits), text dictation (WSJ), and goal-oriented spoken dialog (ATIS). ica01cb.ps.Z [ICASSP'01, Salt Lake City, May 2001] --------- {Automatic Transcription of Compressed Broadcast Audio Claude Barras, Lori Lamel, Jean-Luc Gauvain --------------------- With increasing volumes of audio and video data broadcast over the web, it is of interest to assess the performance of state-of-the-art automatic transcription systems on compressed audio data for media indexation applications. In this paper the performance of the LIMSI 10x French broadcast news transcription system is measured on a two-hour audio set for a range of MP3 and RealAudio codecs at various bitrates and the GSM codec used for European cellular phone communications. The word error rates are compared with those obtained on high quality PCM recordings prior to compression. For a 6.5 kbps audio bit rate (the most commonly used on the web), word error rates under 40% can be achieved, which makes automatic media monitoring systems over the web a realistic task. hlt01.ps.Z [HLT'01, San Diego, March 2001] --------- Portability Issues for Speech Recognition Technologies Lori Lamel, Fabrice Lefevre, Jean-Luc Gauvain, Gilles Adda --------------------- Although there has been regular improvement in speech recognition technology over the past decade, speech recognition is far from being a solved problem. Most recognition systems are tuned to a particular task and porting the system to a new task (or language) still requires substantial investment of time and money, as well as expertise. Todays state-of-the-art systems rely on the availability of large amounts of manually transcribed data for acoustic model training and large normalized text corpora for language model training. Obtaining such data is both time-consuming and expensive, requiring trained human annotators with substantial amounts of supervision. In this paper we address issues in speech recognizer portability and activities aimed at developing generic core speech recognition technology, in order to reduce the manual effort required for system development. Three main axes are pursued: assessing the genericity of wide domain models by evaluating performance under several tasks; investigating techniques for lightly supervised acoustic model training; and exploring transparent methods for adapting generic models to a specific task so as to achieve a higher degree of genericity. sdr00.ps.Z [TREC-9, Gaithersburg, Nov 2000] --------- The LIMSI SDR System for TREC-9 Jean-Luc Gauvain, Lori Lamel, Claude Barras, Gilles Adda, and Yannick de Kercardio --------------------- In this paper we describe the LIMSI Spoken Document Retrieval system used in the TREC-9 evaluation. This system combines an adapted version of the LIMSI 1999 Hub-4E transcription system for speech recognition with text-based IR methods. Compared with the LIMSI TREC-8 system, this year's system is able to index the audio data without knowledge of the story boundaries using a double windowing approach. The query expansion procedure of the information retrieval component has been revised and makes use of contemporaneous text sources. Experimental results are reported in terms of mean average precision for both the TREC SDR'99 and SDR'00 queries using the same 557h data set. The mean average precision of this year's system is 0.5250 for SDR'99 and 0.3706 for SDR'00 for the focus unknown story boundary condition with a 20% word error rate. nle99.ps.Z [Natural Language Engineering special issue on Best Practice in Spoken Language Dialogue Systems Engineering, Vol. 6 (3), Oct 2000] --------- Towards Best Practice in the Development and Evaluation of Speech Recognition Components of a Spoken Language Dialogue System L. Lamel, W. Minker, P. Paroubek --------------------- Spoken Language Dialog Systems (SLDSs) aim to use natural spoken input for performing an information processing task such as automated standards, call routing or travel planning and reservations. The main functionality of an SLDS are speech recognition, natural language understanding, dialog management, database access and interpretation, response generation and speech synthesis. Speech recognition, which transforms the acoustic signal into a string of words, is a key technology in any SLDS. This article summarizes the main aspects of the current practice in the design, implementation and evaluation of speech recognition components for spoken language dialogue systems. It is based on the framework used in the European project DISC. icslp00_h4ger.ps.Z [ICSLP'2000, Beijing, Oct 2000] --------- Investigating text normalization and pronunciation variants for German broadcast transcription Martine Adda-Decker, Gilles Adda, Lori Lamel --------------------- In this paper we describe our ongoing work concerning lexical modeling in the LIMSI broadcast transcription system for German. Lexical decomposition is investigated with a twofold goal: lexical coverage optimization and improved letter-to-sound conversion. A set of about 450 decompounding rules, developed using statistics from a 300M word corpus, reduces the OOV rate from 4.5% to 4.0% on a 30k development text set. Adding partial inflection stripping, the OOV rate drops to 2.9%. For letter-to-sound conversion, decompounding reduces cross-lexeme ambiguities and thus contributes to more consistent pronunciation dictionaries. Another point of interest concerns reduced pronunciation modeling. Word error rates, measured on 1.3 hours of ARTE TV broadcast, vary between 18 and 24% depending on the show and the system configuration. Our experiments indicate that using reduced pronunciations slightly decreases word error rates. icslp00_h4m.ps.Z [ICSLP'2000, Beijing, Oct 2000] --------- Broadcast News Transcription in Mandarin Langzhou Chen, Lori Lamel, Gilles Adda and Jean-Luc Gauvain --------------------- In this paper, our work in developing a Mandarin broadcast news transcription system is described. The main focus of this work is a port of the LIMSI American English broadcast news transcription system to the Chinese Mandarin language. The system consists of an audio partitioner and an HMM-based continuous speech recognizer. The acoustic models were trained on about 24 hours of data from the 1997 Hub4 Mandarin corpus available via LDC. In addition to the transcripts, the language models were trained on Mandarin Chinese News Corpus containing about 186 million characters. We investigate recognition performance as a function of lexical size, with and without tone in the lexicon, and with a topic dependent language model. The transcription character error rate on the DARPA 1997 test set is 18.1 using a lexicon with 3 tone levels and a topic-based language model. icslp00_slds.ps.Z [ICSLP'2000, Beijing, Oct 2000] --------- Considerations in the Design and Evaluation of Spoken Language Dialog Systems Lori Lamel, Sophie Rosset and Jean-Luc Gauvain --------------------- In this paper we summarize our experience at LIMSI in the design, development and evaluation of spoken language dialog systems for information retrieval tasks. This work has been for the most part carried out in the context of several European and international projects. Evaluation plays an integral role in the development of spoken language dialog systems. While there are commonly used measures and methodologies for evaluating speech recognizers, the evaluation of spoken dialog systems is considerably more complicated due to the interactive nature and the human perception of performance. It is therefore important to assess not only the individual system components, but the overall system performance using objective and subjective measures. icslp00_10x.ps.Z [ICSLP'2000, Beijing, Oct 2000] --------- Fast Decoding for Indexation of Broadcast Data Jean-Luc Gauvain and Lori Lamel --------------------- Processing time is an important factor in making a speech transcription system viable for automatic indexation of radio and television broadcasts. When only concerned by the word error rate, it is common to design systems that run in 100 times real-time or more. This paper addresses issues in reducing the speech recognition time for automatic indexation of radio and TV broadcasts with the aim of obtaining reasonable performance for close to real-time operation. We investigated computational resources in the range 1 to 10xRT on commonly available platforms. Constraints on the computational resources led us to reconsider design issues, particularly those concerning the acoustic models and the decoding strategy. A new decoder was implemented which transcribes broadcast data in few times real-time with only a slight increase in word error rate when compared to our best system. Experiments with spoken document retrieval show that comparable IR results are obtained with a 10xRT automatic transcription or with manual transcription, and that reasonable performamce is still obtained with a 1.4xRT transcription system. TLPreco00.ps.Z [France-China Speech Processing Workshop, Beijing, Oct 2000] --------- An Overview of Speech Recognition Activities at LIMSI Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Claude Barras, Langzhou Chen, Michele Jardino, Lori Lamel, Holger Schwenk --------------------- This paper provides an overview of recent activities at LIMSI in multilingual speech recognition and its applications. The main goal of speech recognition is to provide a transcription of the speech signal as a sequence of words. Speech recognition is a core technology for most applications involving voice technology. The two main classes of applications currently addressed are transcription and indexation of broadcast data and spoken language dialog systems for information access. Speaker-independent, large vocabulary, continuous speech recognition systems for different European languages (French, German and British English) and for American English and Mandarin Chinese have been developed. These systems rely on supporting research in acoustic-phonetic modeling, lexical modeling and language modeling. icslp00_holger.ps.Z [ICSLP'2000, Beijing, Oct 2000] --------- Combining Multiple Speech Recognizers using Voting and Language Model Information Holger Schwenk and Jean-Luc Gauvain --------------------- In 1997, NIST introduced a voting scheme called ROVER for combining word scripts produced by different speech recognizers. This approach has achieved a relative word error reduction of up to 20% when used to combine the systems' outputs from the 1998 and 1999 Broadcast News evaluations. Recently, there has been increasing interest in using this technique. This paper provides an analysis of several modifications of the original algorithm. Topics addressed are the order of combination, normalization/filtering of the systems' outputs prior to combining them, treatment of ties during voting and the incorporation of language model information. The modified ROVER achieves an additional 5% relative word error reduction on the 1998 and 1999 Broadcast News evaluation test sets. Links with recent theoretical work on alternative error measures are also discussed. icslp00_holger.ps.Z [ASR'2000, Paris, Sep 2000] --------- Improved ROVER using Language Model Information Holger Schwenk and Jean-Luc Gauvain --------------------- In the standard approach to speech recognition, the goal is to find the sentence hypothesis that maximizes the posterior probability of the word sequence given the acoustic observation. Usually speech recognizers are evaluated by measuring the word error so that there is a mismatch between the training and the evaluation criterion. Recently, algorithms for minimizing directly the word error and other task specific error criterions have been proposed. This paper presents an extension of the ROVER algorithm for combining outputs of multiple speech recognizers using both a word error criterion and a sentence error criterion. The algorithm has been evaluated on the 1998 and 1999 broadcast news evaluation test sets, as well as the SDR 1999 speech recognition 10 hour subset and consistently outperformed the standard ROVER algorithm. The approach seems to be of particular interest for improving the recognition performance by combining only two or three speech recognizers achieving relative performance improvements of up to 20% compared to the best single recognizer. asr2000_light.ps.Z [ASR'2000, Paris, Sep 2000] --------- Lightly Supervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain, Gilles Adda --------------------- Although tremendous progress has been made in speech recognition technology, with the capability of todays state-of-the-art systems to transcribe unrestricted continuous speech from broadcast data, these systems rely on the availability of large amounts of manually transcribed acoustic training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators with substantial amounts of supervision. In this paper we describe some recent experiments using lightly supervised techniques for acoustic model training in order to reduce the system development cost. The strategy we investigate uses a speech recognizer to transcribe unannotated broadcast news data, and optionally combines the hypothesized transcription with associated, but unaligned closed captions or transcripts to create labeled training. We show that this approach can dramatically reduces the cost of building acoustic models. phon00.ps.Z [Workshop on Phonetics and Phonology in ASR, Saarbrücken 2000] --------- Modeling Reduced Pronunciations in German Martine Adda-Decker and Lori Lamel --------------------- This paper deals with pronunciation modeling for automatic speech recognition in German with a special focus on reduced pronunciations. Starting with our 65k full form pronunciation dictionary we have experimented with different phone sets for pronunciation modeling. For each phone set, different lexica have been derived using mapping rules for unstressed syllables, where /schwa-vowel+[lnm]/ are replaced by syllabic /[lnm]/. The different pronunciation dictionaries are used both for acoustic model training and during recognition. Speech corpora correspond to TV broadcast shows, which contain signal segments of various acoustic and linguistic natures. The speech is produced by a wide variety of speakers with linguistic styles ranging from prepared to spontaneous speech with changing background and channel conditions. Experiments were carried out using 4 shows of news and documentaries lasting for more than 15 minutes each (total of 1h20min). Word error rates obtained vary between 19 and 29% depending on the show and the system configuration. Only small differences in recognition rates were measured for the different experimental setups, with slightly better results obtained by the reduced lexica. spc00arise.ps.Z [Speech Communication, Vol 31, (4):339-354, Aug 2000] --------- The LIMSI ARISE System L. Lamel, S. Rosset, J.L. Gauvain, S. Bennacef, M. Garnier-Rizet, B. Prouts --------------------- The LIMSI ARISE system provides vocal access by telephone to rail travel information for main French intercity connections, including timetables, simulated fares and reservations, reductions and services. Our goal is to obtain high dialog success rates with a very open interaction, where the user is free to ask any question or to provide any information at any point in time. In order to improve performance with such an open dialog strategy, we make use of implicit confirmation using the callers wording (when possible), and change to a more constrained dialog level when the dialog is not going well. Le système ARISE du LIMSI est un serveur téléphonique d'informations sur les trains grandes lignes en France. Il fournit les horaires, les prix et réservations simulés, les réductions ainsi que les services. Notre but est d'obtenir des taux de succès de dialogue élevés avec une interaction très ouverte, offrant à l'utilisateur la possibilité de poser des questions ou de donner des informations à n'importe quel moment. Afin d'améliorer les performances avec une telle stratégie, nous avons utilisé la confirmation implicite en fonction des formulations des utilisateurs ainsi qu'un dialogue plus contraint lorsque l'interaction se passe mal. Das vom LIMSI entwickelte ARISE System erlaubt den vokalen Zugriff auf Zugreiseinformationen. Neben Fahrplanauskünften zwischen den wichtigsten französischen Intercity-Verbindungen simulierte das System Zugfahrpreise, ErmäSsigungen und anderen Serviceleistungen. Unser Ziel ist es, hohe Erfolgsraten im Mensch-Maschine Dialog zu erhalten. Erreicht werden soll dieses Ziel aber unter Benutzung einer sehr offenen Dialogstrategie, bei der der Benutzer zu jedem Zeitpunkt dem System Fragen stellen oder Informationen liefern kann. Um die Erfolgsraten mit solch einer offenen Dialogstrategie zu verbessern, benutzen wir die Technik oler impliziten Konfirmation, bei der (wenn möglich) die Formulierungen des Benutzers wiederverwendet werden. Sollte der Dialog schlecht verlaufen, so wechseln wir zu einem begrenzteren Dialog-Niveau. ieee00.ps.Z [Proceedings of the IEEE, Special issue on LVCSR: Advances and Applications. Vol 88(8):1181-1200, Aug 2000] --------- Large Vocabulary Continuous Speech Recognition: Advances and Applications Jean-Luc Gauvain and Lori Lamel --------------------- The last decade has witnessed substantial advances in speech recognition technology, which when combined with the increase in computational power and storage capacity, has resulted in a variety of commercial products already or soon to be on the market. In this paper we review the state-of-the-art in core technology in large vocabulary continuous speech recognition, with a view towards highlighting recent advances. We then highlight issues in moving towards applications, discussing system efficiency, portability across languages and tasks, enhancing the system output by adding tags and non-linguistic information. Current performance in speech recognition and outstanding challenges for three classes of applications: dictation, audio indexation and spoken language dialog systems are discussed. lrec00hbm.ps.Z [LREC'00, Athens, May'98] --------- Predictive performance of dialog systems H. Bonneau-Maynard, L. Devillers, S. Rosset --------------------- This paper relates some of our experiments on the possibility of predictive performance measures of dialog systems. Experimenting dialog systems is often a very high cost procedure due to the necessity to carry out user trials. Obviously it is advantageous when evaluation can be carried out automatically. It would be helpfull if for each application we were able to measure the system performances by an objective cost function. This performance function can be used for making predictions about a future evolution of the systems without user interaction. Using the PARADISE paradigm, a performance function derived from the relative contribution of various factors is first obtained for one system developed at LIMSI: PARIS-SITI (kiosk for tourist information retrieval in Paris). A second experiment with PARIS-SITI with a new test population confirms that the most important predictors of user satisfaction are understanding accuracy, recognition accuracy and number of user repetitions. Futhermore, similar spoken dialog features appear as important features for the Arise system (train timetable telephone information system). We also explore different ways of measuring user satisfaction. We then discuss the introduction of subjective factors in the predictive coefficients. ica00h4.ps.Z [ICASSP'2000, Istanbul, Jun 2000] --------- Transcription and Indexation of Broadcast Data Jean-Luc Gauvain, Lori Lamel, Yannick de Kercadio, and Gilles Adda --------------------- In this paper we report on recent research on transcribing and indexing broadcast news data for information retrieval purposes. The system described here combines an adapted version of the LIMSI 1998 Hub-4E transcription system for speech recognition with text-based IR methods. Experimental results are reported in terms of recognition word error rate and mean average precision for both the TREC SDR98 (100h) and SDR99 (600h) data sets. With query expansion using commercial transcripts, comparable mean average precisions are obtained on manual reference transcriptions and automatic transcriptions with a word error rate of 21.5% measured on a 10 hour data subset. spc00sv.ps.Z [Speech Communication, Vol 31 (2-3):141-154, June 2000] --------- Speaker Verification over the Telephone L.F. Lamel and J.L. Gauvain --------------------- Speaker verification has been the subject of active research for many years, yet despite these efforts and promising results on laboratory data, speaker verification performance over the telephone remains below that required for many applications. This experimental study aimed to quantify speaker recognition performance out of the context of any specific application, as a function of factors more-or-less acknowledged to affect the accuracy. Some if the issues addressed are: the speaker model (Gaussian mixture models are compared with phone-based models), the influence of the amount and content of training and test data on performance; performance degradation due to model ageing and how can this be counteracted by using adaptation techniques; achievable performance levels using text-dependent and text-independent recognition modes. These and other factors were addressed using a large corpus of read and spontaneous speech (over 250 hours collected from 100 target speakers and 1000 impostors) in French designed and recorded for the purpose of this study. On this data, the lowest equal error rate is 1% for the text-dependent mode when 2 trials are allowed per attempt and with a minimum of 1.5s of speech per trial. L'authentification automatique du locuteur a été le sujet d'actives recherches durant de nombreuses années, et malgrés ces efforts et des résultats prometteurs en laboratoire, le niveau de performance sur le réseau téléephonique reste inféerieur au niveau requis pour de nombreuses applications. L´étude expérimentale, dont les principaux résultats sont présentés dans cette article, avait pour objectif de quantifier en dehors de toute application l'influence de facteurs plus ou moins reconnus pour leur effet sur les performances de systèmes d'authentification du locuteur. Les questions addressées sont: le choix du modèle (mélange de gaussiennes ou modèle phonétiques); la connaissance ou non du texte prononcé par le locuteur; l'importance de la quantité et de la nature des données d'apprentissage et d'authentification, en particulier l'influence du contenu linguistique des énoncés sur le niveau de performance pour des textes lus et de la parole spontanée; la dégradation des résultats due au vieillisssement des modèles et la manière de le compenser avec des techniques d'adaptation. Les résultats exrérimentaux ont été obtenus sur d'un corpus téléphonique conçu et enregistré pour cette étude qui comprend plus de 250 heures de parole pour un total de 100 locuteurs abonnées et 1000 imposteurs. Sur ces données le taux d'égale erreur est de l'ordre de 1% dans le mode dépendant du texte lorsque deux essais sont autorisés par tentative d'authentification avec une durée minimale de 1.5s par essai. hub4_00.ps.Z [DARPA Speech Transcription Workshop, May 2000] --------- The LIMSI 1999 Hub-4E Transcription System Jean-Luc Gauvain, Lori Lamel and Gilles Adda --------------------- In this paper we report on the LIMSI 1999 Hub-4E system for broadcast news transcription. The main difference from our previous broadcast news transcription system is that a new decoder was implemented to meet the 10xRT requirement. This single pass 4-gram dynamic network decoder is based on a time-synchronous Viterbi search with dynamic expansion of LM-state conditioned lexical trees, and with acoustic and language model lookaheads. The decoder can handle position-dependent, cross-word triphones and lexicons with contextual pronunciations. Faster than real-time decoding can be obtained using this decoder with a word error under 30%, running in less than 100Mb of memory on widely available platforms such Pentium III or Alpha machines. The same basic models (lexicon, acoustic models, language models) and partitioning procedure used in past systems have been used for this evaluation. The acoustic models were trained on about 150 hours of transcribed speech material. 65K word language models were obtained by interpolation of backoff n-gram language models trained on different text data sets. Prior to word decoding a maximum likelihood partitioning algorithm segments the data into homogenous regions and assigns gender, bandwidth and cluster labels to the speech segments. Word decoding is carried out in three steps, integrating cluster-based MLLR acoustic model adaptation. The final decoding step uses a 4-gram language model interpolated with a category trigram model. The overall word transcription error on the 1999 evaluation test data was 17.1% for the baseline 10X system. riao00.ps.Z [RAIO'00, Paris, May 2000] An Audio Transcriber for Broadcast Document Indexation Contacts: Bernard Prouts (Vecsys), Jean-Luc Gauvain (LIMSI) --------- Demonstration of the AudioSurf automatic transcription system. The demonstrator shows state-of-the-art automatic transcription and indexing capabilities of broadcast data in three languages: American English, French and German. sdr99.ps.Z [Proc. of the Text Retrieval Conference, TREC-8, Gaithersburg, Nov 1999] --------- The LIMSI SDR system for TREC-8 J.L. Gauvain, Y. Kercadio, L. Lamel, and G. Adda -------------------- In this paper we report on our TREC-8 SDR system, which combines an adapted version of the LIMSI 1998 Hub-4E transcription system for speech recognition with an IR system based on the Okapi term weighting function. Experimental results are given in terms of word error rate and average precision for both the SDR'98 and SDR'99 data sets. In addition to the Okapi approach, we also investiged a Markovian approach, which although not used in the TREC-8 evaluation, yields comparable results. The evaluation system obtained an average precision of 0.5411 on the reference transcriptions and of 0.5072 on the automatic transcriptions. The word error rate measured on a 10 hour subset is of 21.5%. spc99pron.ps.Z [Speech Communication, Special Issue on Pronunciation Variation Modeling, Vol 29 (2-4): 83-98, Nov 1999] --------- Pronunciation Variants Across Systems, Languages and Speaking Style M. Adda-Decker, L. Lamel -------------------- This contribution aims at evaluating the use of pronunciation variants for different recognition system configurations, languages and speaking styles. This study is limited to the use of variants during speech alignment, given an orthographic transcription of the utterance and a phonemically represented lexicon, and is thus focused on the modeling capabilities of the acoustic word models. To measure the need for variants we have defined the variant2+ rate which is the percentage of words in the corpus not aligned with the most common phonemic transcription. This measure may be indicative of the possible need for pronunciation variants in the recognition system. Pronunciation lexica have been automatically created so as to include a large number of variants (overgeneration). In particular, lexica with parallel and sequential variants were automatically generated in order to assess the spectral and temporal modeling accuracy. We first investigated the dependence of the aligned variants on the recognizer configuration. Then a cross-lingual study was carried out for read speech in French and American English using the BREF and the WSJ corpora. A comparison between read and spontaneous speech was made for French based on alignments of BREF (read) and MASK (spontaneous) data. Comparative alignment results using different acoustic model sets demonstrate the dependency between the acoustic model accuracy and the need for pronunciation variants. The alignment results obtained with the above lexica have been used to study the link between word frequencies and variants using different acoustic model sets. Cette contribution vise à évaluer l'utilisation des variantes de prononciation pour différentes configurations de système, différentes langues et différents types d'élocution. Cette étude se limite à l'utilisation de variantes pendant l'alignement automatique de la parole étant donnée une transcription orthographique correcte et un lexique de prononciation. Nous focalisons ainsi notre étude sur la capacité des modèles acoustique des mots à rendre compte du signal observé. Pour évaluer le besoin de variantes nous avons défini le taux de variant2+ qui correspond au pourcentage de mots du corpus qui ne sont pas alignés avec la meilleure transcription phonémique. Ce taux peut être considéré comme indicatif d'un éventuel besoin de variantes de prononciation dans le système de reconnaissance. Différents lexiques de prononciation ont été créés automatiquement générant différents types et quantités de variantes (avec surgénération). En particulier des lexiques avec des variantes parallèles et séquentielles ont été distingués afin d'évaluer la précision de la modélisation spectrale et temporelle. Dans une première étape nous avons montré le lien entre le besoin de variantes de prononciation et la qualité des modèles acoustiques. Nous avons ensuite comparé différents phénomènes de variantes pour l'anglais et le français sur des grands corpus de parole lue (WSJ et BREF). Une comparaison entre parole spontanée et parole lue est présentée. Cette étude montre que le besoin de variantes diminue avec la précision des modèles acoustiques. Pour le français, elle permet de révéler l'importance des variantes séquentielles, en particulier du e-muet. cbmi99.ps.Z [Content-Based Multimedia Indexing, CBMI'99 Toulouse, Oct 1999 http://www.eurecom.fr/~bmgroup/cbmi99/] --------- Audio partitioning and transcription for broadcast data indexation J.L. Gauvain, L. Lamel, and G. Adda -------------------- This work addresses automatic transcription of television and radio broadcasts. Transcription of such types of data is a major step in developing automatic tools for indexation and retrieval of the vast amounts of information generated on a daily basis. Radio and television broadcasts consist of a continuous stream of data comprised of segments of different linguistic and acoustic natures, which poses challenges for transcription. Prior to word recognition, the data is partitioned into homogeneous acoustic segments. Non-speech segments are identified and removed, and the speech segments are clustered and labeled according to bandwidth and gender. The speaker-independent large vocabulary, continuous speech recognizer makes use of n-gram statistics for language modeling and of continuous density HMMs with Gaussian mixtures for acoustic modeling. The system has consistently obtained top-level performance in DARPA evaluations. An average word error of about 20% has been obtained on 700 hours of unpartitioned unrestricted American English broadcast data. cbmi99-olive.ps.Z [Content-Based Multimedia Indexing, CBMI'99 Toulouse, Oct 1999 http://www.eurecom.fr/~bmgroup/cbmi99/] --------- Olive: Speech Based Video Retrieval Franciska de Jong (TNO/University of Twente), Jean-Luc Gauvain, Jurgen den Hartog (TNO), Klaus Netter (DFKI) -------------------- This paper describes the OLIVE project which aims to support automated indexing of video material by use of human language technologies. OLIVE is making use of speech recognition to automatically derive transcriptions of the sound tracks, generating time-coded linguistic elements which serve as the basis for text-based retrieval functionality. The retrieval demonstrator builds on and extends the architecture from the POP-EYE project, a system applying human language technology on subtitles for the disclosure of video fragments. SFC99-article.ps.Z [Mathematique Informatique et Sciences Humaines, numero 147 (speciale classification), septembre 1999] --------- Classification de mots non étiquetés par des méthodes statistiques Christel Beaujard et Michèle Jardino -------------------- Notre thématique de recherche est le développement de modèles de langage robustes pour la reconnaissance de la parole. Ces modèles doivent prédire un mot connaissant les mots qui le précèdent. Malgré le nombre croissant de données textuelles électroniques, toutes les possibilités de la langue ne sont pas présentes dans ces données, un moyen de les obtenir est de généraliser la représentation textuelle en regroupant les mots dans des classes. Les modèles de langage fondés sur des classes présentent alors une plus large couverture de la langue avec un nombre réduit de paramètres permettant une reconnaissance plus rapide des mots par les systèmes de reconnaissance de la parole dans lesquels ils sont introduits. Nous décrivons deux types de classification automatique de mots, appris statistiquement sur des textes écrits de journaux et de transcriptions de parole. Ces classifications ne nécessitent pas d'étiquetage des mots, elles sont réalisées suivant les contextes locaux dans lesquels les mots sont observés. L'une est basée sur la distance de Kullback-Leibler et répartit tous les mots dans un nombre de classes fixé à l'avance. La seconde regroupe les mots considérés comme similaires dans un nombre de classes non prédéfini. Cette étude a été réalisée sur des données d'apprentissage en francais de domaine, de taille et de vocabulaire différents. Our goal is to develop robust language models for speech recognition. These models have to predict a word knowing its history. Although the increasing size of electronic text data, all the possible word sequences of a language cannot be observed. A way to generate these non encountered word sequences is to map words in classes. The class-based langage models have a better coverage of the language with a reduced number of parameters, a situation which is favourable to speed up the speech recognition systems. Two types of automatic word classification are described. They are trained on word statistics estimated on texts derived from newspapers and transcribed speech. These classifications do not require any tagging, words are classified according to the local context in which they occur. The first one is a mapping of the vocabulary words in a fixed number of classes according to a Kullback-Leibler measure. In the second one, similar words are clustered in classes whose number is not fixed in advance. This work has been performed with French training data coming from two domains, both different in size and vocabulary. esca99-minker.ps.Z [ESCA Tutorial and Research Workshop on Interactive Dialogue in Multi-Modal Systems, KlosterIrsee, Germany, June 1999] --------- Hidden Understanding Models for Machine Translation W. Minker and M. Gavalda and A. Waibel --------------------- We demonstrate the portability of a stochastic method for understanding natural language from a setting of human-machine interactions (ATIS - Air Travel Information Services and MASK Multimodal Multimedia Automated Service Kiosk) into the more open one of human-to-human interactions. The application we use is the English Spontaneous Speech Task (ESST) for multilingual appointment scheduling. Spoken language systems developed for this task translate spontaneous conver sational speech among different languages. voiceglottalope.ps.Z [Workshop on Models and Analysis of Vocal Emissions for biomedical applications, Firenze] --------- Glottal open quotient estimation using linear prediction N. Henrich, B. Doval and C. d'Alessandro --------------------- A new method for the estimation of the voice open quotient is presented. Assuming abrupt glottal closures, the glottal flow waveform is considered as t he impulse response of an anticausal two-poles filter. It is defined by four parameters: T0, Av, Oq and ffm. The last three ones are estimated by a second-order linear prediction of the inverse filtered speech. Results on synthetic and natural speech signals are reported and compared with measurements on the correspon ding electroglottographic signals. euro99_tuan.ps.Z [EuroSpeech'99 Budapest, Sep'99] --------- Robust glottal closure detection using the wavelet transform. Tuan Vu Ngoc and Christophe d'Alessandro --------------------- In this work, a time-scale framework for analysis of glottal closure instants is proposed. As glottal closure can be soft or sharp, depending on the type of vocal activity, the analysis method should be able to deal with both wide-band and low-pass signals. Thus, a multi-scale analysis seems well-suited. The analysis is based on a dyadic wavelet filterbank. Then, the amplitude maxima of the wavelet transform are computed, at each scale. These maxima are organized into lines of maximal amplitude (LOMA) using a dynamic programming algorithm. These lines are forming ``trees'' in the time-scale domain. Glottal closure instants are then interpreted as the top of the strongest branch, or trunk, of these trees. Interesting features of the LOMA are their amplitudes. The LOMA are strong and well organized for voiced speech, and rather weak and widespread for unvoiced speech. The accumulated amplitude along the LOMA gives a very good measure of the degree of voicing. euro99_dial.ps.Z [EuroSpeech'99 Budapest, Sep'99] --------- Design Strategies for Spoken Language Dialog Systems Sophie Rosset, Samir Bennacef and Lori Lamel --------------------- The development of task-oriented spoken language dialog system requires expertise in multiple domains including speech recognition, natural spoken language understanding and generation, dialog management and speech synthesis. The dialog manager is the core of a spoken language dialog system, and makes use of multiple knowledge sources. In this contribution we report on our methodology for developing and testing different strategies for dialog management, drawing upon our experience with several travel information tasks. In the LIMSI Arise system for train travel information we have implemented a 2-level mixed-initiative dialog strategy, where the user has maximum freedom when all is going well, and the system takes the initiative if problems are detected. The revised dialog strategy and error recovery mechanisms have resulted in a 5-10% increase in dialog success depending upon the word error rate. euro99_hub4.ps.Z [EuroSpeech'99 Budapest, Sep'99] --------- Recent Advances in Transcribing Television and Radio Broadcasts Jean-Luc Gauvain and Lori Lamel and Gilles Adda and Michele Jardino --------------------- Transcription of broadcast news shows (radio and television) is a major step in developing automatic tools for indexation and retrieval of the vast amounts of information generated on a daily basis. Broadcast shows are challenging to transcribe as they consist of a continuous data stream with segments of different linguistic and acoustic natures. Transcribing such data requires addressing two main problems: those related to the varied acoustic properties of the signal, and those related to the linguistic properties of the speech. Prior to word transcription, the data is partitioned into homogeneous acoustic segments. Non-speech segments are identified and rejected, and the speech segments are clustered and labeled according to bandwidth and gender. The speaker-independent large vocabulary, continuous speech recognizer makes use of n-gram statistics for language modeling and of continuous density HMMs with Gaussian mixtures for acoustic modeling. The LIMSI system has consistently obtained top-level performance in DARPA evaluations, with an overall word transcription error on the Nov98 evaluation test data of 13.6%. The average word error on unrestricted American English broadcast news data is under 20%. ica99arise.ps.Z [ICASSP'99 Phoenix] --------- The LIMSI ARISE System for Train Travel Information Lori Lamel, Sophie Rosset, Jean-Luc Gauvain and Samir Bennacef --------------------- In the context of the LE-3 ARISE project we have been developing a dialog system for vocal access to rail travel information. The system provides schedule information for the main French intercity connections, as well as, simulated fares and reservations, reductions and services. Our goal is to obtain high dialog success rates with a very open dialog structure, where the user is free to ask any question or to provide any information at any point in time. In order to improve performance with such an open dialog strategy, we make use of implicit confirmation using the callers wording (when possible), and change to a more constrained dialog level when the dialog is not going well. In addition to own assessment, the prototype system undergoes periodic user evaluations carried out by the our partners at the French Railways. icassp99_holger.ps.Z [ICASSP'99, Phoenix, March 1999] --------- Using boosting to improve a hybrid HMM/neural network speech recognizer Holger Schwenk --------------------- "Boosting" is a general method for improving the performance of almost any learning algorithm. A recently proposed and very promisin g boosting algorithm is AdaBoost. In this paper we investigate if AdaBoost can be used to improve a hybrid HMM/neural network continuous speech recognizer. Boosting significantly improves the word error rate from 6.3% to 5.3% on a test set of the OGI Numbers95 corpus, a medium size continuous numbers recognition task. These results compare favorably with other combining techniques using several different feature representations or additional information from longer time spans. icassp99bref.ps.Z [ICASSP'99, Phoenix, March 1999] --------- Large Vocabulary Speech Recognition in French Martine Adda-Decker, Gilles Adda, Jean-Luc Gauvain, Lori Lamel --------------------- In this contribution we present some design considerations concerning our large vocabulary continuous speech recognition system in French. The impact of the epoch of the text training material on lexical coverage, language model perplexity and recognition performance on newspaper texts is demonstrated. The effectiveness of larger vocabulary sizes and larger text training corpora for language modeling is investigated. French is a highly inflected language producing large lexical variety and a high homophone rate, the language model contribution is of major importance. About 30% of recognition errors are shown to be due to substitutions between inflected forms of a given root form. When word error rates are analysed as a function of word frequency, a significant increase in the error rate can be measured for frequency ranks above 5000. hub4_99.ps.Z [DARPA'99 Broadcast News Workshop, Herndon, VA, Feb'99] --------- The LIMSI 1998 Hub-4E Transcription System Jean-Luc Gauvain, Lori Lamel, Gilles Adda and Michele Jardino --------------------- In this paper we report on our Nov98 Hub-4E system, which is an extension of our Nov97 system. The LIMSI system for the November 1998 Hub-4E evaluation is a continuous mixture density, tied-state cross-word context-dependent HMM system. The acoustic models were trained on the 1995, 1996 and 1997 official Hub-4E training data containing about 150 hours of transcribed speech material. 65K word language models were obtained by interpolation of backoff n-gram language models trained on different text data sets. Prior to word decoding a maximum likelihood partitioning algorithm segments the data into homogenous regions and assigns gender, bandwidth and cluster labels to the speech segments. Word decoding is carried out in three steps, integrating cluster-based MLLR acoustic model adaptation. The final decoding step uses a 4-gram language model interpolated with a category trigram model. The main differences compared to last year's system arise from the use of additional acoustic and language model training data, the use of divisive decision tree clustering instead of agglomerative clustering for state-tying, generation graph word using adapted acoustic models, the use of interpolated LMs trained on different data sets instead of training a single model on weighted texts, and a 4-gram LM interpolated with a category model. The overall word transcription error on the Nov98 evaluation test data was 13.6%. icslp98lid.ps.Z [ICSLP'98, Sydney, Dec'98] --------- Language Identification Incorporating Lexical Information D. Matrouf, M. Adda-Decker, L.F. Lamel, J.L. Gauvain --------------------- In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to provide linguistic constraints during acoustic decoding. Experiments were carried out on a 4-language telephone speech corpus. Using lexical information achieves a relative error reduction of about 20% on spontaneous and read speech compared to the reference phone-based system. Identification rates of 92%, 96% and 99% are achieved for spontaneous, read and task-specific speech segments respectively, with prior speech detection. icslp98dial.ps.Z [ICSLP'98, Sydney, Dec'98] --------- Evaluation of Dialog Strategies for a Tourist Information Retrieval System L. Devillers, H.Bonneau-Maynard --------------------- In this paper, we describe the evaluation of the dialog management and response generation strategies being developed for retrieval of touristic information, selected as a common domain for the ARC AUPELF-B2 action. Two dialog strategy versions within the same general platform have been implemented. We have investigated qualitative and quantitative criteria for evaluation of these dialog control strategies: in particular, by testing the efficiency of our system with and without automatic mechanisms for guiding the user via suggestive prompts. An evaluation phase has been carried out to assess the utility of guiding the user with 32 subjects. We report performance comparisons for naive and experienced subjects and also describe how experimental results can be placed in the PARADISE framework for evaluating dialog systems. The experiments show that user guidance is appropriate for novices and appreciated by all users. icslp98h4.ps.Z [ICSLP'98, Sydney, Dec'98] --------- Partitioning and Transcription of Broadcast News Data Jean-Luc Gauvain, Lori Lamel, Gilles Adda --------------------- Radio and television broadcasts consist of a continuous stream of data comprised of segments of different linguistic and acoustic natures, which poses challenges for transcription. In this paper we report on our recent work in transcribing broadcast news data, including the problem of partitioning the data into homogeneous segments prior to word recognition. Gaussian mixture models are used to identify speech and non-speech segments. A maximum-likelihood segmentation/clustering process is then applied to the speech segments using GMMs and an agglomerative clustering algorithm. The clustered segments are then labeled according to bandwidth and gender. The recognizer is a continuous mixture density, tied-state cross-word context-dependent HMM system with a 65k trigram language model. Decoding is carried out in three passes, with a final pass incorporating cluster-based test-set MLLR adaptation. The overall word transcription error on the Nov'97 unpartitioned evaluation test data was 18.5%. icslp98mask.ps.Z [ICSLP'98, Sydney, Dec'98] --------- User Evaluation of the MASK Kiosk L. Lamel, S. Bennacef, J.L. Gauvain, H. Dartigues, J.N. Temem --------------------- In this paper we report on a series of user trials carried out to assess the performance and usability of the Mask prototype kiosk. The aim of the ESPRIT Multimodal Multimedia Service Kiosk (MASK) project was to pave the way for more advanced public service applications with user interfaces employing multimodal, multi-media input and output. The prototype kiosk, was developed after analysis of the technological requirements in the context of users and the tasks they perform in carrying out travel enquiries, in close collaboration with the French Railways (SNCF) and the Ergonomics group at UCL. The time to complete the transaction with the Mask kiosk is reduced by about 30% compared to that required for the standard kiosk, and the success rate is 85% for novices and 94% once familiar with the system. In addition to meeting or exceeding the performance goals set at the project onset in terms of success rate, transaction time, and user satisfaction, the Mask kiosk was judged to be user-friendly and simple to use. issd98.ps.Z [ISSD'98, Sydney, Nov'98] --------- Spoken Language Dialog System Development and Evaluation at LIMSI Lori Lamel --------------------- The development of natural spoken language dialog systems requires expertise in multiple domains, including speech recognition, natural spoken language understanding and generation, dialog managment and speech synthesis. In this paper I report on our experience at LIMSI in the design, development and evaluation of spoken language dialog systems for information retrieval tasks. Drawing upon our experience in this area, I attempt to highlight some aspects of the design process, such as the use of general and task-specific knowledge sources, the need for an iterative development cycle, and some of the difficulties related to evaluation of development progress. tts98atr.ps.Z [3rd Int'l Workshop on Speech Synthesis, Jenolan Caves, Nov'98] --------- Voice quality modification using periodic-aperiodic decomposition and spectral processing of the voice source signal C. d'Alessandro and B. Doval --------------------- Voice quality is currently a key issue in speech synthesis research. The lack of realistic intra-speaker voice quality variation is an important source of concern for concatenation-based synthesis methods. A challenging problem is to reproduce the voice quality changes that are occuring in natural speech when the vocal effort is varying. A new method for voice quality modification is presented. It takes advantage of a spectral theory for voice source signal representation. An algorithm based on periodic-aperiodic decomposition and spectral processing (using the short-term Fourier transform) is described. The use of adaptive inverse filtering in this framework is also discussed. Applications of this algorithm may include: pre-processing of speech corpora, modification of voice quality parameters together with intonation in synthesis, voice transformation. Some experiments are reported, showing convincing voice quality modifications for various speakers. tts98b3.ps.Z [3rd Int'l Workshop on Speech Synthesis, Jenolan Caves, Nov'98] --------- Joint evaluation of Text-To-Speech synthesis in French within the AUPELF ARC-B3 project C. d'Alessandro and B3 Partners --------------------- A joint international evaluations of Text-To-Speech synthesis (TTS) systems is being conducted for the French language. This project involves eight laboratories of French-speaking countries (Belgium, Canada, France and Switzerland), and is funded by AUPELF (Association of French Speaking Universities) The results obtained after 2 years of work are presented in this paper. The project is split into 4 tasks: 1/ evaluation of grapheme-to-phoneme conversion; 2/ evaluation of prosody; 3/ evaluation of segments concatenation/modification; 4/ global system evaluation. Grapheme-phoneme conversion evaluation has now been completed, and both methodological issues and results for the eight systems are presented at the workshop. For prosody evaluation, the problem is to study several systems that incorporate different linguistic analyses, different prosodic systems, different intonation and rhythmic models. Using the same phonemic input and the same concatenation/modification system with the same diphones, it will be possible to assess prosodic quality independently of the other modules. Perceptual evaluation is used at this stage. The degradation introduced by concatenation/modification systems and by the quality of segment data-bases will be studied using perceptual tests. Evaluation of the global systems will also be performed, using both intellegibility and agreement measures. Finally, one of the aims of this project is to make available corpora and evaluation paradigms that may be reused in future research. This will enable a quantitative analysis of the results obtained, and a measurement of the progress achieved for each specific system. tts98BdM.ps.Z [3rd Int'l Workshop on Speech Synthesis, Jenolan Caves, Nov'98] --------- Text chunking for prosodic phrasing in French P. Boula de Mareuil and C. d'Alessandro --------------------- In this paper, we describe experiments in text chunking for prosodic phrasing and generation in French. We present a quick, robust and deterministic parser which uses part-of-speech information and a set of rules, to consistently assign prosodic boundaries in Text-To-Speech synthesis. The syntactic phrasing, consisting of segmenting sentences in non-recursive sequences, is defined in terms of sets of possible categories. The syntax-prosody interface is presented: the sequences enable the location of potential prosodic boundaries (minor, major or intermediate). The first results are given, including of a listening test, which demonstrated the advantage of our chunk grammar over a simpler approach, based on function words and punctuation. Quantitative measures are made on the chunks defined between boundaries. Our model prefers structural criteria to probabilities, and an approach by intension rather than by extension, is also compared with other models. specom98.ps.Z [SPECOM'98, St. Petersburg] --------- Dialog Strategies in a Tourist Information Spoken Dialog System H. Bonneau-Maynard, L. Devillers --------------------- We describe in this paper the dialog control strategy being developed for the tourist information retrieval system, PARIS-SITI, in the context of ARC AUPELF-B2 action on spoken dialog systems evaluation. The dialog control strategy has been evaluated by testing the efficiency of two systems : with and without automatic mechanisms for guiding the user via suggestive prompts with both novice and expererienced subjects on a set of 4 scenarios performed by each speaker with both systems. Qualitative and quantitative criteria such as user satisfaction, understanding and recognition error rate, percentage of followed prompts and scenario success are discussed showing the interest of the proposed strategy. ivtta98.ps.Z [IVTTA'98, Torino, Sep'98] --------- The LIMSI ARISE System L. Lamel, S. Rosset, J.L. Gauvain, S. Bennacef, M. Garnier-Rizet, B. Prouts --------------------- The LIMSI Arise system provides vocal access to rail travel information for main French intercity connections, including timetables, simulated fares and reservations, reductions and services. Our goal is to obtain high dialog success rates with a very open structure, where the user is free to ask any question or to provide any information at any point in time. In order to improve performance with such an open dialog strategy, we make use of implicit confirmation using the callers wording (when possible), and change to a more constrained dialog level when the dialog is not going well. The same system architecture is being used to develop a French/English prototype timetable service for the high speed trains between Paris and London. asaica98.ps.Z [ASA/ICA'98, Seattle, June'98] --------- Recent Activities in Spoken Language Processing at LIMSI J.L. Gauvain and L. Lamel --------------------- This paper summarizes recent activities at LIMSI in multilingual speech recognition and its applications. While the main goal of speech recognition is to provide a transcription of the speech signal as a sequence of words, the same basic technology serves as the first step in other application areas, such as in automatic systems for information access and for automatic indexation of audiovisual data. sfc98.ps.Z [SFC'98, Montpellier] --------- Classification de mots non étiquetés par des méthodes statistiques Christel Beaujard et Michèle Jardino -------------------- Nous décrivons deux types de classification de mots, appris statistiquement sur des textes écrits de journaux et de transcriptions de parole. Ces classifications ne nécessitent pas d'étiquetage des mots, elles sont réalisées suivant les contextes locaux dans lesquels les mots sont observés. L'une est basée sur la distance de Kullback Leibler et répartit tous les mots dans un nombre de classes fixé à l'avance. La seconde regroupe les mots considérés comme similaires dans un nombre de classes non prédéfini. Cette étude a été réalisée sur des données d'apprentissage de domaine, de taille et de vocabulaire différents. tagworkshop98.ps.Z [TAG+ Workshop, Philadelphia, July'98] --------- An improved Earley parser with LTAG Y. de Kercadio --------------------- This paper presents an adaptation of the Earley algorithm to enable parsing with lexicalized tree-adjoining grammars (LTAGs). Each elementary tree is represented by a rewrite rule, so that the adapted version of the Earley algorithm can be used for parsing. This algorithm conserves the valid prefix property. Taking advantage of the top-down strategy and the valid prefix property, improvements in parsing are obtained by using the local span of the trees and some linguistic properties. lrec98minker.ps.Z [LREC'98, Granada, May'98] --------- Evaluation Methodologies for Interactive Speech Systems W. Minker --------------------- In this paper, several criteria and paradigms are described to measure the performance of spoken language systems. The focus is on the evaluation of natural language understanding components. These evaluations are carried out in the domain of spontaneous human-human interaction as supported by automatic translation systems. They are also applied in the domain of spontaneous human-machine interaction typically used in information retrieval applications. Some system response evaluation paradigms for different applications and domains are discussed in more detail. It is also shown that official performance tests and site-specific evaluation paradigms are complementary in use. lrec98W_minker.ps.Z [LREC'98 Workshop on "The Evaluation of Parsing Systems", Granada, May'98] --------- Evaluating Parses for Spoken Language Dialogue Systems W. Minker and L. Chase --------------------- In this paper, several criteria and paradigms are described to measure the performance of spoken language systems. The focus is on the evaluation of natural language understanding components. These evaluations are carried out in the domain of spontaneous human-human interaction as supported by automatic translation systems. They are also applied in the domain of spontaneous human-machine interaction typically used in information retrieval applications. Some system response evaluation paradigms for different applications and domains are discussed in more detail. It is also shown that official performance tests and site-specific evaluation paradigms are complementary in use. lrec98ideal.ps.Z [LREC'98, Granada, May'98] --------- A Multilingual Corpus for Language Identification L. Lamel, G. Adda, M. Adda-Decker, C. Corredor-Ardoy, J.J. Gangolf, J.L. Gauvain --------------------- In this paper we describe the design, recording, and transcription of a large, multilingual (French, English, German and Spanish) corpus of telephone speech for research in automatic language identification. The corpus contains over 250 calls from native speakers of each language from their home country, and an additional 50 calls per language from another country. Although the same recording protocol was used for all languages, slight modifications were necessary to account for language or country specificities. Issues in designing comparable corpora in different languages are addressed, including how to interact with callers so as to obtain the desired responses. pron98.ps.Z [ESCA ETRW on Modeling Pronunciation Variants for ASR, Rolduc, May'98] --------- Pronunciation Variants Across Systems, Languages and Speaking Style Martine Adda-Decker and Lori Lamel --------------------- This contribution aims at evaluating the use of pronunciation variants across different system configurations, languages and speaking styles. This study is limited to the use of variants during speech alignment, given an orthographic transcription and a phonemically represented lexicon, thus focusing on the modeling abilities of the acoustic word models. Parallel and sequential variants are tested in order to measure the spectral and temporal modeling accuracy. As a preliminary step we investigated the dependance of the aligned variants on the recognizer configuration. A cross-lingual study was carried out for read speech in French and American English using the BREF and the WSJ corpora. A comparison between read and spontaneous speech is presented for French based on alignments from BREF (read) and MASK (spontaneous) data. ica98phrec.ps.Z [ICASSP'98, Seattle, WA, May'98] --------- Multi-lingual Phone Recognition on Telephone Spontaneous Speech C. Corredor-Ardoy, L. Lamel, M. Adda-Decker, J.L. Gauvain --------------------- In this paper we report phone accuracies on telephone spontaneous speech. Language-dependent phone recognizers were trained and assessed on IDEAL, a multi -lingual corpus containing telephone speech in French, British English, German and Castillan Spanish. We discuss phone recognition using context independent Hidden Markov Models (CIHMM) and back-off phonotactic bigram models built on corpora with different sizes and speech materials. Although the training corpus containing spontaneous speech only includes 14% of all available data, it represents the best choice according to the task. Phone recognition was improved using CD phone units: 51.9% compared with 57.4%, with CDHMM and CIHMM respectively (average error rates on the 4 languages). We also suggest a straightforward way to detect non speech phenomena. The basic idea is to remove sequences of any consonants between two silence labels, from the recognized phone strings prior to scoring. This simple technique reduces the relative phone error rate by 5.4% (average on the 4 languages). rla2c98.ps.Z [RLA2C'98, Avignon, April'98] --------- Speaker Verification Over the Telephone Lori Lamel and Jean-Luc Gauvain --------------------- In this paper we present a study on speaker verification using telephone speech and for two operational modes, i.e. text-dependent and text-independent speaker verification. A statistical modeling approach is taken, where for text-independent verification the talker is viewed as a source of phones, modeled by a fully connected Markov chain and for text-dependent verification, a left-to-right HMM is built by concatenating the phone models corresponding to the transcription. A series of experiments were carried out on a large telephone corpus recorded specifically for speaker verification algorithm development assessing performance as a function of the type and amount of data used for training and for verification. Experimental results are presented for both read and spontaneous speech. On this data, the lowest equal error rate is 1% for the text-dependent mode when 2 trials are allowed per attempt and with a minimum of 1.5s of speech per trial. hub4_98.ps.Z [DARPA'98 Broadcast News Transcription and Understanding, Landsdowne, VA, Feb'98] --------- The LIMSI 1997 Hub-4E Transcription System Jean-Luc Gauvain, Lori Lamel, Gilles Adda --------------------- In this paper we report on the LIMSI system used in the Nov'97 Hub-4E benchmark test on transcription of American English broadcast news shows. There are two main differences from the LIMSI system developed for the Nov'96 evaluation. The first concerns the preprocessing stages for partitioning the data, and the second concerns a reduction in the number of acoustic model sets used to deal with the various acoustic signal characteristics. The LIMSI system for the November 1997 Hub-4E evaluation is a continuous mixture density, tied-state cross-word context-dependent HMM system. The acoustic models were trained on the 1995 and 1996 official Hub-4E training data containing about 80 hours of transcribed speech material. The 65K word trigram language models are trained on 155 million words of newspaper texts and 132 million words of broadcast news transcriptions. The test data is segmented and labeled using Gaussian mixture models, and non-speech segments are rejected. The speech segments are classified as telephone or wide-band, and according to gender. Decoding is carried out in three passes, with a final pass incorporating cluster-based test-set MLLR adaptation. The overall word transcription error of the Nov'97 unpartitioned evaluation test data was 18.5%. convsys98.ps.Z [DARPA'98 Broadcast News Transcription and Understanding, Landsdowne, VA, Feb'98] --------- An Overview of EU Programs Related to Conversational/Interactive Systems Joseph Mariani and Lori Lamel --------------------- Since 1984, the European Commission has supported projects related to spoken language processing within various programs (particularly The projects address various aspects of spoken language processing including basic research, technology development, infrastructure (language resources and evaluation) and applications in different areas (telecommunications, language training, office systems, games, cars, information access and retrieval, banking, aids to the handicapped...). The next program, the 5th Framework Program is now being prepared, and should unite the activities in the area of Human Language Technology and its components in a single sector. The three sub-areas which have been identified are "Multilinguality", "Natural Interactivity" and "Active Content". Several on-going projects are addressing the design of conversational systems. Three example applications are: the ESPRIT Mask project which aims to develop a Multimodal-Multimedia Automated Service Kiosk providing access to rail travel information; the LE Arise project which aims to provide similar information via the telephone for the Dutch, French, and Italian railways; and the LE Vodis project whose goal is to develop a vocal interface to enhance the usability and functionality of in-car driver assistance and information services. spc97intro.ps.Z [Speech Communication, Vol 23, (1-2):63-65, Oct 1997] --------- RailTel: Railway Telephone Services R. Billi (CSELT), L.F. Lamel --------------------- The aim of the LE-MLAP project RailTel (Railway Telephone Information Service) was to evaluate the technical adequacy of available speech technology for interactive telephone services, taking into account the requirements of the service providers and the users. In particular the consortium evaluated the specifications and potential for vocal access to rail travel information. The project, coordinated by CSELT (Italy), included two user organizations: British Rail/ British Systems (U.K.) and FS (Italy). The French railways (SNCF) was a partner in the MAIS project (The RailTel project worked in close cooperation with the LE-MLAP project MAIS, that addressed the same task for a different geographical area) but contributed to the RailTel specification of user and application requirements, and the evaluation methodology. Three research centers, CCIR (U.K.), CSELT, LIMSI-CNRS (France) provided the spoken language technologies, and service providers Saritel and LTV (Italy) contributed expertise concerning telephone services and management of railway timetable databases. spc97ivtta.ps.Z [Speech Communication, Vol 23, (1-2):67-82, Oct 1997] --------- The LIMSI RailTel System: Field Trials of a Telephone Service for Rail Travel Information L. Lamel, S.K. Bennacef, S. Rosset, L. Devillers, S. Foukia, J.J. Gangolf, J.L. Gauvain --------------------- This paper describes the RailTel system developed at LIMSI to provide vocal access to static train timetable information in French, and a field trial carried out to assess the technical adequacy of available speech technology for interactive services. The data collection system used to carry out the field trials is based on the LIMSI Mask spoken language system and runs on a Unix workstation with a high quality telephone interface. The spoken language system allows a mixed-initiative dialog where the user can provide any information at any point in time. Experienced users are thus able to provide all the information needed for database access in a single sentence, whereas less experienced users tend to provide shorter responses, allowing the system to guide them. The RailTel field trial was carried out using a common methodology defined by the consortium. 100 naive subjects participated in the field trials, each calling the system one time and completing a user questionnaire. 72% of the callers successfully completed their scenario. The subjective assessment of the prototype was for the most part favorable, with subjects expressing an interest in using such a service. Cet article décrit le système RailTel développé au LIMSI, destiné à l'accès vocal en Français aux horaires des trains de la SNCF et permettant d'évaluer l'adéquation des techniques vocales pour les services interactifs. Le système utilisé pour le recueil de corpus des tests a été développé à partir du système Mask du LIMSI. Il fonctionne sur une station Unix avec une interface téléphonique de haute qualité. Le système offre un dialogue à initiative partagée où l'utilisateur peut fournir les informations à tout instant. Les utilisateurs expérimentés peuvent fournir toutes les informations en une seule phrase, tandis que les utilisateurs moins expérimentés ont tendance à donner de courtes réponses laisant le système les guider. Les tests de RailTel ont été effectués selon une méthodologie commune définie par le consortium. 100 sujets naïfs ont participé aux tests, chacun a appelé le système une seule fois et rempli un questionnaire. 72% des sujets ont achevé leurs scénarios avec succès. L'évaluation subjective du prototype était en majorité favorable, avec un intérêt d'utiliser un tel service. In dem vorliegenden Beitrag beschreiben wir das am LIMSI entwickelte RailTel System. Es ermÖoglicht, telefonisch Zugfahrplanauskünfte in Franzöosisch zu erfragen. Im Rahmen der Systementwicklung wurden Experimente durchgeführt, um den technischen Stand heutiger Sprachtechnologien für interaktive Dienste zu bewerten. Zum Zweck der Datensammlung wurde das am LIMSI entwickelte Sprachverarbeitungssystem Mask verwendet. Es arbeitet auf einer UNIX Station unter Benutzung einer Telefonschnittstelle von gehobener Qualität. Die Dialogstrategie ist gemischt und erlaubt dem Benutzer, jederzeit zusätzliche Informationen zu liefern. Erfahrene Benutzer sind daher in der Lage, sämtliche Auskünfte, die für den Datenbankzugriff notwendig sind, in einem einzelnen Satz zusammenzufassen. Unerfahrene Anwender neigen hingegen eher dazu, kurz zu antworten und können somit vom System angeleitet werden. Eine gemeinsame Vorgehensweise bei den RailTel Experimenten wurde im Vorfeld vom Konsortium definiert. 100 unerfahrene Benutzer nahmen an den Experimenten teil, bei denen jeder von ihnen das System einmal anwählte und im AnschluS einen Fragebogen ausfüllte. 72% der Teilnehmer schlossen ihr Szenario erfolgreich ab. Die subjektive Bewertung des Prototyps fiel in den meisten Fällen positiv aus, die Benutzer zeigten mit anderen Worten Interesse, einen derartigen Service auch in Anspruch zu nehmen. larynx97.ps.Z [ESCA Workshop Larynx 97, Marseille June'97] --------- Spectral representation and modelling of glottal flow signals C.d'Alessandro and B. Doval --------------------- Nous présentons à cet atelier de recherche nos travaux récents sur la modélisation du signal de débit glottique. Un modèle du signal comprenant une composante périodique (liée la vibration quasipériodique des cordes vocales) et une composante apériodique (liée aux bruit d'aspiration, de frication, aux irrégularités de la vibration des cordes vocales) est considéré. Un algorithme de décomposition périodique-apériodique est présenté, ainsi que son évaluation. Un modèle spectral linéaire du débit glottique périodique est proposé, et on montre son utilité pour estimer les paramètres perceptivement pertinents de la source. Ces travaux, conçus au départ pour la synthèse de parole peuvent s'appliquer à d'autres domaines, comme l'étude des pathologies vocale ou de la voix chantée. [The article is in English.] specom97.ps.Z [SPECOM'97, Cluj-Napoca, Roumanie] --------- Evaluation of a class-based language model in a speech recognizer Christel Beaujard, Michèle Jardino and Hélène Maynard -------------------- We present a novel corpus-based clustering algorithm which takes into account a word context similarity criterion between words of a training text. This work has been done with a database recorded at LIMSI, and made of rail travel information requests. The induced class-based model has been evaluated and compared with prior models. Each model has been integrated in the LIMSI speech recognizer. The new similarity class-based model gives the best performances in terms of both perplexity and recognition rates. euro97doval.ps.Z [EUROSPEECH'97, Rhodes, Sep'97] --------- Spectral methods for voice source parameter estimation B. Doval, C. d'Alessandro, B. Diard --------------------- A spectral approach is proposed for voice source parameters representation and estimation. Parameter estimation is based on decomposition of the periodic and the aperiodic components of the speech signal, and on spectral modelling of the periodic component. The paper focusses on parameters estimation for the periodic component of the glottal flow. A new anticausal all-pole model of the glottal flow is derived. Glottal flow is seen as an anticausal 2-pole filter followed by a spectral tilt filter. The anticausal filter has complex poles, instead of the real poles that are usually assumed. Time-domain and frequency domain parameters are linked by analytic formulas. Two spectral domain algorithms are proposed for estimation of open quotient. The first one is based on measurement of the first harmonics, and the second one is based on spectral modelling. Experimental results demonstrate the accuracy of the estimation procedures. euro97h4.ps.Z [EUROSPEECH'97, Rhodes, Sep'97] --------- Transcription of Broadcast News J.L. Gauvain, L. Lamel, G. Adda, M. Adda-Decker --------------------- In this paper we report on our recent work in transcribing broadcast news shows. Radio and television broadcasts contain signal segments of various linguistic and acoustic natures. The shows contain both prepared and spontaneous speech. The signal may be studio quality or have been transmitted over a telephone or other noisy channel (ie., corrupted by additive noise and nonlinear distorsions), or may contain speech over music. Transcription of this type of data poses challenges in dealing with the continuous stream of data under varying conditions. Our approach to this problem is to segment the data into a set of categories, which are then processed with category specific acoustic models. We describe our 65k speech recognizer and experiments using different sets of acoustic models for transcription of broadcast news data. The use of prior knowledge of the segment boundaries and types is shown to not crucially affect the performance. euro97lid.ps.Z [EUROSPEECH'97, Rhodes, Sep'97] --------- Language Identification with Language-Independent Acoustic Models C. Corredor Ardoy, J.L. Gauvain, M. Adda-Decker, L. Lamel --------------------- In this paper we explore the use of language-independent acoustic models for language identification (LID). The phone sequence output by a single language-independent phone recognizer is rescored with language-dependent phonotactic models approximated by phone big rams. The language-independent phoneme inventory was obtained by Agglomerative Hierarchical Clustering, using a measure of similarity between phones. This system is compared with a parallel language-dependent phone architecture, which uses optimally the acoustic log likelihood and the phonotactic score for language identification. Experiments were carried out on the 4-language telephone speech corpus IDEAL, containing calls in British English, Spanish, French and German. Results show that the language-independent approach performs as well as the language-dependent one: 9% versus 10% of error rate on 10 second chunks, for the 4-language task. euro97bref.ps.Z [EUROSPEECH'97, Rhodes, Sep'97] --------- Text Normalization and Speech Recognition in French G. Adda, M. Adda-Decker, J.L. Gauvain, L. Lamel --------------------- In this paper we present a quantitative investigation into the impact of text normalization on lexica and language models for speech recognition in French. The text normalization process defines what is considered to be a word by the recognition system. Depending on this definition we can measure different lexical coverages and language model perplexities, both of which are closely related to the speech recognition accuracies obtained on read newspaper texts.Different text normalizations of up to 185M words of newspaper texts are presented along with corresponding lexical coverage and perplexity measures.Some normalizations were found to be necessary to achieve good lexical coverage, while others were more or less equivalent in this regard. The choice of normalization to create language models for use in the recognition experiments with read newspaper texts was based on these findings. Our best system configuration obtained a 11.2% word error rate in the AUPELF `French-speaking' speech recognizer evaluation test held in February 1997. euro97minker.ps.Z [EUROSPEECH'97, Rhodes, Sep'97] --------- Stochastically-Based Natural Language Understanding Across Tasks and Languages W. Minker --------------------- A stochastically-based method for natural language understanding has been ported from the American ATIS (Air Travel Information Services) to the French MASK (Multimodal-Multimedia Automated Service Kiosk) task. The porting was carried out by designing and annotating a corpus of semantic representations via a semi-automatic iterative labeling. The study shows that domain and language porting is rather flexible, since it is sufficient to train the system on data sets specific to the application and language. A limiting factor of the current implementation is the quality of the semantic representation and the use of query preprocessing strategies which strongly suffer from human influence. The performances of the stochastically-based and a rule-based method are compared on both tasks. darpa97hub4.ps.Z [ARPA Speech Recognition Workshop '97, Chantilly, VA, Feb'97] --------- Transcribing Broadcast News: The LIMSI Nov96 Hub4 System J.L. Gauvain, G. Adda, L. Lamel, M. Adda-Decker --------------------- In this paper we report on the LIMSI Nov96 Hub4 system for transcription of broadcast news shows. We describe the development work in moving from laboratory read speech data to real-world speech data in order to build a system for the ARPA Nov96 evaluation. Two main problems were addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers). The speech recognizer makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and n-gram statistics estimated on large text corpora. The base acoustic models were trained on the WSJ0/WSJ1 corpus, and adapted using MAP estimation with 35 hours of transcribed task-specific training data. The 65k language models are trained on 160 million words of newspaper texts and 132 million words of broadcast news transcriptions. The problem of segmenting the continuous stream of data was investigated using 10 MarketPlace shows. The overall word transcription error of the Nov96 partitioned evaluation test data was 27.1%. ica97doval.ps.Z [ICASSP'97, Munich, April'97] --------- Spectral Correlates of Glottal Waveform Models: An Analytic Study B. Doval and C. d'Alessandro --------------------- This paper deals with spectral representation of the glottal flow. The LF and the KLGLOTT88 models of the glottal flow are studied. In a first part, we compute analytically the spectrum of the LF-model. Then, formulas are given for computing spectral tilt and amplitudes of the first harmonics as functions of the LF-model parameters. In a second part we consider the spectrum of the KLGLOTT88 model. It is shown that this model can be modeled in the spectral domain by an all-pole third-order linear filter. Moreover, the anticausal impulse response of this filter is a good approximation of the glottal flow model. Parameter estimation seems easier in the spectral domain. Therefore our results can be used for modification of the (hidden) glottal flow characteristic of natural speech signals, by processing directly the spectrum, without needing time-domain parameter estimation. ica97hub4.ps.Z [ICASSP'97, Munich, April'97] --------- Transcribing Broadcast News Shows J.L. Gauvain, G. Adda, L. Lamel, M. Adda-Decker --------------------- While significant improvements have been made over the last 5 years in large vocabulary continuous speech recognition of large read-speech corpora such as the ARPA Wall Street Journal-based CSR corpus (WSJ) for American English and the BREF corpus for French, these tasks remain relatively artificial. In this paper we report on our development work in moving from laboratory read speech data to real-world speech data in order to build a system for the new ARPA broadcast news transcription task. The LIMSI Nov96 speech recognizer makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and n-gram statistics estimated on newspaper texts. The acoustic models are trained on the WSJ0/WSJ1, and adapted using MAP estimation with task-specific training data. The overall word error on the Nov96 partitioned evaluation test was 27.1%. ica97sv.ps.Z [ICASSP'97, Munich, April'97] --------- Speaker Recognition with the Switchboard corpus Lori Lamel and Jean-Luc Gauvain --------------------- In this paper we present our development work carried out in preparation for the March'96 speaker recognition test on the Switchboard corpus organized by NIST. The speaker verification system evaluated was a Gaussian mixture model. We provide experimental results on the development test and evaluation test data, and some experiments carried out since the evaluation comparing the GMM with a phone-based approach. Better performance is obtained by training on data from multiple sessions, and with different handsets. High error rates are obtained even using a phone-based approach both with and without the use of orthographic transcriptions of the training data. We also describe a human perceptual test carried out on a subset of the development data, which demonstrates the difficulty human listeners had with this task. ica97matrouf.ps.Z [ICASSP'97, Munich, April'97] --------- Model Compensation for Noises in Training and Test Data Driss Matrouf and Jean-Luc Gauvain --------------------- It is well known that the performances of speech recognition systems degrade rapidly as the mismatch between the training and test conditions increases. Approaches to compensate for this mismatch generally assume that the training data is noise-free, and the test data is noisy. In practice, this assumption is seldom correct. In this paper, we propose an iterative technique to compensate for noises in both the training and test data. The adopted approach compensates the speech model parameters using the noise present in the test data, and compensates the test data frames using the noise present in the training data. The training and test data are assumed to come from different and unknown microphones and acoustic environments. The interest of such a compensation scheme has been assessed on the MASK task using a continuous density HMM-based speech recognizer. Experimental results show the advantage of compensating for both test and training noises. jst97corpusB2.ps.Z [JST'97, Avignon, April'97] --------- Corpus oral de renseignements touristiques S. Rosset, L. Lamel, S. Bennacef, L. Devillers, J.L. Gauvain --------------------- Cet article decrit les activites du LIMSI concernant l'acquisition de corpus de dialogue homme-machine pour une tache de renseignements touristiques dans le contexte de l'action B2 de l'AUPELF: Dialogue oral. Notre experience dans le developpement de systemes de comprehension pour des taches de renseignements sur les tarifs et les horaires de trains ou d'avions (ATIS et Mask) nous a appris qu'il etait avantageux de placer les sujets devant un systeme aussi tot que possible. Ceci permet en effet d'ameliorer iterativement le systeme en terme de performances et de fonctionnalites en se basant sur les interactions entre les sujets et le systeme. Un premier module de comprehension du langage parle a ete developpe (analyse par grammaire de cas avec un systeme de dialogue et un systeme de generation de reponses restreints pour la premiere phase). Ce module a ete utilise pour acquerir un premier corpus d'enonces ecrits. Ces enonces ont permis de constituer un premier lexique et un premier modele de langage d'un systeme complet de dialogue qui est actuellement utilise pour acquerir un corpus oral. ieice96.ps.Z [Institute of Electronics, Information and Communication Engineers, J79-D-II:2005-2021, Dec. 1999] --------- Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications J.L. Gauvain and L. Lamel --------------------- This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research. arpa96nab.ps.Z [ARPA CSR'96, Arden House, Harriman, NY, Feb'96] --------- The LIMSI 1995 Hub3 System J.L. Gauvain, L. Lamel, G. Adda, D. Matrouf --------------------- In this paper we report on the LIMSI recognizer evaluated in the ARPA 1995 North American Business (NAB) News Hub 3 benchmark test. The LIMSI recognizer is an HMM-based system with Gaussian mixture. Decoding is carried out in multiple forward acoustic passes, where more refined acoustic and language models are used in successive passes and information is transmitted via word graphs. In order to deal with the varied acoustic conditions, channel compensation is performed iteratively, refining the noise estimates before the first three decoding passes. The final decoding pass is carried out with speaker-adapted models obtained via unsupervised adaptation using the MLLR method. In contrast to previous evaluations, the new Hub 3 test aimed at improving basic SI, CSR performance on unlimited-vocabulary read speech recorded under more varied acoustical conditions (background environmental noise and unknown microphones). On the Sennheiser microphone (average SNR 29dB) a word error of 9.1% was obtained, which can be compared to 17.5% on the secondary microphone data (average SNR 15dB) using the same recognition system. ica96nab.ps.Z [ICASSP'96, Atlanta, May'96] --------- Developments in Continuous Speech Dictation using the 1995 ARPA NAB News Task J.L. Gauvain, L. Lamel, G. Adda, D. Matrouf --------------------- In this paper we report on the LIMSI recognizer evaluated in the ARPA 1995 North American Business (NAB) News benchmark test. In contrast to previous evaluations, the new Hub 3 test aims at improving basic SI, CSR performance on unlimited-vocabulary read speech recorded under more varied acoustical conditions (background environmental noise and unknown microphones). The LIMSI recognizer is an HMM-based system with Gaussian mixture. Decoding is carried out in multiple forward acoustic passes, where more refined acoustic and language models are used in successive passes and information is transmitted via word graphs. In order to deal with the varied acoustic conditions, channel compensation is performed iteratively, refining the noise estimates before the first three decoding passes. The final decoding pass is carried out with speaker-adapted models obtained via unsupervised adaptation using the MLLR method. On the Sennheiser microphone (average SNR 29dB) a word error of 9.1% was obtained, which can be compared to 17.5% on the secondary microphone data (average SNR 15dB) using the same recognition system. ica96ger.ps.Z [ICASSP'96, Atlanta, May'96] --------- Developments in Large Vocabulary, Continuous Speech Recognition of German M. Adda-Decker, G. Adda, L. Lamel, J.L. Gauvain --------------------- In this paper we describe our large vocabulary continuous speech recognition system for the German language, the development of which was partly carried out within the context of the European LRE project 62-058 SQALE. The recognition system is the LIMSI recognizer originally developed for French and American English, which has been adapted to German. Specificities of German, as relevant to the recognition system, are presented. These specificities have been accounted for during the recognizer's adaptation process. We present experimental results on a first test set ger-dev95 to measure progress in system development. Results are given with the final system using different acoustic model sets on two test sets ger-dev95 and ger-eval95. This system achieved a word error rate of 17.3% (official word error rate of 16.1% after SQALE adjudication process) on the ger-eval95 test set. ica96jardino.ps.Z [ICASSP'96, Atlanta, May'96] --------- Multilingual stochastic n-gram class language models Michele Jardino --------------------- Stochastic language models are widely used in continuous speech recognition systems where a priori probabilites of word sequences are needed. These probabilities are usually given by n-gram word models, estimated on very large training texts. When n increases, it becomes harder to find reliable statistics, even wit h huge texts. Grouping words is a way to overcome this problem. We have developed an automatic language independant classification procedure, which is able to optimize the classification of tens of millions of untagged words in less than a few hours on a Unix workstation. With this language independent approach, three corpora each containing about 30 million words of newspaper texts, in French, German and English, have been mapped into different numbers of classes. From these classifications, bi-gram and tri-gram class language models have been built. The perplexities of held-out test texts have been assessed, showing that tri-gram class models give lower values than those obtained with tri-gram word models, for the three languages. ivtta96.ps.Z [IVTTA'96, BaskingRidge, Sep'96] --------- Field Trials of a Telephone Service for Rail Travel Information L.F. Lamel, J.L. Gauvain, S.K. Bennacef, L. Devillers, S. Foukia, J.J. Gangolf, S. Rosset -------------------- This paper reports on the RAILTEL field trial carried out by LIMSI, to assess the technical adequacy of available speech technology for interactive vocal access to static train timetable information. The data collection system used to carry out the field trials, is based on the LIMSI MASK spoken language system and runs on a Unix workstation with a high quality telephone interface. The spoken language system allows a mixed-initiative dialog where the user can provide any information at any point in time. Experienced users are thus able to provide all the information needed for database access in a single sentence, whereas less experienced users tend to provide shorter responses, allowing the system to guide them. The RAILTEL field trial was carried out using a commonly defined upon methodology. 100 naive subjects participated in the field trials, each contributing one call and completing a user questionnaire. 72% of the callers successfully completed their scenarios. The subjective assessment of the service was for the most part favorable, with subjects expressing interest in using such a service. icslp96lex.ps.Z [ICSLP'96, Philadelphia, Oct'96] --------- On Designing Pronunciation Lexicons for Large Vocabulary, Continuous Speech Recognition Lori Lamel and Gilles Adda --------------------- Creation of pronunciation lexicons for speech recognition is widely acknowledged to be an important, but labor-intensive, aspect of system development. Lexicons are often manually created and make use of knowledge and expertise that is difficult to codify. In this paper we describe our American English lexicon developed primarily for the ARPA WSJ/NAB tasks. The lexicon is phonemically represented, and contains alternate pronunciations for about 10% of the words. Tools have been developed to add new lexical items, as well as to help ensure consistency of the pronunciations. Our experience in large vocabulary, continuous speech recognition is that systematic lexical design can improve system performance. Some comparative results with commonly available lexicons are given. icslp96maskreco.ps.Z [ICSLP'96, Philadelphia, Oct'96] --------- Speech Recognition for an Information Kiosk J.L. Gauvain, J.J. Gangolf, L. Lamel --------------------- In the context of the ESPRIT MASK project we face the problem of adapting a ``state-of-the-art'' laboratory speech recognizer for use in the real world with naive users. The speech recognizer is a software-only system that runs in real-time on a standard Risc processor. All aspects of the speech recognizer have been reconsidered from signal capture to adaptive acoustic models and language models. The resulting system includes such features as microphone selection, response cancellation, noise compensation, query rejection capability and decoding strategies for real-time recognition. icslp96maskwoz.ps.Z [ICSLP'96, Philadelphia, Oct'96] --------- Data Collection for the MASK Kiosk: WOz vs Prototype System A. Life, I. Salter, J.N. Temem, F. Bernard, S. Rosset, S. Bennacef, L. Lamel --------------------- The MASK consortium is developing a prototype multimodal multimedia service kiosk for train travel information and reservation, exploiting state-of-the-art speech technology. In this paper we report on our efforts aimed at evaluating alternative user interface designs and at obtaining acoustic and language modeling data for the spoken language component of the overall system. Simulation methods with increasing degrees of complexity have been used to test the interface design, and to experiment with alternate operating modes. The majority of the speech data has been collected using successive versions of the spoken language system, whose capabilities are incrementally expanded after analysis of the most frequent problems encountered by users of the preceding version. The different data requirements of user interface design and speech corpus acquisition are discussed in light of the experience of the MASK project. icslp96minker.ps.Z [ICSLP'96, Philadelphia, Oct'96] --------- A Stochastic Case Frame Approach for Natural Language Understanding Wolfgang Minker, Samir Bennacef, Jean-Luc Gauvain --------------------- A stochastically based approach for the semantic analysis component of a natural spoken language system for the ATIS task has been developed. The semantic analyzer of the spoken language system already in use at LIMSI makes use of a rule-based case grammar. In this work, the system of rules for the semantic analysis is replaced with a relatively simple, first order Hidden Markov Model. The performance of the two approaches can be compared because they use identical semantic representations despite their rather different methods for meaning extraction. We use an evaluation methodology that assesses performance at different semantic levels, including the database response comparison used in the ARPA ATIS paradigm. icslp96ml.ps.Z [ICSLP'96, Philadelphia, Oct'96] --------- Spoken Language Processing in a Multilingual Context L.F. Lamel, M. Adda-Decker, J.L. Gauvain, G. Adda --------------------- In this paper we overview the spoken language processing activities at LIMSI, which are carried out in a multilingual framework. These activities include speech-to-text conversion, spoken language systems for information retrieval, speaker and language recognition, and speech response. The Spoken Language Processing Group has also been actively involved in corpora development and evaluation. The group has regularly participated in evaluations organized by ARPA, in the LE-SQALE project, and in the AUPELF-UREF program for provision of linguistic resources and evaluation tests for French. icslp96rtel.ps.Z [ICSLP'96, Philadelphia, Oct'96] --------- Dialog in the RAILTEL Telephone-Based System S. Bennacef, L. Devillers, S. Rosset, L. Lamel --------------------- Dialog management is of particular importance in telephone-based services. In this paper we describe our recent activities in dialog management and natural language generation in the LIMSI RAILTEL system for access to rail travel information. The aim of LE-MLAP project RAILTEL was to assess the capabilities of spoken language technology for interactive telephone information services. Because all interaction is over the telephone, oral dialog management and response generation are very important aspects of the overall system design and usability. Each dialog is analysed to determine the source of any errors (speech recognition, understanding, information retreival, processing, or dialog management). An analysis is provided for 100 dialogs taken from the RAILTEL field trials with naive subjects accessing timetable information. asru95mask.ps.Z [1995 IEEE SPS Workshop on Automatic Speech Recognition Snowbird, Utah, Dec'95] --------- Spoken Language System Development for the MASK Kiosk J.L. Gauvain, S. Bennacef, L. Devillers, L. Lamel --------------------- Spoken language systems aim to provide a natural interface between humans and computers through the use of simple and natural dialogues, so that the users can access stored information. In this paper we present our recent activities in developing such a system for the ESPRIT MASK (Multimodal-Multimedia Automated Service Kiosk) project. The goal of the MASK project is to develop a multimodal, multimedia service kiosk to be located in train stations, so as to improve the user-friendliness of the service. The service kiosk will allow the user to speak to the system, as well as to use a touch screen and keypad. The role of LIMSI in the project is to develop the spoken language system which will be incorporated in the kiosk. High quality speech synthesis, graphics, animation and video will used to provide feedback to the user. We have developed a complete spoken language data collection system for this task. In the actual service kiosk the spoken language system has to be modified so that the multimodal level transaction manager can control the overall dialog. The main information provided by the system is access to rail travel information such as timetables, tickets and reservations, as well as services offered on the trains, and fare-related restrictions and supplements. Other important travel information such as up-to-date departure and arrival time and track information will also be provided. Eventual extensions to the system will enable the user to obtain additional information about the train station and local tourist information, such as restaurants, hotels, and attractions in the surrounding area. asru95ml.ps.Z [1995 IEEE SPS Workshop on Automatic Speech Recognition Snowbird, Utah, Dec'95] --------- Speech Recognition of European Languages Lori Lamel and Renato DeMori --------------------- In this paper we aim to provide a basic overview of the main ongoing efforts in large vocabulary, continuous speech recognition for European languages. We address issues in acoustic modeling, lexical representation, and language modeling. In addition to presenting a snapshot of speech recognition in different European languages, we try to highlight language specific characteristics that must be taken into account in developing a recognition system for a given language. Some other issues that are touched upon are the availability of training and testing data in the different languages, the choice of recognition units, lexical coverage, language specificities such as the frequencies of homophones, monophone words, compounding, liaison, and phonological effects (reduction), and multilingual evaluation. csl95-nl.ps.Z [Computer Speech & Language, 9(1), pp. 87-103, Jan. 1995] --------- A Phone-based Approach to Non-Linguistic Speech Feature Identification L. Lamel and J.L. Gauvain --------------------- In this paper we present a general approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. The basic idea is to process the unknown speech signal by feature-specific phone model sets in parallel, and to hypothesize the feature value associated with the model set having the highest likelihood. This technique is shown to be effective for text-independent gender, speaker, and language identification. Text-independent speaker identification accuracies of 98.8% on TIMIT (168 speakers) and 99.2% on BREF (65 speakers), were obtained with one utterance per speaker, and 100% with 2 utterances for both corpora. Experiments in which speaker-specific models were estimated without using of the phonetic transcriptions for the TIMIT speakers had the same identification accuracies obtained with the use of the transcriptions. French/English language identification is better than 99% with 2s of read, laboratory speech. On spontaneous telephone speech from the OGI corpus, the language can be identified as French or English with 82% accuracy with 10s of speech. 10 language identification using the OGI corpus is 59.7% with 10s of signal. slt95.ps.Z [ARPA Workshop Spoken Language Technology, Austin, TX, Jan 1995] --------- Developments in Large Vocabulary Dictation: The LIMSI Nov94 NAB System J.L. Gauvain, L. Lamel, M. Adda-Decker --------------------- In this paper we report on our development work in large vocabulary, American English continuous speech dictation on the ARPA NAB task in preparation for the November 1994 evaluation. We have experimented with (1) alternative analyses for the acoustic front end, (2) the use of an enlarged vocabulary of 65k words so as to reduce the number of errors due to out-of-vocabulary words, (3) extensions to the lexical representation, (4) the use of additional acoustic training data, and (5) modification of the acoustic models for telephone speech. The recognizer was evaluated on Hubs 1 and 2 of the fall 1994 ARPA NAB CSR Hub and Spoke Benchmark test. Experimental results on development and evaluation test data are given, as well as an analysis of the errors on the development data. ica95lv.ps.Z [ICASSP'95, Detroit, MI, May 1995] --------- Developments in Continuous Speech Dictation using the ARPA WSJ Task J.L. Gauvain, L. Lamel, M. Adda-Decker --------------------- In this paper we report on our recent development work in large vocabulary, American English continuous speech dictation. We have experimented with (1) alternative analyses for the acoustic front end, (2) the use of an enlarged vocabulary so as to reduce the number of errors due to out-of-vocabulary words, (3) extensions to the lexical representation, (4) the use of additional acoustic training data, and (5) modification of the acoustic models for telephone speech. The recognizer was evaluated on Hubs 1 and 2 of the fall 1994 ARPA NAB CSR Hub and Spoke Benchmark test. Experimental results for development and evaluation test data are given, as well as an analysis of the errors on the development data. esca-sls95.ps.Z [ESCA ETRW Spoken Dialog Systems, Visgo, Denmark, May 1995] --------- Recent Developments in Spoken Language Sytems for Information Retrieval L. Lamel, S. Bennacef, H. Bonneau-Maynard, S. Rosset, J.L. Gauvain --------------------- In this paper we present our recent activities in developing systems for vocal access to a database information retrieval system, for three applications in a travel domain: l'Atis, Mask and RailTel. The aim of this work is to use spoken language to provide a user-friendly interface with the computer. The spoken language system integrates a speech recognizer (based on HMM with statistical language models), a natural language understanding component (based on a caseframe semantic analyzer and including a dialog manager) and an information retrieval and response generation component. We present the development status of prototype spoken language systems for l'Atis and Mask being used for data collection. euro95slc.ps.Z [Eurospeech'95, Madrid, Sep'95] --------- Development of Spoken Language Corpora for Travel Information L. Lamel, S. Rosset, S. Bennacef, H. Bonneau-Maynard, L. Devillers, J.L. Gauvain --------------------- In this paper we report on our ongoing work in developing spoken language corpora in the context of information access in two travel domain tasks, l'Atis and Mask. The collection of spoken language corpora remains an important research area and represents a significant portion of work in the development of spoken language systems. The use of additional acoustic and language model training data has been shown to almost systematically improve performance in continuous speech recognition. Similarly, progress in spoken language understanding is closely linked to the availability of spoken language corpora. We record subjects on a regular basis using development versions of the spoken language systems for both tasks, obtaining over 1000 queries/month from 20 subjects. To help assess our progress in system development, each subject since March'95 completes a questionnaire addressing the user-friendliness, reliability, ease-of-use of the Mask data collection system. euro95sv.ps.Z [Eurospeech'95, Madrid, Sep'95] --------- Experiments with Speaker Verification over the Telephone J.L. Gauvain, L.F. Lamel, B. Prouts --------------------- In this paper we present a study on speaker verification showing achievable performance levels for both high quality speech and telephone speech and for two operational modes, i.e. text-dependent and text-independent speaker verification. A statistical modeling approach is taken, where for text independent verification the talker is viewed as a source of phones, modeled by a fully connected Markov chain, where the lexical and syntactic structures of the language are approximated by local phonotactic constraints. A first series of experiments were carried out on high quality speech from the BREF corpus to validate this approach and resulted in an a posteriori equal error rate of 0.3% in text-dependent as well as in text-independent mode. A second series of experiments were carried out on a telephone corpus recorded specifically for speaker verification algorithm development. On this data, the lowest equal error rate is 2.9% for the text-dependent mode when 2 trials are allowed per attempt and with a minimum of 2s of speech per trial. euro95sqale.ps.Z [Eurospeech'95, Madrid, Sep'95] --------- Issues in Large Vocabulary, Multilingual Speech Recognition L. Lamel, M. Adda-Decker, J.L. Gauvain --------------------- In this paper we report on our activities in multilingual, speaker-independent, large vocabulary continuous speech recognition. The multilingual aspect of this work is of particular importance in Europe, where each country has its own national language. Our existing recognizer for American English and French, has been ported to British English and German. It has been assessed in the context of the LRE Sqale project whose objective was to experiment with installing in Europe a multilingual evaluation paradigm for the assessment of large vocabulary, continuous speech recognition systems. The recognizer makes use of phone-based continuous density HMM for acoustic modeling and n-gram statistics estimated on newspaper texts for language modeling. The system has been evaluated on a dictation task with read, newspaper-based corpora, the ARPA Wall Street Journal corpus of American English, the WSJCAM0 corpus of British English, the BREF-LeMonde corpus of French and the PHONDAT-Frankfurter Rundschau corpus of German. Under closely matched conditions, the average word accuracy across all 4 languages is 85%, obtained with an open-vocabulary test and 20k trigram systems (64k system German). mask95hcs.ps.Z [Human Comfort & Security, Brussels, Oct'95] --------- The Spoken Language Component of the MASK Kiosk J.L. Gauvain, S. Bennacef, L. Devillers, L.F. Lamel, S. Rosset --------------------- The aim of the Multimodal-Multimedia Automated Service Kiosk (MASK) project is to pave the way for more advanced public service applications by user interfaces employing multimodal, multi-media input and output. The project has analyzed the technological requirements in the context of users and the tasks they perform in carrying out travel enquiries, and developed a prototype information kiosk that will be installed in the Gare St. Lazare in Paris. The kiosk will improve the effectiveness of such services by enabling interaction through the coordinated use of multimodal inputs (speech and touch) and multimedia output (sound, video, text, and graphics) and in doing so create the opportunity for new public services. Vocal input is managed by a spoken language system, which aims to provide a natural interface between the user and the computer through the use of simple and natural dialogs. In this paper the architecture and the capabilities of the spoken language system are described, with emphasis on the speaker-independent, large vocabulary continuous speech recognizer, the natural language component (including semantic analysis and dialog management), and the response generator. We also describe our data collection and evaluation activities which are crucial to system development. lim9512.ps.Z [LIMSI technical report no. 9512] --------- An English Version of the LIMSI L'ATIS System Wolfgang Minker --------------------- Spoken language systems aim to provide a natural interface between humans and computers with which users can access stored information. Such systems should be sufficiently robust to cope with naturally occuring spontaneous speech effects as arise in applications with human-machine interaction. The LIMSI spoken language research is oriented around several tasks in the travel domain. The L'ATIS system is a French spoken language system for vocal access to the Air Travel Information Services (ATIS) task database, a designated common task for data collection and evaluation within the ARPA Speech and Natural Language (NL) program. The system includes a speech recognizer, an understanding component and a component handling database query and response generation. A case frame approach is used for the NL understanding component. This unit determines the meaning of the query and builds an appropriate semantic frame representation. The semantic frame is then used to generate a database request to the database management system and a response is returned to the user. This report reviews the main characteristics of the LIMSI spoken language information retrieval system and demonstrates the flexibility of the case frame approach by describing the process of porting the French NL component to American English. Understanding results are presented in accordance with the methods used in the ARPA spoken und natural language evaluation paradigm. forwiss94.ps.Z [CRIM/FORWISS Workshop on Progress and Prospects of Speech Research and Technology, Munich, Sep. 1994] --------- Continuous Speech Dictation at LIMSI J.L. Gauvain, L.F. Lamel, M. Adda-Decker --------------------- One of our major research activities at LIMSI is multilingual, speaker-independent, large vocabulary speech dictation. The multilingual aspect of this work is of particular importance in Europe, where each country has it's own national language. Speaker-independence and large vocabulary are characteristics necessary to envision real world applications of such technology. The recognizer makes use of phone-based continuous density HMM for acoustic modeling and n-gram statistics estimated on newspaper texts for language modeling. The system has been evaluated on two dictation tasks developed with read, newspaper-based corpora, the ARPA Wall Street Journal corpus of American English and the BREF LeMonde corpus of French. Experimental results under closely matched conditions are reported. For both languages an average word accuracy of 95% is obtained for a 5k vocabulary test. For a 20,000 word lexicon with an unrestricted vocabulary test the word error for WSJ is 10% and for BREF is 16%. slt94.ps.Z [ARPA Workshop on Spoken Language Technology, Princton, NJ, March'94] --------- The LIMSI Nov93 WSJ System J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker --------------------- In this paper we report on the LIMSI Wall Street Journal system which was evaluated in the November 1993 test. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The decoding is carried out in two forward acoustic passes. The first pass is a time-synchronous graph-search, which is shown to still be viable with vocabularies of up to 20k words when used with bigram back-off language models. The second pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. The official Nov93 evaluation results are given for vocabularies of up to 64,000 words, as well as results on the Nov92 5k and 20k test material. slt94sqale.ps.Z [ARPA Workshop on Spoken Language Technology, Princton, NJ, March'94] --------- SQALE: Speech Recognizer Quality Assessment for Linguistic Engineering Herman J.M. Steeneken and Lori Lamel --------------------- The objective of the SQALE project (Speech recognizer Quality Assessment for Linguistic Engineering) is to experiment with establishing an evaluation paradigm in Europe for the assessment of large-vocabulary, continuous speech recognition systems in a multilingual environment. hlt94.ps.Z [ARPA Workshop on Human Language Technology, Princeton, NJ, March'94] --------- The LIMSI Continuous Speech Dictation System J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker --------------------- A major axis of research at LIMSI is directed at multilingual, speaker-independent, large vocabulary speech dictation. In this paper the LIMSI recognizer which was evaluated in the ARPA NOV93 CSR test is described, and experimental results on the WSJ and BREF corpora under closely matched conditions are reported. For both corpora word recognition experiments were carried out with vocabularies containing up to 20k words. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with a 20k-word vocabulary when used with bigram back-off language models. A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. map93.ps.Z [IEEE Trans. on Speech and Audio, April 1994] --------- Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains J.L. Gauvain and Chin-Hui Lee --------------------- In this paper a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely the choice of prior distribution family, the specification of the parameters of prior densities and the evaluation of the MAP estimates, are addressed. Using HMMs with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normal-Wishart densities. The classical maximum likelihood estimation algorithms, namely the forward-backward algorithm and the segmental k-means algorithm, are expanded and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications: parameter smoothing and model adaptation, and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications. ica94lid.ps.Z [ICASSP'94 Adelaide, Australia, May 1994] --------- Language Identification Using Phone-based Acoustic Likelihoods L.F. Lamel and J.L. Gauvain --------------------- In this paper we apply the technique of phone-based acoustic likelihoods to the problem of language identification. The basic idea is to process the unknown speech signal by language-specific phone model sets in parallel, and to hypothesize the language associated with the model set having the highest likelihood. Using laboratory quality speech the language can be identified as French or English with better than 99% accuracy with only as little as 2s of speech. On spontaneous telephone speech from the OGI corpus, the language can be identified as French or English with 82% accuracy with 10s of speech. The 10 language identification rate using the OGI corpus is 59.7% with 10s of signal. ica94lv.ps.Z [ICASSP'94 Adelaide, Australia, May 1994] --------- The LIMSI Continuous Speech Dictation System: Evaluation on the ARPA Wall Street Journal Task J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker --------------------- In this paper we report progress made at LIMSI in speaker-independent large vocabulary speech dictation using the ARPA Wall Street Journal-based CSR corpus. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with vocabularies of up to 20K words when used with bigram back-off language models. A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. The recognizer has been evaluated in the Nov92 and Nov93 ARPA tests for vocabularies of up to 20,000 words. icslp94latis.ps.Z [ICSLP'94, Yokohama, Japan, Sept'94] --------- A Spoken Language System For Information Retrieval S.K. Bennacef, H. Bonneau-Maynard, J.L. Gauvain, L. Lamel, W. Minker --------------------- Spoken language systems aim to provide a natural interface between humans and computers by using simple and natural dialogues to enable the user to access stored information. The LIMSI spoken language work is being pursued in several task domains. In this paper we present a system for vocal access to database for a French version of the Air Travel Information Services (ATIS) task. The ATIS task is a designated common task for data collection and evaluation within the ARPA Speech and Natural Language program. A complete spoken language system including a speech recognizer, a natural language component, a database query generator and a natural language response generator is described. The speaker independent continuous speech recognizer makes use of task-independent acoustic models trained on the BREF corpus and a task-specific language model. A case-frame approach is used for the natural language component. This component determines the meaning of the query and builds an appropriate semantic frame representation. The semantic frame is used to generate a database request to the ddatabase management system and the returned information is used to generate a response. First evaluation results for the ATIS task are given for the recognition and understanding components, as well as for the combined system. icslp94lv.ps.Z [ICSLP'94, Yokohama, Japan, Sept'94] --------- Continuous Speech Dictation in French J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker --------------------- A major research activity at LIMSI is multilingual, speaker-independent, large vocabulary speech dictation. In this paper we report on efforts in large vocabulary, speaker-independent continuous speech recognition of French using the BREF corpus. Recognition experiments were carried out with vocabularies containing up to 20k words. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on 38 million words of newspaper text from LeMonde for language modeling. The recognizer uses a time-synchronous graph-search strategy. When a bigram language model is used, recognition is carried out in a single forward pass. A second forward pass, which makes use of a word graph generated with the bigram language model, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models and phone duration models. An average phone accuracy of 86% was achieved. A word accuracy of 84% has been obtained for an unrestricted vocabulary test and 95% for a 5k vocabulary test. icslp94ted.ps.Z [ICSLP'94, Yokohama, Japan, Sep'94] --------- The Translanguage English Database (TED) L.F. Lamel, F. Schiel, A. Fourcin, J. Mariani, H. Tillmann --------------------- The Translanguage English Database is a corpus of recordings made of oral presentations at Eurospeech93 in Berlin. The corpus name derives from the high percentage of presentations given in English by non-native speakers of English. 224 oral presentations at the conference were successfully recorded, providing a total of about 75 hours of speech material. These recordings provide a relatively large number of speakers speaking a variant of the same language (English) over a relatively large amount of time (15 min each + 5 min discussion) on a specific topic. A subset of speakers were recorded with a laryngograph in addition to the standard microphone. A set of Polyphone-like recordings were made, for which a subset also had a laryngograph signal recorded. These recordings were made in English and in the speaker's mother language. In addition to the spoken material, associated text materials are being collected. These include written versions of the proceedings papers and any oral preparations texts which were made available. The text materials will provide vocabulary items and data for language modeling. Speakers were also asked to complete a short questionnaire regarding their mother language, any other languages they speak, as well as their knowledge of English. lre94sqale.ps.Z [LRE Journees du Genie Linguistique, Paris, July'94] --------- Speech Recognizer Quality Assessment for Linguistic Engineering (SQALE) Lori F. Lamel --------------------- The aim of the LRE-SQALE project (Speech recognizer Quality Assessment for Linguistic Engineering) is to experiment with establishing an evaluation paradigm in Europe for the assessment of large-vocabulary, continuous speech recognition systems in a multilingual environment. This 18 month project is will define and carry out the assessment experiments, paving the way for future projects with a larger scope and wider participation of European sites. The SQALE Consortium consists of a coordinator, the Institute for Human Factors at TNO, and three laboratories (CUED, LIMSI-CNRS, PHILIPS) who will evaluate their recognition systems using commonly agreed upon protocols, with the evaluation organized by the coordinating laboratory. Multiple sites will test their algorithms on the same database, so as to compare the merits of different methods, and each site will evaluate on at least two languages, so as to compare the relative difficulties of the languages, and the degree of independency of the algorithm to a given language. spcom94.ps.Z [Speech Communication, 15, pp. 21-37, Sep'94] --------- Speaker-Independent Continuous Speech Dictation J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker --------------------- In this paper we report on progress made at LIMSI in speaker-independent large vocabulary speech dictation using newspaper-based speech corpora in English and French. The recognizer makes use of continuous density HMMs with Gaussian mixtures for acoustic modeling and n-gram statistics estimated on newspaper texts for language modeling. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. For English the ARPA Wall Street Journal-based CSR corpus is used and for French the BREF corpus containing recordings of texts from the French newspaper LeMonde is used. Experiments were carried out with both these corpora at the phone level and at the word level with vocabularies containing up to 20,000 words. Word recognition experiments are also described for the ARPA RM task which has been widely used to evaluate and compare systems. Nous presentons dans cet article les avancees realisees au LIMSI sur la reconnaissance de parole continue de grand vocabulaire, independante du locuteur dans une application de dictee de textes. Le systeme utilise des modeles de Markov caches a densites continues au niveau acoustique, et des modeles de langage n-grammes au niveau syntaxique. La modelisation acoustique repose sur: une analyse cepstrale du signal vocal, des modeles de phones en contexte (inter- et intramot) dependant du genre du locuteur, et des modeles de duree phonemique. Nous avons utilise, pour la langue anglaise, le corpus de parole continue ARPA-WSJ contenant des enregistrements de textes lus extraits du Wall Street Journal, et, pour la langue francaise, le corpus BREF contenant des enregistrements de textes lus extraits du journal LeMonde. Les performances du systeme de reconnaissance, mesurees au niveau phonetique et au niveau mot sont donnees sur ces deux corpus pour des vocabulaires contenant jusqu'a 20.000 mots. Nous donnons egalement pour reference les resultats obtenus sur le corpus ARPA-RM qui a ete tres largement utilise pour evaluer et comparer des systemes de reconnaissance de parole. In diesem Beitrag beschreiben wir Fortschritte in der Entwicklung eines sprecherunabhangigen Spracherkennungssystems fur groSen Wortschatz, welches mit (gesprochenem und geschriebenem) Datenmaterial von Zeitungartikeln trainiert wurde. Die akustische Modellierung des Spracherkennungssystems besteht aus Mischungen kontinuierlicher gauSschen Dichten in Hidden Markov Modellen (HMM). Die Modellierung der (geschriebenen) Sprache beruht auf statistischen N-grams, deren Wahrscheinlichkeiten aus einer Datenbasis bestehend aus Zeitungsartikeln geschatzt wurden. Was die akustischen Modelle betrifft, verwenden wir Cepstrum Parameter in kontextabhangigen Phonmodellen, mit zeitlicher Modellierung und geschlechtspezifischen Modellen. Fur die englische Sprache benutzen wir die ARPA Wall Street Journal Datenbasis und fur die franzosische, die BREF Daten, die gesprochene Texte der franzosischen Zeitung LeMonde enthalten. Fur beide Sprachen wurden Experimente auf Phonem- und Wortbasis durchgefuhrt. Der Wortschatz besteht aus bis zu 20K Worter. Weiterhin prasentieren wir Ergebnisse auf der ARPA RM Datenbasis, da diese weltweit zur Bewertung von Spracherkennungssystemen verwendet wurde. iprai94.ps.Z [International Journal of Pattern Recognition and Artificial Intelligence, special issue on Speech Recognition for different languages 8(1):99-131, 1994.] --------- Speech-to-Text Conversion in French J.L. Gauvain, L. F. Lamel, G. Adda, and J. Mariani --------------------- Speech-to-text conversion of French necessitates that both the acoustic level recognition and language modeling be tailored to the French language. Work in this area was initiated at LIMSI over 10 years ago. In this paper a summary of the ongoing research in this direction is presented. Included are studies on distributional properties of French text materials; problems specific to speech-to-text conversion particular to French; studies in phoneme-to-grapheme conversion, for continuous, error-free phonemic strings; past work on isolated-word speech-to-text conversion; and more recent work on continuous-speech speech-to-text conversion. Also demonstrated is the use of phone recognition for both language and speaker identification. The continuous speech-to-text conversion for French is based on a speaker-independent, vocabulary-independent recognizer. In this paper phone recognition and word recognition results are reported evaluating this recognizer on read speech taken from the BREF corpus. The recognizer was trained on over 4 hours of speech from 57 speakers, and tested on sentences from an independent set of 19 speakers. A phone accuracy of 78.7% was obtained using a set of 35 phones. The word accuracy was 88% for a 1139 word lexicon and 86% for a 2716 word lexicon, with a word pair grammar with respective perplexities of 100 and 160. Using a bigram grammar word accuracies of 85.5% and 81.7% were obtained with 5K and 20K word vocabularies, with respective perplexities of 122 and 205. hlt93.ps.Z [ARPA Human Language Technology Workshop, Plainsboro, NJ, pp. 96-101, March'93] --------- Identification of Non-Linguistic Speech Features Jean-Luc Gauvain and Lori F. Lamel --------------------- Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information centers in public places such as train stations and airports. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. Ideally, the speaker's identity can be verified continually during the transaction, in a manner completely transparent to the user. With these views in mind, this paper presents a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. This technique is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly human-machine interaction. With 2s of speech, the language can be identified with better than 99% accuracy. Error in sex-identification is about 1% on a per-sentence basis, and speaker identification accuracies of 98.5% on TIMIT (168 speakers) and 99.2% on BREF (65 speakers), were obtained with one utterance per speaker, and 100% with 2 utterances for both corpora. An experiment using unsupervised adaptation for speaker identification on the 168 TIMIT speakers had the same identification accuracies obtained with supervised adaptation. csl93.ps.Z [Computer, Speech & Language, Vol 7 (2):169-191, April 1993] --------- A knowledge-based system for stop consonant identification based on speech spectogram reading Lori F. Lamel --------------------- In order to formalize the information used in spectrogram reading, a knowledge-based system for identifying spoken stop consonants was developed. Speech spectrogram reading involves interpreting the acoustic patterns in the image to determine the spoken utterance. One must selectively attend to many different acoustic cues, interpret their significance in light of other evidence, and make inferences based on information from multiple sources. The evidence, obtained from both spectrogram reading experiments and from teaching spectrogram reading, indicates that the process can be modeled with rules. Formalizing spectrogram reading entails refining the language used to describe acoustic events in the spectrogram, selecting a set of relevant acoustic events that distinguish among phonemes, and developing rules which map these acoustic attributes into phonemes. One way to assess how well the knowledge used by experts has been captured is by embedding the rules in a computer program. A knowledge-based system was selected because the expression and use of knowledge are explicit. The emphasis was in capturing the acoustic descriptions and modeling the reasoning used by human spectrogram readers. In this paper, the knowledge acquisition and knowledge representation, in terms of descriptions and rules, are described. A performance evaluation and error analysis are also provided, and the performance of an HMM-based phone recognizer on the same test data is given for comparison. ica93.ps.Z [ICASSP'93 Minneapolis, MN, April'93] --------- Cross-Lingual Experiments with Phone Recognition Lori F. Lamel and Jean-Luc Gauvain --------------------- This paper presents some of the recent research on speaker-independent continuous phone recognition for both French and English. The phone accuracy is assessed on the BREF corpus for French, and on the Wall Street Journal and TIMIT corpora for English. Cross-language differences concerning language properties are presented. It was found that French is easier to recognize at the phone level (the phone error for BREF is 23.6% vs. 30.1% for WSJ), but harder to recognize at the lexical level due to the larger number of homophones. Experiments with signal analysis indicate that a 4kHz signal bandwidth is sufficient for French, whereas 8kHz is needed for English. Phone recognition is a powerful technique for language, sex, and speaker identification. With 2s of speech, the language can be identified with better than 99% accuracy. Sex-identification for BREF and WSJ is error-free. Speaker identification accuracies of 98.2% on TIMIT (462 speakers) and 99.1% on BREF (57 speakers), were obtained with one utterance per speaker, and 100% with 2 utterances. ast93.ps.Z [ESCA ETRW on Applications of Speech Technology, Lautrach, Sep'93] --------- Generation and Synthesis of Broadcast Messages L.F. Lamel, J.L. Gauvain, B. Prouts, C. Bouhier, R. Boesch --------------------- In this paper we present a system designed by VECSYS, in collaboration with LIMSI, to generate and synthesize high quality broadcast messages. The system is currently undergoing evaluation tests to automatically generate and broadcast messages about weather and airport conditions. The generated messages are alternately broadcast in French and in English. This paper focuses on the synthesis aspects of the system which synthesizes messages by concatenation of speech units stored in the form of a dictionary. The texts corresponding to the messages to be synthesized and broadcast are automatically generated by a server, and transmitted to the synthesis component. Should the system be unable to generate the message from existing entries in the dictionary, the system permits a human operator to record the message which is subsequently broadcast. euro93nl.ps.Z [Eurospeech'93, Berlin, Sept'93] --------- Identifying Non-Linguistic Speech Features L.F. Lamel and J.L. Gauvain --------------------- Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications, for example, information centers in public places such as train stations and airports, where the spoken query is to be recognized without even prior knowledge of the language being spoken. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. In this paper we present a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. The basic idea is to process the unknown speech signal by feature-specific phone model sets in parallel, and to hypothesize the feature value associated with the model set having the highest likelihood. This technique is shown to be effective for text-independent sex, speaker, and language identification and can enable better and more friendly human-machine interaction. Text-independent speaker identification accuracies of 98.8% on TIMIT (168 speakers) and 99.2% on BREF (65 speakers), were obtained with one utterance per speaker, and 100% with 2 utterances for both corpora. Experiments estimating speaker-specific models without use of the phonetic transcription for the TIMIT speakers had the same identification accuracies obtained with the use of the transcriptions. French/English language identification is better than 99% with 2s of read, laboratory speech. On spontaneous telephone speech from the OGI corpus, the language can be identified as French or English with 82% accuracy with 10s of speech. 10 language identification using the OGI corpus is 59.7% with 10s of signal. euro93lv.ps.Z [Eurospeech'93, Berlin, Sept'93] --------- Speaker-Independent Continuous Speech Dictation J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker --------------------- In this paper we report progress made at LIMSI in speaker-independent large vocabulary speech dictation using newspaper speech corpora. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. Two corpora of read speech have been used to carry out the experiments: the DARPA Wall Street Journal-based CSR corpus and the BREF corpus containing recordings of texts from the French newspaper LeMonde. For both corpora experiments were carried out with up to 20K word lexicons. Experimental results are also given for the DARPA RM task which has been widely used to evaluate and compare systems. euro93ph.ps.Z [Eurospeech'93, Berlin, Sep'93] --------- High Performance Speaker-Independent Phone Recognition Using CDHMM L.F. Lamel and J.L. Gauvain --------------------- In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizer are: high dimensional feature vector (48), context- and gender-dependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n-gram probabilities for the phonotatic constraints. These models are trained on speech data that have either phonetic or orthographic transcriptions using maximum likelihood and maximum a posteriori estimation techniques. On the WSJ0 corpus with a 46 phone set we obtain phone accuracies of 72.4% and 74.4% using 500 and 1600 CD phone units, respectively. Accuracy on BREF with 35 phones is as high as 78.7% with only 428 CD phone units. On TIMIT using the 61 phone symbols and only 500 CD phone units, we obtain a phone accuracy of 67.2% which correspond to 73.4% when the recognizer output is mapped to the commonly used 39 phone set. Making reference to our work on large vocabulary CSR, we show that it is worthwhile to perform phone recognition experiments as opposed to only focusing attention on word recognition results. asru93lv.ps.Z [1993 IEEE SPS Workshop on Automatic Speech Recognition Snowbird, Utah, Dec'93 (invited)] --------- Large Vocabulary Speech Recognition in English and French J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker --------------------- In this paper we report efforts at LIMSI in speaker independent large vocabulary speech recognition in French and in English. The recognizer makes use of continuous density HMM (CDHMM) with Gaussian mixture for acoustic modeling and n-gram statistics estimated on text material for language modeling. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. Evaluation of the system has been based on read speech corpora: DARPA Wall Street Journal-based CSR corpus and the BREF corpus containing recordings of texts from the French newspaper LeMonde. For both corpora experiments were carried out with up to 20K word lexicons. Experimental results on the WSJ and BREF corpora with similar size lexicons and task complexities are given, with attention to language-specific properties and problems. asru93lid.ps.Z [1993 IEEE SPS Workshop on Automatic Speech Recognition Snowbird, Utah, Dec'93] --------- Language Identification Using Phone-based Acoustic Likelihoods Lori F. Lamel and Jean-Luc Gauvain --------------------- This paper presents our recent work in language identification using phone-based acoustic likelihoods. The basic idea is to process the unknown utterance by language-dependent phone models, identifying the language to be that language associated with the phone model set having the highest likelihood. This approach has been evaluated for French/English language identification in laboratory conditions, and for 10 languages using the OGI Multilingual telepone corpus. Phone-based acoustic likelihoods have also been shown to be effective for sex and speaker-identification. rmsep92.ps.Z [Final review of the DARPA Artificial Neural Network Technology Speech Program, Stanford, CA, pp. 59-64, Sep'92] --------- Continuous Speech Recognition at LIMSI Lori F. Lamel and Jean-Luc Gauvain --------------------- This paper presents some of the recent research on speaker-independent continuous speech recognition at LIMSI including efforts in phone and word recognition for both French and English. Evaluation of an HMM-based phone recognizer on a subset of the BREF corpus, gives a phone accuracy of 67.1% with 35 context-independent phone models and 74.2% with 428 context-dependent phone models. The word accuracy is 88% for a 1139 word lexicon and 86% for a 2716 word lexicon, using a word pair grammar with respective perplexities of 101 and 160. Phone recognition is also shown to be effective for language, sex, and speaker identification. The second part of the paper describes the recognizer used for the September-92 Resource Management evaluation test. The HMM-based word recognizer is built by concatenation of the phone models for each word, where each phone model is a 3-state left-to-right HMM with Gaussian mixture observation densities. Separate male and female models are run in parallel. The lexicon is represented with a reduced set of 36 phones so as to permit additional sharing of contexts. Intra- and inter-word phonological rules are optionally applied during training and recognition. These rules attempt to account for some of the phonological variations observed in fluent speech. The speaker-independent word accuracy on the Sep92 test data was 95.6%. On the previous test materials which were used for development, the word accuracies are: 96.7% (Jun88), 97.5% (Feb89), 96.7% (Oct89) and 97.4% (Feb91). spc92map.ps.Z [Speech Communication, Vol. 11:205-213, June 1992] --------- Bayesian Learning for Hidden Markov Model with Gaussian Mixture State Observation Densities Jean-Luc Gauvain and Chin-Hui Lee* [*AT&T Bell Laboratories] --------------------- An investigation into the use of Bayesian learning of the parameters of a multivariate Gaussian mixture density has been carried out. In a framework of continuous density hidden Markov model (CDHMM), Bayesian learning serves as a unified approach for parameter smoothing, speaker adaptation, speaker clustering and corrective training. The goal is to enhance model robustness in a CDHMM-based speech recognition system so as to improve performance. Our approach is to use Bayesian learning to incorporate prior knowledge into the training process in the form of prior densities of the HMM parameters. The theoretical basis for this procedure is presented and results applying it to parameter smoothing, speaker adaptation, speaker clustering, and corrective training are given. Une étude sur l'utilisation de l'apprentissage bayésien des paramètres de densitées multigaussiennes a été effectuée. Dans le cadre des modèles markoviens cachés à densities d'observations continues (CDHMM), l'apprentissage bayésien est un outil très général applicable au lissage des paramètres, à l'adaptation au locuteur, à l'estimation de modèles par groupe de locuteurs et à l'apprentissage correctif. Le but est d'augmenter la robustesse des modèles d'un système de reconnaissance afin d'en améliorer les performances. Notre approche consiste a utiliser l'apprentissage bayésien pour incorporer une connaissance a priori dans le processus d'apprentissage sous forme de densités de probabilités des paramètres des modèles markoviens. La base théorique de cette procédure est présentée, ainsi que les résultats obtenus pour le lissage des paramètres, l'adaptation au locuteur, l'estimation de modèles propres à chaque sexe, et l'apprentissage correctif. Wir berichten über eine Untersuchung zum Einsatz der Bayes'schen Lerntheorie zur Schaetzung der Parameter von multi-variaten Gauss'schen Verteilungsdichten. Im Rahmen eines ``Hidden Markov Modells'' mit kontinuierlicher Dichteverteilungen (CDHMM) stellt die Bayes'sche Theorie einen einheitlichen Ansatz dar zur Parameterglaettung, Sprecheradaption, Sprecherklusterung und zum korrigierenden Training. Das Ziel ist, die Modellrobustheit eines auf CDHMM basierenden Spracherkennungssystems in Hinblick auf die Ergebnisse zu verbessern. Unser Ansatz ist, Bayes'sches Lernen zu benutzen, um Vorwissen in Form von initialen Dichten der HMM-Parameter in den Trainingsprozess einzubringen. Wir stellen die theoretische Basis für dieses Verfahren dar und wenden es zur Glaettung von Parametern, Sprecheradaption, Sprecherklusterung und im korrigierenden Training an. ica92map.ps.Z (missing) [ICASSP'92 San Francisco, CA, March'92] --------- Improved Acoustic Modeling with Bayesian Learning Jean-Luc Gauvain and Chin-Hui Lee --------------------- darpa92.ps.Z [DARPA Speech & Natural Language Workshop, Arden House, NY, Feb'92] --------- Speaker-Independent Phone Recognition Using BREF Jean-Luc Gauvain and Lori F. Lamel --------------------- A series of experiments on speaker-independent phone recognition of continuous speech have been carried out using the recently recorded BREF corpus. These experiments are the first to use this large corpus, and are meant to provide a baseline performance evaluation for vocabulary-independent phone recognition of French. The HMM-based recognizer was trained with hand-verified data from 43 speakers. Using 35 context-independent phone models, a baseline phone accuracy of 60% (no phone grammar) was obtained on an independent test set of 7635 phone segments from 19 new speakers. Including phone bigram probabilities as phonotactic constraints resulted in a performance of 63.5%. A phone accuracy of 68.6% was obtained with 428 context dependent models and the bigram phone language model. Vocabulary-independent word recognition results with no grammar are also reported for the same test data. darfeb92.ps.Z [DARPA Speech & Nat. Lang. Wshp, Arden House, pp. 185-190, feb'92] --------- MAP Estimation of Continuous Density HMM : Theory and Applications Jean-Luc Gauvain and Chin-Hui Lee* [*AT&T Bell Laboratories] --------------------- We discuss maximum a posteriori estimation of continuous density hidden Markov models (CDHMM). The classical MLE reestimation algorithms, namely the forward-backward algorithm and the segmental k-means algorithm, are expanded and reestimation formulas are given for HMM with Gaussian mixture observation densities. Because of its adaptive nature, Bayesian learning serves as a unified approach for the following four speech recognition applications, namely parameter smoothing, speaker adaptation, speaker group modeling and corrective training. New experimental results on all four applications are provided to show the effectiveness of the MAP estimation approach. euro91map.ps.Z (missing) [Eurospeech'91, Genoa, Italy, pp. 505-508, Sep'91] --------- Bayesian Learning for Hidden Markov Model with Gaussian Mixture State Observation Densities Jean-Luc Gauvain and Chin-Hui Lee* [*AT&T Bell Laboratories] --------------------- T euro91.ps.Z [Eurospeech'91, Genoa, Italy, pp. 505-508, Sep'91] --------- BREF, a Large Vocabulary Spoken Corpus for French Lori F. Lamel, Jean-Luc Gauvain, and Maxine Eskenazi --------------------- This paper presents some of the design considerations of BREF, a large read-speech corpus for French. BREF was designed to provide continuous speech data for the development of dictation machines, for the evaluation of continuous speech recognition systems (both speaker-dependent and speaker-independent), and for the study of phonological variations. The texts to be read were selected from 5 million words of the French newspaper, LeMonde. In total, 11,000 texts were selected, with selection criteria that emphasisized maximizing the number of distinct triphones. Separate text materials were selected for training and test corpora. Ninety speakers have been recorded, each providing between 5,000 and 10,000 words (approximately 40-70 min.) of speech. darpa91.ps.Z [DARPA Speech & Nat. Lang., Morgan Kaufmann, pp. 272-277, feb'91] --------- Bayesian Learning of Gaussian Mixture Densities for Hidden Markov Models Jean-Luc Gauvain and Chin-Hui Lee* [*AT&T Bell Laboratories] --------------------- An investigation into the use of Bayesian learning of the parameters of a multivariate Gaussian mixture density has been carried out. In a continuous density hidden Markov model (CDHMM) framework, Bayesian learning serves as a unified approach for parameter smoothing, speaker adaptation, speaker clustering, and corrective training. The goal of this study is to enhance model robustness in a CDHMM-based speech recognition system so as to improve performance. Our approach is to use Bayesian learning to incorporate prior knowledge into the CDHMM training process in the form of prior densities of the HMM parameters. The theoretical basis for this procedure is presented and preliminary results applying to HMM parameter smoothing, speaker adaptation, and speaker clustering are given. Performance improvements were observed on tests using the DARPA RM task. For speaker adaptation, under a supervised learning mode with 2 minutes of speaker-specific training data, a 31% reduction in word error rate was obtained compared to speaker-independent results. Using Baysesian learning for HMM parameter smoothing and sex-dependent modeling, a 21% error reduction was observed on the FEB91 test. kobe90.ps.Z [ICSLP'90, Kobe, Japan, Oct'90] --------- Design Considerations and Text Selection for BREF, a large French read-speech corpus Jean-Luc Gauvain, Lori F. Lamel, and Maxine Eskenazi --------------------- BREF, a large read-speech corpus in French has been designed with several aims: to provide enough speech data to develop dictation machines, to provide data for evaluation of continuous speech recognition systems (both speaker-dependent and speaker-independent), and to provide a corpus of continuous speech to study phonological variations. This paper presents some of the design considerations of BREF, focusing on the text analysis and the selection of text materials. The texts to be read were selected from 4.6 million words of the French newspaper, LeMonde. In total, 11,000 texts were selected, with an emphasis on maximizing the number of distinct triphones. Separate text materials were selected for training and test corpora. The goal is to obtain about 10,000 words (approximately 60-70 min.) of speech from each of 100 speakers, from different French dialects.
Years: 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005