LIMSI logo
Search 
 
    The CNRS LIMSI Directory
   
Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
 

Spoken Language Processing Group (TLP)

TLP Group - Speech Recognition Evaluation


  • MediaEval2012
  • For the third year Vocapia and LIMSI team together to provide automatic transcripts of audiovideo data for MediaEval. In 2012 updated language identification and speech-to-text transcription systems have been used to process a multilingual collection of 3500 hours of Internet video. All videos were shared by their owners under a Creative Commons license. MediaEval2012, speech recognition by Vocapia.

  • Italian Evalita'11
  • Vocapia Research and LIMSI have been ranked first in the Evalita 2011 ASR benchmark test.

  • Dutch NBEST'08
  • Vocapia Research and LIMSI have been ranked first in the Dutch NBEST ASR benchmark test in 2008.

  • Technolangue ESTER'05
  • Vocapia Research and LIMSI technologies have been ranked first in the French Technolangue ASR benchmark tests in May 2005.
    See (D)ARPA evalutation table at the end of this page.

  • DARPA-BN-10X, Dec99
  • The main difference from our previous broadcast news transcription system is that a new decoder was implemented to meet the 10xRT requirement. This single pass 4-gram dynamic network decoder is based on a time-synchronous Viterbi search with dynamic expansion of LM-state conditioned lexical trees, and with acoustic and language model lookaheads. The decoder can handle position-dependent, cross-word triphones and lexicons with contextual pronunciations. Faster than real-time decoding can be obtained using this decoder with a word error under 30%, running in less than 100Mb of memory on widely available platforms such Pentium III or Alpha machines. The same basic models (lexicon, acoustic models, language models) and partitioning procedure used in past systems have been used for this evaluation. The acoustic models were trained on about 150 hours of transcribed speech material. 65K word language models were obtained by interpolation of backoff n-gram language models trained on different text data sets. Prior to word decoding a maximum likelihood partitioning algorithm segments the data into homogenous regions and assigns gender, bandwidth and cluster labels to the speech segments. Word decoding is carried out in three steps, integrating cluster-based MLLR acoustic model adaptation. The final decoding step uses a 4-gram language model interpolated with a category trigram model. The overall word transcription error on the 1999 evaluation test data was 17.1% for the baseline 10X system. (see 2000 DARPA Speech Transcription proceedings)

  • DARPA-BN, Nov98
  • The LIMSI system for the November 1998 Hub-4E evaluation is a continuous mixture density, tied-state cross-word context-dependent HMM system. The acoustic models were trained on the 1995, 1996 and 1997 official Hub-4E training data containing about 150 hours of transcribed speech material. 65K word language models were obtained by interpolation of backoff n-gram language models trained on different data sets. Prior to word decoding a maximum likelihood partitioning algorithm segments the data into homogenous regions and assigns gender, bandwidth and cluster labels to the speech segments. Decoding is carried out in 3 steps, integrating cluster-based MLLR acoustic model adaptation. The final decoding step uses a 4-gram language model interpolated with a category trigram model. The main differences compared to last year's system arise from the use of additional acoustic and language model training data, the use of divisive decision tree clustering instead of agglomerative clustering for state-tying, the generation of word graphs using adapted acoustic models, the use of interpolated LMs trained on different data sets instead of training a single model on weighted texts, and a 4-gram LM interpolated with a category model. The overall word transcription error on the Nov98 evaluation test data was 13.6% (see 1999 DARPA BN proceedings or hub4_99.ps.Z))

  • DARPA-BN, Nov97
  • The LIMSI system used in the Nov97 Hub4 benchmark test on transcription of American English broadcast news shows has two main differences from the system developed for the Nov96 evaluation. The first concerns the preprocessing stages for partitioning the data, and the second concerns a reduction in the number of acoustic model sets used to deal with the various acoustic signal characteristics. The LIMSI system for the November 1997 Hub-4E evaluation is a continuous mixture density, tied-state cross-word context-dependent HMM system. The acoustic models were trained on the 1995 and 1996 official Hub4 training data containing about 80 hours of transcribed speech material. The 65k trigram language models are trained on 155 million words of newspaper texts and 132 million words of broadcast news transcriptions. The test data is segmented and labeled using Gaussian mixture models, and non-speech segments are rejected. The speech segments are classified as telephone or wide-band, and according to gender. Decoding is carried out in three passes, with a final pass incorporating cluster-based test-set MLLR adaptation. The overall word transcription error of the Nov97 unpartitioned evaluation test data was 18.3%. (see 1998 Broadcast News Transcription Proceedings or hub4_98.ps.Z)

  • DARPA-BN, Nov96
  • The task of transcribing broadcast news is a move toward real-world applications for speech technology. Broadcast news shows contain signal segments of various acoustic and linguistic nature, with abrupt or gradual transitions between segments. Two main problems in dealing with the continuous flow of inhomogenous data concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers). The LIMSI Nov96 speech recognizer for the ARPA Broadcast News task makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and n-gram statistics estimated on newspaper texts (160M words) and broadcast news transcription (132M words). The acoustic models are trained on the WSJ0/WSJ1, and adapted using MAP estimation with 35 hours of broadcast news data. The overall word error on the Nov96 partitioned evaluation test was 27.1%. (see 1997 Speech Recognition Workshop Proceedings or ica97hub4.ps.Z and all Nov96 results at NIST.

  • DARPA-NAB-MUM, Nov95
  • The LIMSI recognizer has been evaluated in the ARPA 1995 North American Business (NAB) News Hub 3 benchmark test. This recognizer is an HMM-based system with Gaussian mixture. Decoding is carried out in multiple forward acoustic passes, where more refined acoustic and language models are used in successive passes and information is transmitted via word graphs. In order to deal with the varied acoustic conditions, channel compensation is performed iteratively, refining the noise estimates before the first three decoding passes. The final decoding pass is carried out with speaker-adapted models obtained via unsupervised adaptation using the MLLR method. In contrast to previous evaluations, the new Hub 3 test aimed at improving basic SI, CSR performance on unlimited-vocabulary read speech recorded under more varied acoustical conditions (background environmental noise and unknown microphones). On the Sennheiser microphone (average SNR 29dB) a word error of 9.1% was obtained, which can be compared to 17.5% on the secondary microphone data (average SNR 15dB) using the same recognition system. (see arpa96nab.ps.Z)

  • ARPA-NAB, Nov94
  • In our development work for the Nov94 ARPA evaluation we experimented with (1) alternative analyses for the acoustic front end, (2) the use of an enlarged vocabulary so as to reduce the number of errors due to out-of-vocabulary words, (3) extensions to the lexical representation, (4) the use of additional acoustic training data, and (5) modification of the acoustic models for telephone speech. The recognizer was evaluated on Hubs 1 and 2 of the fall 1994 ARPA NAB CSR Hub and Spoke Benchmark test. A word error of 9.2% was obtained on the Hub1 Sennheiser data and 24.6% on the Hub2 telephone data. (see slt95.ps.Z and ica95lv.ps.Z)

  • ARPA-WSJ, Nov93
  • The LIMSI 20k Wall Street Journal system was evaluated in the November 1993 test. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which was shown to still be viable with vocabularies of up to 20K words when used with bigram back-off language models. A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. The trigram pass gives an error rate reduction of 20% to 30% relative to the bigram system. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. Increasing the amount of training utterances from 7k to 37k, reduced the word error by about 30%. The word error obtained on the Nov93 test data was 11.8%. (see slt94.ps.Z and ica94lv.ps.Z)

  • DARPA-WSJ, Nov92
  • The first LIMSI 5k WSJ system was evaluated in Nov92. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models. The official word error obtained in the evaluation was 9.7%. (see euro93lv.ps.Z)

  • DARPA-RM, Sep92 (Artificial Neural Network Technology Speech Program)
  • The HMM-based word recognizer used for the September-92 Resource Management evaluation was built by concatenation of the phone models for each word, where each phone model is a 3-state left-to-right HMM with Gaussian mixture observation densities. Separate male and female models were run in parallel. The lexicon was represented with a reduced set of 36 phones so as to permit additional sharing of contexts. Intra- and inter-word phonological rules are optionally applied during training and recognition. These rules attempt to account for some of the phonological variations observed in fluent speech. The speaker-independent word accuracy on the Sep92 test data was 95.6%. On the previous test materials which were used for development, the word accuracies are: 96.7% (Jun88), 97.5% (Feb89), 96.7% (Oct89) and 97.4% (Feb91). (see rmsep92.ps.Z)

  • Summary of LIMSI participation in ARPA evaluations
  • ARPA Test Nb. of
    Participants
    LIMSI
    Rank¹
    Vocab.
    (words)
    LIMSI
    %error(words)
    range of
    %error(words)
    Sep92 RM 10 (6 US) 1 1K 4.4 4.4 - 11.7
    Nov92 WSJ 6 (5 US) 3 5K 9.7 6.9 - 15.0
    Nov93 WSJ 8 (5 US) 1 20K 11.8 11.8 - 19.1
    Nov94 NAB 13 (9 US) 3 65K 9.2 7.2 - 17.4
    Nov95 MUM
    Sennheiser
    8 (5 US) 2 65K 17.5
    9.1
    13.5 - 55.5
    6.6 - 20.2
    Nov96 BN 8 (5 US) 1 65K 27.1 27.1 - 53.8
    Nov97 BN 10 (6 US) 3 65K 18.3 16.4 - 39.0
    Nov98 BN 9 (5 US) 2 65K 13.6 13.5 - 25.7
    Dec99 BN-10X 5 (4 US) 1 65K 17.1 17.1 - 26.3
    (¹Differences between the best performing systems were often not significant.)

    ARPA related publications

    Related links
    NIST (Spoken Natural Language Processing)
    Participating sites




    Last modified: Saturday,02-May-15 05:42:34 CEST