|
|
|
Spoken Language Processing Group (TLP)
TLP Group - Speech Recognition Evaluation
MediaEval2012
For the third year Vocapia and LIMSI team together to provide automatic
transcripts of audiovideo data for MediaEval. In 2012 updated language
identification and speech-to-text transcription systems have been used to
process a multilingual collection of 3500 hours of Internet video. All videos
were shared by their owners under a Creative Commons license.
MediaEval2012,
speech recognition by Vocapia.
Italian Evalita'11
Vocapia Research and LIMSI have been ranked first in the
Evalita 2011 ASR benchmark test.
Dutch NBEST'08
Vocapia Research and LIMSI have been ranked first in the
Dutch NBEST ASR benchmark test in 2008.
Technolangue ESTER'05
Vocapia Research and LIMSI technologies have been ranked
first in the French Technolangue
ASR benchmark tests in May 2005.
See (D)ARPA evalutation table at the end of this page.
DARPA-BN-10X, Dec99
The main difference from our previous broadcast news transcription
system is that a new decoder was implemented to meet the 10xRT
requirement. This single pass 4-gram dynamic network decoder is based
on a time-synchronous Viterbi search with dynamic expansion of
LM-state conditioned lexical trees, and with acoustic and language
model lookaheads. The decoder can handle position-dependent,
cross-word triphones and lexicons with contextual
pronunciations. Faster than real-time decoding can be obtained using
this decoder with a word error under 30%, running in less than 100Mb
of memory on widely available platforms such Pentium III or Alpha
machines. The same basic models (lexicon, acoustic models, language
models) and partitioning procedure used in past systems have been used
for this evaluation. The acoustic models were trained on about 150
hours of transcribed speech material. 65K word language models were
obtained by interpolation of backoff n-gram language models trained on
different text data sets. Prior to word decoding a maximum likelihood
partitioning algorithm segments the data into homogenous regions and
assigns gender, bandwidth and cluster labels to the speech segments.
Word decoding is carried out in three steps, integrating cluster-based
MLLR acoustic model adaptation. The final decoding step uses a 4-gram
language model interpolated with a category trigram model. The
overall word transcription error on the 1999 evaluation test data was
17.1% for the baseline 10X system.
(see
2000 DARPA Speech Transcription proceedings)
DARPA-BN, Nov98
The LIMSI system for the November 1998 Hub-4E evaluation is a
continuous mixture density, tied-state cross-word context-dependent
HMM system. The acoustic models were trained on the 1995, 1996 and
1997 official Hub-4E training data containing about 150 hours of
transcribed speech material. 65K word language models were obtained
by interpolation of backoff n-gram language models trained on
different data sets.
Prior to word decoding a maximum likelihood partitioning algorithm
segments the data into homogenous regions and assigns gender,
bandwidth and cluster labels to the speech segments. Decoding is
carried out in 3 steps, integrating cluster-based MLLR acoustic model
adaptation. The final decoding step uses a 4-gram language model
interpolated with a category trigram model.
The main differences compared to last year's system arise from the
use of additional acoustic and language model training data, the use
of divisive decision tree clustering instead of agglomerative
clustering for state-tying, the generation of word graphs using
adapted acoustic models, the use of interpolated LMs trained on
different data sets instead of training a single model on weighted
texts, and a 4-gram LM interpolated with a category model.
The overall word transcription error on the Nov98 evaluation test
data was 13.6% (see 1999 DARPA BN proceedings
or hub4_99.ps.Z))
DARPA-BN, Nov97
The LIMSI system used in the Nov97 Hub4 benchmark test on
transcription of American English broadcast news shows has two main
differences from the system developed for the Nov96 evaluation. The
first concerns the preprocessing stages for partitioning the data, and
the second concerns a reduction in the number of acoustic model sets
used to deal with the various acoustic signal characteristics.
The LIMSI system for the November 1997 Hub-4E evaluation is a
continuous mixture density, tied-state cross-word context-dependent
HMM system. The acoustic models were trained on the 1995 and 1996
official Hub4 training data containing about 80 hours of transcribed
speech material. The 65k trigram language models are trained on 155
million words of newspaper texts and 132 million words of broadcast
news transcriptions. The test data is segmented and labeled using
Gaussian mixture models, and non-speech segments are rejected. The
speech segments are classified as telephone or wide-band, and
according to gender. Decoding is carried out in three passes, with a
final pass incorporating cluster-based test-set MLLR adaptation. The
overall word transcription error of the Nov97 unpartitioned evaluation
test data was 18.3%.
(see 1998
Broadcast News Transcription Proceedings or
hub4_98.ps.Z)
DARPA-BN, Nov96
The task of transcribing broadcast news is a move toward real-world
applications for speech technology. Broadcast news shows contain signal
segments of various acoustic and linguistic nature, with abrupt or
gradual transitions between segments. Two main problems in dealing with
the continuous flow of inhomogenous data concern the varied acoustic
nature of the signal (signal quality, environmental and transmission
noise, music) and different linguistic styles (prepared and spontaneous
speech on a wide range of topics, spoken by a large variety of
speakers).
The LIMSI Nov96 speech recognizer for the ARPA Broadcast News task
makes use of continuous density HMMs with Gaussian mixture for acoustic
modeling and n-gram statistics estimated on newspaper texts (160M words)
and broadcast news transcription (132M words). The acoustic models are
trained on the WSJ0/WSJ1, and adapted using MAP estimation with 35 hours
of broadcast news data. The overall word error on the Nov96 partitioned
evaluation test was 27.1%.
(see
1997 Speech Recognition Workshop Proceedings or
ica97hub4.ps.Z and
all Nov96 results
at NIST.
DARPA-NAB-MUM, Nov95
The LIMSI recognizer has been evaluated in the ARPA
1995 North American Business (NAB) News Hub 3 benchmark test. This
recognizer is an HMM-based system with Gaussian mixture.
Decoding is carried out in multiple forward acoustic passes, where
more refined acoustic and language models are used in successive
passes and information is transmitted via word graphs. In order to
deal with the varied acoustic conditions, channel compensation is
performed iteratively, refining the noise estimates before the first
three decoding passes. The final decoding pass is carried out with
speaker-adapted models obtained via unsupervised adaptation using the
MLLR method. In contrast to previous evaluations, the new Hub 3 test
aimed at improving basic SI, CSR performance on unlimited-vocabulary
read speech recorded under more varied acoustical conditions
(background environmental noise and unknown microphones). On the
Sennheiser microphone (average SNR 29dB) a word error of 9.1% was
obtained, which can be compared to 17.5% on the secondary microphone
data (average SNR 15dB) using the same recognition system.
(see arpa96nab.ps.Z)
ARPA-NAB, Nov94
In our development work for the Nov94 ARPA evaluation we experimented
with (1) alternative analyses for the acoustic front end, (2) the use of
an enlarged vocabulary so as to reduce the number of errors due to
out-of-vocabulary words, (3) extensions to the lexical representation,
(4) the use of additional acoustic training data, and (5) modification
of the acoustic models for telephone speech. The recognizer was
evaluated on Hubs 1 and 2 of the fall 1994 ARPA NAB CSR Hub and Spoke
Benchmark test. A word error of 9.2% was obtained on the Hub1 Sennheiser
data and 24.6% on the Hub2 telephone data.
(see slt95.ps.Z and
ica95lv.ps.Z)
ARPA-WSJ, Nov93
The LIMSI 20k Wall Street Journal system was evaluated in the November
1993 test. The recognizer makes use of continuous density HMM with
Gaussian mixture for acoustic modeling and n-gram statistics estimated
on the newspaper texts for language modeling. The recognizer uses a
time-synchronous graph-search strategy which was shown to still be
viable with vocabularies of up to 20K words when used with bigram
back-off language models. A second forward pass, which makes use of a
word graph generated with the bigram, incorporates a trigram language
model. The trigram pass gives an error rate reduction of 20% to 30%
relative to the bigram system. Acoustic modeling uses cepstrum-based
features, context-dependent phone models (intra and interword), phone
duration models, and sex-dependent models. Increasing the amount of
training utterances from 7k to 37k, reduced the word error by about
30%. The word error obtained on the Nov93 test data was 11.8%.
(see slt94.ps.Z
and ica94lv.ps.Z)
DARPA-WSJ, Nov92
The first LIMSI 5k WSJ system was evaluated in Nov92.
The recognizer makes use of continuous density HMM with Gaussian mixture
for acoustic modeling and n-gram statistics estimated on the newspaper
texts for language modeling. Acoustic modeling uses cepstrum-based
features, context-dependent phone models (intra and interword), phone
duration models, and sex-dependent models. The official word error
obtained in the evaluation was 9.7%.
(see euro93lv.ps.Z)
DARPA-RM, Sep92 (Artificial Neural Network Technology Speech Program)
The HMM-based word recognizer used for the September-92 Resource
Management evaluation was built by concatenation of the phone models for
each word, where each phone model is a 3-state left-to-right HMM with
Gaussian mixture observation densities. Separate male and female models
were run in parallel. The lexicon was represented with a reduced set of
36 phones so as to permit additional sharing of contexts. Intra- and
inter-word phonological rules are optionally applied during training and
recognition. These rules attempt to account for some of the
phonological variations observed in fluent speech. The
speaker-independent word accuracy on the Sep92 test data was 95.6%. On
the previous test materials which were used for development, the word
accuracies are: 96.7% (Jun88), 97.5% (Feb89), 96.7% (Oct89) and 97.4%
(Feb91). (see rmsep92.ps.Z)
Summary of LIMSI participation in ARPA evaluations
ARPA Test |
Nb. of Participants |
LIMSI Rank¹ |
Vocab. (words) |
LIMSI %error(words) |
range of %error(words) |
Sep92 RM | 10 (6 US) | 1 | 1K | 4.4 | 4.4 - 11.7 |
Nov92 WSJ | 6 (5 US) | 3 | 5K | 9.7 | 6.9 - 15.0 |
Nov93 WSJ | 8 (5 US) | 1 | 20K | 11.8 | 11.8 - 19.1 |
Nov94 NAB | 13 (9 US) | 3 | 65K | 9.2 | 7.2 - 17.4 |
Nov95 MUM Sennheiser | 8 (5 US) | 2 | 65K | 17.5 9.1 | 13.5 - 55.5 6.6 - 20.2 |
Nov96 BN | 8 (5 US) | 1 | 65K | 27.1 | 27.1 - 53.8 |
Nov97 BN | 10 (6 US) | 3 | 65K | 18.3 | 16.4 - 39.0 |
Nov98 BN | 9 (5 US) | 2 | 65K | 13.6 | 13.5 - 25.7 |
Dec99 BN-10X | 5 (4 US) | 1 | 65K | 17.1 | 17.1 - 26.3 |
(¹Differences between the best performing systems were often not significant.)
ARPA related publications
Related links
NIST (Spoken Natural Language Processing)
Participating sites
Last modified: Saturday,02-May-15 05:42:34 CEST
|
|
|