Language models play an important role in many applications like
character and speech recognition, translation and information
retrieval. The dominant approach, at least in large vocabulary
continuous speech recognition, are n-gram back-off language models
(LM). In this research a new approach is presented that is based on the
estimation of the LM probabilities in a continuous space. This
technique is evaluated in a large vocabulary continuous speech recognizer for
several tasks and languages and achieved always significant reductions
in word error of up to 1.6% absolute with respect to a carefully
optimized back-off 4-gram LM.
We describe here the application of a new approach that uses a neural
network to estimate the LM posterior probabilities [1].
The basic
idea is to project the word indices onto a continuous space and to use
a probability estimator operating on this space. Since the
resulting probability functions are smooth functions of the word
representation, better generalization to unknown n-grams can be
expected. A neural network is used to simultaneously learn the
projection of the words onto the continuous space and the n-gram
probability estimation. This is still a n-gram approach, but the
LM posterior probabilities are "interpolated" for any possible context
of length n-1 instead of backing-off to shorter contexts.
The architecture of the neural network n-gram LM is shown in Figure 1.
A standard fully-connected multi-layer perceptron is used. The
inputs to the neural network are the indices of the n-1 previous words
in the vocabulary
and the outputs are the posterior probabilities
of all words of the
vocabulary:
where N is the size of the vocabulary. The input uses the
so-called 1-of-n coding, i.e., the i-th
word of the vocabulary is coded by setting the i-th element of the vector to 1 and
all the other elements to 0. The i-th line of the NxP dimensional
projection matrix corresponds to the continuous representation of the i-th word. The hidden layer
uses hyperbolic tangent activation functions and the outputs are
softmax normalized. Training is performed with the stochastic
back-propagation algorithm using cross-entropy as error function, and a
weight decay regularization term. The neural network minimizes
the perplexity on the training data and learns the projection of the
words onto the continuous space that is best for
the probability estimation task.
Several improvements of this basic architecture have been developed
in order to achieve fast training and recognition [2]: 1) Lattice rescoring: decoding is done
with a standard back-off LM and a lattice is generated. The
neural network LM is then used to rescore the lattice; 2) Shortlists: the neural network is
only used to predict the LM probabilities of a subset of the whole
vocabulary (2k-12k most frequent words in function of the language); 3)
Regrouping: all LM probability
requests in one lattice are collected and sorted. By these means all LM
probability requests with the same context h_t lead to only one forward
pass through the neural network; 4) Block
mode: several examples are propagated at once through the neural
network, allowing the use of faster matrix/matrix operations.
Training of a typical network with a hidden layer with 500 nodes and
a shortlist of length 2000 (about one million parameters) takes less
than one hour for one epoch through four million examples on a standard
PC. When large amounts of training data are available a
resampling algorithm is used [3]. Lattice
rescoring with the neural network LM is also very fast, typically less
than 0.1xRT, allowing to use the new approach even in a fast real-time
speech recognition system [4].
The neural network language model has been thoroughly evaluated on several speech recognition tasks and languages: English broadcast news and conversations in the context of the Nist/Darpa evaluations, French broadcast news in the context of the Technolangue/Ester evaluation, Spanish and English Speeches of the European Parliament and English lectures and seminars in the context of the European projects Tc-Star and Chil respectively. By these means different speaking styles are covered, ranging from rather well formed speech with little errors (broadcast news) to very relaxed speaking with many errors in syntax and semantics (meetings and conversations). The three covered languages exhibit also particular properties, French has for instance a large number of homophones in the inflected forms that can only be distinguished by a language model.
Table 1 summarizes the
performances of the new approach. The neural network LM achieved
in all cases significant reductions in word error of up to 1.6%
absolute
with respect to a carefully optimized back-off 4-gram LM. It is worth
to recall that the baseline speech recognizers were all
state-of-the-art as demonstrated by their good ranking in international
evaluations. It seems therefore reasonable to say that the developed
neural network LM can be considered as a serious alternative to the
commonly used back-off language models.
Back-off
LM |
Neual
LM |
|||
Task
and Language |
#words | Werr | #words | Werr |
---|---|---|---|---|
English
conversations |
948M |
16.0% | 27.3M |
15.5% |
English
broadcast news |
1878M |
9.6% | 27M |
9.2% |
French
broadcast news |
614M |
10.7% | 600M |
10.2% |
English
Parliament speeches |
33M |
12.1% | 33M |
11.3% |
Spanish Parliament speeches | 34M |
10.6% |
34M |
10.0% |
English
lectures and seminars |
36M |
26.0% |
36M |
24.4% |
Table 1: Performance of the neural network language model in comparison to a 4-gram back-off LM (modified Kneser-Ney smoothing) for different tasks and languages (#words defines the size of the training corpus, Werr is the word error rate).
Despite these good results, we believe that the full potential of
language modeling in the continuous domain is far from attained
yet. Future work will concentrate on (word) error corrective
training algorithms like AdaBoost and training criteria that directly
minimize the estimated word error when the LM is used in a speech
recognizer.
We are also interested in unsupervised language model
adaptation. The continuous representation of the words in the
neural network language model offers new ways to perform constraint LM
adaptation. For instance, the continuous representation of the
words can be changed so that the LM predictions are improved on some
adaptation data, e.g. by moving some words closer together which appear
often in similar contexts. The idea is to apply a transformation
on the continuous representation of the words by adding an adaptation
layer between the projection layer and the hidden layer.