Continuous Space Language Models

Object

Language models play an important role in many applications like character and speech recognition, translation and information retrieval. The dominant approach, at least in large vocabulary continuous speech recognition, are n-gram back-off language models (LM). In this research a new approach is presented that is based on the estimation of the LM probabilities in a continuous space. This technique is evaluated in a large vocabulary continuous speech recognizer for several tasks and languages and achieved always significant reductions in word error of up to 1.6% absolute with respect to a carefully optimized back-off 4-gram LM.

Description

We describe here the application of a new approach that uses a neural network to estimate the LM posterior probabilities [1]. The basic idea is to project the word indices onto a continuous space and to use a probability estimator operating on this space. Since the resulting probability functions are smooth functions of the word representation, better generalization to unknown n-grams can be expected. A neural network is used to simultaneously learn the projection of the words onto the continuous space and the n-gram probability estimation. This is still a n-gram approach, but the LM posterior probabilities are "interpolated" for any possible context of length n-1 instead of backing-off to shorter contexts.

Architecture of the neural network LM

FIGURE 1: Architecture of the neural network language model.

The architecture of the neural network n-gram LM is shown in Figure 1. A standard fully-connected multi-layer perceptron is used. The inputs to the neural network are the indices of the n-1 previous words in the vocabulary $h_j=w_{j-n+1}, ..., w_{j-2}, w_{j-1}$ and the outputs are the posterior probabilities of all words of the vocabulary:

$\begin{displaymath} P(w_j = i \vert h_j) \hspace{1cm}\forall i \in [1,N] \end{displaymath}$

where N is the size of the vocabulary. The input uses the so-called 1-of-n coding, i.e., the i-th word of the vocabulary is coded by setting the i-th element of the vector to 1 and all the other elements to 0. The i-th line of the NxP dimensional projection matrix corresponds to the continuous representation of the i-th word. The hidden layer uses hyperbolic tangent activation functions and the outputs are softmax normalized. Training is performed with the stochastic back-propagation algorithm using cross-entropy as error function, and a weight decay regularization term. The neural network minimizes the perplexity on the training data and learns the projection of the words onto the continuous space that is best for the probability estimation task.

Several improvements of this basic architecture have been developed in order to achieve fast training and recognition [2]: 1) Lattice rescoring: decoding is done with a standard back-off LM and a lattice is generated. The neural network LM is then used to rescore the lattice; 2) Shortlists: the neural network is only used to predict the LM probabilities of a subset of the whole vocabulary (2k-12k most frequent words in function of the language); 3) Regrouping: all LM probability requests in one lattice are collected and sorted. By these means all LM probability requests with the same context h_t lead to only one forward pass through the neural network; 4) Block mode: several examples are propagated at once through the neural network, allowing the use of faster matrix/matrix operations.

Training of a typical network with a hidden layer with 500 nodes and a shortlist of length 2000 (about one million parameters) takes less than one hour for one epoch through four million examples on a standard PC. When large amounts of training data are available a resampling algorithm is used [3]. Lattice rescoring with the neural network LM is also very fast, typically less than 0.1xRT, allowing to use the new approach even in a fast real-time speech recognition system [4].

Results and prospects

The neural network language model has been thoroughly evaluated on several speech recognition tasks and languages: English broadcast news and conversations in the context of the Nist/Darpa evaluations, French broadcast news in the context of the Technolangue/Ester evaluation, Spanish and English Speeches of the European Parliament and English lectures and seminars in the context of the European projects Tc-Star and Chil respectively. By these means different speaking styles are covered, ranging from rather well formed speech with little errors (broadcast news) to very relaxed speaking with many errors in syntax and semantics (meetings and conversations). The three covered languages exhibit also particular properties, French has for instance a large number of homophones in the inflected forms that can only be distinguished by a language model.

Table 1 summarizes the performances of the new approach. The neural network LM achieved in all cases significant reductions in word error of up to 1.6% absolute with respect to a carefully optimized back-off 4-gram LM. It is worth to recall that the baseline speech recognizers were all state-of-the-art as demonstrated by their good ranking in international evaluations. It seems therefore reasonable to say that the developed neural network LM can be considered as a serious alternative to the commonly used back-off language models.

	Back-off LM		Neual LM
Task and Language	#words	Werr	#words	Werr
English conversations	948M	16.0%	27.3M	15.5%
English broadcast news	1878M	9.6%	27M	9.2%
French broadcast news	614M	10.7%	600M	10.2%
English Parliament speeches	33M	12.1%	33M	11.3%
Spanish Parliament speeches	34M	10.6%	34M	10.0%
English lectures and seminars	36M	26.0%	36M	24.4%

Table 1: Performance of the neural network language model in comparison to a 4-gram back-off LM (modified Kneser-Ney smoothing) for different tasks and languages (#words defines the size of the training corpus, Werr is the word error rate).

Despite these good results, we believe that the full potential of language modeling in the continuous domain is far from attained yet. Future work will concentrate on (word) error corrective training algorithms like AdaBoost and training criteria that directly minimize the estimated word error when the LM is used in a speech recognizer.

We are also interested in unsupervised language model adaptation. The continuous representation of the words in the neural network language model offers new ways to perform constraint LM adaptation. For instance, the continuous representation of the words can be changed so that the LM predictions are improved on some adaptation data, e.g. by moving some words closer together which appear often in similar contexts. The idea is to apply a transformation on the continuous representation of the words by adding an adaptation layer between the projection layer and the hidden layer.

References

[1] Y. Bengio, and R. Ducharme. A neural probabilistic language model. In Neural Information Processing Systems, volume 13, pages 932-938. 2001.

[2] H. Schwenk, Efficient training of large neural networks for language modeling. In International Joint Conference on Neural Networks, pages 3059-3062, 2004.

[3] H. Schwenk, and J.-L. Gauvain, Training neural network language models on very large corpora. In Joint Conference on Human Language Technology and on Empirical Methods in Natural Language Processing, pages 201-208, 2005.

[4] H. Schwenk, and J.-L. Gauvain, Building continuous space language models for transcribing European languages. In Interspeech, pages 737-740, 2005.