Spoken Language Processing Group at LIMSI | Translation and machine learning

Translation and machine learning

Francois Yvon, Gilles Adda, Alexandre Allauzen, Marianna Apidianaki, Hélène Bonneau-Maynard, Souhir Gahbiche-Braham, Quoc Khanh Do, Li Gong, Nicolas Pecheux, Guillaume Wisniewski, Yong Xu

Research activities in this theme are focused on testing, adapting and specializing proven and sound statistical Machine Learning algorithms to the peculiarities of Natural Language and Speech Processing data and problems. The main testbed is a final application, Machine Translation, which implies many intermediate sub-tasks (part-of-speech (POS) tagging, chunking, named-entity recognition (NER), etc) that can also be approached with ML tools. Besides their intrinsic complexity, these problems imply to deal with (i) very large and (ii) heterogeneous datasets, containing both (iii) annotated and non-annotated data; further, linguistic data is often (iv) structured and can be described by (v) myriads of linguistic features, involving (vi) complex statistical dependencies. These are the six main scientific challenges that are are being addressed. However, contrarily to many teams working in this lively domain, improving the current state-of-the-art in Machine Translation, as measured in international evaluation campaigns, is also a major objective; thus the need to develop and maintain our own Machine Translation software(s).

Statistical Machine Translation (SMT)

Statistical Machine Translation systems rely on the statistical analysis of large bilingual corpora to train stochastic models describing the mapping between a source language (SL) and a target language (TL). In their simplest form, these models express probabilistic associations between source and target strings of words, as initially formulated in the famous IBM models in the early nineties. More recently, these models have been extended to more complex representations (eg. chunks, trees, or dependency structures) and to probabilistic mappings between these representations. Translation models are typically trained using parallel corpora containing examples of source texts aligned with their translation(s), where the alignment is defined at a sub-sentential level.

In this context, LIMSI is developing its research activities in several directions, from the design of word and phrase alignment models, to the conception of novel translation or language models; from the exploration of new training or tuning methodologies to the development of new decoding strategies. All these innovations need to be properly evaluated and diagnosed and significant efforts are devoted to the vexing issue of quality measurements of MT outputs. These research activities have been published in a number of international conferences or journals (see the publications ). LIMSI is finally involved in a number of national and international projects.

Regarding alignment models, most recent work deals with the design and training of discriminative alignment techniques (Allauzen & Wisniewski, 2009; Tomeh et al, 2010, 2011a, 2011b, 2012) in order to improve both word alignment and phrases extraction. (Lardilleux et al, 2011a, 2011b, 2012, 2013) explores alternative alignment techniques, based on statistical association measures between phrases.

LIMSI's decoder, N-code, belongs to the class of n-gram based systems. In this approach, translation is defined as a two step process, in which an input SL sentence is first non-deterministically reordered yielding a large word lattice of possible reorderings. This lattice is then translated monotonically using a bilingual n-gram model; as in the more standard approach, hypotheses are scored using several probabilistic models, the weights of which are discriminatively optimized with minimum error weight training. Recent evolutions of this approach are described in (Crego & Yvon, 2009, 2010a, 2010b). This system is now released as an open source software (Crego & Yvon, 2011); an online demo is also available. As an alternative training strategy, a CRF-based translation model (Lavergne et al, 2011) has recently been proposed, which builds on our in-house CRF toolkit (Lavergne et al, 2011)Lavergne et al, 2010). We are exploring more agile and adaptive approaches to training for MT where the model parameters are computed on the fly (Li et al, 2012).

LIMSI's activities are not restricted to these core modules and many other aspects of SMT are also investigated, such as "tuning" (Sokolov & Yvon, 2011), multi-source machine translation (Crego et al 2010a, 2010b), diagnostic evaluation of MT, notably via the computation of oracle scores (Max et al, 2010; (Wisniewski et al, 2010, 2013; Sokolov et al, 2012), confidence estimation (Zhang et al, 2012), word sense disambiguation for MT (Sokolov et al, 2012; Apidianaki et al, 2012), extraction of parallel sentences from comparable corpora (Braham-Ghabiche et al, 2011), sentence alignment (Yu & al, 2012a, 2012b), etc.

LIMSI's MT systems have taken part in several international MT evaluation campaigns. This includes a yearly participation to the WMT evaluation series (2006-2013), where LIMSI has consistently been ranked amongst the top systems, especially when translating from and into French. We have also partaken in the 2009 NIST MT evaluation for the Arabic-English task, as well as in the 2010 and 2011 IWSLT evaluations for translation of speech.

Machine Learning

LIMSI's activities in the area of Machine Learning bridge a gap between Machine Translation and Machine Learning: on the one hand, MT is a difficult application which provides us with a realistic testbed for many ML innovations. Conversely, it appears that the development of efficient, large-scale MT systems poses problems whose solutions can also be used in other contexts or give rise to generic solutions.

A major achievement is the development of Wapiti, an open source package for linear chain Conditional Random Fields (CRFs) tailored for very large scale tasks (Lavergne et al, 2010). Owing to a very careful implementation of the core routines (gradient computation and optimization procedure) and to the selection of very sparse models through l1 regularization, allied with a very expressive language for representing feature patterns, this package is able to handle very large feature sets (up to billion of features), very large label set (up to hundreds of features), and very large datasets (up to millions of instances). This software has achieved state-of-the-art performance for many NLP tasks (grapheme-to-phoneme conversion, POS tagging, Named Entity Recognition, etc) in a variety of languages.

Another recent achievement is the development of original architectures for training and using very large neural networks having millions of neurons on their output layer. These are especially useful in the context of Neural Network Language Models (NNLMs), a theme on which LIMSI has been contributing since (Gauvain & Schwenk, 2002) and which has provided us consistent performance improvements in many tasks. The recent work of (Le & al, 2010, 2011, 2013) has lead the development of the first NNLMs capable of predicting very large output vocabularies and of taking advantage of large context (up to 10-grams). These models have been successfully used to rescore n-best lists for speech recognition and for machine translation. This work has been generalized to Neural Network Translation Models (Lavergne et al, 2011; Le et al, 2012) which are even more demanding in terms of their output vocabulary (bilingual segments).

Some publications

Projects

Some current and past projects:

ANR TRANREAD (2012-) Towards bilingual reading tools
Quaero (2008-2013) Improvement of MT engines.
ANR TRACE (2009-2012) Towards more robust MT translation.
Meta-Net (2008-2012) A network of excellence forging the Multilingual Europe Technology Alliance
SAMAR (2009-2012) A platform for processing multimedia content in Arabic

ANR CROTAL (2007-2010) Discriminative sequence models for NLP.

Software, demos and corpora

Software

Demos

Corpora

Post edited machine translations (A result of trace)

Last modified: Sunday,16-August-15 21:03:45 CEST

Spoken Language Processing Group (TLP)