OPEN SOURCE

Open Source

An explicit goal of TC-STAR is the development and release of open source software for Spoken Language Translation and and Text to Speech Synthesis. This page presents open source software that was developed and made available by TC-STAR partners.

Confusion Network Decoder for Moses

The effective integration of ASR and MT has been a major research goal of TC-STAR and the major challenge was "the effective extension of current statistical MT models to account for multiple sentence hypotheses produced by the ASR algorithm." During the first year of the project, ITC-irst developed a novel decoding algorithm able to efficiently translate word graphs. In order to make this result publicly available, a more efficient version was jointly developed by ITC-irst and RWTH and integrated into an open source software, called Moses. Moses' development started during a 2006 JHU Summer Workshop named Open Source Toolkit for Statistical Machine Translation (10 July - 19 August).

Moses is a full fledged statistical MT package featuring a phrase-based beam-search decoder and a log-linear model combining lexicon models, language models, (lexicalized) reordering models, word and phrase penalties. Moses easily enables the integration of a factored representation (surface forms, lemma, part-of-speech, morphology, word classes, ...) of words, and the processing of ambiguous input represented by confusion networks.

TCSTAR researchers contributed by extending the translation process to confusion networks, an efficient language model library (see IRST LM Toolkit). Since the JHU Workshop, significant work has been be carried out (and still continues) in order to finalize the toolkit, test and debug the algorithms, prepare user manuals, and organize the maintenance and distribution of the toolkit.

Moses can be downloaded for free under a GNU Lesser General Public Licence from the project homepage.

MARIE Decoder

This package consists of an ngram-based statistical machine translation decoder, which aims at being helpful to the research community in the field of statistical MT. It has been developed at TALP Research Center of the Universitat Politecnica de Catalunya. The decoder follows a beam search strategy implementing a log-linear combination of different feature functions. It allows for a tight coupling between reordering and decoding through the use of input word graphs what enables the use of reordering capabilities at a very low efficiency cost, and provides n-best translations by means of output graphs, to be used in further re-scoring work. The decoder key feature consists of the use of a translation model estimated as a standard n-gram language model which introduces the word context in the translation model. In addition to the standard translation and target language models, it allows to apply less sparse data models (lemma, part-of-speech, morphology, word classes, ...) in order to gain generalization power.

MARIE can be downloaded free of charge under the GNU General Public License from the project homepage.

Lingua-AlignmentSet Tool

Most statistical machine translation systems feature lexicon models, whose estimation usually relies on word-aligned parallel corpora. Henceforth, the improvement of word alignment is fundamental to improve the overall translation quality. Hence, the availability of a software for handling and evaluating alignment sets is crucial for the community. The Lingua-AlignmentSet distribution is a Perl Tools Library (and command-line utilities) developed at TALP, a Research Center of the Universitat Politecnica de Catalunya, to handle an Alignment Set, i.e. a set of sentences aligned at the word (or phrase) level. It provides methods to display the links, to apply a function to each alignment of the set, to evaluate the alignments against a reference, and more. One of the objectives of the module is to allow the user to perform all these operations without bothering with the particular physical format of the Alignment Set. Anyway it also provides format conversion methods.

Lingua-AlignmentSet can be downloaded free of charge under the GNU General Public License from the project homepage.

IRST LM Toolkit

Statistical machine translation, as well as other areas of human language processing, have recently pushed toward the use of huge n-gram language models. Nowadays,
the availability of larger and larger text corpora is stressing the need for efficient data structures and algorithms to estimate, store and access LMs. Unfortunately, the rate of progress in computer technology seems for the moment below the space requirements of such huge LMs, at least by considering standard lab equipment, like for instance the well-known SRILM Toolkit. To overcome this issue a new software toolkit has been developed at ITC-irst. The IRST LM toolkit consists of a C++ library and several scripts to handle huge LMs.

The toolkit includes the following features: collection of n-grams and their frequency counters; estimation of smoothing parameters for each n-gram level; pruning of unfrequent n-grams; estimation of probabilities (and back-off weights) of n-grams. LMs can be stored both in text and binary format. Optimization of memory usage and time access is achieved by means of quantization of probabilities of n-grams; parallel training; caching of probabilities; on-demand access of probabilities on disk through a memory mapping service offered by several operating systems. Memory-mapping permits to share the same address space among multiple processes, so that the same LM binary file can be accessed by several decoding processes. The LM toolkit is integrated into the popular open source SMT system Moses.

The IRST LM toolkit is distributed for free under the GNU Lesser General Public Licence and can be dowloaded from the project homepage.

Graph Error Rate Tool

Current SLT systems are designed to process only one input hypothesis, making them vulnerable to errors in the input. Recently, approaches have been proposed for improving translation quality through the processing of multiple input hypotheses, usually represented as a word graph. In fact, it is widely accepted that a better quality of the input bears a better quality of the translation. A software package has been developed at RWTH which computes the Graph Error Rate (GER). GER is defined as the Word Error Rate (WER) of the best path in the graph. Multiple references are taken into account.

This scoring tool can be dowloaded for free from here.

Translation Re-segmentation Tool

A correct end-to-end evaluation of speech translation systems means that no human interaction is involved between ASR and MT. Therefore, the sentence segmentation which is passed on to the translation system has to be determined automatically. This means that the segmentation of the ASR output and, consequently, of the translation hypotheses will be different from the sentence boundaries defined by the reference translations. However, all current objective MT evaluation measures make use of the fact that each hypothesized sentence uniquely corresponds to a reference segment. To overcome this problem, a novel automatic sentence re-segmentation method was developed by RWTH for evaluating machine translation output with possibly different/erroneous sentence boundaries or without any boundaries. The evaluation procedure takes advantage of the edit distance algorithm and is designed to handle multiple reference translations. The algorithm efficiently performs an optimal automatic re-segmentation of the hypotheses that matches the reference segmentation. This makes the application of existing well-established evaluation measures possible.

This segmentation tool can be dowloaded from here.

Translation Error Analysis Tool

A user-friendly tool for the analysis of the translation erroes has been developed by LIMSI. Through a graphical user interface source text, translation hypotheses and translation references are displayed, together with a graphical interpretation of several error measures (BLEU, NIST, WER and PER). The translations can be annotated by an operator, for instance to identify difficult sentences or common type of errors. Sorting and search facilities are also implemented. This tool is written in Java, so that it can be used on various platforms, in particular Linux and Windows based machines.

The Translation Error Analysis Tool can be dowloaded from the project homepage.

UPC Intonation Toolkit (mCART)

mCART is a complete intonation model training package developed at TALP Research Center of the Universitat Politecnica de Catalunya.

This software eases the generation of an intonation model for the prosody module of Text-to-Speech systems. It generates a fundamental frequency contour specific to the input text that is to be synthesized. The generation process uses information provided by upstream components, such as syllablification, stress, phonetic transcription, part-of-speech tagging, syntactic analysis and prosodic boundaries.

Three different mathematical formulations are implemented: Bezier, Fujisaki and Tilt. Each formulation can be trained by means of the two available procedures: sentence-by-sentence (SbS) and Joint Estimation and Modeling Approach (JEMA). Several modes are available: train and test, n-FOLD cross-validation and full trainig.

The UPC Intonation Toolkit can be dowloaded from the project homepage.
UPC Voice Conversion Toolkit

This toolkit provides two different methods for performing Voice Conversion. It has been developed at TALP Research Center of the Universitat Politecnica de Catalunya.The first method is a C/C++ tool based on the Linear Prediction model (LPC). CARTs are used to split the acoustic space into several classes based on phonetic features. For each class, a linear regression is applied to transform the LSF coefficients using GMMs. Then, the appropiated residual is selected from the residuals found in the training data based on the similarity of the associated LSF and the transformed LSF. Code is provided to perform all the aforementioned steps. Sample scripts are provided to help in the automatization of the whole process.

The second method is a Matlab tool based on the harmonic/stochastic model (HSM). This model is used to analyze, modify and synthesize the speech signals. The voice conversion method is based on gaussian mixture models (GMM), which can be trained from parallel and non-parallel corpora. The non-parallel training procedure is suitable for cross-lingual applications because it handles only acoustic parameters. The harmonic component of the signals is converted using the trained transformation function, and the stochastic component is predicted from the converted harmonic component. The unvoiced frames are not modified. The pitch is also adapted to the target speaker by means of a linear transformation concerning the means and variances of the log-f0.

The UPC Voice Conversion Toolkit can be dowloaded from the project homepage.