The TC-STAR project, financed by European Commission within the Sixth Program, is envisaged as a long-term effort to advance research in all core technologies for Speech-to-Speech Translation (SST). SST technology is a combination of Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text to Speech (TTS) (speech synthesis). The objectives of the project are ambitious: making a breakthrough in SST that significantly reduces the gap between human and machine translation performance.
The project targets a selection of unconstrained conversational speech domains—speeches and broadcast news—and three languages: European English, European Spanish, and Mandarin Chinese. Accurate translation of unrestricted speech is well beyond the capability of today's state-of-the-art research systems. Therefore, advances are needed to improve the state-of the-art technologies for speech recognition and speech translation.
Long-term research goals of the project are:
- Effective SLT of unrestricted conversational speech on large domains of discourse.
- Speech recognition able to perform reliably under varying speaking styles, recording conditions, and for different user communities, and able to adapt in a transparent manner to the particular conditions.
- Effective integration of speech recognition and translation into a unique statistically sound framework. A major challenge will be the effective extension of current statistical machine translation models to account for multiple sentence hypotheses produced by the speech recognition algorithm.
- General expressive speech synthesis imitating the human voice. In order to overcome the barriers of reading and talking style and languages, a breakthrough in speech synthesis requires the development of new models for prosody, emotions and for expressive speech in general.
A measure of success of the project will be the progress achieved in each component of SST technology. Key actions to meet these grand challenges are:
- The implementation of an evaluation infrastructure based on competitive evaluations.
- The creation of a technological infrastructure to foster effective delivery and assessment of scientific results.
- The acquisition of appropriate language resources.
- Supporting the dissemination of scientific results within the consortium and the research community.
To foster significant advances in all SST technologies, periodic competitive evaluations are planned. A measure of success of the project will be the involvement of external participants in the evaluation campaigns. Results will be presented and discussed in a series of TC-STAR evaluation workshops. The ambition is to turn these workshops into public events that draw the attention of scientific community, industry, and in particular companies active in the area of technology transfer and services.
The project brings together key SST actors, to form a critical mass of researchers. The project participants are:
- Istituto Trentino di Cultura - Centro per la Ricerca Scientifica e Tecnologica (ITC)
- Rheinisch-Westfaelische Technische Hochschule Aachen (RWTH-AACHEN)
- Centre National de la Recherche Scientifique (CNRS-LIMSI)
- Universitat Politècnica de Catalunya (UPC)
- Universitaet Karlsruhe (TH) (UKA)
- IBM Deutschland Entwicklung GmbH (IBM)
- Siemens Aktiengesellschaft (SIEMENS)
- Siemens Reseaux Informatique et Telecommunications SAS (SRIT)
- Nokia Corporation (NOKIA)
- Sony International (Europe) GmbH (SONY)
- Evaluations and Language resources Distribution Agency (ELDA)
- Stichting Katholieke Universiteit/ Speech Processing Expertise Centre (KUN-SPEX).
ITC-irst is the project coordinator. Project partners have strong expertise in different areas of the project: automatic speech recognition, spoken language translation, text-to-speech, and technological implementation. The consortium is balanced between research and technology partners (industrial leaders in SST), and includes centres for language resources distribution and validation. SRIT withdrew from the project at the end of 2004 (month nine).
In addition to exchanging new knowledge through the evaluation workshops, scientific achievements are disseminated towards the scientific community through major international scientific conferences and journals covering all research areas involved in SST.
The most significant results of the project, in terms of advances made in SST technology, are expected in the mid- to long-term. Due to its inadequate current performance, SLT technology is not ready for a wide-spread market deployment. The purpose of TC-STAR is to push SLT together with the required functionality of ASR and TTS in order to prepare market adoption. This should improve the functionality of existing products based on ASR and TTS and to enable the launch of new translation products for face-to-face and over-the-phone use, speeches, documents (or web sites), cross lingual retrieval in audio streams, etc.
Currently, the main market segments of voice-driven interfaces are network-based servers, mobile terminals, and automotive applications. Network-based servers represent the largest market segment, dominated by IVR systems. Voice-driven mobile phones are the largest market segment in mobile terminals, where new services based on speech technology are most visible.
Even though an overestimation of the capabilities of these technologies in the past years caused a negative perception in the user market, it seems that a more mature and positive phase is ahead. This phase is characterized by a more realistic view of the usability of such technologies.
Concerning the first year of activity of the project, the main objectives were:
- Establishing baseline systems for ASR and SLT in order to measure the capabilities of available speech recognition and translation technologies and to provide references for evaluating progress made during the project.
- Identifying a task suited for automatic speech recognition, speech translation, and speech synthesis. Specifying and starting the production of task-specific language resources needed for developing and evaluating components for SLT and ASR.
- Implementing an evaluation infrastructure through the organization of an evaluation campaign addressing performance measures of ASR and SLT technologies.
- Specifying language resources and evaluation criteria for TTS. Establishing reference baselines according to defined evaluation modules. Establishing baseline algorithms for voice conversion and starting an investigation of voice conversion and prosody.
- Producing specifications for a technological infrastructure aimed at supporting the effective delivery of scientific results. In particular, the infrastructure will be used to evaluate single components as well as end-to-end systems, and to showcase project results
At the early stage of the project, a suitable and challenging reference task was identified in the translation of speeches delivered during the European Parliament Plenary Sessions (EPPS). This makes TC-STAR the first European project on spoken language translation working on a non restricted real-life task. Two translation directions were considered: from English to Spanish and from Spanish to English. Appropriate language resources were specified for these languages and the process for their production and validation started immediately. In the meantime, baseline systems for speech translation and automatic speech recognition were under development by exploiting publicly available language resources. The specifications of language resources for speech synthesis were also defined both to create the baseline voices and to support the research tasks.
An evaluation of baseline systems for ASR and SLT took place during months six and seven while during month twelve, March 2005, the first evaluation campaign of the project was carried out. The aim was to measure progress made in automatic speech recognition and spoken language translation. Training, development, and evaluation sets for the EPPS task were made available to the participants in the evaluation. Components for speech recognition and translation were also evaluated on a Broadcast News (BN) transcription task with translation from Mandarin Chinese into English. In addition to evaluating single components for ASR and SLT, full systems performing speech recognition and translation were evaluated.
In addition to TC-STAR partners, two external sites took part in the evaluation campaign. Results of the first evaluation campaign were presented and discussed in the first TC-STAR Evaluation Workshop held in Trento (Italy) in April 2005.
With respect to speech synthesis, the evaluation criteria were defined. Their aim is to evaluate the speech synthesis component as a whole and also its single modules. Furthermore, some evaluation tests were defined for specific research tasks.
The objectives of the TC-STAR project are ambitious: making a breakthrough in SST research to significantly reduce the gap between human and machine performance. In the following, operational goals and corresponding measures of success are specified (from the Technical Annex):
- Effective SLT of unrestricted conversational speech on large domains of discourse. We expect to significantly reduce the gap between human and machine performance. Breakthroughs are foreseen in five to six years time. Our target is a 15% word-error-rate relative reduction, every 18 months, over established baseline systems (milestones M4 month 23, and M6 month 34).
- Speech recognition able to perform reliably under varying speaking styles, recording conditions, and for different user communities, and able to adapt in a transparent manner to the particular conditions: Breakthroughs are expected in all aspects of speech recognition technology. Improvements with respect to the reference baselines should lead to word-error-rate relative reductions by 15% every 18 months (milestones M4 month 23, and M6 month 34).
- Effective integration of speech recognition and translation into a unique statistically sound framework. A major challenge will be the effective extension of current statistical machine translation models to account for multiple sentence hypothesis produced by the speech recognition algorithm. A measure of success will be the achievement, at the end of the project, of better performance than with the best available loosely coupled system, disregarding efficiency (milestone M6 month 34).
- General expressive speech synthesis imitating the human voice. In order to overcome the barriers of reading and talking style and languages, a breakthrough in speech synthesis requires the development of new models for prosody, emotions and for expressive speech in general. Formal and subjective evaluation should demonstrate the improvements by the end of the project (milestone M6 month 34).
- Creation of a technological infrastructure aimed at supporting the effective delivery of scientific results. In particular, the infrastructure will be used to evaluate single components as well as end-to-end systems, and to showcase project results. A first baseline supporting evaluation campaign functionality will be ready by month 24 (M5). A showcase of the infrastructure and of progress in SST achieved by the project will be demonstrated by month 36 (M7). Finally, the infrastructure will be made accessible for benchmarking of scientific results by external labs for the third evaluation campaign at month 34 (M6), and open source code of some of the SST modules will also be released (M6).
- Implementation of an evaluation infrastructure through the organization of evaluation campaigns addressing performance measures of single technologies as well as end-to-end systems. The research of all partners will be systematically evaluated and compared in a competitive framework. Workshops will be organized after each evaluation (M2 month 11, M4 month 23, M6 month 34). The third evaluation will be open to external participants. A measure of success of this goal will be the progress achieved in each component technology and the involvement of external participants.