WeBiText: building large heterogeneous translation memories from parallel Web content

This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available p...

Full description

Bibliographic Details
Main Authors: Désilets, Alain, Farley, Benoit, Stojanovic, Marta, Patenaude, Geneviève
Format: Article in Journal/Newspaper
Language:English
Published: ASLIB 2008
Subjects:
Online Access:https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
Description
Summary:This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available parallel corpora, such as the Canadian Hansard. In the case of Canadian translators working with English and French, we show that the answer to both questions is a resounding yes. Using field data collected through contextualized observation and interviews with translators at their workplace, we show how this concept is well grounded in existing workpractices of translators, especially Canadian ones. We also show that a TM based on 10 million pairs of pages from Government of Canada Web sites is able to cover 90% of the translation problems observed in our interview subjects. This turns out to be significantly better than coverage of a general purpose TM built from a smaller corpus, namely, the Canadian Hansard. The difference is most notable for the harder problems, such as specialized terminology. We also evaluate the approach on Web parallel corpora for other languages (European Commission Web sites, and 5000 Inuktitut-English pages harvested from the Nunavut domain), and find the approach to not be as advantageous there. We conclude that, while the concept of building TMs from Web corpora holds great promise, more research may be needed to make it work for language pairs other than English-French. Peer reviewed: Yes NRC publication: Yes