WeBiText: building large heterogeneous translation memories from parallel Web content

This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available p...

Full description

Bibliographic Details
Main Authors: Désilets, Alain, Farley, Benoit, Stojanovic, Marta, Patenaude, Geneviève
Format: Article in Journal/Newspaper
Language:English
Published: ASLIB 2008
Subjects:
Online Access:https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
id ftnrccanada:oai:cisti-icist.nrc-cnrc.ca:cistinparc:a05f4b93-c0e8-4383-97d2-728e08e458e5
record_format openpolar
spelling ftnrccanada:oai:cisti-icist.nrc-cnrc.ca:cistinparc:a05f4b93-c0e8-4383-97d2-728e08e458e5 2023-05-15T16:55:36+02:00 WeBiText: building large heterogeneous translation memories from parallel Web content Désilets, Alain Farley, Benoit Stojanovic, Marta Patenaude, Geneviève 2008-11 text https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 eng eng ASLIB Proceedings of Translating and the Computer 30, Translating and the Computer 30: Conference and Exhibition, November 27-28, 2008, London, United Kingdom, ISBN: 0851424864, Publication date: 2008-11 Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/) Attribution - Pas d’Utilisation Commerciale - Partage dans les Mêmes Conditions 3.0 non transposé (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/deed.fr) CC-BY-NC-SA article 2008 ftnrccanada 2021-09-25T23:00:14Z This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available parallel corpora, such as the Canadian Hansard. In the case of Canadian translators working with English and French, we show that the answer to both questions is a resounding yes. Using field data collected through contextualized observation and interviews with translators at their workplace, we show how this concept is well grounded in existing workpractices of translators, especially Canadian ones. We also show that a TM based on 10 million pairs of pages from Government of Canada Web sites is able to cover 90% of the translation problems observed in our interview subjects. This turns out to be significantly better than coverage of a general purpose TM built from a smaller corpus, namely, the Canadian Hansard. The difference is most notable for the harder problems, such as specialized terminology. We also evaluate the approach on Web parallel corpora for other languages (European Commission Web sites, and 5000 Inuktitut-English pages harvested from the Nunavut domain), and find the approach to not be as advantageous there. We conclude that, while the concept of building TMs from Web corpora holds great promise, more research may be needed to make it work for language pairs other than English-French. Peer reviewed: Yes NRC publication: Yes Article in Journal/Newspaper inuktitut Nunavut National Research Council Canada: NRC Publications Archive Canada Nunavut
institution Open Polar
collection National Research Council Canada: NRC Publications Archive
op_collection_id ftnrccanada
language English
description This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available parallel corpora, such as the Canadian Hansard. In the case of Canadian translators working with English and French, we show that the answer to both questions is a resounding yes. Using field data collected through contextualized observation and interviews with translators at their workplace, we show how this concept is well grounded in existing workpractices of translators, especially Canadian ones. We also show that a TM based on 10 million pairs of pages from Government of Canada Web sites is able to cover 90% of the translation problems observed in our interview subjects. This turns out to be significantly better than coverage of a general purpose TM built from a smaller corpus, namely, the Canadian Hansard. The difference is most notable for the harder problems, such as specialized terminology. We also evaluate the approach on Web parallel corpora for other languages (European Commission Web sites, and 5000 Inuktitut-English pages harvested from the Nunavut domain), and find the approach to not be as advantageous there. We conclude that, while the concept of building TMs from Web corpora holds great promise, more research may be needed to make it work for language pairs other than English-French. Peer reviewed: Yes NRC publication: Yes
format Article in Journal/Newspaper
author Désilets, Alain
Farley, Benoit
Stojanovic, Marta
Patenaude, Geneviève
spellingShingle Désilets, Alain
Farley, Benoit
Stojanovic, Marta
Patenaude, Geneviève
WeBiText: building large heterogeneous translation memories from parallel Web content
author_facet Désilets, Alain
Farley, Benoit
Stojanovic, Marta
Patenaude, Geneviève
author_sort Désilets, Alain
title WeBiText: building large heterogeneous translation memories from parallel Web content
title_short WeBiText: building large heterogeneous translation memories from parallel Web content
title_full WeBiText: building large heterogeneous translation memories from parallel Web content
title_fullStr WeBiText: building large heterogeneous translation memories from parallel Web content
title_full_unstemmed WeBiText: building large heterogeneous translation memories from parallel Web content
title_sort webitext: building large heterogeneous translation memories from parallel web content
publisher ASLIB
publishDate 2008
url https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
geographic Canada
Nunavut
geographic_facet Canada
Nunavut
genre inuktitut
Nunavut
genre_facet inuktitut
Nunavut
op_relation Proceedings of Translating and the Computer 30, Translating and the Computer 30: Conference and Exhibition, November 27-28, 2008, London, United Kingdom, ISBN: 0851424864, Publication date: 2008-11
op_rights Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/)
Attribution - Pas d’Utilisation Commerciale - Partage dans les Mêmes Conditions 3.0 non transposé (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/deed.fr)
op_rightsnorm CC-BY-NC-SA
_version_ 1766046591833079808