WeBiText: building large heterogeneous translation memories from parallel Web content

This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available p...

Full description

Bibliographic Details
Main Authors:	Désilets, Alain, Farley, Benoit, Stojanovic, Marta, Patenaude, Geneviève
Format:	Article in Journal/Newspaper
Language:	English
Published:	ASLIB 2008
Subjects:	Canada Nunavut inuktitut
Online Access:	https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5

id	ftnrccanada:oai:cisti-icist.nrc-cnrc.ca:cistinparc:a05f4b93-c0e8-4383-97d2-728e08e458e5
record_format	openpolar
spelling	ftnrccanada:oai:cisti-icist.nrc-cnrc.ca:cistinparc:a05f4b93-c0e8-4383-97d2-728e08e458e5 2023-05-15T16:55:36+02:00 WeBiText: building large heterogeneous translation memories from parallel Web content Désilets, Alain Farley, Benoit Stojanovic, Marta Patenaude, Geneviève 2008-11 text https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 eng eng ASLIB Proceedings of Translating and the Computer 30, Translating and the Computer 30: Conference and Exhibition, November 27-28, 2008, London, United Kingdom, ISBN: 0851424864, Publication date: 2008-11 Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/) Attribution - Pas d’Utilisation Commerciale - Partage dans les Mêmes Conditions 3.0 non transposé (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/deed.fr) CC-BY-NC-SA article 2008 ftnrccanada 2021-09-25T23:00:14Z This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available parallel corpora, such as the Canadian Hansard. In the case of Canadian translators working with English and French, we show that the answer to both questions is a resounding yes. Using field data collected through contextualized observation and interviews with translators at their workplace, we show how this concept is well grounded in existing workpractices of translators, especially Canadian ones. We also show that a TM based on 10 million pairs of pages from Government of Canada Web sites is able to cover 90% of the translation problems observed in our interview subjects. This turns out to be significantly better than coverage of a general purpose TM built from a smaller corpus, namely, the Canadian Hansard. The difference is most notable for the harder problems, such as specialized terminology. We also evaluate the approach on Web parallel corpora for other languages (European Commission Web sites, and 5000 Inuktitut-English pages harvested from the Nunavut domain), and find the approach to not be as advantageous there. We conclude that, while the concept of building TMs from Web corpora holds great promise, more research may be needed to make it work for language pairs other than English-French. Peer reviewed: Yes NRC publication: Yes Article in Journal/Newspaper inuktitut Nunavut National Research Council Canada: NRC Publications Archive Canada Nunavut
institution	Open Polar
collection	National Research Council Canada: NRC Publications Archive
op_collection_id	ftnrccanada
language	English
description	This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available parallel corpora, such as the Canadian Hansard. In the case of Canadian translators working with English and French, we show that the answer to both questions is a resounding yes. Using field data collected through contextualized observation and interviews with translators at their workplace, we show how this concept is well grounded in existing workpractices of translators, especially Canadian ones. We also show that a TM based on 10 million pairs of pages from Government of Canada Web sites is able to cover 90% of the translation problems observed in our interview subjects. This turns out to be significantly better than coverage of a general purpose TM built from a smaller corpus, namely, the Canadian Hansard. The difference is most notable for the harder problems, such as specialized terminology. We also evaluate the approach on Web parallel corpora for other languages (European Commission Web sites, and 5000 Inuktitut-English pages harvested from the Nunavut domain), and find the approach to not be as advantageous there. We conclude that, while the concept of building TMs from Web corpora holds great promise, more research may be needed to make it work for language pairs other than English-French. Peer reviewed: Yes NRC publication: Yes
format	Article in Journal/Newspaper
author	Désilets, Alain Farley, Benoit Stojanovic, Marta Patenaude, Geneviève
spellingShingle	Désilets, Alain Farley, Benoit Stojanovic, Marta Patenaude, Geneviève WeBiText: building large heterogeneous translation memories from parallel Web content
author_facet	Désilets, Alain Farley, Benoit Stojanovic, Marta Patenaude, Geneviève
author_sort	Désilets, Alain
title	WeBiText: building large heterogeneous translation memories from parallel Web content
title_short	WeBiText: building large heterogeneous translation memories from parallel Web content
title_full	WeBiText: building large heterogeneous translation memories from parallel Web content
title_fullStr	WeBiText: building large heterogeneous translation memories from parallel Web content
title_full_unstemmed	WeBiText: building large heterogeneous translation memories from parallel Web content
title_sort	webitext: building large heterogeneous translation memories from parallel web content
publisher	ASLIB
publishDate	2008
url	https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5
geographic	Canada Nunavut
geographic_facet	Canada Nunavut
genre	inuktitut Nunavut
genre_facet	inuktitut Nunavut
op_relation	Proceedings of Translating and the Computer 30, Translating and the Computer 30: Conference and Exhibition, November 27-28, 2008, London, United Kingdom, ISBN: 0851424864, Publication date: 2008-11
op_rights	Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/) Attribution - Pas d’Utilisation Commerciale - Partage dans les Mêmes Conditions 3.0 non transposé (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/deed.fr)
op_rightsnorm	CC-BY-NC-SA
_version_	1766046591833079808

WeBiText: building large heterogeneous translation memories from parallel Web content

Similar Items