WeBiText: building large heterogeneous translation memories from parallel Web content
This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available p...
Main Authors: | , , , |
---|---|
Format: | Article in Journal/Newspaper |
Language: | English |
Published: |
ASLIB
2008
|
Subjects: | |
Online Access: | https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 |
id |
ftnrccanada:oai:cisti-icist.nrc-cnrc.ca:cistinparc:a05f4b93-c0e8-4383-97d2-728e08e458e5 |
---|---|
record_format |
openpolar |
spelling |
ftnrccanada:oai:cisti-icist.nrc-cnrc.ca:cistinparc:a05f4b93-c0e8-4383-97d2-728e08e458e5 2023-05-15T16:55:36+02:00 WeBiText: building large heterogeneous translation memories from parallel Web content Désilets, Alain Farley, Benoit Stojanovic, Marta Patenaude, Geneviève 2008-11 text https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 eng eng ASLIB Proceedings of Translating and the Computer 30, Translating and the Computer 30: Conference and Exhibition, November 27-28, 2008, London, United Kingdom, ISBN: 0851424864, Publication date: 2008-11 Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/) Attribution - Pas d’Utilisation Commerciale - Partage dans les Mêmes Conditions 3.0 non transposé (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/deed.fr) CC-BY-NC-SA article 2008 ftnrccanada 2021-09-25T23:00:14Z This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available parallel corpora, such as the Canadian Hansard. In the case of Canadian translators working with English and French, we show that the answer to both questions is a resounding yes. Using field data collected through contextualized observation and interviews with translators at their workplace, we show how this concept is well grounded in existing workpractices of translators, especially Canadian ones. We also show that a TM based on 10 million pairs of pages from Government of Canada Web sites is able to cover 90% of the translation problems observed in our interview subjects. This turns out to be significantly better than coverage of a general purpose TM built from a smaller corpus, namely, the Canadian Hansard. The difference is most notable for the harder problems, such as specialized terminology. We also evaluate the approach on Web parallel corpora for other languages (European Commission Web sites, and 5000 Inuktitut-English pages harvested from the Nunavut domain), and find the approach to not be as advantageous there. We conclude that, while the concept of building TMs from Web corpora holds great promise, more research may be needed to make it work for language pairs other than English-French. Peer reviewed: Yes NRC publication: Yes Article in Journal/Newspaper inuktitut Nunavut National Research Council Canada: NRC Publications Archive Canada Nunavut |
institution |
Open Polar |
collection |
National Research Council Canada: NRC Publications Archive |
op_collection_id |
ftnrccanada |
language |
English |
description |
This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available parallel corpora, such as the Canadian Hansard. In the case of Canadian translators working with English and French, we show that the answer to both questions is a resounding yes. Using field data collected through contextualized observation and interviews with translators at their workplace, we show how this concept is well grounded in existing workpractices of translators, especially Canadian ones. We also show that a TM based on 10 million pairs of pages from Government of Canada Web sites is able to cover 90% of the translation problems observed in our interview subjects. This turns out to be significantly better than coverage of a general purpose TM built from a smaller corpus, namely, the Canadian Hansard. The difference is most notable for the harder problems, such as specialized terminology. We also evaluate the approach on Web parallel corpora for other languages (European Commission Web sites, and 5000 Inuktitut-English pages harvested from the Nunavut domain), and find the approach to not be as advantageous there. We conclude that, while the concept of building TMs from Web corpora holds great promise, more research may be needed to make it work for language pairs other than English-French. Peer reviewed: Yes NRC publication: Yes |
format |
Article in Journal/Newspaper |
author |
Désilets, Alain Farley, Benoit Stojanovic, Marta Patenaude, Geneviève |
spellingShingle |
Désilets, Alain Farley, Benoit Stojanovic, Marta Patenaude, Geneviève WeBiText: building large heterogeneous translation memories from parallel Web content |
author_facet |
Désilets, Alain Farley, Benoit Stojanovic, Marta Patenaude, Geneviève |
author_sort |
Désilets, Alain |
title |
WeBiText: building large heterogeneous translation memories from parallel Web content |
title_short |
WeBiText: building large heterogeneous translation memories from parallel Web content |
title_full |
WeBiText: building large heterogeneous translation memories from parallel Web content |
title_fullStr |
WeBiText: building large heterogeneous translation memories from parallel Web content |
title_full_unstemmed |
WeBiText: building large heterogeneous translation memories from parallel Web content |
title_sort |
webitext: building large heterogeneous translation memories from parallel web content |
publisher |
ASLIB |
publishDate |
2008 |
url |
https://nrc-publications.canada.ca/eng/view/ft/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/eng/view/object/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 https://nrc-publications.canada.ca/fra/voir/objet/?id=a05f4b93-c0e8-4383-97d2-728e08e458e5 |
geographic |
Canada Nunavut |
geographic_facet |
Canada Nunavut |
genre |
inuktitut Nunavut |
genre_facet |
inuktitut Nunavut |
op_relation |
Proceedings of Translating and the Computer 30, Translating and the Computer 30: Conference and Exhibition, November 27-28, 2008, London, United Kingdom, ISBN: 0851424864, Publication date: 2008-11 |
op_rights |
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/) Attribution - Pas d’Utilisation Commerciale - Partage dans les Mêmes Conditions 3.0 non transposé (CC BY-NC-SA 3.0) (https://creativecommons.org/licenses/by-nc-sa/3.0/deed.fr) |
op_rightsnorm |
CC-BY-NC-SA |
_version_ |
1766046591833079808 |