Aligning and Using an English-Innukitut Parallel Corpus

A parallel corpus of texts in English and in Inuktitut, an Inuit language, is presented. These texts are from the Nunavut Hansards. The parallel texts are processed in two phases, the sentence alignment phase and the word correspondence phase. Our sentence alignment technique achieves a precision of...

Full description

Bibliographic Details
Published in:Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts data driven machine translation and beyond -
Main Authors: Martin, Joel, Johnson, Howard, Farley, Benoît, Maclachlan, Anna
Format: Article in Journal/Newspaper
Language:English
Published: 2003
Subjects:
Online Access:https://doi.org/10.3115/1118905.1118925
https://nrc-publications.canada.ca/eng/view/object/?id=bce8df0d-20c8-4b42-a200-223ed4fb92b3
https://nrc-publications.canada.ca/fra/voir/objet/?id=bce8df0d-20c8-4b42-a200-223ed4fb92b3
Description
Summary:A parallel corpus of texts in English and in Inuktitut, an Inuit language, is presented. These texts are from the Nunavut Hansards. The parallel texts are processed in two phases, the sentence alignment phase and the word correspondence phase. Our sentence alignment technique achieves a precision of 91.4% and a recall of 92.3%. Our word correspondence technique is aimed at providing the broadest coverage collection of reliable pairs of Inuktitut and English morphemes for dictionary expansion. For an agglutinative language like Inuktitut, this entails considering substrings, not simply whole words. We employ a Pointwise Mutual Information method (PMI) and attain a coverage of 72.3% of English words and a precision of 87%. NRC publication: Yes