Extremely low-resource machine translation for closely related languages
An effective method to improve extremely low-resource neural machine translation is multilingual training, which can be improved by leveraging monolingual data to create synthetic bilingual corpora using the back-translation method. This work focuses on closely related languages from the Uralic lang...
Main Authors: | , , |
---|---|
Format: | Article in Journal/Newspaper |
Language: | unknown |
Published: |
arXiv
2021
|
Subjects: | |
Online Access: | https://dx.doi.org/10.48550/arxiv.2105.13065 https://arxiv.org/abs/2105.13065 |
id |
ftdatacite:10.48550/arxiv.2105.13065 |
---|---|
record_format |
openpolar |
spelling |
ftdatacite:10.48550/arxiv.2105.13065 2023-05-15T18:08:15+02:00 Extremely low-resource machine translation for closely related languages Tars, Maali Tättar, Andre Fišel, Mark 2021 https://dx.doi.org/10.48550/arxiv.2105.13065 https://arxiv.org/abs/2105.13065 unknown arXiv Creative Commons Attribution Share Alike 4.0 International https://creativecommons.org/licenses/by-sa/4.0/legalcode cc-by-sa-4.0 CC-BY-SA Computation and Language cs.CL FOS Computer and information sciences Article CreativeWork article Preprint 2021 ftdatacite https://doi.org/10.48550/arxiv.2105.13065 2022-03-10T14:27:09Z An effective method to improve extremely low-resource neural machine translation is multilingual training, which can be improved by leveraging monolingual data to create synthetic bilingual corpora using the back-translation method. This work focuses on closely related languages from the Uralic language family: from Estonian and Finnish geographical regions. We find that multilingual learning and synthetic corpora increase the translation quality in every language pair for which we have data. We show that transfer learning and fine-tuning are very effective for doing low-resource machine translation and achieve the best results. We collected new parallel data for Võro, North and South Saami and present first results of neural machine translation for these languages. : Accepted at Nodalida'2021 Article in Journal/Newspaper saami DataCite Metadata Store (German National Library of Science and Technology) |
institution |
Open Polar |
collection |
DataCite Metadata Store (German National Library of Science and Technology) |
op_collection_id |
ftdatacite |
language |
unknown |
topic |
Computation and Language cs.CL FOS Computer and information sciences |
spellingShingle |
Computation and Language cs.CL FOS Computer and information sciences Tars, Maali Tättar, Andre Fišel, Mark Extremely low-resource machine translation for closely related languages |
topic_facet |
Computation and Language cs.CL FOS Computer and information sciences |
description |
An effective method to improve extremely low-resource neural machine translation is multilingual training, which can be improved by leveraging monolingual data to create synthetic bilingual corpora using the back-translation method. This work focuses on closely related languages from the Uralic language family: from Estonian and Finnish geographical regions. We find that multilingual learning and synthetic corpora increase the translation quality in every language pair for which we have data. We show that transfer learning and fine-tuning are very effective for doing low-resource machine translation and achieve the best results. We collected new parallel data for Võro, North and South Saami and present first results of neural machine translation for these languages. : Accepted at Nodalida'2021 |
format |
Article in Journal/Newspaper |
author |
Tars, Maali Tättar, Andre Fišel, Mark |
author_facet |
Tars, Maali Tättar, Andre Fišel, Mark |
author_sort |
Tars, Maali |
title |
Extremely low-resource machine translation for closely related languages |
title_short |
Extremely low-resource machine translation for closely related languages |
title_full |
Extremely low-resource machine translation for closely related languages |
title_fullStr |
Extremely low-resource machine translation for closely related languages |
title_full_unstemmed |
Extremely low-resource machine translation for closely related languages |
title_sort |
extremely low-resource machine translation for closely related languages |
publisher |
arXiv |
publishDate |
2021 |
url |
https://dx.doi.org/10.48550/arxiv.2105.13065 https://arxiv.org/abs/2105.13065 |
genre |
saami |
genre_facet |
saami |
op_rights |
Creative Commons Attribution Share Alike 4.0 International https://creativecommons.org/licenses/by-sa/4.0/legalcode cc-by-sa-4.0 |
op_rightsnorm |
CC-BY-SA |
op_doi |
https://doi.org/10.48550/arxiv.2105.13065 |
_version_ |
1766180524392448000 |