When Word Embeddings Become Endangered
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most commo...
Main Author: | |
---|---|
Format: | Article in Journal/Newspaper |
Language: | unknown |
Published: |
arXiv
2021
|
Subjects: | |
Online Access: | https://dx.doi.org/10.48550/arxiv.2103.13275 https://arxiv.org/abs/2103.13275 |
id |
ftdatacite:10.48550/arxiv.2103.13275 |
---|---|
record_format |
openpolar |
spelling |
ftdatacite:10.48550/arxiv.2103.13275 2023-05-15T18:12:38+02:00 When Word Embeddings Become Endangered Alnajjar, Khalid 2021 https://dx.doi.org/10.48550/arxiv.2103.13275 https://arxiv.org/abs/2103.13275 unknown arXiv https://dx.doi.org/10.31885/9789515150257.24 Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 CC-BY Computation and Language cs.CL FOS Computer and information sciences article-journal Article ScholarlyArticle Text 2021 ftdatacite https://doi.org/10.48550/arxiv.2103.13275 https://doi.org/10.31885/9789515150257.24 2022-03-10T14:47:09Z Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library. Article in Journal/Newspaper sami DataCite Metadata Store (German National Library of Science and Technology) |
institution |
Open Polar |
collection |
DataCite Metadata Store (German National Library of Science and Technology) |
op_collection_id |
ftdatacite |
language |
unknown |
topic |
Computation and Language cs.CL FOS Computer and information sciences |
spellingShingle |
Computation and Language cs.CL FOS Computer and information sciences Alnajjar, Khalid When Word Embeddings Become Endangered |
topic_facet |
Computation and Language cs.CL FOS Computer and information sciences |
description |
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library. |
format |
Article in Journal/Newspaper |
author |
Alnajjar, Khalid |
author_facet |
Alnajjar, Khalid |
author_sort |
Alnajjar, Khalid |
title |
When Word Embeddings Become Endangered |
title_short |
When Word Embeddings Become Endangered |
title_full |
When Word Embeddings Become Endangered |
title_fullStr |
When Word Embeddings Become Endangered |
title_full_unstemmed |
When Word Embeddings Become Endangered |
title_sort |
when word embeddings become endangered |
publisher |
arXiv |
publishDate |
2021 |
url |
https://dx.doi.org/10.48550/arxiv.2103.13275 https://arxiv.org/abs/2103.13275 |
genre |
sami |
genre_facet |
sami |
op_relation |
https://dx.doi.org/10.31885/9789515150257.24 |
op_rights |
Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 |
op_rightsnorm |
CC-BY |
op_doi |
https://doi.org/10.48550/arxiv.2103.13275 https://doi.org/10.31885/9789515150257.24 |
_version_ |
1766185143944347648 |