When Word Embeddings Become Endangered

Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most commo...

Full description

Bibliographic Details
Main Author: Alnajjar, Khalid
Format: Text
Language:unknown
Published: 2021
Subjects:
Online Access:http://arxiv.org/abs/2103.13275
https://doi.org/10.31885/9789515150257.24
id ftarxivpreprints:oai:arXiv.org:2103.13275
record_format openpolar
spelling ftarxivpreprints:oai:arXiv.org:2103.13275 2023-09-05T13:22:56+02:00 When Word Embeddings Become Endangered Alnajjar, Khalid 2021-03-24 http://arxiv.org/abs/2103.13275 https://doi.org/10.31885/9789515150257.24 unknown http://arxiv.org/abs/2103.13275 doi:10.31885/9789515150257.24 Computer Science - Computation and Language text 2021 ftarxivpreprints https://doi.org/10.31885/9789515150257.24 2023-08-16T16:24:35Z Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library. Text sami ArXiv.org (Cornell University Library) 275 288
institution Open Polar
collection ArXiv.org (Cornell University Library)
op_collection_id ftarxivpreprints
language unknown
topic Computer Science - Computation and Language
spellingShingle Computer Science - Computation and Language
Alnajjar, Khalid
When Word Embeddings Become Endangered
topic_facet Computer Science - Computation and Language
description Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
format Text
author Alnajjar, Khalid
author_facet Alnajjar, Khalid
author_sort Alnajjar, Khalid
title When Word Embeddings Become Endangered
title_short When Word Embeddings Become Endangered
title_full When Word Embeddings Become Endangered
title_fullStr When Word Embeddings Become Endangered
title_full_unstemmed When Word Embeddings Become Endangered
title_sort when word embeddings become endangered
publishDate 2021
url http://arxiv.org/abs/2103.13275
https://doi.org/10.31885/9789515150257.24
genre sami
genre_facet sami
op_relation http://arxiv.org/abs/2103.13275
doi:10.31885/9789515150257.24
op_doi https://doi.org/10.31885/9789515150257.24
container_start_page 275
op_container_end_page 288
_version_ 1776203500994166784