When Word Embeddings Become Endangered

Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most commo...

Full description

Bibliographic Details
Main Author:	Alnajjar, Khalid
Format:	Text
Language:	unknown
Published:	2021
Subjects:	Computer Science - Computation and Language sami
Online Access:	http://arxiv.org/abs/2103.13275 https://doi.org/10.31885/9789515150257.24

id	ftarxivpreprints:oai:arXiv.org:2103.13275
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2103.13275 2023-09-05T13:22:56+02:00 When Word Embeddings Become Endangered Alnajjar, Khalid 2021-03-24 http://arxiv.org/abs/2103.13275 https://doi.org/10.31885/9789515150257.24 unknown http://arxiv.org/abs/2103.13275 doi:10.31885/9789515150257.24 Computer Science - Computation and Language text 2021 ftarxivpreprints https://doi.org/10.31885/9789515150257.24 2023-08-16T16:24:35Z Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library. Text sami ArXiv.org (Cornell University Library) 275 288
institution	Open Polar
collection	ArXiv.org (Cornell University Library)
op_collection_id	ftarxivpreprints
language	unknown
topic	Computer Science - Computation and Language
spellingShingle	Computer Science - Computation and Language Alnajjar, Khalid When Word Embeddings Become Endangered
topic_facet	Computer Science - Computation and Language
description	Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
format	Text
author	Alnajjar, Khalid
author_facet	Alnajjar, Khalid
author_sort	Alnajjar, Khalid
title	When Word Embeddings Become Endangered
title_short	When Word Embeddings Become Endangered
title_full	When Word Embeddings Become Endangered
title_fullStr	When Word Embeddings Become Endangered
title_full_unstemmed	When Word Embeddings Become Endangered
title_sort	when word embeddings become endangered
publishDate	2021
url	http://arxiv.org/abs/2103.13275 https://doi.org/10.31885/9789515150257.24
genre	sami
genre_facet	sami
op_relation	http://arxiv.org/abs/2103.13275 doi:10.31885/9789515150257.24
op_doi	https://doi.org/10.31885/9789515150257.24
container_start_page	275
op_container_end_page	288
_version_	1776203500994166784

When Word Embeddings Become Endangered

Similar Items