When Word Embeddings Become Endangered
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most commo...
Main Author: | |
---|---|
Format: | Text |
Language: | unknown |
Published: |
2021
|
Subjects: | |
Online Access: | http://arxiv.org/abs/2103.13275 https://doi.org/10.31885/9789515150257.24 |
id |
ftarxivpreprints:oai:arXiv.org:2103.13275 |
---|---|
record_format |
openpolar |
spelling |
ftarxivpreprints:oai:arXiv.org:2103.13275 2023-09-05T13:22:56+02:00 When Word Embeddings Become Endangered Alnajjar, Khalid 2021-03-24 http://arxiv.org/abs/2103.13275 https://doi.org/10.31885/9789515150257.24 unknown http://arxiv.org/abs/2103.13275 doi:10.31885/9789515150257.24 Computer Science - Computation and Language text 2021 ftarxivpreprints https://doi.org/10.31885/9789515150257.24 2023-08-16T16:24:35Z Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library. Text sami ArXiv.org (Cornell University Library) 275 288 |
institution |
Open Polar |
collection |
ArXiv.org (Cornell University Library) |
op_collection_id |
ftarxivpreprints |
language |
unknown |
topic |
Computer Science - Computation and Language |
spellingShingle |
Computer Science - Computation and Language Alnajjar, Khalid When Word Embeddings Become Endangered |
topic_facet |
Computer Science - Computation and Language |
description |
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library. |
format |
Text |
author |
Alnajjar, Khalid |
author_facet |
Alnajjar, Khalid |
author_sort |
Alnajjar, Khalid |
title |
When Word Embeddings Become Endangered |
title_short |
When Word Embeddings Become Endangered |
title_full |
When Word Embeddings Become Endangered |
title_fullStr |
When Word Embeddings Become Endangered |
title_full_unstemmed |
When Word Embeddings Become Endangered |
title_sort |
when word embeddings become endangered |
publishDate |
2021 |
url |
http://arxiv.org/abs/2103.13275 https://doi.org/10.31885/9789515150257.24 |
genre |
sami |
genre_facet |
sami |
op_relation |
http://arxiv.org/abs/2103.13275 doi:10.31885/9789515150257.24 |
op_doi |
https://doi.org/10.31885/9789515150257.24 |
container_start_page |
275 |
op_container_end_page |
288 |
_version_ |
1776203500994166784 |