Multilingual word embeddings and their utility in cross-lingual learning

Word embeddings - dense vector representations of a word’s distributional semantics - are an indespensable component of contemporary natural language processing (NLP). Bilingual embeddings, in particular, have attracted much attention in recent years, given their inherent applicability to cross-ling...

Full description

Bibliographic Details
Main Author: Kulmizev, Artur
Other Authors: Agirre Bengoa, Eneko, van Noord, Gertjan
Format: Master Thesis
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10810/29083
id ftunivpaisvasco:oai:addi.ehu.es:10810/29083
record_format openpolar
spelling ftunivpaisvasco:oai:addi.ehu.es:10810/29083 2023-05-15T18:12:58+02:00 Multilingual word embeddings and their utility in cross-lingual learning Kulmizev, Artur Agirre Bengoa, Eneko van Noord, Gertjan 2018-09 http://hdl.handle.net/10810/29083 eng eng http://hdl.handle.net/10810/29083 info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/3.0/es/ Atribución-NoComercial-CompartirIgual 3.0 España CC-BY-NC-SA info:eu-repo/semantics/masterThesis 2018 ftunivpaisvasco 2022-03-10T16:35:57Z Word embeddings - dense vector representations of a word’s distributional semantics - are an indespensable component of contemporary natural language processing (NLP). Bilingual embeddings, in particular, have attracted much attention in recent years, given their inherent applicability to cross-lingual NLP tasks, such as Part-of-speech tagging and dependency parsing. However, despite recent advancements in bilingual embedding mapping, very little research has been dedicated to aligning embeddings multilingually, where word embeddings for a variable amount of languages are oriented to a single vector space. Given a proper alignment, one potential use case for multilingual embeddings is cross-lingual transfer learning, where a machine learning model trained on resource-rich languages (e.g. Finnish and Estonian) can “transfer” its salient features to a related language for which annotated resources are scarce (e.g. North Sami). The effect of the quality of this alignment on downstream cross-lingual NLP tasks has also been left largely unexplored, however. With this in mind, our work is motivated by two goals. First, we aim to leverage existing supervised and unsupervised methods in bilingual embedding mapping towards inducing high quality multilingual embeddings. To this end, we propose three algorithms (one supervised, two unsupervised) and evaluate them against a completely supervised bilingual system and a commonly employed baseline approach. Second, we investigate the utility of multilingual embeddings in two common cross-lingual transfer learning scenarios: POS-tagging and dependency parsing. To do so, we train a joint POS-tagger/dependency parser on Universal Dependencies treebanks for a variety of Indo-European languages and evaluate it on other, closely related languages. Although we ultimately observe that, in most settings, multilingual word embeddings themselves do not induce a cross-lingual signal, our experimental framework and results offer many insights for future cross-lingual learning experiments. Master Thesis sami ADDI: Repositorio Institucional de la Universidad del País Vasco (UPV)
institution Open Polar
collection ADDI: Repositorio Institucional de la Universidad del País Vasco (UPV)
op_collection_id ftunivpaisvasco
language English
description Word embeddings - dense vector representations of a word’s distributional semantics - are an indespensable component of contemporary natural language processing (NLP). Bilingual embeddings, in particular, have attracted much attention in recent years, given their inherent applicability to cross-lingual NLP tasks, such as Part-of-speech tagging and dependency parsing. However, despite recent advancements in bilingual embedding mapping, very little research has been dedicated to aligning embeddings multilingually, where word embeddings for a variable amount of languages are oriented to a single vector space. Given a proper alignment, one potential use case for multilingual embeddings is cross-lingual transfer learning, where a machine learning model trained on resource-rich languages (e.g. Finnish and Estonian) can “transfer” its salient features to a related language for which annotated resources are scarce (e.g. North Sami). The effect of the quality of this alignment on downstream cross-lingual NLP tasks has also been left largely unexplored, however. With this in mind, our work is motivated by two goals. First, we aim to leverage existing supervised and unsupervised methods in bilingual embedding mapping towards inducing high quality multilingual embeddings. To this end, we propose three algorithms (one supervised, two unsupervised) and evaluate them against a completely supervised bilingual system and a commonly employed baseline approach. Second, we investigate the utility of multilingual embeddings in two common cross-lingual transfer learning scenarios: POS-tagging and dependency parsing. To do so, we train a joint POS-tagger/dependency parser on Universal Dependencies treebanks for a variety of Indo-European languages and evaluate it on other, closely related languages. Although we ultimately observe that, in most settings, multilingual word embeddings themselves do not induce a cross-lingual signal, our experimental framework and results offer many insights for future cross-lingual learning experiments.
author2 Agirre Bengoa, Eneko
van Noord, Gertjan
format Master Thesis
author Kulmizev, Artur
spellingShingle Kulmizev, Artur
Multilingual word embeddings and their utility in cross-lingual learning
author_facet Kulmizev, Artur
author_sort Kulmizev, Artur
title Multilingual word embeddings and their utility in cross-lingual learning
title_short Multilingual word embeddings and their utility in cross-lingual learning
title_full Multilingual word embeddings and their utility in cross-lingual learning
title_fullStr Multilingual word embeddings and their utility in cross-lingual learning
title_full_unstemmed Multilingual word embeddings and their utility in cross-lingual learning
title_sort multilingual word embeddings and their utility in cross-lingual learning
publishDate 2018
url http://hdl.handle.net/10810/29083
genre sami
genre_facet sami
op_relation http://hdl.handle.net/10810/29083
op_rights info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/3.0/es/
Atribución-NoComercial-CompartirIgual 3.0 España
op_rightsnorm CC-BY-NC-SA
_version_ 1766185453531168768