Multilingual word embeddings and their utility in cross-lingual learning

Word embeddings - dense vector representations of a word’s distributional semantics - are an indespensable component of contemporary natural language processing (NLP). Bilingual embeddings, in particular, have attracted much attention in recent years, given their inherent applicability to cross-ling...

Full description

Bibliographic Details
Main Author:	Kulmizev, Artur
Other Authors:	Agirre Bengoa, Eneko, van Noord, Gertjan
Format:	Master Thesis
Language:	English
Published:	2018
Subjects:	sami
Online Access:	http://hdl.handle.net/10810/29083

id	ftunivpaisvasco:oai:addi.ehu.es:10810/29083
record_format	openpolar
spelling	ftunivpaisvasco:oai:addi.ehu.es:10810/29083 2023-05-15T18:12:58+02:00 Multilingual word embeddings and their utility in cross-lingual learning Kulmizev, Artur Agirre Bengoa, Eneko van Noord, Gertjan 2018-09 http://hdl.handle.net/10810/29083 eng eng http://hdl.handle.net/10810/29083 info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/3.0/es/ Atribución-NoComercial-CompartirIgual 3.0 España CC-BY-NC-SA info:eu-repo/semantics/masterThesis 2018 ftunivpaisvasco 2022-03-10T16:35:57Z Word embeddings - dense vector representations of a word’s distributional semantics - are an indespensable component of contemporary natural language processing (NLP). Bilingual embeddings, in particular, have attracted much attention in recent years, given their inherent applicability to cross-lingual NLP tasks, such as Part-of-speech tagging and dependency parsing. However, despite recent advancements in bilingual embedding mapping, very little research has been dedicated to aligning embeddings multilingually, where word embeddings for a variable amount of languages are oriented to a single vector space. Given a proper alignment, one potential use case for multilingual embeddings is cross-lingual transfer learning, where a machine learning model trained on resource-rich languages (e.g. Finnish and Estonian) can “transfer” its salient features to a related language for which annotated resources are scarce (e.g. North Sami). The effect of the quality of this alignment on downstream cross-lingual NLP tasks has also been left largely unexplored, however. With this in mind, our work is motivated by two goals. First, we aim to leverage existing supervised and unsupervised methods in bilingual embedding mapping towards inducing high quality multilingual embeddings. To this end, we propose three algorithms (one supervised, two unsupervised) and evaluate them against a completely supervised bilingual system and a commonly employed baseline approach. Second, we investigate the utility of multilingual embeddings in two common cross-lingual transfer learning scenarios: POS-tagging and dependency parsing. To do so, we train a joint POS-tagger/dependency parser on Universal Dependencies treebanks for a variety of Indo-European languages and evaluate it on other, closely related languages. Although we ultimately observe that, in most settings, multilingual word embeddings themselves do not induce a cross-lingual signal, our experimental framework and results offer many insights for future cross-lingual learning experiments. Master Thesis sami ADDI: Repositorio Institucional de la Universidad del País Vasco (UPV)
institution	Open Polar
collection	ADDI: Repositorio Institucional de la Universidad del País Vasco (UPV)
op_collection_id	ftunivpaisvasco
language	English
description	Word embeddings - dense vector representations of a word’s distributional semantics - are an indespensable component of contemporary natural language processing (NLP). Bilingual embeddings, in particular, have attracted much attention in recent years, given their inherent applicability to cross-lingual NLP tasks, such as Part-of-speech tagging and dependency parsing. However, despite recent advancements in bilingual embedding mapping, very little research has been dedicated to aligning embeddings multilingually, where word embeddings for a variable amount of languages are oriented to a single vector space. Given a proper alignment, one potential use case for multilingual embeddings is cross-lingual transfer learning, where a machine learning model trained on resource-rich languages (e.g. Finnish and Estonian) can “transfer” its salient features to a related language for which annotated resources are scarce (e.g. North Sami). The effect of the quality of this alignment on downstream cross-lingual NLP tasks has also been left largely unexplored, however. With this in mind, our work is motivated by two goals. First, we aim to leverage existing supervised and unsupervised methods in bilingual embedding mapping towards inducing high quality multilingual embeddings. To this end, we propose three algorithms (one supervised, two unsupervised) and evaluate them against a completely supervised bilingual system and a commonly employed baseline approach. Second, we investigate the utility of multilingual embeddings in two common cross-lingual transfer learning scenarios: POS-tagging and dependency parsing. To do so, we train a joint POS-tagger/dependency parser on Universal Dependencies treebanks for a variety of Indo-European languages and evaluate it on other, closely related languages. Although we ultimately observe that, in most settings, multilingual word embeddings themselves do not induce a cross-lingual signal, our experimental framework and results offer many insights for future cross-lingual learning experiments.
author2	Agirre Bengoa, Eneko van Noord, Gertjan
format	Master Thesis
author	Kulmizev, Artur
spellingShingle	Kulmizev, Artur Multilingual word embeddings and their utility in cross-lingual learning
author_facet	Kulmizev, Artur
author_sort	Kulmizev, Artur
title	Multilingual word embeddings and their utility in cross-lingual learning
title_short	Multilingual word embeddings and their utility in cross-lingual learning
title_full	Multilingual word embeddings and their utility in cross-lingual learning
title_fullStr	Multilingual word embeddings and their utility in cross-lingual learning
title_full_unstemmed	Multilingual word embeddings and their utility in cross-lingual learning
title_sort	multilingual word embeddings and their utility in cross-lingual learning
publishDate	2018
url	http://hdl.handle.net/10810/29083
genre	sami
genre_facet	sami
op_relation	http://hdl.handle.net/10810/29083
op_rights	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/3.0/es/ Atribución-NoComercial-CompartirIgual 3.0 España
op_rightsnorm	CC-BY-NC-SA
_version_	1766185453531168768

Multilingual word embeddings and their utility in cross-lingual learning

Similar Items