A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-...

Full description

Bibliographic Details
Main Authors:	Jones, Alex, Wang, William Yang, Mahowald, Kyle
Format:	Article in Journal/Newspaper
Language:	unknown
Published:	arXiv 2021
Subjects:	Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences I.2.7 inuktitut
Online Access:	https://dx.doi.org/10.48550/arxiv.2109.06324 https://arxiv.org/abs/2109.06324

id	ftdatacite:10.48550/arxiv.2109.06324
record_format	openpolar
spelling	ftdatacite:10.48550/arxiv.2109.06324 2023-05-15T16:55:36+02:00 A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space Jones, Alex Wang, William Yang Mahowald, Kyle 2021 https://dx.doi.org/10.48550/arxiv.2109.06324 https://arxiv.org/abs/2109.06324 unknown arXiv Creative Commons Attribution Non Commercial Share Alike 4.0 International https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode cc-by-nc-sa-4.0 CC-BY-NC-SA Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences I.2.7 Article CreativeWork article Preprint 2021 ftdatacite https://doi.org/10.48550/arxiv.2109.06324 2022-03-10T13:45:51Z In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family training data as a stronger predictor than language-specific training data across the board. We verify some of our linguistic findings by looking at the effect of morphological segmentation on English-Inuktitut alignment, in addition to examining the effect of word order agreement on isomorphism for 66 zero-shot language pairs from a different corpus. We make the data and code for our experiments publicly available. : 15 pages, 8 figures, EMNLP 2021 Article in Journal/Newspaper inuktitut DataCite Metadata Store (German National Library of Science and Technology)
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	unknown
topic	Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences I.2.7
spellingShingle	Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences I.2.7 Jones, Alex Wang, William Yang Mahowald, Kyle A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
topic_facet	Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences I.2.7
description	In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family training data as a stronger predictor than language-specific training data across the board. We verify some of our linguistic findings by looking at the effect of morphological segmentation on English-Inuktitut alignment, in addition to examining the effect of word order agreement on isomorphism for 66 zero-shot language pairs from a different corpus. We make the data and code for our experiments publicly available. : 15 pages, 8 figures, EMNLP 2021
format	Article in Journal/Newspaper
author	Jones, Alex Wang, William Yang Mahowald, Kyle
author_facet	Jones, Alex Wang, William Yang Mahowald, Kyle
author_sort	Jones, Alex
title	A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_short	A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_full	A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_fullStr	A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_full_unstemmed	A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_sort	massively multilingual analysis of cross-linguality in shared embedding space
publisher	arXiv
publishDate	2021
url	https://dx.doi.org/10.48550/arxiv.2109.06324 https://arxiv.org/abs/2109.06324
genre	inuktitut
genre_facet	inuktitut
op_rights	Creative Commons Attribution Non Commercial Share Alike 4.0 International https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode cc-by-nc-sa-4.0
op_rightsnorm	CC-BY-NC-SA
op_doi	https://doi.org/10.48550/arxiv.2109.06324
_version_	1766046590851612672

A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

Similar Items