A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

Anthology paper link: https://aclanthology.org/2021.emnlp-main.471/ Abstract: In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pret...

Full description

Bibliographic Details
Main Authors: The 2021 Conference on Empirical Methods in Natural Language Processing 2021, Jones, Alex, Mahowald, Kyle, Wang, William Yang
Format: Article in Journal/Newspaper
Language:unknown
Published: Underline Science Inc. 2021
Subjects:
Online Access:https://dx.doi.org/10.48448/5bth-3135
https://underline.io/lecture/37745-a-massively-multilingual-analysis-of-cross-linguality-in-shared-embedding-space
id ftdatacite:10.48448/5bth-3135
record_format openpolar
spelling ftdatacite:10.48448/5bth-3135 2023-05-15T16:55:36+02:00 A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space The 2021 Conference on Empirical Methods in Natural Language Processing 2021 Jones, Alex Mahowald, Kyle Wang, William Yang 2021 https://dx.doi.org/10.48448/5bth-3135 https://underline.io/lecture/37745-a-massively-multilingual-analysis-of-cross-linguality-in-shared-embedding-space unknown Underline Science Inc. Natural Language Processing Machine Learning Machine Learning and Data Mining Computational Linguistics Language Models Machine translation Conference talk article Audiovisual MediaObject 2021 ftdatacite https://doi.org/10.48448/5bth-3135 2022-03-10T10:29:25Z Anthology paper link: https://aclanthology.org/2021.emnlp-main.471/ Abstract: In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family training data as a stronger predictor than language-specific training data across the board. We verify some of our linguistic findings by looking at the effect of morphological segmentation on English-Inuktitut alignment, in addition to examining the effect of word order agreement on isomorphism for 66 zero-shot language pairs from a different corpus. Article in Journal/Newspaper inuktitut DataCite Metadata Store (German National Library of Science and Technology)
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic Natural Language Processing
Machine Learning
Machine Learning and Data Mining
Computational Linguistics
Language Models
Machine translation
spellingShingle Natural Language Processing
Machine Learning
Machine Learning and Data Mining
Computational Linguistics
Language Models
Machine translation
The 2021 Conference on Empirical Methods in Natural Language Processing 2021
Jones, Alex
Mahowald, Kyle
Wang, William Yang
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
topic_facet Natural Language Processing
Machine Learning
Machine Learning and Data Mining
Computational Linguistics
Language Models
Machine translation
description Anthology paper link: https://aclanthology.org/2021.emnlp-main.471/ Abstract: In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family training data as a stronger predictor than language-specific training data across the board. We verify some of our linguistic findings by looking at the effect of morphological segmentation on English-Inuktitut alignment, in addition to examining the effect of word order agreement on isomorphism for 66 zero-shot language pairs from a different corpus.
format Article in Journal/Newspaper
author The 2021 Conference on Empirical Methods in Natural Language Processing 2021
Jones, Alex
Mahowald, Kyle
Wang, William Yang
author_facet The 2021 Conference on Empirical Methods in Natural Language Processing 2021
Jones, Alex
Mahowald, Kyle
Wang, William Yang
author_sort The 2021 Conference on Empirical Methods in Natural Language Processing 2021
title A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_short A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_full A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_fullStr A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_full_unstemmed A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
title_sort massively multilingual analysis of cross-linguality in shared embedding space
publisher Underline Science Inc.
publishDate 2021
url https://dx.doi.org/10.48448/5bth-3135
https://underline.io/lecture/37745-a-massively-multilingual-analysis-of-cross-linguality-in-shared-embedding-space
genre inuktitut
genre_facet inuktitut
op_doi https://doi.org/10.48448/5bth-3135
_version_ 1766046590365073408