Extending the Subwording Model of Multilingual Pretrained Models for New Languages

Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend...

Full description

Bibliographic Details
Main Authors: Imamura, Kenji, Sumita, Eiichiro
Format: Text
Language:unknown
Published: 2022
Subjects:
Online Access:http://arxiv.org/abs/2211.15965
id ftarxivpreprints:oai:arXiv.org:2211.15965
record_format openpolar
spelling ftarxivpreprints:oai:arXiv.org:2211.15965 2023-09-05T13:20:41+02:00 Extending the Subwording Model of Multilingual Pretrained Models for New Languages Imamura, Kenji Sumita, Eiichiro 2022-11-29 http://arxiv.org/abs/2211.15965 unknown http://arxiv.org/abs/2211.15965 Computer Science - Computation and Language text 2022 ftarxivpreprints 2023-08-16T17:24:57Z Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation. Comment: Code: https://github.com/kenji-imamura/sentpiece_mimic Text inuktitut ArXiv.org (Cornell University Library)
institution Open Polar
collection ArXiv.org (Cornell University Library)
op_collection_id ftarxivpreprints
language unknown
topic Computer Science - Computation and Language
spellingShingle Computer Science - Computation and Language
Imamura, Kenji
Sumita, Eiichiro
Extending the Subwording Model of Multilingual Pretrained Models for New Languages
topic_facet Computer Science - Computation and Language
description Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation. Comment: Code: https://github.com/kenji-imamura/sentpiece_mimic
format Text
author Imamura, Kenji
Sumita, Eiichiro
author_facet Imamura, Kenji
Sumita, Eiichiro
author_sort Imamura, Kenji
title Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_short Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_full Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_fullStr Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_full_unstemmed Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_sort extending the subwording model of multilingual pretrained models for new languages
publishDate 2022
url http://arxiv.org/abs/2211.15965
genre inuktitut
genre_facet inuktitut
op_relation http://arxiv.org/abs/2211.15965
_version_ 1776201329201381376