Extending the Subwording Model of Multilingual Pretrained Models for New Languages

Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend...

Full description

Bibliographic Details
Main Authors:	Imamura, Kenji, Sumita, Eiichiro
Format:	Text
Language:	unknown
Published:	2022
Subjects:	Computer Science - Computation and Language inuktitut
Online Access:	http://arxiv.org/abs/2211.15965

id	ftarxivpreprints:oai:arXiv.org:2211.15965
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2211.15965 2023-09-05T13:20:41+02:00 Extending the Subwording Model of Multilingual Pretrained Models for New Languages Imamura, Kenji Sumita, Eiichiro 2022-11-29 http://arxiv.org/abs/2211.15965 unknown http://arxiv.org/abs/2211.15965 Computer Science - Computation and Language text 2022 ftarxivpreprints 2023-08-16T17:24:57Z Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation. Comment: Code: https://github.com/kenji-imamura/sentpiece_mimic Text inuktitut ArXiv.org (Cornell University Library)
institution	Open Polar
collection	ArXiv.org (Cornell University Library)
op_collection_id	ftarxivpreprints
language	unknown
topic	Computer Science - Computation and Language
spellingShingle	Computer Science - Computation and Language Imamura, Kenji Sumita, Eiichiro Extending the Subwording Model of Multilingual Pretrained Models for New Languages
topic_facet	Computer Science - Computation and Language
description	Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation. Comment: Code: https://github.com/kenji-imamura/sentpiece_mimic
format	Text
author	Imamura, Kenji Sumita, Eiichiro
author_facet	Imamura, Kenji Sumita, Eiichiro
author_sort	Imamura, Kenji
title	Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_short	Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_full	Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_fullStr	Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_full_unstemmed	Extending the Subwording Model of Multilingual Pretrained Models for New Languages
title_sort	extending the subwording model of multilingual pretrained models for new languages
publishDate	2022
url	http://arxiv.org/abs/2211.15965
genre	inuktitut
genre_facet	inuktitut
op_relation	http://arxiv.org/abs/2211.15965
_version_	1776201329201381376

Extending the Subwording Model of Multilingual Pretrained Models for New Languages

Similar Items