Extending the Subwording Model of Multilingual Pretrained Models for New Languages
Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend...
Main Authors: | , |
---|---|
Format: | Text |
Language: | unknown |
Published: |
2022
|
Subjects: | |
Online Access: | http://arxiv.org/abs/2211.15965 |
id |
ftarxivpreprints:oai:arXiv.org:2211.15965 |
---|---|
record_format |
openpolar |
spelling |
ftarxivpreprints:oai:arXiv.org:2211.15965 2023-09-05T13:20:41+02:00 Extending the Subwording Model of Multilingual Pretrained Models for New Languages Imamura, Kenji Sumita, Eiichiro 2022-11-29 http://arxiv.org/abs/2211.15965 unknown http://arxiv.org/abs/2211.15965 Computer Science - Computation and Language text 2022 ftarxivpreprints 2023-08-16T17:24:57Z Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation. Comment: Code: https://github.com/kenji-imamura/sentpiece_mimic Text inuktitut ArXiv.org (Cornell University Library) |
institution |
Open Polar |
collection |
ArXiv.org (Cornell University Library) |
op_collection_id |
ftarxivpreprints |
language |
unknown |
topic |
Computer Science - Computation and Language |
spellingShingle |
Computer Science - Computation and Language Imamura, Kenji Sumita, Eiichiro Extending the Subwording Model of Multilingual Pretrained Models for New Languages |
topic_facet |
Computer Science - Computation and Language |
description |
Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation. Comment: Code: https://github.com/kenji-imamura/sentpiece_mimic |
format |
Text |
author |
Imamura, Kenji Sumita, Eiichiro |
author_facet |
Imamura, Kenji Sumita, Eiichiro |
author_sort |
Imamura, Kenji |
title |
Extending the Subwording Model of Multilingual Pretrained Models for New Languages |
title_short |
Extending the Subwording Model of Multilingual Pretrained Models for New Languages |
title_full |
Extending the Subwording Model of Multilingual Pretrained Models for New Languages |
title_fullStr |
Extending the Subwording Model of Multilingual Pretrained Models for New Languages |
title_full_unstemmed |
Extending the Subwording Model of Multilingual Pretrained Models for New Languages |
title_sort |
extending the subwording model of multilingual pretrained models for new languages |
publishDate |
2022 |
url |
http://arxiv.org/abs/2211.15965 |
genre |
inuktitut |
genre_facet |
inuktitut |
op_relation |
http://arxiv.org/abs/2211.15965 |
_version_ |
1776201329201381376 |