Automatic Speech Recognition of Low-Resource Languages Based on Chukchi
The following paper presents a project focused on the research and creation of a new Automatic Speech Recognition (ASR) based in the Chukchi language. There is no one complete corpus of the Chukchi language, so most of the work consisted in collecting audio and texts in the Chukchi language from ope...
Main Authors: | , , , |
---|---|
Format: | Text |
Language: | unknown |
Published: |
2022
|
Subjects: | |
Online Access: | http://arxiv.org/abs/2210.05726 |
id |
ftarxivpreprints:oai:arXiv.org:2210.05726 |
---|---|
record_format |
openpolar |
spelling |
ftarxivpreprints:oai:arXiv.org:2210.05726 2023-09-05T13:18:51+02:00 Automatic Speech Recognition of Low-Resource Languages Based on Chukchi Safonova, Anastasia Yudina, Tatiana Nadimanov, Emil Davenport, Cydnie 2022-10-11 http://arxiv.org/abs/2210.05726 unknown http://arxiv.org/abs/2210.05726 Computer Science - Computation and Language text 2022 ftarxivpreprints 2023-08-16T17:19:46Z The following paper presents a project focused on the research and creation of a new Automatic Speech Recognition (ASR) based in the Chukchi language. There is no one complete corpus of the Chukchi language, so most of the work consisted in collecting audio and texts in the Chukchi language from open sources and processing them. We managed to collect 21:34:23 hours of audio recordings and 112,719 sentences (or 2,068,273 words) of text in the Chukchi language. The XLSR model was trained on the obtained data, which showed good results even with a small amount of data. Besides the fact that the Chukchi language is a low-resource language, it is also polysynthetic, which significantly complicates any automatic processing. Thus, the usual WER metric for evaluating ASR becomes less indicative for a polysynthetic language. However, the CER metric showed good results. The question of metrics for polysynthetic languages remains open. Text Chukchi ArXiv.org (Cornell University Library) |
institution |
Open Polar |
collection |
ArXiv.org (Cornell University Library) |
op_collection_id |
ftarxivpreprints |
language |
unknown |
topic |
Computer Science - Computation and Language |
spellingShingle |
Computer Science - Computation and Language Safonova, Anastasia Yudina, Tatiana Nadimanov, Emil Davenport, Cydnie Automatic Speech Recognition of Low-Resource Languages Based on Chukchi |
topic_facet |
Computer Science - Computation and Language |
description |
The following paper presents a project focused on the research and creation of a new Automatic Speech Recognition (ASR) based in the Chukchi language. There is no one complete corpus of the Chukchi language, so most of the work consisted in collecting audio and texts in the Chukchi language from open sources and processing them. We managed to collect 21:34:23 hours of audio recordings and 112,719 sentences (or 2,068,273 words) of text in the Chukchi language. The XLSR model was trained on the obtained data, which showed good results even with a small amount of data. Besides the fact that the Chukchi language is a low-resource language, it is also polysynthetic, which significantly complicates any automatic processing. Thus, the usual WER metric for evaluating ASR becomes less indicative for a polysynthetic language. However, the CER metric showed good results. The question of metrics for polysynthetic languages remains open. |
format |
Text |
author |
Safonova, Anastasia Yudina, Tatiana Nadimanov, Emil Davenport, Cydnie |
author_facet |
Safonova, Anastasia Yudina, Tatiana Nadimanov, Emil Davenport, Cydnie |
author_sort |
Safonova, Anastasia |
title |
Automatic Speech Recognition of Low-Resource Languages Based on Chukchi |
title_short |
Automatic Speech Recognition of Low-Resource Languages Based on Chukchi |
title_full |
Automatic Speech Recognition of Low-Resource Languages Based on Chukchi |
title_fullStr |
Automatic Speech Recognition of Low-Resource Languages Based on Chukchi |
title_full_unstemmed |
Automatic Speech Recognition of Low-Resource Languages Based on Chukchi |
title_sort |
automatic speech recognition of low-resource languages based on chukchi |
publishDate |
2022 |
url |
http://arxiv.org/abs/2210.05726 |
genre |
Chukchi |
genre_facet |
Chukchi |
op_relation |
http://arxiv.org/abs/2210.05726 |
_version_ |
1776199702425894912 |