Speech Recognition for Endangered and Extinct Samoyedic languages

Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label...

Full description

Bibliographic Details
Main Authors: Partanen, Niko, Hämäläinen, Mika, Klooster, Tiina
Format: Text
Language:unknown
Published: 2020
Subjects:
Online Access:http://arxiv.org/abs/2012.05331
id ftarxivpreprints:oai:arXiv.org:2012.05331
record_format openpolar
spelling ftarxivpreprints:oai:arXiv.org:2012.05331 2023-09-05T13:21:14+02:00 Speech Recognition for Endangered and Extinct Samoyedic languages Partanen, Niko Hämäläinen, Mika Klooster, Tiina 2020-12-09 http://arxiv.org/abs/2012.05331 unknown http://arxiv.org/abs/2012.05331 Computer Science - Computation and Language text 2020 ftarxivpreprints 2023-08-16T16:14:34Z Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15\%, and conclude through careful error analysis that this quality is already very useful as a starting point for refined human transcriptions. Our results with related Nganasan language are more modest, with best model having the error rate of 33\%. We show, however, through experiments where Kamas training data is enlarged incrementally, that Nganasan results are in line with what is expected under low-resource circumstances of the language. Based on this, we provide recommendations for scenarios in which further language documentation or archive processing activities could benefit from modern ASR technology. All training data and processing scripts haven been published on Zenodo with clear licences to ensure further work in this important topic. Comment: the 34th Pacific Asia Conference on Language, Information and Computation Text Nganasan* samoyed* Siberia ArXiv.org (Cornell University Library) Pacific
institution Open Polar
collection ArXiv.org (Cornell University Library)
op_collection_id ftarxivpreprints
language unknown
topic Computer Science - Computation and Language
spellingShingle Computer Science - Computation and Language
Partanen, Niko
Hämäläinen, Mika
Klooster, Tiina
Speech Recognition for Endangered and Extinct Samoyedic languages
topic_facet Computer Science - Computation and Language
description Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15\%, and conclude through careful error analysis that this quality is already very useful as a starting point for refined human transcriptions. Our results with related Nganasan language are more modest, with best model having the error rate of 33\%. We show, however, through experiments where Kamas training data is enlarged incrementally, that Nganasan results are in line with what is expected under low-resource circumstances of the language. Based on this, we provide recommendations for scenarios in which further language documentation or archive processing activities could benefit from modern ASR technology. All training data and processing scripts haven been published on Zenodo with clear licences to ensure further work in this important topic. Comment: the 34th Pacific Asia Conference on Language, Information and Computation
format Text
author Partanen, Niko
Hämäläinen, Mika
Klooster, Tiina
author_facet Partanen, Niko
Hämäläinen, Mika
Klooster, Tiina
author_sort Partanen, Niko
title Speech Recognition for Endangered and Extinct Samoyedic languages
title_short Speech Recognition for Endangered and Extinct Samoyedic languages
title_full Speech Recognition for Endangered and Extinct Samoyedic languages
title_fullStr Speech Recognition for Endangered and Extinct Samoyedic languages
title_full_unstemmed Speech Recognition for Endangered and Extinct Samoyedic languages
title_sort speech recognition for endangered and extinct samoyedic languages
publishDate 2020
url http://arxiv.org/abs/2012.05331
geographic Pacific
geographic_facet Pacific
genre Nganasan*
samoyed*
Siberia
genre_facet Nganasan*
samoyed*
Siberia
op_relation http://arxiv.org/abs/2012.05331
_version_ 1776201827939778560