CommonLanguage

CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train,...

Full description

Bibliographic Details
Main Authors: Sinisetty, Ganesh, Ruban, Pavlo, Dymov, Oleksandr, Ravanelli, Mirco
Format: Dataset
Language:unknown
Published: Zenodo 2021
Subjects:
Online Access:https://dx.doi.org/10.5281/zenodo.5036977
https://zenodo.org/record/5036977
id ftdatacite:10.5281/zenodo.5036977
record_format openpolar
spelling ftdatacite:10.5281/zenodo.5036977 2023-05-15T18:08:23+02:00 CommonLanguage Sinisetty, Ganesh Ruban, Pavlo Dymov, Oleksandr Ravanelli, Mirco 2021 https://dx.doi.org/10.5281/zenodo.5036977 https://zenodo.org/record/5036977 unknown Zenodo https://dx.doi.org/10.5281/zenodo.5036976 Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 info:eu-repo/semantics/openAccess CC-BY speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source dataset Dataset 2021 ftdatacite https://doi.org/10.5281/zenodo.5036977 https://doi.org/10.5281/zenodo.5036976 2021-11-05T12:55:41Z CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train, dev (validation) and test sets. The dataset has been extracted from CommonVoice to train easily language-id systems. The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain Statistics of CommonLanguage: | Name | Train | Dev | Test | |:---------------------------------:|:------:|:------:|:-----:| | **# of utterances** | 177552 | 47104 | 47704 | | **# unique speakers** | 11189 | 1297 | 1322 | | **Total duration, hr** | 30.04 | 7.53 | 7.53 | | **Min duration, sec** | 0.86 | 0.98 | 0.89 | | **Mean duration, sec** | 4.87 | 4.61 | 4.55 | | **Max duration, sec** | 21.72 | 105.67 | 29.83 | | **Duration per language, min** | ~40 | ~10 | ~10 | ## List of languages: * Arabic * Basque * Breton * Catalan * Chinese_China * Chinese_Hongkong * Chinese_Taiwan * Chuvash * Czech * Dhivehi * Dutch * English * Esperanto * Estonian * French * Frisian * Georgian * German * Greek * Hakha_Chin * Indonesian * Interlingua * Italian * Japanese * Kabyle * Kinyarwanda * Kyrgyz * Latvian * Maltese * Mangolian * Persian * Polish * Portuguese * Romanian * Romansh_Sursilvan * Russian * Sakha * Slovenian * Spanish * Swedish * Tamil * Tatar * Turkish * Ukranian * Welsh ## Other information In addition to the language label, the datapoints have `age`, `gender` and `utterance transcription` associated with each utterance. : {"references": ["Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Fran\u00e7ois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio, \"SpeechBrain: A General-Purpose Speech Toolkit\", 2021, arXiv"]} Dataset Sakha DataCite Metadata Store (German National Library of Science and Technology) Aris ENVELOPE(-61.400,-61.400,-70.633,-70.633) Loren ENVELOPE(-171.669,-171.669,65.509,65.509) Sakha
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic speechbrain
language identification
CommonVoice
speech recognition
multiple languages
speaker recognition
language recognition
open-source
spellingShingle speechbrain
language identification
CommonVoice
speech recognition
multiple languages
speaker recognition
language recognition
open-source
Sinisetty, Ganesh
Ruban, Pavlo
Dymov, Oleksandr
Ravanelli, Mirco
CommonLanguage
topic_facet speechbrain
language identification
CommonVoice
speech recognition
multiple languages
speaker recognition
language recognition
open-source
description CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train, dev (validation) and test sets. The dataset has been extracted from CommonVoice to train easily language-id systems. The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain Statistics of CommonLanguage: | Name | Train | Dev | Test | |:---------------------------------:|:------:|:------:|:-----:| | **# of utterances** | 177552 | 47104 | 47704 | | **# unique speakers** | 11189 | 1297 | 1322 | | **Total duration, hr** | 30.04 | 7.53 | 7.53 | | **Min duration, sec** | 0.86 | 0.98 | 0.89 | | **Mean duration, sec** | 4.87 | 4.61 | 4.55 | | **Max duration, sec** | 21.72 | 105.67 | 29.83 | | **Duration per language, min** | ~40 | ~10 | ~10 | ## List of languages: * Arabic * Basque * Breton * Catalan * Chinese_China * Chinese_Hongkong * Chinese_Taiwan * Chuvash * Czech * Dhivehi * Dutch * English * Esperanto * Estonian * French * Frisian * Georgian * German * Greek * Hakha_Chin * Indonesian * Interlingua * Italian * Japanese * Kabyle * Kinyarwanda * Kyrgyz * Latvian * Maltese * Mangolian * Persian * Polish * Portuguese * Romanian * Romansh_Sursilvan * Russian * Sakha * Slovenian * Spanish * Swedish * Tamil * Tatar * Turkish * Ukranian * Welsh ## Other information In addition to the language label, the datapoints have `age`, `gender` and `utterance transcription` associated with each utterance. : {"references": ["Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Fran\u00e7ois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio, \"SpeechBrain: A General-Purpose Speech Toolkit\", 2021, arXiv"]}
format Dataset
author Sinisetty, Ganesh
Ruban, Pavlo
Dymov, Oleksandr
Ravanelli, Mirco
author_facet Sinisetty, Ganesh
Ruban, Pavlo
Dymov, Oleksandr
Ravanelli, Mirco
author_sort Sinisetty, Ganesh
title CommonLanguage
title_short CommonLanguage
title_full CommonLanguage
title_fullStr CommonLanguage
title_full_unstemmed CommonLanguage
title_sort commonlanguage
publisher Zenodo
publishDate 2021
url https://dx.doi.org/10.5281/zenodo.5036977
https://zenodo.org/record/5036977
long_lat ENVELOPE(-61.400,-61.400,-70.633,-70.633)
ENVELOPE(-171.669,-171.669,65.509,65.509)
geographic Aris
Loren
Sakha
geographic_facet Aris
Loren
Sakha
genre Sakha
genre_facet Sakha
op_relation https://dx.doi.org/10.5281/zenodo.5036976
op_rights Open Access
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
cc-by-4.0
info:eu-repo/semantics/openAccess
op_rightsnorm CC-BY
op_doi https://doi.org/10.5281/zenodo.5036977
https://doi.org/10.5281/zenodo.5036976
_version_ 1766180665640878080