CommonLanguage

CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train,...

Full description

Bibliographic Details
Main Authors: Ganesh Sinisetty, Pavlo Ruban, Oleksandr Dymov, Mirco Ravanelli
Format: Other/Unknown Material
Language:unknown
Published: Zenodo 2021
Subjects:
Online Access:https://doi.org/10.5281/zenodo.5036977
id ftzenodo:oai:zenodo.org:5036977
record_format openpolar
spelling ftzenodo:oai:zenodo.org:5036977 2024-09-15T18:32:39+00:00 CommonLanguage Ganesh Sinisetty Pavlo Ruban Oleksandr Dymov Mirco Ravanelli 2021-06-28 https://doi.org/10.5281/zenodo.5036977 unknown Zenodo https://doi.org/10.5281/zenodo.5036976 https://doi.org/10.5281/zenodo.5036977 oai:zenodo.org:5036977 info:eu-repo/semantics/openAccess Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source info:eu-repo/semantics/other 2021 ftzenodo https://doi.org/10.5281/zenodo.503697710.5281/zenodo.5036976 2024-07-27T08:10:46Z CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train, dev (validation) and test sets. The dataset has been extracted from CommonVoice to train easily language-id systems. The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain Statistics of CommonLanguage: | Name | Train | Dev | Test | |: --- :|: --- :|: --- :|: --- :| | **# of utterances** | 177552 | 47104 | 47704 | | **# unique speakers** | 11189 | 1297 | 1322 | | **Total duration, hr** | 30.04 | 7.53 | 7.53 | | **Min duration, sec** | 0.86 | 0.98 | 0.89 | | **Mean duration, sec** | 4.87 | 4.61 | 4.55 | | **Max duration, sec** | 21.72 | 105.67 | 29.83 | | **Duration per language, min** | ~40 | ~10 | ~10 | ## List of languages: * Arabic * Basque * Breton * Catalan * Chinese_China * Chinese_Hongkong * Chinese_Taiwan * Chuvash * Czech * Dhivehi * Dutch * English * Esperanto * Estonian * French * Frisian * Georgian * German * Greek * Hakha_Chin * Indonesian * Interlingua * Italian * Japanese * Kabyle * Kinyarwanda * Kyrgyz * Latvian * Maltese * Mangolian * Persian * Polish * Portuguese * Romanian * Romansh_Sursilvan * Russian * Sakha * Slovenian * Spanish * Swedish * Tamil * Tatar * Turkish * Ukranian * Welsh ## Other information In addition to the language label, the datapoints have `age`, `gender` and `utterance transcription` associated with each utterance. Other/Unknown Material Sakha Zenodo
institution Open Polar
collection Zenodo
op_collection_id ftzenodo
language unknown
topic speechbrain
language identification
CommonVoice
speech recognition
multiple languages
speaker recognition
language recognition
open-source
spellingShingle speechbrain
language identification
CommonVoice
speech recognition
multiple languages
speaker recognition
language recognition
open-source
Ganesh Sinisetty
Pavlo Ruban
Oleksandr Dymov
Mirco Ravanelli
CommonLanguage
topic_facet speechbrain
language identification
CommonVoice
speech recognition
multiple languages
speaker recognition
language recognition
open-source
description CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train, dev (validation) and test sets. The dataset has been extracted from CommonVoice to train easily language-id systems. The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain Statistics of CommonLanguage: | Name | Train | Dev | Test | |: --- :|: --- :|: --- :|: --- :| | **# of utterances** | 177552 | 47104 | 47704 | | **# unique speakers** | 11189 | 1297 | 1322 | | **Total duration, hr** | 30.04 | 7.53 | 7.53 | | **Min duration, sec** | 0.86 | 0.98 | 0.89 | | **Mean duration, sec** | 4.87 | 4.61 | 4.55 | | **Max duration, sec** | 21.72 | 105.67 | 29.83 | | **Duration per language, min** | ~40 | ~10 | ~10 | ## List of languages: * Arabic * Basque * Breton * Catalan * Chinese_China * Chinese_Hongkong * Chinese_Taiwan * Chuvash * Czech * Dhivehi * Dutch * English * Esperanto * Estonian * French * Frisian * Georgian * German * Greek * Hakha_Chin * Indonesian * Interlingua * Italian * Japanese * Kabyle * Kinyarwanda * Kyrgyz * Latvian * Maltese * Mangolian * Persian * Polish * Portuguese * Romanian * Romansh_Sursilvan * Russian * Sakha * Slovenian * Spanish * Swedish * Tamil * Tatar * Turkish * Ukranian * Welsh ## Other information In addition to the language label, the datapoints have `age`, `gender` and `utterance transcription` associated with each utterance.
format Other/Unknown Material
author Ganesh Sinisetty
Pavlo Ruban
Oleksandr Dymov
Mirco Ravanelli
author_facet Ganesh Sinisetty
Pavlo Ruban
Oleksandr Dymov
Mirco Ravanelli
author_sort Ganesh Sinisetty
title CommonLanguage
title_short CommonLanguage
title_full CommonLanguage
title_fullStr CommonLanguage
title_full_unstemmed CommonLanguage
title_sort commonlanguage
publisher Zenodo
publishDate 2021
url https://doi.org/10.5281/zenodo.5036977
genre Sakha
genre_facet Sakha
op_relation https://doi.org/10.5281/zenodo.5036976
https://doi.org/10.5281/zenodo.5036977
oai:zenodo.org:5036977
op_rights info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
op_doi https://doi.org/10.5281/zenodo.503697710.5281/zenodo.5036976
_version_ 1810474390188457984