CommonLanguage
CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train,...
Main Authors: | , , , |
---|---|
Format: | Dataset |
Language: | unknown |
Published: |
Zenodo
2021
|
Subjects: | |
Online Access: | https://dx.doi.org/10.5281/zenodo.5036977 https://zenodo.org/record/5036977 |
id |
ftdatacite:10.5281/zenodo.5036977 |
---|---|
record_format |
openpolar |
spelling |
ftdatacite:10.5281/zenodo.5036977 2023-05-15T18:08:23+02:00 CommonLanguage Sinisetty, Ganesh Ruban, Pavlo Dymov, Oleksandr Ravanelli, Mirco 2021 https://dx.doi.org/10.5281/zenodo.5036977 https://zenodo.org/record/5036977 unknown Zenodo https://dx.doi.org/10.5281/zenodo.5036976 Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 info:eu-repo/semantics/openAccess CC-BY speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source dataset Dataset 2021 ftdatacite https://doi.org/10.5281/zenodo.5036977 https://doi.org/10.5281/zenodo.5036976 2021-11-05T12:55:41Z CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train, dev (validation) and test sets. The dataset has been extracted from CommonVoice to train easily language-id systems. The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain Statistics of CommonLanguage: | Name | Train | Dev | Test | |:---------------------------------:|:------:|:------:|:-----:| | **# of utterances** | 177552 | 47104 | 47704 | | **# unique speakers** | 11189 | 1297 | 1322 | | **Total duration, hr** | 30.04 | 7.53 | 7.53 | | **Min duration, sec** | 0.86 | 0.98 | 0.89 | | **Mean duration, sec** | 4.87 | 4.61 | 4.55 | | **Max duration, sec** | 21.72 | 105.67 | 29.83 | | **Duration per language, min** | ~40 | ~10 | ~10 | ## List of languages: * Arabic * Basque * Breton * Catalan * Chinese_China * Chinese_Hongkong * Chinese_Taiwan * Chuvash * Czech * Dhivehi * Dutch * English * Esperanto * Estonian * French * Frisian * Georgian * German * Greek * Hakha_Chin * Indonesian * Interlingua * Italian * Japanese * Kabyle * Kinyarwanda * Kyrgyz * Latvian * Maltese * Mangolian * Persian * Polish * Portuguese * Romanian * Romansh_Sursilvan * Russian * Sakha * Slovenian * Spanish * Swedish * Tamil * Tatar * Turkish * Ukranian * Welsh ## Other information In addition to the language label, the datapoints have `age`, `gender` and `utterance transcription` associated with each utterance. : {"references": ["Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Fran\u00e7ois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio, \"SpeechBrain: A General-Purpose Speech Toolkit\", 2021, arXiv"]} Dataset Sakha DataCite Metadata Store (German National Library of Science and Technology) Aris ENVELOPE(-61.400,-61.400,-70.633,-70.633) Loren ENVELOPE(-171.669,-171.669,65.509,65.509) Sakha |
institution |
Open Polar |
collection |
DataCite Metadata Store (German National Library of Science and Technology) |
op_collection_id |
ftdatacite |
language |
unknown |
topic |
speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source |
spellingShingle |
speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source Sinisetty, Ganesh Ruban, Pavlo Dymov, Oleksandr Ravanelli, Mirco CommonLanguage |
topic_facet |
speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source |
description |
CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train, dev (validation) and test sets. The dataset has been extracted from CommonVoice to train easily language-id systems. The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain Statistics of CommonLanguage: | Name | Train | Dev | Test | |:---------------------------------:|:------:|:------:|:-----:| | **# of utterances** | 177552 | 47104 | 47704 | | **# unique speakers** | 11189 | 1297 | 1322 | | **Total duration, hr** | 30.04 | 7.53 | 7.53 | | **Min duration, sec** | 0.86 | 0.98 | 0.89 | | **Mean duration, sec** | 4.87 | 4.61 | 4.55 | | **Max duration, sec** | 21.72 | 105.67 | 29.83 | | **Duration per language, min** | ~40 | ~10 | ~10 | ## List of languages: * Arabic * Basque * Breton * Catalan * Chinese_China * Chinese_Hongkong * Chinese_Taiwan * Chuvash * Czech * Dhivehi * Dutch * English * Esperanto * Estonian * French * Frisian * Georgian * German * Greek * Hakha_Chin * Indonesian * Interlingua * Italian * Japanese * Kabyle * Kinyarwanda * Kyrgyz * Latvian * Maltese * Mangolian * Persian * Polish * Portuguese * Romanian * Romansh_Sursilvan * Russian * Sakha * Slovenian * Spanish * Swedish * Tamil * Tatar * Turkish * Ukranian * Welsh ## Other information In addition to the language label, the datapoints have `age`, `gender` and `utterance transcription` associated with each utterance. : {"references": ["Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Fran\u00e7ois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio, \"SpeechBrain: A General-Purpose Speech Toolkit\", 2021, arXiv"]} |
format |
Dataset |
author |
Sinisetty, Ganesh Ruban, Pavlo Dymov, Oleksandr Ravanelli, Mirco |
author_facet |
Sinisetty, Ganesh Ruban, Pavlo Dymov, Oleksandr Ravanelli, Mirco |
author_sort |
Sinisetty, Ganesh |
title |
CommonLanguage |
title_short |
CommonLanguage |
title_full |
CommonLanguage |
title_fullStr |
CommonLanguage |
title_full_unstemmed |
CommonLanguage |
title_sort |
commonlanguage |
publisher |
Zenodo |
publishDate |
2021 |
url |
https://dx.doi.org/10.5281/zenodo.5036977 https://zenodo.org/record/5036977 |
long_lat |
ENVELOPE(-61.400,-61.400,-70.633,-70.633) ENVELOPE(-171.669,-171.669,65.509,65.509) |
geographic |
Aris Loren Sakha |
geographic_facet |
Aris Loren Sakha |
genre |
Sakha |
genre_facet |
Sakha |
op_relation |
https://dx.doi.org/10.5281/zenodo.5036976 |
op_rights |
Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 info:eu-repo/semantics/openAccess |
op_rightsnorm |
CC-BY |
op_doi |
https://doi.org/10.5281/zenodo.5036977 https://doi.org/10.5281/zenodo.5036976 |
_version_ |
1766180665640878080 |