CommonLanguage

CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train,...

Full description

Bibliographic Details
Main Authors:	Sinisetty, Ganesh, Ruban, Pavlo, Dymov, Oleksandr, Ravanelli, Mirco
Format:	Dataset
Language:	unknown
Published:	Zenodo 2021
Subjects:	speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source Aris Loren Sakha
Online Access:	https://dx.doi.org/10.5281/zenodo.5036977 https://zenodo.org/record/5036977

id	ftdatacite:10.5281/zenodo.5036977
record_format	openpolar
spelling	ftdatacite:10.5281/zenodo.5036977 2023-05-15T18:08:23+02:00 CommonLanguage Sinisetty, Ganesh Ruban, Pavlo Dymov, Oleksandr Ravanelli, Mirco 2021 https://dx.doi.org/10.5281/zenodo.5036977 https://zenodo.org/record/5036977 unknown Zenodo https://dx.doi.org/10.5281/zenodo.5036976 Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 info:eu-repo/semantics/openAccess CC-BY speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source dataset Dataset 2021 ftdatacite https://doi.org/10.5281/zenodo.5036977 https://doi.org/10.5281/zenodo.5036976 2021-11-05T12:55:41Z CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train, dev (validation) and test sets. The dataset has been extracted from CommonVoice to train easily language-id systems. The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain Statistics of CommonLanguage: \| Name \| Train \| Dev \| Test \| \|:---------------------------------:\|:------:\|:------:\|:-----:\| \| # of utterances \| 177552 \| 47104 \| 47704 \| \| # unique speakers \| 11189 \| 1297 \| 1322 \| \| Total duration, hr \| 30.04 \| 7.53 \| 7.53 \| \| Min duration, sec \| 0.86 \| 0.98 \| 0.89 \| \| Mean duration, sec \| 4.87 \| 4.61 \| 4.55 \| \| Max duration, sec \| 21.72 \| 105.67 \| 29.83 \| \| Duration per language, min \| ~40 \| ~10 \| ~10 \| ## List of languages: * Arabic * Basque * Breton * Catalan * Chinese_China * Chinese_Hongkong * Chinese_Taiwan * Chuvash * Czech * Dhivehi * Dutch * English * Esperanto * Estonian * French * Frisian * Georgian * German * Greek * Hakha_Chin * Indonesian * Interlingua * Italian * Japanese * Kabyle * Kinyarwanda * Kyrgyz * Latvian * Maltese * Mangolian * Persian * Polish * Portuguese * Romanian * Romansh_Sursilvan * Russian * Sakha * Slovenian * Spanish * Swedish * Tamil * Tatar * Turkish * Ukranian * Welsh ## Other information In addition to the language label, the datapoints have `age`, `gender` and `utterance transcription` associated with each utterance. : {"references": ["Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Fran\u00e7ois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio, \"SpeechBrain: A General-Purpose Speech Toolkit\", 2021, arXiv"]} Dataset Sakha DataCite Metadata Store (German National Library of Science and Technology) Aris ENVELOPE(-61.400,-61.400,-70.633,-70.633) Loren ENVELOPE(-171.669,-171.669,65.509,65.509) Sakha
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	unknown
topic	speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source
spellingShingle	speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source Sinisetty, Ganesh Ruban, Pavlo Dymov, Oleksandr Ravanelli, Mirco CommonLanguage
topic_facet	speechbrain language identification CommonVoice speech recognition multiple languages speaker recognition language recognition open-source
description	CommonLanguage Dataset This dataset is composed of speech recordings from of languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset is already balanced and split into train, dev (validation) and test sets. The dataset has been extracted from CommonVoice to train easily language-id systems. The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain Statistics of CommonLanguage: \| Name \| Train \| Dev \| Test \| \|:---------------------------------:\|:------:\|:------:\|:-----:\| \| # of utterances \| 177552 \| 47104 \| 47704 \| \| # unique speakers \| 11189 \| 1297 \| 1322 \| \| Total duration, hr \| 30.04 \| 7.53 \| 7.53 \| \| Min duration, sec \| 0.86 \| 0.98 \| 0.89 \| \| Mean duration, sec \| 4.87 \| 4.61 \| 4.55 \| \| Max duration, sec \| 21.72 \| 105.67 \| 29.83 \| \| Duration per language, min \| ~40 \| ~10 \| ~10 \| ## List of languages: * Arabic * Basque * Breton * Catalan * Chinese_China * Chinese_Hongkong * Chinese_Taiwan * Chuvash * Czech * Dhivehi * Dutch * English * Esperanto * Estonian * French * Frisian * Georgian * German * Greek * Hakha_Chin * Indonesian * Interlingua * Italian * Japanese * Kabyle * Kinyarwanda * Kyrgyz * Latvian * Maltese * Mangolian * Persian * Polish * Portuguese * Romanian * Romansh_Sursilvan * Russian * Sakha * Slovenian * Spanish * Swedish * Tamil * Tatar * Turkish * Ukranian * Welsh ## Other information In addition to the language label, the datapoints have `age`, `gender` and `utterance transcription` associated with each utterance. : {"references": ["Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Fran\u00e7ois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio, \"SpeechBrain: A General-Purpose Speech Toolkit\", 2021, arXiv"]}
format	Dataset
author	Sinisetty, Ganesh Ruban, Pavlo Dymov, Oleksandr Ravanelli, Mirco
author_facet	Sinisetty, Ganesh Ruban, Pavlo Dymov, Oleksandr Ravanelli, Mirco
author_sort	Sinisetty, Ganesh
title	CommonLanguage
title_short	CommonLanguage
title_full	CommonLanguage
title_fullStr	CommonLanguage
title_full_unstemmed	CommonLanguage
title_sort	commonlanguage
publisher	Zenodo
publishDate	2021
url	https://dx.doi.org/10.5281/zenodo.5036977 https://zenodo.org/record/5036977
long_lat	ENVELOPE(-61.400,-61.400,-70.633,-70.633) ENVELOPE(-171.669,-171.669,65.509,65.509)
geographic	Aris Loren Sakha
geographic_facet	Aris Loren Sakha
genre	Sakha
genre_facet	Sakha
op_relation	https://dx.doi.org/10.5281/zenodo.5036976
op_rights	Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 info:eu-repo/semantics/openAccess
op_rightsnorm	CC-BY
op_doi	https://doi.org/10.5281/zenodo.5036977 https://doi.org/10.5281/zenodo.5036976
_version_	1766180665640878080

CommonLanguage

Similar Items