TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision

Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained w...

Full description

Bibliographic Details
Main Authors:	A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, S. King
Other Authors:	The Pennsylvania State University CiteSeerX Archives
Format:	Text
Language:	English
Published:	2013
Subjects:	Tundra
Online Access:	http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.396.250 http://www.cstr.ed.ac.uk/downloads/publications/2013/IS131055.pdf

Description
Summary:	Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained with a lightly-supervised process. Future versions of the corpus will include finer-grained alignment and prosodic annotation, all of which will be made freely available. This paper gives a general outline of the data collected so far, as well as a detailed description of how this has been done, emphasizing the minimal language-specific knowledge and manual intervention used to compile the corpus. To demonstrate its potential use, textto-speech systems have been built for all languages using unsupervised or lightly supervised methods, also briefly presented in the paper. Index Terms: multilingual corpus, light supervision, imperfect data, found data, text-to-speech, audiobook data

TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision

Similar Items