Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an...

Full description

Bibliographic Details
Main Authors:	Zhao, Yi, Takaki, Shinji, Luong, Hieu-Thi, Yamagishi, Junichi, Saito, Daisuke, Minematsu, Nobuaki
Format:	Report
Language:	unknown
Published:	arXiv 2018
Subjects:	Audio and Speech Processing eess.AS Computation and Language cs.CL Sound cs.SD Machine Learning stat.ML FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences DML
Online Access:	https://dx.doi.org/10.48550/arxiv.1807.11679 https://arxiv.org/abs/1807.11679

id	ftdatacite:10.48550/arxiv.1807.11679
record_format	openpolar
spelling	ftdatacite:10.48550/arxiv.1807.11679 2023-05-15T16:02:06+02:00 Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder Zhao, Yi Takaki, Shinji Luong, Hieu-Thi Yamagishi, Junichi Saito, Daisuke Minematsu, Nobuaki 2018 https://dx.doi.org/10.48550/arxiv.1807.11679 https://arxiv.org/abs/1807.11679 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Audio and Speech Processing eess.AS Computation and Language cs.CL Sound cs.SD Machine Learning stat.ML FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences Preprint Article article CreativeWork 2018 ftdatacite https://doi.org/10.48550/arxiv.1807.11679 2022-04-01T09:24:59Z Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an alternative to vocoders and hence they are often called neural vocoders. The neural vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. However, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation happens, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. We also extend the GAN frameworks and use the discretized mixture logistic loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated discretized-mixture-of-logistics (DML) loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity. Report DML DataCite Metadata Store (German National Library of Science and Technology)
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	unknown
topic	Audio and Speech Processing eess.AS Computation and Language cs.CL Sound cs.SD Machine Learning stat.ML FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences
spellingShingle	Audio and Speech Processing eess.AS Computation and Language cs.CL Sound cs.SD Machine Learning stat.ML FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences Zhao, Yi Takaki, Shinji Luong, Hieu-Thi Yamagishi, Junichi Saito, Daisuke Minematsu, Nobuaki Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
topic_facet	Audio and Speech Processing eess.AS Computation and Language cs.CL Sound cs.SD Machine Learning stat.ML FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences
description	Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an alternative to vocoders and hence they are often called neural vocoders. The neural vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. However, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation happens, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. We also extend the GAN frameworks and use the discretized mixture logistic loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated discretized-mixture-of-logistics (DML) loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.
format	Report
author	Zhao, Yi Takaki, Shinji Luong, Hieu-Thi Yamagishi, Junichi Saito, Daisuke Minematsu, Nobuaki
author_facet	Zhao, Yi Takaki, Shinji Luong, Hieu-Thi Yamagishi, Junichi Saito, Daisuke Minematsu, Nobuaki
author_sort	Zhao, Yi
title	Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_short	Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_full	Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_fullStr	Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_full_unstemmed	Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_sort	wasserstein gan and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a wavenet vocoder
publisher	arXiv
publishDate	2018
url	https://dx.doi.org/10.48550/arxiv.1807.11679 https://arxiv.org/abs/1807.11679
genre	DML
genre_facet	DML
op_rights	arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/
op_doi	https://doi.org/10.48550/arxiv.1807.11679
_version_	1766397716528627712

Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Similar Items