Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

WaveNet, which learns directly from speech waveform samples, has been used as an alternative to vocoders and achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. However, the WaveNet vocoder uses acous...

Full description

Bibliographic Details
Published in:	IEEE Access
Main Authors:	Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke Saito, Nobuaki Minematsu
Format:	Article in Journal/Newspaper
Language:	English
Published:	IEEE 2018
Subjects:	Generative adversarial network multi-speaker modeling speech synthesis WaveNet Electrical engineering. Electronics. Nuclear engineering TK1-9971 DML
Online Access:	https://doi.org/10.1109/ACCESS.2018.2872060 https://doaj.org/article/940663e0a8f74dbfbb1aa673ff64b3a5

id	ftdoajarticles:oai:doaj.org/article:940663e0a8f74dbfbb1aa673ff64b3a5
record_format	openpolar
spelling	ftdoajarticles:oai:doaj.org/article:940663e0a8f74dbfbb1aa673ff64b3a5 2023-05-15T16:01:47+02:00 Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder Yi Zhao Shinji Takaki Hieu-Thi Luong Junichi Yamagishi Daisuke Saito Nobuaki Minematsu 2018-01-01T00:00:00Z https://doi.org/10.1109/ACCESS.2018.2872060 https://doaj.org/article/940663e0a8f74dbfbb1aa673ff64b3a5 EN eng IEEE https://ieeexplore.ieee.org/document/8471179/ https://doaj.org/toc/2169-3536 2169-3536 doi:10.1109/ACCESS.2018.2872060 https://doaj.org/article/940663e0a8f74dbfbb1aa673ff64b3a5 IEEE Access, Vol 6, Pp 60478-60488 (2018) Generative adversarial network multi-speaker modeling speech synthesis WaveNet Electrical engineering. Electronics. Nuclear engineering TK1-9971 article 2018 ftdoajarticles https://doi.org/10.1109/ACCESS.2018.2872060 2022-12-31T13:29:02Z WaveNet, which learns directly from speech waveform samples, has been used as an alternative to vocoders and achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. However, the WaveNet vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. So far, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation occurs, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose new frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. The GAN generator performs as an acoustic model and its outputs are used as the local condition parameters of the WaveNet. We also extend the GAN frameworks and use the discretized-mixture-of-logistics (DML) loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity. Article in Journal/Newspaper DML Directory of Open Access Journals: DOAJ Articles IEEE Access 6 60478 60488
institution	Open Polar
collection	Directory of Open Access Journals: DOAJ Articles
op_collection_id	ftdoajarticles
language	English
topic	Generative adversarial network multi-speaker modeling speech synthesis WaveNet Electrical engineering. Electronics. Nuclear engineering TK1-9971
spellingShingle	Generative adversarial network multi-speaker modeling speech synthesis WaveNet Electrical engineering. Electronics. Nuclear engineering TK1-9971 Yi Zhao Shinji Takaki Hieu-Thi Luong Junichi Yamagishi Daisuke Saito Nobuaki Minematsu Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
topic_facet	Generative adversarial network multi-speaker modeling speech synthesis WaveNet Electrical engineering. Electronics. Nuclear engineering TK1-9971
description	WaveNet, which learns directly from speech waveform samples, has been used as an alternative to vocoders and achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. However, the WaveNet vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. So far, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation occurs, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose new frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. The GAN generator performs as an acoustic model and its outputs are used as the local condition parameters of the WaveNet. We also extend the GAN frameworks and use the discretized-mixture-of-logistics (DML) loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.
format	Article in Journal/Newspaper
author	Yi Zhao Shinji Takaki Hieu-Thi Luong Junichi Yamagishi Daisuke Saito Nobuaki Minematsu
author_facet	Yi Zhao Shinji Takaki Hieu-Thi Luong Junichi Yamagishi Daisuke Saito Nobuaki Minematsu
author_sort	Yi Zhao
title	Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_short	Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_full	Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_fullStr	Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_full_unstemmed	Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_sort	wasserstein gan and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a wavenet vocoder
publisher	IEEE
publishDate	2018
url	https://doi.org/10.1109/ACCESS.2018.2872060 https://doaj.org/article/940663e0a8f74dbfbb1aa673ff64b3a5
genre	DML
genre_facet	DML
op_source	IEEE Access, Vol 6, Pp 60478-60488 (2018)
op_relation	https://ieeexplore.ieee.org/document/8471179/ https://doaj.org/toc/2169-3536 2169-3536 doi:10.1109/ACCESS.2018.2872060 https://doaj.org/article/940663e0a8f74dbfbb1aa673ff64b3a5
op_doi	https://doi.org/10.1109/ACCESS.2018.2872060
container_title	IEEE Access
container_volume	6
container_start_page	60478
op_container_end_page	60488
_version_	1766397510957400064

Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Similar Items