Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion
In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syl...
Main Authors: | , , , |
---|---|
Other Authors: | |
Format: | Text |
Language: | English |
Subjects: | |
Online Access: | http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.178.6294 http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/IS2010_ProsodyConversion.pdf |
id |
ftciteseerx:oai:CiteSeerX.psu:10.1.1.178.6294 |
---|---|
record_format |
openpolar |
spelling |
ftciteseerx:oai:CiteSeerX.psu:10.1.1.178.6294 2023-05-15T15:02:38+02:00 Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion Zhi-zheng Wu Tomi Kinnunen Eng Siong Chng Haizhou Li The Pennsylvania State University CiteSeerX Archives application/pdf http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.178.6294 http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/IS2010_ProsodyConversion.pdf en eng http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.178.6294 http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/IS2010_ProsodyConversion.pdf Metadata may be used without restrictions as long as the oai identifier remains attached to it. http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/IS2010_ProsodyConversion.pdf Index Terms Voice conversion F0 transformation GMM histogram text ftciteseerx 2016-01-07T16:21:01Z In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplicity and text-independence of the frame-level conversion while yielding high-quality conversion. We achieve these goals by (1) introducing a text-independent tri-frame alignment method, (2) including delta features of F0 into Gaussian mixture model (GMM) conversion and (3) reducing the well-known GMM oversmoothing effect by F0 histogram equalization. Our objective and subjective experiments on the CMU Arctic corpus indicate improvements over both the mean/variance normalization and the baseline GMM conversion. Text Arctic Unknown Arctic |
institution |
Open Polar |
collection |
Unknown |
op_collection_id |
ftciteseerx |
language |
English |
topic |
Index Terms Voice conversion F0 transformation GMM histogram |
spellingShingle |
Index Terms Voice conversion F0 transformation GMM histogram Zhi-zheng Wu Tomi Kinnunen Eng Siong Chng Haizhou Li Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion |
topic_facet |
Index Terms Voice conversion F0 transformation GMM histogram |
description |
In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplicity and text-independence of the frame-level conversion while yielding high-quality conversion. We achieve these goals by (1) introducing a text-independent tri-frame alignment method, (2) including delta features of F0 into Gaussian mixture model (GMM) conversion and (3) reducing the well-known GMM oversmoothing effect by F0 histogram equalization. Our objective and subjective experiments on the CMU Arctic corpus indicate improvements over both the mean/variance normalization and the baseline GMM conversion. |
author2 |
The Pennsylvania State University CiteSeerX Archives |
format |
Text |
author |
Zhi-zheng Wu Tomi Kinnunen Eng Siong Chng Haizhou Li |
author_facet |
Zhi-zheng Wu Tomi Kinnunen Eng Siong Chng Haizhou Li |
author_sort |
Zhi-zheng Wu |
title |
Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion |
title_short |
Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion |
title_full |
Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion |
title_fullStr |
Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion |
title_full_unstemmed |
Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion |
title_sort |
text-independent f0 transformation with non-parallel data for voice conversion |
url |
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.178.6294 http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/IS2010_ProsodyConversion.pdf |
geographic |
Arctic |
geographic_facet |
Arctic |
genre |
Arctic |
genre_facet |
Arctic |
op_source |
http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/IS2010_ProsodyConversion.pdf |
op_relation |
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.178.6294 http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/IS2010_ProsodyConversion.pdf |
op_rights |
Metadata may be used without restrictions as long as the oai identifier remains attached to it. |
_version_ |
1766334558071947264 |