Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion

In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syl...

Full description

Bibliographic Details
Main Authors: Zhi-zheng Wu, Tomi Kinnunen, Eng Siong Chng, Haizhou Li
Other Authors: The Pennsylvania State University CiteSeerX Archives
Format: Text
Language:English
Subjects:
GMM
Online Access:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.178.6294
http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/IS2010_ProsodyConversion.pdf
Description
Summary:In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplicity and text-independence of the frame-level conversion while yielding high-quality conversion. We achieve these goals by (1) introducing a text-independent tri-frame alignment method, (2) including delta features of F0 into Gaussian mixture model (GMM) conversion and (3) reducing the well-known GMM oversmoothing effect by F0 histogram equalization. Our objective and subjective experiments on the CMU Arctic corpus indicate improvements over both the mean/variance normalization and the baseline GMM conversion.