Voice Conversion Using a Perceptual Criterion

In voice conversion (VC), it is highly desirable to obtain transformed speech signals that are perceptually close to a target speaker’s voice. To this end, a perceptually meaningful criterion where the human auditory system was taken into consideration in measuring the distances between the converte...

Full description

Bibliographic Details
Published in:Applied Sciences
Main Author: Ki-Seung Lee
Format: Article in Journal/Newspaper
Language:English
Published: MDPI AG 2020
Subjects:
T
Online Access:https://doi.org/10.3390/app10082884
https://doaj.org/article/fb1f585af195451bbe205f19aedbe8d6
id ftdoajarticles:oai:doaj.org/article:fb1f585af195451bbe205f19aedbe8d6
record_format openpolar
spelling ftdoajarticles:oai:doaj.org/article:fb1f585af195451bbe205f19aedbe8d6 2023-05-15T15:06:36+02:00 Voice Conversion Using a Perceptual Criterion Ki-Seung Lee 2020-04-01T00:00:00Z https://doi.org/10.3390/app10082884 https://doaj.org/article/fb1f585af195451bbe205f19aedbe8d6 EN eng MDPI AG https://www.mdpi.com/2076-3417/10/8/2884 https://doaj.org/toc/2076-3417 doi:10.3390/app10082884 2076-3417 https://doaj.org/article/fb1f585af195451bbe205f19aedbe8d6 Applied Sciences, Vol 10, Iss 2884, p 2884 (2020) voice conversion joint conversion perceptual distance measure Technology T Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999 article 2020 ftdoajarticles https://doi.org/10.3390/app10082884 2022-12-31T12:34:17Z In voice conversion (VC), it is highly desirable to obtain transformed speech signals that are perceptually close to a target speaker’s voice. To this end, a perceptually meaningful criterion where the human auditory system was taken into consideration in measuring the distances between the converted and the target voices was adopted in the proposed VC scheme. The conversion rules for the features associated with the spectral envelope and the pitch modification factor were jointly constructed so that perceptual distance measurement was minimized. This minimization problem was solved using a deep neural network (DNN) framework where input features and target features were derived from source speech signals and time-aligned version of target speech signals, respectively. The validation tests were carried out for the CMU ARCTIC database to evaluate the effectiveness of the proposed method, especially in terms of perceptual quality. The experimental results showed that the proposed method yielded perceptually preferred results compared with independent conversion using conventional mean-square error (MSE) criterion. The maximum improvement in perceptual evaluation of speech quality (PESQ) was 0.312, compared with the conventional VC method. Article in Journal/Newspaper Arctic Directory of Open Access Journals: DOAJ Articles Arctic Applied Sciences 10 8 2884
institution Open Polar
collection Directory of Open Access Journals: DOAJ Articles
op_collection_id ftdoajarticles
language English
topic voice conversion
joint conversion
perceptual distance measure
Technology
T
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999
spellingShingle voice conversion
joint conversion
perceptual distance measure
Technology
T
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999
Ki-Seung Lee
Voice Conversion Using a Perceptual Criterion
topic_facet voice conversion
joint conversion
perceptual distance measure
Technology
T
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999
description In voice conversion (VC), it is highly desirable to obtain transformed speech signals that are perceptually close to a target speaker’s voice. To this end, a perceptually meaningful criterion where the human auditory system was taken into consideration in measuring the distances between the converted and the target voices was adopted in the proposed VC scheme. The conversion rules for the features associated with the spectral envelope and the pitch modification factor were jointly constructed so that perceptual distance measurement was minimized. This minimization problem was solved using a deep neural network (DNN) framework where input features and target features were derived from source speech signals and time-aligned version of target speech signals, respectively. The validation tests were carried out for the CMU ARCTIC database to evaluate the effectiveness of the proposed method, especially in terms of perceptual quality. The experimental results showed that the proposed method yielded perceptually preferred results compared with independent conversion using conventional mean-square error (MSE) criterion. The maximum improvement in perceptual evaluation of speech quality (PESQ) was 0.312, compared with the conventional VC method.
format Article in Journal/Newspaper
author Ki-Seung Lee
author_facet Ki-Seung Lee
author_sort Ki-Seung Lee
title Voice Conversion Using a Perceptual Criterion
title_short Voice Conversion Using a Perceptual Criterion
title_full Voice Conversion Using a Perceptual Criterion
title_fullStr Voice Conversion Using a Perceptual Criterion
title_full_unstemmed Voice Conversion Using a Perceptual Criterion
title_sort voice conversion using a perceptual criterion
publisher MDPI AG
publishDate 2020
url https://doi.org/10.3390/app10082884
https://doaj.org/article/fb1f585af195451bbe205f19aedbe8d6
geographic Arctic
geographic_facet Arctic
genre Arctic
genre_facet Arctic
op_source Applied Sciences, Vol 10, Iss 2884, p 2884 (2020)
op_relation https://www.mdpi.com/2076-3417/10/8/2884
https://doaj.org/toc/2076-3417
doi:10.3390/app10082884
2076-3417
https://doaj.org/article/fb1f585af195451bbe205f19aedbe8d6
op_doi https://doi.org/10.3390/app10082884
container_title Applied Sciences
container_volume 10
container_issue 8
container_start_page 2884
_version_ 1766338169724207104