Gender domain adaptation for automatic speech recognition task

This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conduct experiments with finetuning on the gender-specific test subsets and. In general, we do not obtain essential WER reduc...

Full description

Bibliographic Details
Main Authors: Artem, Sokolov, Savchenko, Andrey V.
Format: Article in Journal/Newspaper
Language:unknown
Published: arXiv 2020
Subjects:
Online Access:https://dx.doi.org/10.48550/arxiv.2010.04224
https://arxiv.org/abs/2010.04224
id ftdatacite:10.48550/arxiv.2010.04224
record_format openpolar
spelling ftdatacite:10.48550/arxiv.2010.04224 2023-05-15T14:57:16+02:00 Gender domain adaptation for automatic speech recognition task Artem, Sokolov Savchenko, Andrey V. 2020 https://dx.doi.org/10.48550/arxiv.2010.04224 https://arxiv.org/abs/2010.04224 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Audio and Speech Processing eess.AS Sound cs.SD FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences Article CreativeWork article Preprint 2020 ftdatacite https://doi.org/10.48550/arxiv.2010.04224 2022-03-10T15:15:57Z This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conduct experiments with finetuning on the gender-specific test subsets and. In general, we do not obtain essential WER reduction by finetuning techniques by this approach. We achieved up to ~5% lower word error rate on the male subset and 3% on the female subset if the layers in the encoder and decoder are not frozen, but the tuning is started from the last checkpoints. Moreover, we adapted our base model on the full L2 Arctic dataset of accented speech and fine-tuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1-2% higher accuracy when compared to the model tuned on the whole L2 Arctic dataset. Finally, we tested the concatenation of the pretrained x-vector voice embeddings and embeddings from a conventional encoder, but its gain in accuracy is not significant. : Draft of paper for SAMI conference Article in Journal/Newspaper Arctic sami DataCite Metadata Store (German National Library of Science and Technology) Arctic
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic Audio and Speech Processing eess.AS
Sound cs.SD
FOS Electrical engineering, electronic engineering, information engineering
FOS Computer and information sciences
spellingShingle Audio and Speech Processing eess.AS
Sound cs.SD
FOS Electrical engineering, electronic engineering, information engineering
FOS Computer and information sciences
Artem, Sokolov
Savchenko, Andrey V.
Gender domain adaptation for automatic speech recognition task
topic_facet Audio and Speech Processing eess.AS
Sound cs.SD
FOS Electrical engineering, electronic engineering, information engineering
FOS Computer and information sciences
description This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conduct experiments with finetuning on the gender-specific test subsets and. In general, we do not obtain essential WER reduction by finetuning techniques by this approach. We achieved up to ~5% lower word error rate on the male subset and 3% on the female subset if the layers in the encoder and decoder are not frozen, but the tuning is started from the last checkpoints. Moreover, we adapted our base model on the full L2 Arctic dataset of accented speech and fine-tuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1-2% higher accuracy when compared to the model tuned on the whole L2 Arctic dataset. Finally, we tested the concatenation of the pretrained x-vector voice embeddings and embeddings from a conventional encoder, but its gain in accuracy is not significant. : Draft of paper for SAMI conference
format Article in Journal/Newspaper
author Artem, Sokolov
Savchenko, Andrey V.
author_facet Artem, Sokolov
Savchenko, Andrey V.
author_sort Artem, Sokolov
title Gender domain adaptation for automatic speech recognition task
title_short Gender domain adaptation for automatic speech recognition task
title_full Gender domain adaptation for automatic speech recognition task
title_fullStr Gender domain adaptation for automatic speech recognition task
title_full_unstemmed Gender domain adaptation for automatic speech recognition task
title_sort gender domain adaptation for automatic speech recognition task
publisher arXiv
publishDate 2020
url https://dx.doi.org/10.48550/arxiv.2010.04224
https://arxiv.org/abs/2010.04224
geographic Arctic
geographic_facet Arctic
genre Arctic
sami
genre_facet Arctic
sami
op_rights arXiv.org perpetual, non-exclusive license
http://arxiv.org/licenses/nonexclusive-distrib/1.0/
op_doi https://doi.org/10.48550/arxiv.2010.04224
_version_ 1766329346350383104