Gender domain adaptation for automatic speech recognition task

This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conduct experiments with finetuning on the gender-specific test subsets and. In general, we do not obtain essential WER reduc...

Full description

Bibliographic Details
Main Authors:	Artem, Sokolov, Savchenko, Andrey V.
Format:	Article in Journal/Newspaper
Language:	unknown
Published:	arXiv 2020
Subjects:	Audio and Speech Processing eess.AS Sound cs.SD FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences Arctic sami
Online Access:	https://dx.doi.org/10.48550/arxiv.2010.04224 https://arxiv.org/abs/2010.04224

id	ftdatacite:10.48550/arxiv.2010.04224
record_format	openpolar
spelling	ftdatacite:10.48550/arxiv.2010.04224 2023-05-15T14:57:16+02:00 Gender domain adaptation for automatic speech recognition task Artem, Sokolov Savchenko, Andrey V. 2020 https://dx.doi.org/10.48550/arxiv.2010.04224 https://arxiv.org/abs/2010.04224 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Audio and Speech Processing eess.AS Sound cs.SD FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences Article CreativeWork article Preprint 2020 ftdatacite https://doi.org/10.48550/arxiv.2010.04224 2022-03-10T15:15:57Z This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conduct experiments with finetuning on the gender-specific test subsets and. In general, we do not obtain essential WER reduction by finetuning techniques by this approach. We achieved up to ~5% lower word error rate on the male subset and 3% on the female subset if the layers in the encoder and decoder are not frozen, but the tuning is started from the last checkpoints. Moreover, we adapted our base model on the full L2 Arctic dataset of accented speech and fine-tuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1-2% higher accuracy when compared to the model tuned on the whole L2 Arctic dataset. Finally, we tested the concatenation of the pretrained x-vector voice embeddings and embeddings from a conventional encoder, but its gain in accuracy is not significant. : Draft of paper for SAMI conference Article in Journal/Newspaper Arctic sami DataCite Metadata Store (German National Library of Science and Technology) Arctic
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	unknown
topic	Audio and Speech Processing eess.AS Sound cs.SD FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences
spellingShingle	Audio and Speech Processing eess.AS Sound cs.SD FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences Artem, Sokolov Savchenko, Andrey V. Gender domain adaptation for automatic speech recognition task
topic_facet	Audio and Speech Processing eess.AS Sound cs.SD FOS Electrical engineering, electronic engineering, information engineering FOS Computer and information sciences
description	This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conduct experiments with finetuning on the gender-specific test subsets and. In general, we do not obtain essential WER reduction by finetuning techniques by this approach. We achieved up to ~5% lower word error rate on the male subset and 3% on the female subset if the layers in the encoder and decoder are not frozen, but the tuning is started from the last checkpoints. Moreover, we adapted our base model on the full L2 Arctic dataset of accented speech and fine-tuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1-2% higher accuracy when compared to the model tuned on the whole L2 Arctic dataset. Finally, we tested the concatenation of the pretrained x-vector voice embeddings and embeddings from a conventional encoder, but its gain in accuracy is not significant. : Draft of paper for SAMI conference
format	Article in Journal/Newspaper
author	Artem, Sokolov Savchenko, Andrey V.
author_facet	Artem, Sokolov Savchenko, Andrey V.
author_sort	Artem, Sokolov
title	Gender domain adaptation for automatic speech recognition task
title_short	Gender domain adaptation for automatic speech recognition task
title_full	Gender domain adaptation for automatic speech recognition task
title_fullStr	Gender domain adaptation for automatic speech recognition task
title_full_unstemmed	Gender domain adaptation for automatic speech recognition task
title_sort	gender domain adaptation for automatic speech recognition task
publisher	arXiv
publishDate	2020
url	https://dx.doi.org/10.48550/arxiv.2010.04224 https://arxiv.org/abs/2010.04224
geographic	Arctic
geographic_facet	Arctic
genre	Arctic sami
genre_facet	Arctic sami
op_rights	arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/
op_doi	https://doi.org/10.48550/arxiv.2010.04224
_version_	1766329346350383104

Gender domain adaptation for automatic speech recognition task

Similar Items