Text-Conditioned Transformer for Automatic Pronunciation Error Detection

Automatic pronunciation error detection (APED) plays an important role in the domain of language learning. As for the previous ASR-based APED methods, the decoded results need to be aligned with the target text so that the errors can be found out. However, since the decoding process and the alignmen...

Full description

Bibliographic Details
Main Authors:	Zhang, Zhan, Wang, Yuehai, Yang, Jianyi
Format:	Article in Journal/Newspaper
Language:	unknown
Published:	arXiv 2020
Subjects:	Audio and Speech Processing eess.AS FOS Electrical engineering, electronic engineering, information engineering Arctic Handle The
Online Access:	https://dx.doi.org/10.48550/arxiv.2008.12424 https://arxiv.org/abs/2008.12424

id	ftdatacite:10.48550/arxiv.2008.12424
record_format	openpolar
spelling	ftdatacite:10.48550/arxiv.2008.12424 2023-05-15T15:06:03+02:00 Text-Conditioned Transformer for Automatic Pronunciation Error Detection Zhang, Zhan Wang, Yuehai Yang, Jianyi 2020 https://dx.doi.org/10.48550/arxiv.2008.12424 https://arxiv.org/abs/2008.12424 unknown arXiv https://dx.doi.org/10.1016/j.specom.2021.04.004 arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Audio and Speech Processing eess.AS FOS Electrical engineering, electronic engineering, information engineering article-journal Article ScholarlyArticle Text 2020 ftdatacite https://doi.org/10.48550/arxiv.2008.12424 https://doi.org/10.1016/j.specom.2021.04.004 2022-03-10T15:55:41Z Automatic pronunciation error detection (APED) plays an important role in the domain of language learning. As for the previous ASR-based APED methods, the decoded results need to be aligned with the target text so that the errors can be found out. However, since the decoding process and the alignment process are independent, the prior knowledge about the target text is not fully utilized. In this paper, we propose to use the target text as an extra condition for the Transformer backbone to handle the APED task. The proposed method can output the error states with consideration of the relationship between the input speech and the target text in a fully end-to-end fashion.Meanwhile, as the prior target text is used as a condition for the decoder input, the Transformer works in a feed-forward manner instead of autoregressive in the inference stage, which can significantly boost the speed in the actual deployment. We set the ASR-based Transformer as the baseline APED model and conduct several experiments on the L2-Arctic dataset. The results demonstrate that our approach can obtain 8.4\% relative improvement on the $F_1$ score metric. : published for Speech Communication journal Article in Journal/Newspaper Arctic DataCite Metadata Store (German National Library of Science and Technology) Arctic Handle The ENVELOPE(161.983,161.983,-78.000,-78.000)
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	unknown
topic	Audio and Speech Processing eess.AS FOS Electrical engineering, electronic engineering, information engineering
spellingShingle	Audio and Speech Processing eess.AS FOS Electrical engineering, electronic engineering, information engineering Zhang, Zhan Wang, Yuehai Yang, Jianyi Text-Conditioned Transformer for Automatic Pronunciation Error Detection
topic_facet	Audio and Speech Processing eess.AS FOS Electrical engineering, electronic engineering, information engineering
description	Automatic pronunciation error detection (APED) plays an important role in the domain of language learning. As for the previous ASR-based APED methods, the decoded results need to be aligned with the target text so that the errors can be found out. However, since the decoding process and the alignment process are independent, the prior knowledge about the target text is not fully utilized. In this paper, we propose to use the target text as an extra condition for the Transformer backbone to handle the APED task. The proposed method can output the error states with consideration of the relationship between the input speech and the target text in a fully end-to-end fashion.Meanwhile, as the prior target text is used as a condition for the decoder input, the Transformer works in a feed-forward manner instead of autoregressive in the inference stage, which can significantly boost the speed in the actual deployment. We set the ASR-based Transformer as the baseline APED model and conduct several experiments on the L2-Arctic dataset. The results demonstrate that our approach can obtain 8.4\% relative improvement on the $F_1$ score metric. : published for Speech Communication journal
format	Article in Journal/Newspaper
author	Zhang, Zhan Wang, Yuehai Yang, Jianyi
author_facet	Zhang, Zhan Wang, Yuehai Yang, Jianyi
author_sort	Zhang, Zhan
title	Text-Conditioned Transformer for Automatic Pronunciation Error Detection
title_short	Text-Conditioned Transformer for Automatic Pronunciation Error Detection
title_full	Text-Conditioned Transformer for Automatic Pronunciation Error Detection
title_fullStr	Text-Conditioned Transformer for Automatic Pronunciation Error Detection
title_full_unstemmed	Text-Conditioned Transformer for Automatic Pronunciation Error Detection
title_sort	text-conditioned transformer for automatic pronunciation error detection
publisher	arXiv
publishDate	2020
url	https://dx.doi.org/10.48550/arxiv.2008.12424 https://arxiv.org/abs/2008.12424
long_lat	ENVELOPE(161.983,161.983,-78.000,-78.000)
geographic	Arctic Handle The
geographic_facet	Arctic Handle The
genre	Arctic
genre_facet	Arctic
op_relation	https://dx.doi.org/10.1016/j.specom.2021.04.004
op_rights	arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/
op_doi	https://doi.org/10.48550/arxiv.2008.12424 https://doi.org/10.1016/j.specom.2021.04.004
_version_	1766337719771856896

Text-Conditioned Transformer for Automatic Pronunciation Error Detection

Similar Items