CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis

Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significan...

Full description

Bibliographic Details
Main Authors:	Zheng, Nianzu, Deng, Liqun, Huang, Wenyong, Yeung, Yu Ting, Xu, Baohua, Guo, Yuanyuan, Wang, Yasheng, Chen, Xiao, Jiang, Xin, Liu, Qun
Format:	Text
Language:	unknown
Published:	2021
Subjects:	Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing Arctic
Online Access:	http://arxiv.org/abs/2111.08191

id	ftarxivpreprints:oai:arXiv.org:2111.08191
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2111.08191 2023-09-05T13:17:27+02:00 CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis Zheng, Nianzu Deng, Liqun Huang, Wenyong Yeung, Yu Ting Xu, Baohua Guo, Yuanyuan Wang, Yasheng Chen, Xiao Jiang, Xin Liu, Qun 2021-11-15 http://arxiv.org/abs/2111.08191 unknown http://arxiv.org/abs/2111.08191 Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing text 2021 ftarxivpreprints 2023-08-16T16:47:26Z Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features. CoCA also enables our model to perform mispronunciation classification with whole utterances. The proposed model allows system fusion between the streaming output and mispronunciation classification output for further performance enhancement. We evaluate CoCA-MDD on publicly available corpora. CoCA-MDD achieves F1 scores of 57.03% and 60.78% for streaming and fusion modes respectively on L2-ARCTIC. For phone-level pronunciation scoring, CoCA-MDD achieves 0.58 Pearson correlation coefficient (PCC) value on SpeechOcean762. Comment: 5 pages, 4 figures, Accepted by INTERSPEECH 2022 Text Arctic ArXiv.org (Cornell University Library) Arctic
institution	Open Polar
collection	ArXiv.org (Cornell University Library)
op_collection_id	ftarxivpreprints
language	unknown
topic	Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
spellingShingle	Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing Zheng, Nianzu Deng, Liqun Huang, Wenyong Yeung, Yu Ting Xu, Baohua Guo, Yuanyuan Wang, Yasheng Chen, Xiao Jiang, Xin Liu, Qun CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
topic_facet	Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
description	Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features. CoCA also enables our model to perform mispronunciation classification with whole utterances. The proposed model allows system fusion between the streaming output and mispronunciation classification output for further performance enhancement. We evaluate CoCA-MDD on publicly available corpora. CoCA-MDD achieves F1 scores of 57.03% and 60.78% for streaming and fusion modes respectively on L2-ARCTIC. For phone-level pronunciation scoring, CoCA-MDD achieves 0.58 Pearson correlation coefficient (PCC) value on SpeechOcean762. Comment: 5 pages, 4 figures, Accepted by INTERSPEECH 2022
format	Text
author	Zheng, Nianzu Deng, Liqun Huang, Wenyong Yeung, Yu Ting Xu, Baohua Guo, Yuanyuan Wang, Yasheng Chen, Xiao Jiang, Xin Liu, Qun
author_facet	Zheng, Nianzu Deng, Liqun Huang, Wenyong Yeung, Yu Ting Xu, Baohua Guo, Yuanyuan Wang, Yasheng Chen, Xiao Jiang, Xin Liu, Qun
author_sort	Zheng, Nianzu
title	CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_short	CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_full	CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_fullStr	CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_full_unstemmed	CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_sort	coca-mdd: a coupled cross-attention based framework for streaming mispronunciation detection and diagnosis
publishDate	2021
url	http://arxiv.org/abs/2111.08191
geographic	Arctic
geographic_facet	Arctic
genre	Arctic
genre_facet	Arctic
op_relation	http://arxiv.org/abs/2111.08191
_version_	1776198619698823168

CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis

Similar Items