CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significan...
Main Authors: | , , , , , , , , , |
---|---|
Format: | Text |
Language: | unknown |
Published: |
2021
|
Subjects: | |
Online Access: | http://arxiv.org/abs/2111.08191 |
id |
ftarxivpreprints:oai:arXiv.org:2111.08191 |
---|---|
record_format |
openpolar |
spelling |
ftarxivpreprints:oai:arXiv.org:2111.08191 2023-09-05T13:17:27+02:00 CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis Zheng, Nianzu Deng, Liqun Huang, Wenyong Yeung, Yu Ting Xu, Baohua Guo, Yuanyuan Wang, Yasheng Chen, Xiao Jiang, Xin Liu, Qun 2021-11-15 http://arxiv.org/abs/2111.08191 unknown http://arxiv.org/abs/2111.08191 Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing text 2021 ftarxivpreprints 2023-08-16T16:47:26Z Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features. CoCA also enables our model to perform mispronunciation classification with whole utterances. The proposed model allows system fusion between the streaming output and mispronunciation classification output for further performance enhancement. We evaluate CoCA-MDD on publicly available corpora. CoCA-MDD achieves F1 scores of 57.03% and 60.78% for streaming and fusion modes respectively on L2-ARCTIC. For phone-level pronunciation scoring, CoCA-MDD achieves 0.58 Pearson correlation coefficient (PCC) value on SpeechOcean762. Comment: 5 pages, 4 figures, Accepted by INTERSPEECH 2022 Text Arctic ArXiv.org (Cornell University Library) Arctic |
institution |
Open Polar |
collection |
ArXiv.org (Cornell University Library) |
op_collection_id |
ftarxivpreprints |
language |
unknown |
topic |
Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing |
spellingShingle |
Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing Zheng, Nianzu Deng, Liqun Huang, Wenyong Yeung, Yu Ting Xu, Baohua Guo, Yuanyuan Wang, Yasheng Chen, Xiao Jiang, Xin Liu, Qun CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis |
topic_facet |
Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing |
description |
Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features. CoCA also enables our model to perform mispronunciation classification with whole utterances. The proposed model allows system fusion between the streaming output and mispronunciation classification output for further performance enhancement. We evaluate CoCA-MDD on publicly available corpora. CoCA-MDD achieves F1 scores of 57.03% and 60.78% for streaming and fusion modes respectively on L2-ARCTIC. For phone-level pronunciation scoring, CoCA-MDD achieves 0.58 Pearson correlation coefficient (PCC) value on SpeechOcean762. Comment: 5 pages, 4 figures, Accepted by INTERSPEECH 2022 |
format |
Text |
author |
Zheng, Nianzu Deng, Liqun Huang, Wenyong Yeung, Yu Ting Xu, Baohua Guo, Yuanyuan Wang, Yasheng Chen, Xiao Jiang, Xin Liu, Qun |
author_facet |
Zheng, Nianzu Deng, Liqun Huang, Wenyong Yeung, Yu Ting Xu, Baohua Guo, Yuanyuan Wang, Yasheng Chen, Xiao Jiang, Xin Liu, Qun |
author_sort |
Zheng, Nianzu |
title |
CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis |
title_short |
CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis |
title_full |
CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis |
title_fullStr |
CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis |
title_full_unstemmed |
CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis |
title_sort |
coca-mdd: a coupled cross-attention based framework for streaming mispronunciation detection and diagnosis |
publishDate |
2021 |
url |
http://arxiv.org/abs/2111.08191 |
geographic |
Arctic |
geographic_facet |
Arctic |
genre |
Arctic |
genre_facet |
Arctic |
op_relation |
http://arxiv.org/abs/2111.08191 |
_version_ |
1776198619698823168 |