CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis

Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significan...

Full description

Bibliographic Details
Main Authors: Zheng, Nianzu, Deng, Liqun, Huang, Wenyong, Yeung, Yu Ting, Xu, Baohua, Guo, Yuanyuan, Wang, Yasheng, Chen, Xiao, Jiang, Xin, Liu, Qun
Format: Text
Language:unknown
Published: 2021
Subjects:
Online Access:http://arxiv.org/abs/2111.08191
id ftarxivpreprints:oai:arXiv.org:2111.08191
record_format openpolar
spelling ftarxivpreprints:oai:arXiv.org:2111.08191 2023-09-05T13:17:27+02:00 CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis Zheng, Nianzu Deng, Liqun Huang, Wenyong Yeung, Yu Ting Xu, Baohua Guo, Yuanyuan Wang, Yasheng Chen, Xiao Jiang, Xin Liu, Qun 2021-11-15 http://arxiv.org/abs/2111.08191 unknown http://arxiv.org/abs/2111.08191 Computer Science - Computation and Language Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing text 2021 ftarxivpreprints 2023-08-16T16:47:26Z Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features. CoCA also enables our model to perform mispronunciation classification with whole utterances. The proposed model allows system fusion between the streaming output and mispronunciation classification output for further performance enhancement. We evaluate CoCA-MDD on publicly available corpora. CoCA-MDD achieves F1 scores of 57.03% and 60.78% for streaming and fusion modes respectively on L2-ARCTIC. For phone-level pronunciation scoring, CoCA-MDD achieves 0.58 Pearson correlation coefficient (PCC) value on SpeechOcean762. Comment: 5 pages, 4 figures, Accepted by INTERSPEECH 2022 Text Arctic ArXiv.org (Cornell University Library) Arctic
institution Open Polar
collection ArXiv.org (Cornell University Library)
op_collection_id ftarxivpreprints
language unknown
topic Computer Science - Computation and Language
Computer Science - Sound
Electrical Engineering and Systems Science - Audio and Speech Processing
spellingShingle Computer Science - Computation and Language
Computer Science - Sound
Electrical Engineering and Systems Science - Audio and Speech Processing
Zheng, Nianzu
Deng, Liqun
Huang, Wenyong
Yeung, Yu Ting
Xu, Baohua
Guo, Yuanyuan
Wang, Yasheng
Chen, Xiao
Jiang, Xin
Liu, Qun
CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
topic_facet Computer Science - Computation and Language
Computer Science - Sound
Electrical Engineering and Systems Science - Audio and Speech Processing
description Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features. CoCA also enables our model to perform mispronunciation classification with whole utterances. The proposed model allows system fusion between the streaming output and mispronunciation classification output for further performance enhancement. We evaluate CoCA-MDD on publicly available corpora. CoCA-MDD achieves F1 scores of 57.03% and 60.78% for streaming and fusion modes respectively on L2-ARCTIC. For phone-level pronunciation scoring, CoCA-MDD achieves 0.58 Pearson correlation coefficient (PCC) value on SpeechOcean762. Comment: 5 pages, 4 figures, Accepted by INTERSPEECH 2022
format Text
author Zheng, Nianzu
Deng, Liqun
Huang, Wenyong
Yeung, Yu Ting
Xu, Baohua
Guo, Yuanyuan
Wang, Yasheng
Chen, Xiao
Jiang, Xin
Liu, Qun
author_facet Zheng, Nianzu
Deng, Liqun
Huang, Wenyong
Yeung, Yu Ting
Xu, Baohua
Guo, Yuanyuan
Wang, Yasheng
Chen, Xiao
Jiang, Xin
Liu, Qun
author_sort Zheng, Nianzu
title CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_short CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_full CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_fullStr CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_full_unstemmed CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
title_sort coca-mdd: a coupled cross-attention based framework for streaming mispronunciation detection and diagnosis
publishDate 2021
url http://arxiv.org/abs/2111.08191
geographic Arctic
geographic_facet Arctic
genre Arctic
genre_facet Arctic
op_relation http://arxiv.org/abs/2111.08191
_version_ 1776198619698823168