Multi-View Multi-Task Representation Learning for Mispronunciation Detection

The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that explo...

Full description

Bibliographic Details
Main Authors:	Kheir, Yassine El, Chowdhury, Shammur Absar, Ali, Ahmed
Format:	Text
Language:	unknown
Published:	2023
Subjects:	Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing Arctic
Online Access:	http://arxiv.org/abs/2306.01845

id	ftarxivpreprints:oai:arXiv.org:2306.01845
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2306.01845 2023-09-05T13:17:23+02:00 Multi-View Multi-Task Representation Learning for Mispronunciation Detection Kheir, Yassine El Chowdhury, Shammur Absar Ali, Ahmed 2023-06-02 http://arxiv.org/abs/2306.01845 unknown http://arxiv.org/abs/2306.01845 Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing text 2023 ftarxivpreprints 2023-08-16T17:44:52Z The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phonetic representation in a low-resource setting. Using the mono- and multilingual encoders, the model learn multiple views of the input, and capture the sound properties across diverse languages and accents. These encoded representations are further enriched by learning articulatory features in a multi-task setup. Our reported results using the L2-ARCTIC data outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and 8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the single-view mono- and multilingual systems, with a limited L2 dataset. Comment: 5 pages, Accepted SLaTE23 Text Arctic ArXiv.org (Cornell University Library) Arctic
institution	Open Polar
collection	ArXiv.org (Cornell University Library)
op_collection_id	ftarxivpreprints
language	unknown
topic	Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
spellingShingle	Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing Kheir, Yassine El Chowdhury, Shammur Absar Ali, Ahmed Multi-View Multi-Task Representation Learning for Mispronunciation Detection
topic_facet	Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
description	The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phonetic representation in a low-resource setting. Using the mono- and multilingual encoders, the model learn multiple views of the input, and capture the sound properties across diverse languages and accents. These encoded representations are further enriched by learning articulatory features in a multi-task setup. Our reported results using the L2-ARCTIC data outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and 8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the single-view mono- and multilingual systems, with a limited L2 dataset. Comment: 5 pages, Accepted SLaTE23
format	Text
author	Kheir, Yassine El Chowdhury, Shammur Absar Ali, Ahmed
author_facet	Kheir, Yassine El Chowdhury, Shammur Absar Ali, Ahmed
author_sort	Kheir, Yassine El
title	Multi-View Multi-Task Representation Learning for Mispronunciation Detection
title_short	Multi-View Multi-Task Representation Learning for Mispronunciation Detection
title_full	Multi-View Multi-Task Representation Learning for Mispronunciation Detection
title_fullStr	Multi-View Multi-Task Representation Learning for Mispronunciation Detection
title_full_unstemmed	Multi-View Multi-Task Representation Learning for Mispronunciation Detection
title_sort	multi-view multi-task representation learning for mispronunciation detection
publishDate	2023
url	http://arxiv.org/abs/2306.01845
geographic	Arctic
geographic_facet	Arctic
genre	Arctic
genre_facet	Arctic
op_relation	http://arxiv.org/abs/2306.01845
_version_	1776198577397170176

Multi-View Multi-Task Representation Learning for Mispronunciation Detection

Similar Items