Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification

In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In ou...

Full description

Bibliographic Details
Main Authors:	Selvakumar, Anith, Fashandi, Homa
Format:	Text
Language:	unknown
Published:	2023
Subjects:	Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing DML
Online Access:	http://arxiv.org/abs/2309.07115

_version_	1821499230154391552
author	Selvakumar, Anith Fashandi, Homa
author_facet	Selvakumar, Anith Fashandi, Homa
author_sort	Selvakumar, Anith
collection	ArXiv.org (Cornell University Library)
description	In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In our work, we explored multitask learning techniques to further boost performance of the DML approach and show that an auxiliary task with weak labels can increase the compactness of the learned speaker representation. We also extend the Generalized end-to-end loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space. Finally, we introduce a non-synchronous audio-visual sampling random strategy during training time that has shown to improve generalization. Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published results on VoxCeleb1-E and VoxCeleb1-H.
format	Text
genre	DML
genre_facet	DML
id	ftarxivpreprints:oai:arXiv.org:2309.07115
institution	Open Polar
language	unknown
op_collection_id	ftarxivpreprints
op_relation	http://arxiv.org/abs/2309.07115
publishDate	2023
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2309.07115 2025-01-16T21:38:39+00:00 Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification Selvakumar, Anith Fashandi, Homa 2023-09-13 http://arxiv.org/abs/2309.07115 unknown http://arxiv.org/abs/2309.07115 Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing text 2023 ftarxivpreprints 2023-09-17T01:06:44Z In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In our work, we explored multitask learning techniques to further boost performance of the DML approach and show that an auxiliary task with weak labels can increase the compactness of the learned speaker representation. We also extend the Generalized end-to-end loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space. Finally, we introduce a non-synchronous audio-visual sampling random strategy during training time that has shown to improve generalization. Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published results on VoxCeleb1-E and VoxCeleb1-H. Text DML ArXiv.org (Cornell University Library)
spellingShingle	Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing Selvakumar, Anith Fashandi, Homa Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title	Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_full	Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_fullStr	Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_full_unstemmed	Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_short	Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_sort	weakly-supervised multi-task learning for audio-visual speaker verification
topic	Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing
topic_facet	Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing
url	http://arxiv.org/abs/2309.07115

Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification

Similar Items