Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification

In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In ou...

Full description

Bibliographic Details
Main Authors: Selvakumar, Anith, Fashandi, Homa
Format: Text
Language:unknown
Published: 2023
Subjects:
DML
Online Access:http://arxiv.org/abs/2309.07115
id ftarxivpreprints:oai:arXiv.org:2309.07115
record_format openpolar
spelling ftarxivpreprints:oai:arXiv.org:2309.07115 2023-10-09T21:51:03+02:00 Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification Selvakumar, Anith Fashandi, Homa 2023-09-13 http://arxiv.org/abs/2309.07115 unknown http://arxiv.org/abs/2309.07115 Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing text 2023 ftarxivpreprints 2023-09-17T01:06:44Z In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In our work, we explored multitask learning techniques to further boost performance of the DML approach and show that an auxiliary task with weak labels can increase the compactness of the learned speaker representation. We also extend the Generalized end-to-end loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space. Finally, we introduce a non-synchronous audio-visual sampling random strategy during training time that has shown to improve generalization. Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published results on VoxCeleb1-E and VoxCeleb1-H. Text DML ArXiv.org (Cornell University Library)
institution Open Polar
collection ArXiv.org (Cornell University Library)
op_collection_id ftarxivpreprints
language unknown
topic Computer Science - Sound
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Machine Learning
Computer Science - Multimedia
Electrical Engineering and Systems Science - Audio and Speech Processing
spellingShingle Computer Science - Sound
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Machine Learning
Computer Science - Multimedia
Electrical Engineering and Systems Science - Audio and Speech Processing
Selvakumar, Anith
Fashandi, Homa
Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
topic_facet Computer Science - Sound
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Machine Learning
Computer Science - Multimedia
Electrical Engineering and Systems Science - Audio and Speech Processing
description In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In our work, we explored multitask learning techniques to further boost performance of the DML approach and show that an auxiliary task with weak labels can increase the compactness of the learned speaker representation. We also extend the Generalized end-to-end loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space. Finally, we introduce a non-synchronous audio-visual sampling random strategy during training time that has shown to improve generalization. Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published results on VoxCeleb1-E and VoxCeleb1-H.
format Text
author Selvakumar, Anith
Fashandi, Homa
author_facet Selvakumar, Anith
Fashandi, Homa
author_sort Selvakumar, Anith
title Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_short Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_full Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_fullStr Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_full_unstemmed Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
title_sort weakly-supervised multi-task learning for audio-visual speaker verification
publishDate 2023
url http://arxiv.org/abs/2309.07115
genre DML
genre_facet DML
op_relation http://arxiv.org/abs/2309.07115
_version_ 1779314142764597248