Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In ou...
Main Authors: | , |
---|---|
Format: | Text |
Language: | unknown |
Published: |
2023
|
Subjects: | |
Online Access: | http://arxiv.org/abs/2309.07115 |
id |
ftarxivpreprints:oai:arXiv.org:2309.07115 |
---|---|
record_format |
openpolar |
spelling |
ftarxivpreprints:oai:arXiv.org:2309.07115 2023-10-09T21:51:03+02:00 Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification Selvakumar, Anith Fashandi, Homa 2023-09-13 http://arxiv.org/abs/2309.07115 unknown http://arxiv.org/abs/2309.07115 Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing text 2023 ftarxivpreprints 2023-09-17T01:06:44Z In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In our work, we explored multitask learning techniques to further boost performance of the DML approach and show that an auxiliary task with weak labels can increase the compactness of the learned speaker representation. We also extend the Generalized end-to-end loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space. Finally, we introduce a non-synchronous audio-visual sampling random strategy during training time that has shown to improve generalization. Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published results on VoxCeleb1-E and VoxCeleb1-H. Text DML ArXiv.org (Cornell University Library) |
institution |
Open Polar |
collection |
ArXiv.org (Cornell University Library) |
op_collection_id |
ftarxivpreprints |
language |
unknown |
topic |
Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing |
spellingShingle |
Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing Selvakumar, Anith Fashandi, Homa Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification |
topic_facet |
Computer Science - Sound Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing |
description |
In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In our work, we explored multitask learning techniques to further boost performance of the DML approach and show that an auxiliary task with weak labels can increase the compactness of the learned speaker representation. We also extend the Generalized end-to-end loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space. Finally, we introduce a non-synchronous audio-visual sampling random strategy during training time that has shown to improve generalization. Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published results on VoxCeleb1-E and VoxCeleb1-H. |
format |
Text |
author |
Selvakumar, Anith Fashandi, Homa |
author_facet |
Selvakumar, Anith Fashandi, Homa |
author_sort |
Selvakumar, Anith |
title |
Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification |
title_short |
Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification |
title_full |
Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification |
title_fullStr |
Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification |
title_full_unstemmed |
Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification |
title_sort |
weakly-supervised multi-task learning for audio-visual speaker verification |
publishDate |
2023 |
url |
http://arxiv.org/abs/2309.07115 |
genre |
DML |
genre_facet |
DML |
op_relation |
http://arxiv.org/abs/2309.07115 |
_version_ |
1779314142764597248 |