End to end raw audio deep learning of transients, application to bioacoustics

International audience In this paper, we propose a raw audio deep learning of clicks, building specific convolution filters in high dimension to elaborate complex TF representation. The CNN has 12 layers for several thousands of audio bins in inputs, and a dozen of output classes. We test this model...

Full description

Bibliographic Details
Main Authors: Ferrari, Maxence, Glotin, Hervé, Marxer, Ricard
Other Authors: Laboratoire Amiénois de Mathématique Fondamentale et Appliquée - UMR CNRS 7352 (LAMFA), Université de Picardie Jules Verne (UPJV)-Centre National de la Recherche Scientifique (CNRS), DYNamiques de l’Information (DYNI), Laboratoire d'Informatique et Systèmes (LIS), Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS)
Format: Conference Object
Language:English
Published: HAL CCSD 2020
Subjects:
Online Access:https://hal.archives-ouvertes.fr/hal-03230842
https://hal.archives-ouvertes.fr/hal-03230842/document
https://hal.archives-ouvertes.fr/hal-03230842/file/001096.pdf
https://doi.org/10.48465/fa.2020.1096
Description
Summary:International audience In this paper, we propose a raw audio deep learning of clicks, building specific convolution filters in high dimension to elaborate complex TF representation. The CNN has 12 layers for several thousands of audio bins in inputs, and a dozen of output classes. We test this model on the international DCLDE challenge of 3 To of clicks (http://sabiod.org/DCLDE). This challenge was open in 2018, but no team answered before. At our knowledge, our model is the first raw audio click classifier with nearly 70% accurray on a dozen of classes. We discuss on the class confusions of the model and possible enhancement using data augmentation and regulation.