A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions

International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this pape...

Full description

Bibliographic Details
Main Authors:	Morchid, Mohamed, Dufour, Richard, Linarès, Georges
Other Authors:	Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Format:	Conference Object
Language:	English
Published:	HAL CCSD 2014
Subjects:	Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] Iceland
Online Access:	https://hal.archives-ouvertes.fr/hal-01319771

Description
Summary:	International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this paper, we propose to identify themes of telephone conversation services with the classical Term Frequency-Inverse Document Frequency using Gini purity criteria (TF-IDF-Gini) method and with a Latent Dirichlet Allocation (LDA) approach. These approaches are coupled with a Support Vector Machine (SVM) classification to resolve theme identification problem. Results show the effectiveness of the proposed LDA-based method compared to the classical TF-IDF-Gini approach in the context of highly imperfect automatic transcriptions. Finally , we discuss the impact of discriminative and non-discriminative words extracted by both methods in terms of transcription accuracy.

A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions

Similar Items