A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions

International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this pape...

Full description

Bibliographic Details
Main Authors: Morchid, Mohamed, Dufour, Richard, Linarès, Georges
Other Authors: Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://hal.archives-ouvertes.fr/hal-01319771
id ftunivavignon:oai:HAL:hal-01319771v1
record_format openpolar
spelling ftunivavignon:oai:HAL:hal-01319771v1 2023-05-15T16:49:23+02:00 A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions Morchid, Mohamed Dufour, Richard Linarès, Georges Laboratoire Informatique d'Avignon (LIA) Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI Reykjavik, Iceland 2014-05 https://hal.archives-ouvertes.fr/hal-01319771 en eng HAL CCSD hal-01319771 https://hal.archives-ouvertes.fr/hal-01319771 LREC https://hal.archives-ouvertes.fr/hal-01319771 LREC, May 2014, Reykjavik, Iceland Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftunivavignon 2022-10-18T08:11:35Z International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this paper, we propose to identify themes of telephone conversation services with the classical Term Frequency-Inverse Document Frequency using Gini purity criteria (TF-IDF-Gini) method and with a Latent Dirichlet Allocation (LDA) approach. These approaches are coupled with a Support Vector Machine (SVM) classification to resolve theme identification problem. Results show the effectiveness of the proposed LDA-based method compared to the classical TF-IDF-Gini approach in the context of highly imperfect automatic transcriptions. Finally , we discuss the impact of discriminative and non-discriminative words extracted by both methods in terms of transcription accuracy. Conference Object Iceland Université d'Avignon et des Pays de Vaucluse: HAL
institution Open Polar
collection Université d'Avignon et des Pays de Vaucluse: HAL
op_collection_id ftunivavignon
language English
topic Speech analytics
Topic identification
Latent Dirichlet Allocation
[INFO]Computer Science [cs]
spellingShingle Speech analytics
Topic identification
Latent Dirichlet Allocation
[INFO]Computer Science [cs]
Morchid, Mohamed
Dufour, Richard
Linarès, Georges
A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions
topic_facet Speech analytics
Topic identification
Latent Dirichlet Allocation
[INFO]Computer Science [cs]
description International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this paper, we propose to identify themes of telephone conversation services with the classical Term Frequency-Inverse Document Frequency using Gini purity criteria (TF-IDF-Gini) method and with a Latent Dirichlet Allocation (LDA) approach. These approaches are coupled with a Support Vector Machine (SVM) classification to resolve theme identification problem. Results show the effectiveness of the proposed LDA-based method compared to the classical TF-IDF-Gini approach in the context of highly imperfect automatic transcriptions. Finally , we discuss the impact of discriminative and non-discriminative words extracted by both methods in terms of transcription accuracy.
author2 Laboratoire Informatique d'Avignon (LIA)
Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
format Conference Object
author Morchid, Mohamed
Dufour, Richard
Linarès, Georges
author_facet Morchid, Mohamed
Dufour, Richard
Linarès, Georges
author_sort Morchid, Mohamed
title A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions
title_short A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions
title_full A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions
title_fullStr A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions
title_full_unstemmed A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions
title_sort lda-based topic classification approach from highly imperfect automatic transcriptions
publisher HAL CCSD
publishDate 2014
url https://hal.archives-ouvertes.fr/hal-01319771
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source LREC
https://hal.archives-ouvertes.fr/hal-01319771
LREC, May 2014, Reykjavik, Iceland
op_relation hal-01319771
https://hal.archives-ouvertes.fr/hal-01319771
_version_ 1766039531114463232