A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions
International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this pape...
Main Authors: | , , |
---|---|
Other Authors: | , |
Format: | Conference Object |
Language: | English |
Published: |
HAL CCSD
2014
|
Subjects: | |
Online Access: | https://hal.archives-ouvertes.fr/hal-01319771 |
_version_ | 1821553522005508096 |
---|---|
author | Morchid, Mohamed Dufour, Richard Linarès, Georges |
author2 | Laboratoire Informatique d'Avignon (LIA) Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI |
author_facet | Morchid, Mohamed Dufour, Richard Linarès, Georges |
author_sort | Morchid, Mohamed |
collection | Université d'Avignon et des Pays de Vaucluse: HAL |
description | International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this paper, we propose to identify themes of telephone conversation services with the classical Term Frequency-Inverse Document Frequency using Gini purity criteria (TF-IDF-Gini) method and with a Latent Dirichlet Allocation (LDA) approach. These approaches are coupled with a Support Vector Machine (SVM) classification to resolve theme identification problem. Results show the effectiveness of the proposed LDA-based method compared to the classical TF-IDF-Gini approach in the context of highly imperfect automatic transcriptions. Finally , we discuss the impact of discriminative and non-discriminative words extracted by both methods in terms of transcription accuracy. |
format | Conference Object |
genre | Iceland |
genre_facet | Iceland |
id | ftunivavignon:oai:HAL:hal-01319771v1 |
institution | Open Polar |
language | English |
op_collection_id | ftunivavignon |
op_coverage | Reykjavik, Iceland |
op_relation | hal-01319771 https://hal.archives-ouvertes.fr/hal-01319771 |
op_source | LREC https://hal.archives-ouvertes.fr/hal-01319771 LREC, May 2014, Reykjavik, Iceland |
publishDate | 2014 |
publisher | HAL CCSD |
record_format | openpolar |
spelling | ftunivavignon:oai:HAL:hal-01319771v1 2025-01-16T22:36:45+00:00 A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions Morchid, Mohamed Dufour, Richard Linarès, Georges Laboratoire Informatique d'Avignon (LIA) Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI Reykjavik, Iceland 2014-05 https://hal.archives-ouvertes.fr/hal-01319771 en eng HAL CCSD hal-01319771 https://hal.archives-ouvertes.fr/hal-01319771 LREC https://hal.archives-ouvertes.fr/hal-01319771 LREC, May 2014, Reykjavik, Iceland Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftunivavignon 2022-10-18T08:11:35Z International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this paper, we propose to identify themes of telephone conversation services with the classical Term Frequency-Inverse Document Frequency using Gini purity criteria (TF-IDF-Gini) method and with a Latent Dirichlet Allocation (LDA) approach. These approaches are coupled with a Support Vector Machine (SVM) classification to resolve theme identification problem. Results show the effectiveness of the proposed LDA-based method compared to the classical TF-IDF-Gini approach in the context of highly imperfect automatic transcriptions. Finally , we discuss the impact of discriminative and non-discriminative words extracted by both methods in terms of transcription accuracy. Conference Object Iceland Université d'Avignon et des Pays de Vaucluse: HAL |
spellingShingle | Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] Morchid, Mohamed Dufour, Richard Linarès, Georges A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title | A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_full | A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_fullStr | A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_full_unstemmed | A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_short | A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_sort | lda-based topic classification approach from highly imperfect automatic transcriptions |
topic | Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] |
topic_facet | Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] |
url | https://hal.archives-ouvertes.fr/hal-01319771 |