A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions
International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this pape...
Main Authors: | , , |
---|---|
Other Authors: | , |
Format: | Conference Object |
Language: | English |
Published: |
HAL CCSD
2014
|
Subjects: | |
Online Access: | https://hal.archives-ouvertes.fr/hal-01319771 |
id |
ftccsdartic:oai:HAL:hal-01319771v1 |
---|---|
record_format |
openpolar |
spelling |
ftccsdartic:oai:HAL:hal-01319771v1 2023-05-15T16:49:30+02:00 A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions Morchid, Mohamed Dufour, Richard Linarès, Georges Laboratoire Informatique d'Avignon (LIA) Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI Reykjavik, Iceland 2014-05 https://hal.archives-ouvertes.fr/hal-01319771 en eng HAL CCSD hal-01319771 https://hal.archives-ouvertes.fr/hal-01319771 LREC https://hal.archives-ouvertes.fr/hal-01319771 LREC, May 2014, Reykjavik, Iceland Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftccsdartic 2021-11-14T00:36:30Z International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this paper, we propose to identify themes of telephone conversation services with the classical Term Frequency-Inverse Document Frequency using Gini purity criteria (TF-IDF-Gini) method and with a Latent Dirichlet Allocation (LDA) approach. These approaches are coupled with a Support Vector Machine (SVM) classification to resolve theme identification problem. Results show the effectiveness of the proposed LDA-based method compared to the classical TF-IDF-Gini approach in the context of highly imperfect automatic transcriptions. Finally , we discuss the impact of discriminative and non-discriminative words extracted by both methods in terms of transcription accuracy. Conference Object Iceland Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe) |
institution |
Open Polar |
collection |
Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe) |
op_collection_id |
ftccsdartic |
language |
English |
topic |
Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] |
spellingShingle |
Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] Morchid, Mohamed Dufour, Richard Linarès, Georges A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
topic_facet |
Speech analytics Topic identification Latent Dirichlet Allocation [INFO]Computer Science [cs] |
description |
International audience Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this paper, we propose to identify themes of telephone conversation services with the classical Term Frequency-Inverse Document Frequency using Gini purity criteria (TF-IDF-Gini) method and with a Latent Dirichlet Allocation (LDA) approach. These approaches are coupled with a Support Vector Machine (SVM) classification to resolve theme identification problem. Results show the effectiveness of the proposed LDA-based method compared to the classical TF-IDF-Gini approach in the context of highly imperfect automatic transcriptions. Finally , we discuss the impact of discriminative and non-discriminative words extracted by both methods in terms of transcription accuracy. |
author2 |
Laboratoire Informatique d'Avignon (LIA) Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI |
format |
Conference Object |
author |
Morchid, Mohamed Dufour, Richard Linarès, Georges |
author_facet |
Morchid, Mohamed Dufour, Richard Linarès, Georges |
author_sort |
Morchid, Mohamed |
title |
A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_short |
A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_full |
A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_fullStr |
A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_full_unstemmed |
A LDA-Based Topic Classification Approach from Highly Imperfect Automatic Transcriptions |
title_sort |
lda-based topic classification approach from highly imperfect automatic transcriptions |
publisher |
HAL CCSD |
publishDate |
2014 |
url |
https://hal.archives-ouvertes.fr/hal-01319771 |
op_coverage |
Reykjavik, Iceland |
genre |
Iceland |
genre_facet |
Iceland |
op_source |
LREC https://hal.archives-ouvertes.fr/hal-01319771 LREC, May 2014, Reykjavik, Iceland |
op_relation |
hal-01319771 https://hal.archives-ouvertes.fr/hal-01319771 |
_version_ |
1766039635247497216 |