Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish
International audience Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, su...
Main Authors: | , , , |
---|---|
Other Authors: | , , |
Format: | Conference Object |
Language: | English |
Published: |
HAL CCSD
2014
|
Subjects: | |
Online Access: | https://hal.archives-ouvertes.fr/hal-01843401 |
id |
ftccsdartic:oai:HAL:hal-01843401v1 |
---|---|
record_format |
openpolar |
spelling |
ftccsdartic:oai:HAL:hal-01843401v1 2023-05-15T16:50:50+02:00 Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish Lavergne, Thomas Adda, Gilles Adda-Decker, Martine Lamel, Lori Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI) Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919) Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11) Reykjavik, Iceland 2014-05-01 https://hal.archives-ouvertes.fr/hal-01843401 en eng HAL CCSD hal-01843401 https://hal.archives-ouvertes.fr/hal-01843401 International Conference on Language Resources and Evaluation https://hal.archives-ouvertes.fr/hal-01843401 International Conference on Language Resources and Evaluation, May 2014, Reykjavik, Iceland under-resourced language language identification corpus of Luxembourguish [INFO]Computer Science [cs] [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftccsdartic 2021-12-19T02:16:51Z International audience Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25% WER on the Quaero development data. Conference Object Iceland Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe) |
institution |
Open Polar |
collection |
Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe) |
op_collection_id |
ftccsdartic |
language |
English |
topic |
under-resourced language language identification corpus of Luxembourguish [INFO]Computer Science [cs] [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] |
spellingShingle |
under-resourced language language identification corpus of Luxembourguish [INFO]Computer Science [cs] [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] Lavergne, Thomas Adda, Gilles Adda-Decker, Martine Lamel, Lori Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish |
topic_facet |
under-resourced language language identification corpus of Luxembourguish [INFO]Computer Science [cs] [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] |
description |
International audience Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25% WER on the Quaero development data. |
author2 |
Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI) Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919) Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11) |
format |
Conference Object |
author |
Lavergne, Thomas Adda, Gilles Adda-Decker, Martine Lamel, Lori |
author_facet |
Lavergne, Thomas Adda, Gilles Adda-Decker, Martine Lamel, Lori |
author_sort |
Lavergne, Thomas |
title |
Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish |
title_short |
Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish |
title_full |
Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish |
title_fullStr |
Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish |
title_full_unstemmed |
Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish |
title_sort |
automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on luxembourgish |
publisher |
HAL CCSD |
publishDate |
2014 |
url |
https://hal.archives-ouvertes.fr/hal-01843401 |
op_coverage |
Reykjavik, Iceland |
genre |
Iceland |
genre_facet |
Iceland |
op_source |
International Conference on Language Resources and Evaluation https://hal.archives-ouvertes.fr/hal-01843401 International Conference on Language Resources and Evaluation, May 2014, Reykjavik, Iceland |
op_relation |
hal-01843401 https://hal.archives-ouvertes.fr/hal-01843401 |
_version_ |
1766040961253638144 |