Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish

International audience Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, su...

Full description

Bibliographic Details
Main Authors: Lavergne, Thomas, Adda, Gilles, Adda-Decker, Martine, Lamel, Lori
Other Authors: Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11), Institut des Technologies Multilingues et Multimédias de l'Information (IMMI), Centre National de la Recherche Scientifique (CNRS), LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP), Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS), European Language Resources Association (ELRA), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis, ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011)
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://hal.archives-ouvertes.fr/hal-01134776
id ftccsdartic:oai:HAL:hal-01134776v1
record_format openpolar
spelling ftccsdartic:oai:HAL:hal-01134776v1 2023-05-15T16:50:51+02:00 Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish Lavergne, Thomas Adda, Gilles Adda-Decker, Martine Lamel, Lori Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI) Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919) Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11) Institut des Technologies Multilingues et Multimédias de l'Information (IMMI) Centre National de la Recherche Scientifique (CNRS) LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP) Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS) European Language Resources Association (ELRA) Khalid Choukri Thierry Declerck Hrafn Loftsson Bente Maegaard Joseph Mariani Asuncion Moreno Jan Odijk Stelios Piperidis ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011) Reykjavik, Iceland 2014-05-26 https://hal.archives-ouvertes.fr/hal-01134776 en eng HAL CCSD hal-01134776 https://hal.archives-ouvertes.fr/hal-01134776 Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) Ninth International Conference on Language Resources and Evaluation (LREC'14) https://hal.archives-ouvertes.fr/hal-01134776 Ninth International Conference on Language Resources and Evaluation (LREC'14), European Language Resources Association (ELRA), May 2014, Reykjavik, Iceland. pp.3300-3304 http://lrec2014.lrec-conf.org/en/ corpus of Luxembourguish language identification under-resourced language [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] [SHS.LANGUE]Humanities and Social Sciences/Linguistics info:eu-repo/semantics/conferenceObject Conference papers 2014 ftccsdartic 2021-12-19T02:56:47Z International audience Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25\% WER on the Quaero development data. Conference Object Iceland Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe)
institution Open Polar
collection Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe)
op_collection_id ftccsdartic
language English
topic corpus of Luxembourguish
language identification
under-resourced language
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
spellingShingle corpus of Luxembourguish
language identification
under-resourced language
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
Lavergne, Thomas
Adda, Gilles
Adda-Decker, Martine
Lamel, Lori
Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
topic_facet corpus of Luxembourguish
language identification
under-resourced language
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
description International audience Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25\% WER on the Quaero development data.
author2 Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI)
Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919)
Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11)
Institut des Technologies Multilingues et Multimédias de l'Information (IMMI)
Centre National de la Recherche Scientifique (CNRS)
LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP)
Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS)
European Language Resources Association (ELRA)
Khalid Choukri
Thierry Declerck
Hrafn Loftsson
Bente Maegaard
Joseph Mariani
Asuncion Moreno
Jan Odijk
Stelios Piperidis
ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011)
format Conference Object
author Lavergne, Thomas
Adda, Gilles
Adda-Decker, Martine
Lamel, Lori
author_facet Lavergne, Thomas
Adda, Gilles
Adda-Decker, Martine
Lamel, Lori
author_sort Lavergne, Thomas
title Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_short Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_full Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_fullStr Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_full_unstemmed Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_sort automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on luxembourgish
publisher HAL CCSD
publishDate 2014
url https://hal.archives-ouvertes.fr/hal-01134776
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Ninth International Conference on Language Resources and Evaluation (LREC'14)
https://hal.archives-ouvertes.fr/hal-01134776
Ninth International Conference on Language Resources and Evaluation (LREC'14), European Language Resources Association (ELRA), May 2014, Reykjavik, Iceland. pp.3300-3304
http://lrec2014.lrec-conf.org/en/
op_relation hal-01134776
https://hal.archives-ouvertes.fr/hal-01134776
_version_ 1766040966376980480