Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish

International audience Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, su...

Full description

Bibliographic Details
Main Authors:	Lavergne, Thomas, Adda, Gilles, Adda-Decker, Martine, Lamel, Lori
Other Authors:	Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris-Sud - Paris 11 (UP11)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Saclay (COmUE), Institut des Technologies Multilingues et Multimédias de l'Information (IMMI), Centre National de la Recherche Scientifique (CNRS), LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP), Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS), European Language Resources Association (ELRA), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis, ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011)
Format:	Other/Unknown Material
Language:	English
Published:	HAL CCSD 2014
Subjects:	corpus of Luxembourguish language identification under-resourced language lang litt Iceland
Online Access:	https://hal.archives-ouvertes.fr/hal-01134776

id	fttriple:oai:gotriple.eu:10670/1.q0guna
record_format	openpolar
spelling	fttriple:oai:gotriple.eu:10670/1.q0guna 2023-05-15T16:50:41+02:00 Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish Lavergne, Thomas Adda, Gilles Adda-Decker, Martine Lamel, Lori Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI) Université Paris-Sud - Paris 11 (UP11)-Sorbonne Université - UFR d'Ingénierie (UFR 919) Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Saclay (COmUE) Institut des Technologies Multilingues et Multimédias de l'Information (IMMI) Centre National de la Recherche Scientifique (CNRS) LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP) Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS) European Language Resources Association (ELRA) Khalid Choukri Thierry Declerck Hrafn Loftsson Bente Maegaard Joseph Mariani Asuncion Moreno Jan Odijk Stelios Piperidis ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011) Reykjavik, Iceland 2014-05-26 https://hal.archives-ouvertes.fr/hal-01134776 en eng HAL CCSD hal-01134776 10670/1.q0guna https://hal.archives-ouvertes.fr/hal-01134776 undefined Hyper Article en Ligne - Sciences de l'Homme et de la Société Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) Ninth International Conference on Language Resources and Evaluation (LREC'14) Ninth International Conference on Language Resources and Evaluation (LREC'14), European Language Resources Association (ELRA), May 2014, Reykjavik, Iceland. pp.3300-3304 corpus of Luxembourguish language identification under-resourced language lang litt Conference Output https://vocabularies.coar-repositories.org/resource_types/c_c94f/ 2014 fttriple 2023-01-22T17:27:25Z International audience Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25\% WER on the Quaero development data. Other/Unknown Material Iceland Unknown
institution	Open Polar
collection	Unknown
op_collection_id	fttriple
language	English
topic	corpus of Luxembourguish language identification under-resourced language lang litt
spellingShingle	corpus of Luxembourguish language identification under-resourced language lang litt Lavergne, Thomas Adda, Gilles Adda-Decker, Martine Lamel, Lori Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
topic_facet	corpus of Luxembourguish language identification under-resourced language lang litt
description	International audience Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25\% WER on the Quaero development data.
author2	Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI) Université Paris-Sud - Paris 11 (UP11)-Sorbonne Université - UFR d'Ingénierie (UFR 919) Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Saclay (COmUE) Institut des Technologies Multilingues et Multimédias de l'Information (IMMI) Centre National de la Recherche Scientifique (CNRS) LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP) Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS) European Language Resources Association (ELRA) Khalid Choukri Thierry Declerck Hrafn Loftsson Bente Maegaard Joseph Mariani Asuncion Moreno Jan Odijk Stelios Piperidis ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011)
format	Other/Unknown Material
author	Lavergne, Thomas Adda, Gilles Adda-Decker, Martine Lamel, Lori
author_facet	Lavergne, Thomas Adda, Gilles Adda-Decker, Martine Lamel, Lori
author_sort	Lavergne, Thomas
title	Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_short	Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_full	Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_fullStr	Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_full_unstemmed	Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish
title_sort	automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on luxembourgish
publisher	HAL CCSD
publishDate	2014
url	https://hal.archives-ouvertes.fr/hal-01134776
op_coverage	Reykjavik, Iceland
genre	Iceland
genre_facet	Iceland
op_source	Hyper Article en Ligne - Sciences de l'Homme et de la Société Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) Ninth International Conference on Language Resources and Evaluation (LREC'14) Ninth International Conference on Language Resources and Evaluation (LREC'14), European Language Resources Association (ELRA), May 2014, Reykjavik, Iceland. pp.3300-3304
op_relation	hal-01134776 10670/1.q0guna https://hal.archives-ouvertes.fr/hal-01134776
op_rights	undefined
_version_	1766040797086482432

Automatic Language Identity Tagging on Word and Sentence-Level in Multilingual Text Sources: a Case-Study on Luxembourgish

Similar Items