A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages

International audience In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from...

Full description

Bibliographic Details
Main Authors:	Scherrer, Yves, Sagot, Benoît
Other Authors:	LATL-CUI, Laboratoire d'Analyse et de Technologie du Langage (LATL), Université de Genève (UNIGE)-Université de Genève (UNIGE), Analyse Linguistique Profonde à Grande Echelle, Large-scale deep linguistic processing (ALPAGE), Inria Paris-Rocquencourt, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris Diderot - Paris 7 (UPD7), European Language Resources Association, ANR-11-IDEX-0005,EFL,Empirical Foundations of Linguistics : data, methods, models(2011)
Format:	Conference Object
Language:	English
Published:	HAL CCSD 2014
Subjects:	ACM: J.: Computer Applications/J.5: ARTS AND HUMANITIES/J.5.4: Linguistics ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] Iceland
Online Access:	https://hal.inria.fr/hal-01022298 https://hal.inria.fr/hal-01022298/document https://hal.inria.fr/hal-01022298/file/lrec14cll.pdf

Description
Summary:	International audience In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS information is transferred from the resourced language along translation pairs to the non-resourced language and used for tagging the corpus. We evaluate our methods on three language families, consisting of five Romance languages, three Germanic languages and five Slavic languages. We obtain tagging accuracies of up to 91.6%.

A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages

Similar Items