A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages

International audience In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from...

Full description

Bibliographic Details
Main Authors: Scherrer, Yves, Sagot, Benoît
Other Authors: LATL-CUI, Laboratoire d'Analyse et de Technologie du Langage (LATL), Université de Genève = University of Geneva (UNIGE)-Université de Genève = University of Geneva (UNIGE), Analyse Linguistique Profonde à Grande Echelle, Large-scale deep linguistic processing (ALPAGE), Inria Paris-Rocquencourt, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris Diderot - Paris 7 (UPD7), European Language Resources Association, ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011)
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://inria.hal.science/hal-01022298
https://inria.hal.science/hal-01022298/document
https://inria.hal.science/hal-01022298/file/lrec14cll.pdf
id ftanrparis:oai:HAL:hal-01022298v1
record_format openpolar
spelling ftanrparis:oai:HAL:hal-01022298v1 2023-06-11T04:13:08+02:00 A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages Scherrer, Yves Sagot, Benoît LATL-CUI Laboratoire d'Analyse et de Technologie du Langage (LATL) Université de Genève = University of Geneva (UNIGE)-Université de Genève = University of Geneva (UNIGE) Analyse Linguistique Profonde à Grande Echelle Large-scale deep linguistic processing (ALPAGE) Inria Paris-Rocquencourt Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris Diderot - Paris 7 (UPD7) European Language Resources Association ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011) Reykjavik, Iceland 2014-05-26 https://inria.hal.science/hal-01022298 https://inria.hal.science/hal-01022298/document https://inria.hal.science/hal-01022298/file/lrec14cll.pdf en eng HAL CCSD hal-01022298 https://inria.hal.science/hal-01022298 https://inria.hal.science/hal-01022298/document https://inria.hal.science/hal-01022298/file/lrec14cll.pdf info:eu-repo/semantics/OpenAccess Language Resources and Evaluation Conference https://inria.hal.science/hal-01022298 Language Resources and Evaluation Conference, European Language Resources Association, May 2014, Reykjavik, Iceland ACM: J.: Computer Applications/J.5: ARTS AND HUMANITIES/J.5.4: Linguistics ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftanrparis 2023-05-29T00:18:33Z International audience In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS information is transferred from the resourced language along translation pairs to the non-resourced language and used for tagging the corpus. We evaluate our methods on three language families, consisting of five Romance languages, three Germanic languages and five Slavic languages. We obtain tagging accuracies of up to 91.6%. Conference Object Iceland Portail HAL-ANR (Agence Nationale de la Recherche)
institution Open Polar
collection Portail HAL-ANR (Agence Nationale de la Recherche)
op_collection_id ftanrparis
language English
topic ACM: J.: Computer Applications/J.5: ARTS AND HUMANITIES/J.5.4: Linguistics
ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
spellingShingle ACM: J.: Computer Applications/J.5: ARTS AND HUMANITIES/J.5.4: Linguistics
ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Scherrer, Yves
Sagot, Benoît
A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
topic_facet ACM: J.: Computer Applications/J.5: ARTS AND HUMANITIES/J.5.4: Linguistics
ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
description International audience In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS information is transferred from the resourced language along translation pairs to the non-resourced language and used for tagging the corpus. We evaluate our methods on three language families, consisting of five Romance languages, three Germanic languages and five Slavic languages. We obtain tagging accuracies of up to 91.6%.
author2 LATL-CUI
Laboratoire d'Analyse et de Technologie du Langage (LATL)
Université de Genève = University of Geneva (UNIGE)-Université de Genève = University of Geneva (UNIGE)
Analyse Linguistique Profonde à Grande Echelle
Large-scale deep linguistic processing (ALPAGE)
Inria Paris-Rocquencourt
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris Diderot - Paris 7 (UPD7)
European Language Resources Association
ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011)
format Conference Object
author Scherrer, Yves
Sagot, Benoît
author_facet Scherrer, Yves
Sagot, Benoît
author_sort Scherrer, Yves
title A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
title_short A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
title_full A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
title_fullStr A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
title_full_unstemmed A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
title_sort language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
publisher HAL CCSD
publishDate 2014
url https://inria.hal.science/hal-01022298
https://inria.hal.science/hal-01022298/document
https://inria.hal.science/hal-01022298/file/lrec14cll.pdf
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source Language Resources and Evaluation Conference
https://inria.hal.science/hal-01022298
Language Resources and Evaluation Conference, European Language Resources Association, May 2014, Reykjavik, Iceland
op_relation hal-01022298
https://inria.hal.science/hal-01022298
https://inria.hal.science/hal-01022298/document
https://inria.hal.science/hal-01022298/file/lrec14cll.pdf
op_rights info:eu-repo/semantics/OpenAccess
_version_ 1768389799061749760