Joint Annotation of Morphology and Syntax in Dependency Treebanks

International audience In this paper, we compare different ways to annotate both syntactic and morphological relations in a dependency treebank. We propose new formats we call mSUD and mUD, compatible with the Universal Dependencies (UD) schema for syntactic treebanks. We emphasize on mSUD rather th...

Full description

Bibliographic Details
Main Authors: Guillaume, Bruno, Gerdes, Kim, Guiller, Kirian, Kahane, Sylvain, Li, Yixuan
Other Authors: Semantic Analysis of Natural Language (SEMAGRAMME), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Modèles, Dynamiques, Corpus (MoDyCo), Université Paris Nanterre (UPN)-Centre National de la Recherche Scientifique (CNRS), LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP), Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS), ANR-21-CE38-0017,Autogramm,Induction de grammaires descriptives à partir de corpus(2021)
Format: Conference Object
Language:English
Published: HAL CCSD 2024
Subjects:
Online Access:https://inria.hal.science/hal-04550108
https://inria.hal.science/hal-04550108/document
https://inria.hal.science/hal-04550108/file/mSUD.pdf
id ftanrparis:oai:HAL:hal-04550108v1
record_format openpolar
spelling ftanrparis:oai:HAL:hal-04550108v1 2024-05-19T07:49:59+00:00 Joint Annotation of Morphology and Syntax in Dependency Treebanks Guillaume, Bruno Gerdes, Kim Guiller, Kirian Kahane, Sylvain Li, Yixuan Semantic Analysis of Natural Language (SEMAGRAMME) Inria Nancy - Grand Est Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD) Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA) Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA) Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS) Laboratoire Interdisciplinaire des Sciences du Numérique (LISN) Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS) Modèles, Dynamiques, Corpus (MoDyCo) Université Paris Nanterre (UPN)-Centre National de la Recherche Scientifique (CNRS) LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP) Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS) ANR-21-CE38-0017,Autogramm,Induction de grammaires descriptives à partir de corpus(2021) Turino, Italy 2024-05-20 https://inria.hal.science/hal-04550108 https://inria.hal.science/hal-04550108/document https://inria.hal.science/hal-04550108/file/mSUD.pdf en eng HAL CCSD hal-04550108 https://inria.hal.science/hal-04550108 https://inria.hal.science/hal-04550108/document https://inria.hal.science/hal-04550108/file/mSUD.pdf http://creativecommons.org/licenses/by/ info:eu-repo/semantics/OpenAccess The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING) https://inria.hal.science/hal-04550108 The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), May 2024, Turino, Italy Morph Morpheme Morph-based treebank Derivational affix Derivational path Compound Word structure Universal Dependencies [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] [SHS.LANGUE]Humanities and Social Sciences/Linguistics info:eu-repo/semantics/conferenceObject Conference papers 2024 ftanrparis 2024-04-23T02:56:20Z International audience In this paper, we compare different ways to annotate both syntactic and morphological relations in a dependency treebank. We propose new formats we call mSUD and mUD, compatible with the Universal Dependencies (UD) schema for syntactic treebanks. We emphasize on mSUD rather than mUD, the former being based on distributional criteria for the choice of the head of any combination, which allows us to clearly encode the internal structure of a word, that is, the derivational path. We investigate different problems posed by a morph-based annotation, concerning tokenization, choice of the head of a morph combination, relations between morphs, additional features needed, such as the token type differentiating roots and derivational and inflectional affixes. We show how our annotation schema can be applied to different languages from polysynthetic languages such as Yupik to isolating languages such as Chinese. Conference Object Yupik Portail HAL-ANR (Agence Nationale de la Recherche)
institution Open Polar
collection Portail HAL-ANR (Agence Nationale de la Recherche)
op_collection_id ftanrparis
language English
topic Morph
Morpheme
Morph-based treebank
Derivational affix
Derivational path
Compound
Word structure
Universal Dependencies
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
spellingShingle Morph
Morpheme
Morph-based treebank
Derivational affix
Derivational path
Compound
Word structure
Universal Dependencies
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
Guillaume, Bruno
Gerdes, Kim
Guiller, Kirian
Kahane, Sylvain
Li, Yixuan
Joint Annotation of Morphology and Syntax in Dependency Treebanks
topic_facet Morph
Morpheme
Morph-based treebank
Derivational affix
Derivational path
Compound
Word structure
Universal Dependencies
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
description International audience In this paper, we compare different ways to annotate both syntactic and morphological relations in a dependency treebank. We propose new formats we call mSUD and mUD, compatible with the Universal Dependencies (UD) schema for syntactic treebanks. We emphasize on mSUD rather than mUD, the former being based on distributional criteria for the choice of the head of any combination, which allows us to clearly encode the internal structure of a word, that is, the derivational path. We investigate different problems posed by a morph-based annotation, concerning tokenization, choice of the head of a morph combination, relations between morphs, additional features needed, such as the token type differentiating roots and derivational and inflectional affixes. We show how our annotation schema can be applied to different languages from polysynthetic languages such as Yupik to isolating languages such as Chinese.
author2 Semantic Analysis of Natural Language (SEMAGRAMME)
Inria Nancy - Grand Est
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD)
Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA)
Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA)
Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Laboratoire Interdisciplinaire des Sciences du Numérique (LISN)
Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)
Modèles, Dynamiques, Corpus (MoDyCo)
Université Paris Nanterre (UPN)-Centre National de la Recherche Scientifique (CNRS)
LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP)
Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS)
ANR-21-CE38-0017,Autogramm,Induction de grammaires descriptives à partir de corpus(2021)
format Conference Object
author Guillaume, Bruno
Gerdes, Kim
Guiller, Kirian
Kahane, Sylvain
Li, Yixuan
author_facet Guillaume, Bruno
Gerdes, Kim
Guiller, Kirian
Kahane, Sylvain
Li, Yixuan
author_sort Guillaume, Bruno
title Joint Annotation of Morphology and Syntax in Dependency Treebanks
title_short Joint Annotation of Morphology and Syntax in Dependency Treebanks
title_full Joint Annotation of Morphology and Syntax in Dependency Treebanks
title_fullStr Joint Annotation of Morphology and Syntax in Dependency Treebanks
title_full_unstemmed Joint Annotation of Morphology and Syntax in Dependency Treebanks
title_sort joint annotation of morphology and syntax in dependency treebanks
publisher HAL CCSD
publishDate 2024
url https://inria.hal.science/hal-04550108
https://inria.hal.science/hal-04550108/document
https://inria.hal.science/hal-04550108/file/mSUD.pdf
op_coverage Turino, Italy
genre Yupik
genre_facet Yupik
op_source The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)
https://inria.hal.science/hal-04550108
The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), May 2024, Turino, Italy
op_relation hal-04550108
https://inria.hal.science/hal-04550108
https://inria.hal.science/hal-04550108/document
https://inria.hal.science/hal-04550108/file/mSUD.pdf
op_rights http://creativecommons.org/licenses/by/
info:eu-repo/semantics/OpenAccess
_version_ 1799468569339101184