Joint Annotation of Morphology and Syntax in Dependency Treebanks
International audience In this paper, we compare different ways to annotate both syntactic and morphological relations in a dependency treebank. We propose new formats we call mSUD and mUD, compatible with the Universal Dependencies (UD) schema for syntactic treebanks. We emphasize on mSUD rather th...
Main Authors: | , , , , |
---|---|
Other Authors: | , , , , , , , , , , , , |
Format: | Conference Object |
Language: | English |
Published: |
HAL CCSD
2024
|
Subjects: | |
Online Access: | https://inria.hal.science/hal-04550108 https://inria.hal.science/hal-04550108/document https://inria.hal.science/hal-04550108/file/mSUD.pdf |
id |
ftunivparis10:oai:HAL:hal-04550108v1 |
---|---|
record_format |
openpolar |
spelling |
ftunivparis10:oai:HAL:hal-04550108v1 2024-05-19T07:49:59+00:00 Joint Annotation of Morphology and Syntax in Dependency Treebanks Guillaume, Bruno Gerdes, Kim Guiller, Kirian Kahane, Sylvain Li, Yixuan Semantic Analysis of Natural Language (SEMAGRAMME) Inria Nancy - Grand Est Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD) Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA) Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA) Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS) Laboratoire Interdisciplinaire des Sciences du Numérique (LISN) Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS) Modèles, Dynamiques, Corpus (MoDyCo) Université Paris Nanterre (UPN)-Centre National de la Recherche Scientifique (CNRS) LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP) Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS) ANR-21-CE38-0017,Autogramm,Induction de grammaires descriptives à partir de corpus(2021) Turino, Italy 2024-05-20 https://inria.hal.science/hal-04550108 https://inria.hal.science/hal-04550108/document https://inria.hal.science/hal-04550108/file/mSUD.pdf en eng HAL CCSD hal-04550108 https://inria.hal.science/hal-04550108 https://inria.hal.science/hal-04550108/document https://inria.hal.science/hal-04550108/file/mSUD.pdf http://creativecommons.org/licenses/by/ info:eu-repo/semantics/OpenAccess The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING) https://inria.hal.science/hal-04550108 The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), May 2024, Turino, Italy Morph Morpheme Morph-based treebank Derivational affix Derivational path Compound Word structure Universal Dependencies [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] [SHS.LANGUE]Humanities and Social Sciences/Linguistics info:eu-repo/semantics/conferenceObject Conference papers 2024 ftunivparis10 2024-04-22T00:05:30Z International audience In this paper, we compare different ways to annotate both syntactic and morphological relations in a dependency treebank. We propose new formats we call mSUD and mUD, compatible with the Universal Dependencies (UD) schema for syntactic treebanks. We emphasize on mSUD rather than mUD, the former being based on distributional criteria for the choice of the head of any combination, which allows us to clearly encode the internal structure of a word, that is, the derivational path. We investigate different problems posed by a morph-based annotation, concerning tokenization, choice of the head of a morph combination, relations between morphs, additional features needed, such as the token type differentiating roots and derivational and inflectional affixes. We show how our annotation schema can be applied to different languages from polysynthetic languages such as Yupik to isolating languages such as Chinese. Conference Object Yupik Université Paris Nanterre: HAL |
institution |
Open Polar |
collection |
Université Paris Nanterre: HAL |
op_collection_id |
ftunivparis10 |
language |
English |
topic |
Morph Morpheme Morph-based treebank Derivational affix Derivational path Compound Word structure Universal Dependencies [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] [SHS.LANGUE]Humanities and Social Sciences/Linguistics |
spellingShingle |
Morph Morpheme Morph-based treebank Derivational affix Derivational path Compound Word structure Universal Dependencies [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] [SHS.LANGUE]Humanities and Social Sciences/Linguistics Guillaume, Bruno Gerdes, Kim Guiller, Kirian Kahane, Sylvain Li, Yixuan Joint Annotation of Morphology and Syntax in Dependency Treebanks |
topic_facet |
Morph Morpheme Morph-based treebank Derivational affix Derivational path Compound Word structure Universal Dependencies [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] [SHS.LANGUE]Humanities and Social Sciences/Linguistics |
description |
International audience In this paper, we compare different ways to annotate both syntactic and morphological relations in a dependency treebank. We propose new formats we call mSUD and mUD, compatible with the Universal Dependencies (UD) schema for syntactic treebanks. We emphasize on mSUD rather than mUD, the former being based on distributional criteria for the choice of the head of any combination, which allows us to clearly encode the internal structure of a word, that is, the derivational path. We investigate different problems posed by a morph-based annotation, concerning tokenization, choice of the head of a morph combination, relations between morphs, additional features needed, such as the token type differentiating roots and derivational and inflectional affixes. We show how our annotation schema can be applied to different languages from polysynthetic languages such as Yupik to isolating languages such as Chinese. |
author2 |
Semantic Analysis of Natural Language (SEMAGRAMME) Inria Nancy - Grand Est Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD) Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA) Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA) Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS) Laboratoire Interdisciplinaire des Sciences du Numérique (LISN) Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS) Modèles, Dynamiques, Corpus (MoDyCo) Université Paris Nanterre (UPN)-Centre National de la Recherche Scientifique (CNRS) LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP) Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS) ANR-21-CE38-0017,Autogramm,Induction de grammaires descriptives à partir de corpus(2021) |
format |
Conference Object |
author |
Guillaume, Bruno Gerdes, Kim Guiller, Kirian Kahane, Sylvain Li, Yixuan |
author_facet |
Guillaume, Bruno Gerdes, Kim Guiller, Kirian Kahane, Sylvain Li, Yixuan |
author_sort |
Guillaume, Bruno |
title |
Joint Annotation of Morphology and Syntax in Dependency Treebanks |
title_short |
Joint Annotation of Morphology and Syntax in Dependency Treebanks |
title_full |
Joint Annotation of Morphology and Syntax in Dependency Treebanks |
title_fullStr |
Joint Annotation of Morphology and Syntax in Dependency Treebanks |
title_full_unstemmed |
Joint Annotation of Morphology and Syntax in Dependency Treebanks |
title_sort |
joint annotation of morphology and syntax in dependency treebanks |
publisher |
HAL CCSD |
publishDate |
2024 |
url |
https://inria.hal.science/hal-04550108 https://inria.hal.science/hal-04550108/document https://inria.hal.science/hal-04550108/file/mSUD.pdf |
op_coverage |
Turino, Italy |
genre |
Yupik |
genre_facet |
Yupik |
op_source |
The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING) https://inria.hal.science/hal-04550108 The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), May 2024, Turino, Italy |
op_relation |
hal-04550108 https://inria.hal.science/hal-04550108 https://inria.hal.science/hal-04550108/document https://inria.hal.science/hal-04550108/file/mSUD.pdf |
op_rights |
http://creativecommons.org/licenses/by/ info:eu-repo/semantics/OpenAccess |
_version_ |
1799468569533087744 |