Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora

International audience Following the pioneering work by \cite{Li-Gaussier-10}, we address in this paper the analysis of a family of quantitative comparability measures dedicated to the construction and evaluation of topical comparable corpora. After recalling the definition of the quantitative compa...

Full description

Bibliographic Details
Main Authors: Ke, Guiyao, Marteau, Pierre-François, Ménier, Gildas
Other Authors: Expressiveness in Human Centered Data/Media (EXPRESSION), Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://hal.science/hal-00995294
id ftecolecentrpar:oai:HAL:hal-00995294v1
record_format openpolar
spelling ftecolecentrpar:oai:HAL:hal-00995294v1 2023-08-15T12:41:52+02:00 Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora Ke, Guiyao Marteau, Pierre-François Ménier, Gildas Expressiveness in Human Centered Data/Media (EXPRESSION) Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6) Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS) Reykjavik, Iceland 2014-05-26 https://hal.science/hal-00995294 en eng HAL CCSD hal-00995294 https://hal.science/hal-00995294 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014 LREC 2014, the 9th edition of the Language Resources and Evaluation Conference https://hal.science/hal-00995294 LREC 2014, the 9th edition of the Language Resources and Evaluation Conference, May 2014, Reykjavik, Iceland Comparable corpora Comparability measures Evaluation [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftecolecentrpar 2023-07-25T20:08:41Z International audience Following the pioneering work by \cite{Li-Gaussier-10}, we address in this paper the analysis of a family of quantitative comparability measures dedicated to the construction and evaluation of topical comparable corpora. After recalling the definition of the quantitative comparability measure proposed by \cite{Li-Gaussier-10}, we develop some variants of this measure based primarily on the consideration that the occurrence frequencies of lexical entries and the number of their translations are important. We compare the respective advantages and disadvantages of these variants in the context of an evaluation framework that is based on the progressive degradation of the Europarl parallel corpus. The degradation is obtained by replacing either deterministically or randomly a varying amount of lines in blocks that compose partitions of the initial Europarl corpus. The impact of the coverage of bilingual dictionaries on these measures is also discussed and perspectives are finally presented. Conference Object Iceland École Centrale Paris: HAL-ECP
institution Open Polar
collection École Centrale Paris: HAL-ECP
op_collection_id ftecolecentrpar
language English
topic Comparable corpora
Comparability measures
Evaluation
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
spellingShingle Comparable corpora
Comparability measures
Evaluation
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Ke, Guiyao
Marteau, Pierre-François
Ménier, Gildas
Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora
topic_facet Comparable corpora
Comparability measures
Evaluation
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
description International audience Following the pioneering work by \cite{Li-Gaussier-10}, we address in this paper the analysis of a family of quantitative comparability measures dedicated to the construction and evaluation of topical comparable corpora. After recalling the definition of the quantitative comparability measure proposed by \cite{Li-Gaussier-10}, we develop some variants of this measure based primarily on the consideration that the occurrence frequencies of lexical entries and the number of their translations are important. We compare the respective advantages and disadvantages of these variants in the context of an evaluation framework that is based on the progressive degradation of the Europarl parallel corpus. The degradation is obtained by replacing either deterministically or randomly a varying amount of lines in blocks that compose partitions of the initial Europarl corpus. The impact of the coverage of bilingual dictionaries on these measures is also discussed and perspectives are finally presented.
author2 Expressiveness in Human Centered Data/Media (EXPRESSION)
Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6)
Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA)
Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes)
Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes)
Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA)
Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
format Conference Object
author Ke, Guiyao
Marteau, Pierre-François
Ménier, Gildas
author_facet Ke, Guiyao
Marteau, Pierre-François
Ménier, Gildas
author_sort Ke, Guiyao
title Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora
title_short Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora
title_full Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora
title_fullStr Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora
title_full_unstemmed Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora
title_sort variations on quantitative comparability measures and their evaluations on synthetic french-english comparable corpora
publisher HAL CCSD
publishDate 2014
url https://hal.science/hal-00995294
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source The 9th edition of the Language Resources and Evaluation Conference, LREC 2014
LREC 2014, the 9th edition of the Language Resources and Evaluation Conference
https://hal.science/hal-00995294
LREC 2014, the 9th edition of the Language Resources and Evaluation Conference, May 2014, Reykjavik, Iceland
op_relation hal-00995294
https://hal.science/hal-00995294
_version_ 1774295401481895936