Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering appr...

Full description

Bibliographic Details
Main Authors: Ke, Guiyao, Marteau, Pierre-François
Other Authors: Expressiveness in Human Centered Data/Media (EXPRESSION), Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://hal.archives-ouvertes.fr/hal-00995297
id ftccsdartic:oai:HAL:hal-00995297v1
record_format openpolar
spelling ftccsdartic:oai:HAL:hal-00995297v1 2023-05-15T16:50:13+02:00 Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora Ke, Guiyao Marteau, Pierre-François Expressiveness in Human Centered Data/Media (EXPRESSION) Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6) Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1) Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1) Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA) Reykjavik, Iceland 2014-05-26 https://hal.archives-ouvertes.fr/hal-00995297 en eng HAL CCSD hal-00995297 https://hal.archives-ouvertes.fr/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014 https://hal.archives-ouvertes.fr/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftccsdartic 2021-10-24T13:25:59Z International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality. Conference Object Iceland Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe)
institution Open Polar
collection Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe)
op_collection_id ftccsdartic
language English
topic Thematic comparable corpora
Comparability measure
Co-clustering
Cluster alignment
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
spellingShingle Thematic comparable corpora
Comparability measure
Co-clustering
Cluster alignment
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
Ke, Guiyao
Marteau, Pierre-François
Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
topic_facet Thematic comparable corpora
Comparability measure
Co-clustering
Cluster alignment
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
description International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality.
author2 Expressiveness in Human Centered Data/Media (EXPRESSION)
Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6)
Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA)
CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1)
Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes)
Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1)
Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA)
Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes)
Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)
format Conference Object
author Ke, Guiyao
Marteau, Pierre-François
author_facet Ke, Guiyao
Marteau, Pierre-François
author_sort Ke, Guiyao
title Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_short Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_full Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_fullStr Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_full_unstemmed Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_sort co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
publisher HAL CCSD
publishDate 2014
url https://hal.archives-ouvertes.fr/hal-00995297
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source The 9th edition of the Language Resources and Evaluation Conference, LREC 2014
https://hal.archives-ouvertes.fr/hal-00995297
The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland
op_relation hal-00995297
https://hal.archives-ouvertes.fr/hal-00995297
_version_ 1766040387119480832