Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering appr...

Full description

Bibliographic Details
Main Authors: Ke, Guiyao, Marteau, Pierre-François
Other Authors: Expressiveness in Human Centered Data/Media (EXPRESSION), Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://hal.science/hal-00995297
id ftecolecentrpar:oai:HAL:hal-00995297v1
record_format openpolar
spelling ftecolecentrpar:oai:HAL:hal-00995297v1 2024-09-15T18:14:04+00:00 Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora Ke, Guiyao Marteau, Pierre-François Expressiveness in Human Centered Data/Media (EXPRESSION) Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6) Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS) Reykjavik, Iceland 2014-05-26 https://hal.science/hal-00995297 en eng HAL CCSD hal-00995297 https://hal.science/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014 https://hal.science/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftecolecentrpar 2024-08-28T23:59:45Z International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality. Conference Object Iceland École Centrale Paris: HAL-ECP
institution Open Polar
collection École Centrale Paris: HAL-ECP
op_collection_id ftecolecentrpar
language English
topic Thematic comparable corpora
Comparability measure
Co-clustering
Cluster alignment
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
spellingShingle Thematic comparable corpora
Comparability measure
Co-clustering
Cluster alignment
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
Ke, Guiyao
Marteau, Pierre-François
Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
topic_facet Thematic comparable corpora
Comparability measure
Co-clustering
Cluster alignment
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
description International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality.
author2 Expressiveness in Human Centered Data/Media (EXPRESSION)
Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6)
Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA)
Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes)
Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes)
Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA)
Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
format Conference Object
author Ke, Guiyao
Marteau, Pierre-François
author_facet Ke, Guiyao
Marteau, Pierre-François
author_sort Ke, Guiyao
title Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_short Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_full Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_fullStr Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_full_unstemmed Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_sort co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
publisher HAL CCSD
publishDate 2014
url https://hal.science/hal-00995297
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source The 9th edition of the Language Resources and Evaluation Conference, LREC 2014
https://hal.science/hal-00995297
The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland
op_relation hal-00995297
https://hal.science/hal-00995297
_version_ 1810451849825746944