Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering appr...

Full description

Bibliographic Details
Main Authors:	Ke, Guiyao, Marteau, Pierre-François
Other Authors:	Expressiveness in Human Centered Data/Media (EXPRESSION), Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
Format:	Conference Object
Language:	English
Published:	HAL CCSD 2014
Subjects:	Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] Iceland
Online Access:	https://hal.science/hal-00995297

id	ftecolecentrpar:oai:HAL:hal-00995297v1
record_format	openpolar
spelling	ftecolecentrpar:oai:HAL:hal-00995297v1 2024-09-15T18:14:04+00:00 Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora Ke, Guiyao Marteau, Pierre-François Expressiveness in Human Centered Data/Media (EXPRESSION) Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6) Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS) Reykjavik, Iceland 2014-05-26 https://hal.science/hal-00995297 en eng HAL CCSD hal-00995297 https://hal.science/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014 https://hal.science/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftecolecentrpar 2024-08-28T23:59:45Z International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality. Conference Object Iceland École Centrale Paris: HAL-ECP
institution	Open Polar
collection	École Centrale Paris: HAL-ECP
op_collection_id	ftecolecentrpar
language	English
topic	Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
spellingShingle	Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] Ke, Guiyao Marteau, Pierre-François Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
topic_facet	Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
description	International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality.
author2	Expressiveness in Human Centered Data/Media (EXPRESSION) Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6) Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
format	Conference Object
author	Ke, Guiyao Marteau, Pierre-François
author_facet	Ke, Guiyao Marteau, Pierre-François
author_sort	Ke, Guiyao
title	Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_short	Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_full	Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_fullStr	Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_full_unstemmed	Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
title_sort	co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
publisher	HAL CCSD
publishDate	2014
url	https://hal.science/hal-00995297
op_coverage	Reykjavik, Iceland
genre	Iceland
genre_facet	Iceland
op_source	The 9th edition of the Language Resources and Evaluation Conference, LREC 2014 https://hal.science/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland
op_relation	hal-00995297 https://hal.science/hal-00995297
_version_	1810451849825746944

Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

Similar Items