Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora
International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering appr...
Main Authors: | , |
---|---|
Other Authors: | , , , , , , |
Format: | Conference Object |
Language: | English |
Published: |
HAL CCSD
2014
|
Subjects: | |
Online Access: | https://hal.science/hal-00995297 |
id |
ftunivrennes1hal:oai:HAL:hal-00995297v1 |
---|---|
record_format |
openpolar |
spelling |
ftunivrennes1hal:oai:HAL:hal-00995297v1 2023-05-15T16:50:13+02:00 Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora Ke, Guiyao Marteau, Pierre-François Expressiveness in Human Centered Data/Media (EXPRESSION) Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6) Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS) Reykjavik, Iceland 2014-05-26 https://hal.science/hal-00995297 en eng HAL CCSD hal-00995297 https://hal.science/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014 https://hal.science/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftunivrennes1hal 2023-03-14T23:44:20Z International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality. Conference Object Iceland Université de Rennes 1: Publications scientifiques (HAL) |
institution |
Open Polar |
collection |
Université de Rennes 1: Publications scientifiques (HAL) |
op_collection_id |
ftunivrennes1hal |
language |
English |
topic |
Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] |
spellingShingle |
Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] Ke, Guiyao Marteau, Pierre-François Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora |
topic_facet |
Thematic comparable corpora Comparability measure Co-clustering Cluster alignment [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] |
description |
International audience We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality. |
author2 |
Expressiveness in Human Centered Data/Media (EXPRESSION) Université de Bretagne Sud (UBS)-MEDIA ET INTERACTIONS (IRISA-D6) Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS) |
format |
Conference Object |
author |
Ke, Guiyao Marteau, Pierre-François |
author_facet |
Ke, Guiyao Marteau, Pierre-François |
author_sort |
Ke, Guiyao |
title |
Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora |
title_short |
Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora |
title_full |
Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora |
title_fullStr |
Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora |
title_full_unstemmed |
Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora |
title_sort |
co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora |
publisher |
HAL CCSD |
publishDate |
2014 |
url |
https://hal.science/hal-00995297 |
op_coverage |
Reykjavik, Iceland |
genre |
Iceland |
genre_facet |
Iceland |
op_source |
The 9th edition of the Language Resources and Evaluation Conference, LREC 2014 https://hal.science/hal-00995297 The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland |
op_relation |
hal-00995297 https://hal.science/hal-00995297 |
_version_ |
1766040395284742144 |