Facing the identification problem in language-related scientific data analysis

International audience This paper describes the problems that must be addressed when studying large amounts of data over time which require entity normalization applied not to the usual genres of news or political speech, but to the genre of academic discourse about language resources, technologies...

Full description

Bibliographic Details
Main Authors: Mariani, Joseph, J, Cieri, Christopher, Francopoulo, Gil, Paroubek, Patrick, Delaborde, Marine
Other Authors: Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris-Sud - Paris 11 (UP11)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Saclay (COmUE)
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://hal.science/hal-01840821
id ftsorbonneuniv:oai:HAL:hal-01840821v1
record_format openpolar
spelling ftsorbonneuniv:oai:HAL:hal-01840821v1 2023-11-05T03:42:54+01:00 Facing the identification problem in language-related scientific data analysis Mariani, Joseph, J Cieri, Christopher Francopoulo, Gil Paroubek, Patrick Delaborde, Marine Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI) Université Paris-Sud - Paris 11 (UP11)-Sorbonne Université - UFR d'Ingénierie (UFR 919) Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Saclay (COmUE) Reykjavik, Iceland 2014-01-01 https://hal.science/hal-01840821 en eng HAL CCSD hal-01840821 https://hal.science/hal-01840821 International Conference on Language Resources and Evaluation https://hal.science/hal-01840821 International Conference on Language Resources and Evaluation, Jan 2014, Reykjavik, Iceland [INFO]Computer Science [cs] [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] info:eu-repo/semantics/conferenceObject Conference papers 2014 ftsorbonneuniv 2023-10-10T23:58:29Z International audience This paper describes the problems that must be addressed when studying large amounts of data over time which require entity normalization applied not to the usual genres of news or political speech, but to the genre of academic discourse about language resources, technologies and sciences. It reports on the normalization processes that had to be applied to produce data usable for computing statistics in three past studies on the LRE Map, the ISCA Archive and the LDC Bibliography. It shows the need for human expertise during normalization and the necessity to adapt the work to the study objectives. It investigates possible improvements for reducing the workload necessary to produce comparable results. Through this paper, we show the necessity to define and agree on international persistent and unique identifiers. Conference Object Iceland HAL Sorbonne Université
institution Open Polar
collection HAL Sorbonne Université
op_collection_id ftsorbonneuniv
language English
topic [INFO]Computer Science [cs]
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
spellingShingle [INFO]Computer Science [cs]
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Mariani, Joseph, J
Cieri, Christopher
Francopoulo, Gil
Paroubek, Patrick
Delaborde, Marine
Facing the identification problem in language-related scientific data analysis
topic_facet [INFO]Computer Science [cs]
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
description International audience This paper describes the problems that must be addressed when studying large amounts of data over time which require entity normalization applied not to the usual genres of news or political speech, but to the genre of academic discourse about language resources, technologies and sciences. It reports on the normalization processes that had to be applied to produce data usable for computing statistics in three past studies on the LRE Map, the ISCA Archive and the LDC Bibliography. It shows the need for human expertise during normalization and the necessity to adapt the work to the study objectives. It investigates possible improvements for reducing the workload necessary to produce comparable results. Through this paper, we show the necessity to define and agree on international persistent and unique identifiers.
author2 Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI)
Université Paris-Sud - Paris 11 (UP11)-Sorbonne Université - UFR d'Ingénierie (UFR 919)
Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Saclay (COmUE)
format Conference Object
author Mariani, Joseph, J
Cieri, Christopher
Francopoulo, Gil
Paroubek, Patrick
Delaborde, Marine
author_facet Mariani, Joseph, J
Cieri, Christopher
Francopoulo, Gil
Paroubek, Patrick
Delaborde, Marine
author_sort Mariani, Joseph, J
title Facing the identification problem in language-related scientific data analysis
title_short Facing the identification problem in language-related scientific data analysis
title_full Facing the identification problem in language-related scientific data analysis
title_fullStr Facing the identification problem in language-related scientific data analysis
title_full_unstemmed Facing the identification problem in language-related scientific data analysis
title_sort facing the identification problem in language-related scientific data analysis
publisher HAL CCSD
publishDate 2014
url https://hal.science/hal-01840821
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source International Conference on Language Resources and Evaluation
https://hal.science/hal-01840821
International Conference on Language Resources and Evaluation, Jan 2014, Reykjavik, Iceland
op_relation hal-01840821
https://hal.science/hal-01840821
_version_ 1781700508418834432