TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French

International audience In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the t...

Full description

Bibliographic Details
Main Authors: Balvet, Antonio, Stosic, Dejan, Miletic, Aleksandra
Other Authors: Savoirs, Textes, Langage (STL) - UMR 8163 (STL), Université de Lille-Centre National de la Recherche Scientifique (CNRS), Cognition, Langues, Langage, Ergonomie (CLLE-ERSS), École Pratique des Hautes Études (EPHE), Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université de Toulouse (UT)-Université Bordeaux Montaigne (UBM)-Centre National de la Recherche Scientifique (CNRS)
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://shs.hal.science/halshs-01077767
id ftunivlille:oai:HAL:halshs-01077767v1
record_format openpolar
spelling ftunivlille:oai:HAL:halshs-01077767v1 2024-06-23T07:54:01+00:00 TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French Balvet, Antonio Stosic, Dejan Miletic, Aleksandra Savoirs, Textes, Langage (STL) - UMR 8163 (STL) Université de Lille-Centre National de la Recherche Scientifique (CNRS) Cognition, Langues, Langage, Ergonomie (CLLE-ERSS) École Pratique des Hautes Études (EPHE) Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J) Université de Toulouse (UT)-Université de Toulouse (UT)-Université Bordeaux Montaigne (UBM)-Centre National de la Recherche Scientifique (CNRS) Reykjavik, Iceland 2014-05-26 https://shs.hal.science/halshs-01077767 en eng HAL CCSD halshs-01077767 https://shs.hal.science/halshs-01077767 Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) LREC 2014 https://shs.hal.science/halshs-01077767 LREC 2014, May 2014, Reykjavik, Iceland Multilinguality Part-of-Speech Tagging Aligned Corpora [SHS.LANGUE]Humanities and Social Sciences/Linguistics info:eu-repo/semantics/conferenceObject Conference papers 2014 ftunivlille 2024-06-10T15:15:41Z International audience In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the three languages, both for literary and linguistic studies. The French and English sub-corpora had been pos-tagged from the onset, using TreeTagger (Schmid, 1994), but the corpus lacked, until now, a tagged version of the Serbian sub-corpus. Here, we present the original parallel literary corpus, then we address issues related to pos-tagging a large collection of Serbian text: from the conception of an appropriate tagset for Serbian, to the choice of an automatic pos-tagger adapted to the task, and then to some quantitative and qualitative results. We then move on to a discussion of perspectives in the near future for further annotations of the whole parallel corpus. Conference Object Iceland LillOA (HAL Lille Open Archive, Université de Lille)
institution Open Polar
collection LillOA (HAL Lille Open Archive, Université de Lille)
op_collection_id ftunivlille
language English
topic Multilinguality
Part-of-Speech Tagging
Aligned Corpora
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
spellingShingle Multilinguality
Part-of-Speech Tagging
Aligned Corpora
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
Balvet, Antonio
Stosic, Dejan
Miletic, Aleksandra
TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
topic_facet Multilinguality
Part-of-Speech Tagging
Aligned Corpora
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
description International audience In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the three languages, both for literary and linguistic studies. The French and English sub-corpora had been pos-tagged from the onset, using TreeTagger (Schmid, 1994), but the corpus lacked, until now, a tagged version of the Serbian sub-corpus. Here, we present the original parallel literary corpus, then we address issues related to pos-tagging a large collection of Serbian text: from the conception of an appropriate tagset for Serbian, to the choice of an automatic pos-tagger adapted to the task, and then to some quantitative and qualitative results. We then move on to a discussion of perspectives in the near future for further annotations of the whole parallel corpus.
author2 Savoirs, Textes, Langage (STL) - UMR 8163 (STL)
Université de Lille-Centre National de la Recherche Scientifique (CNRS)
Cognition, Langues, Langage, Ergonomie (CLLE-ERSS)
École Pratique des Hautes Études (EPHE)
Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J)
Université de Toulouse (UT)-Université de Toulouse (UT)-Université Bordeaux Montaigne (UBM)-Centre National de la Recherche Scientifique (CNRS)
format Conference Object
author Balvet, Antonio
Stosic, Dejan
Miletic, Aleksandra
author_facet Balvet, Antonio
Stosic, Dejan
Miletic, Aleksandra
author_sort Balvet, Antonio
title TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_short TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_full TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_fullStr TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_full_unstemmed TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_sort talc-sef a manually-revised pos-tagged literary corpus in serbian, english and french
publisher HAL CCSD
publishDate 2014
url https://shs.hal.science/halshs-01077767
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
LREC 2014
https://shs.hal.science/halshs-01077767
LREC 2014, May 2014, Reykjavik, Iceland
op_relation halshs-01077767
https://shs.hal.science/halshs-01077767
_version_ 1802645960851980288