TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French

International audience In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the t...

Full description

Bibliographic Details
Main Authors: Balvet, Antonio, Stosic, Dejan, Miletic, Aleksandra
Other Authors: Savoirs, Textes, Langage (STL) - UMR 8163 (STL), Université de Lille-Centre National de la Recherche Scientifique (CNRS), Cognition, Langues, Langage, Ergonomie (CLLE-ERSS), École Pratique des Hautes Études (EPHE), Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université de Toulouse (UT)-Université Bordeaux Montaigne (UBM)-Centre National de la Recherche Scientifique (CNRS)
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://shs.hal.science/halshs-01077767
id ftunivbordmont:oai:HAL:halshs-01077767v1
record_format openpolar
spelling ftunivbordmont:oai:HAL:halshs-01077767v1 2024-06-23T07:54:02+00:00 TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French Balvet, Antonio Stosic, Dejan Miletic, Aleksandra Savoirs, Textes, Langage (STL) - UMR 8163 (STL) Université de Lille-Centre National de la Recherche Scientifique (CNRS) Cognition, Langues, Langage, Ergonomie (CLLE-ERSS) École Pratique des Hautes Études (EPHE) Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J) Université de Toulouse (UT)-Université de Toulouse (UT)-Université Bordeaux Montaigne (UBM)-Centre National de la Recherche Scientifique (CNRS) Reykjavik, Iceland 2014-05-26 https://shs.hal.science/halshs-01077767 en eng HAL CCSD halshs-01077767 https://shs.hal.science/halshs-01077767 Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) LREC 2014 https://shs.hal.science/halshs-01077767 LREC 2014, May 2014, Reykjavik, Iceland Multilinguality Part-of-Speech Tagging Aligned Corpora [SHS.LANGUE]Humanities and Social Sciences/Linguistics info:eu-repo/semantics/conferenceObject Conference papers 2014 ftunivbordmont 2024-06-10T14:05:21Z International audience In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the three languages, both for literary and linguistic studies. The French and English sub-corpora had been pos-tagged from the onset, using TreeTagger (Schmid, 1994), but the corpus lacked, until now, a tagged version of the Serbian sub-corpus. Here, we present the original parallel literary corpus, then we address issues related to pos-tagging a large collection of Serbian text: from the conception of an appropriate tagset for Serbian, to the choice of an automatic pos-tagger adapted to the task, and then to some quantitative and qualitative results. We then move on to a discussion of perspectives in the near future for further annotations of the whole parallel corpus. Conference Object Iceland Archive Ouverte de l'Université Bordeaux Montaigne - HAL
institution Open Polar
collection Archive Ouverte de l'Université Bordeaux Montaigne - HAL
op_collection_id ftunivbordmont
language English
topic Multilinguality
Part-of-Speech Tagging
Aligned Corpora
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
spellingShingle Multilinguality
Part-of-Speech Tagging
Aligned Corpora
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
Balvet, Antonio
Stosic, Dejan
Miletic, Aleksandra
TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
topic_facet Multilinguality
Part-of-Speech Tagging
Aligned Corpora
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
description International audience In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the three languages, both for literary and linguistic studies. The French and English sub-corpora had been pos-tagged from the onset, using TreeTagger (Schmid, 1994), but the corpus lacked, until now, a tagged version of the Serbian sub-corpus. Here, we present the original parallel literary corpus, then we address issues related to pos-tagging a large collection of Serbian text: from the conception of an appropriate tagset for Serbian, to the choice of an automatic pos-tagger adapted to the task, and then to some quantitative and qualitative results. We then move on to a discussion of perspectives in the near future for further annotations of the whole parallel corpus.
author2 Savoirs, Textes, Langage (STL) - UMR 8163 (STL)
Université de Lille-Centre National de la Recherche Scientifique (CNRS)
Cognition, Langues, Langage, Ergonomie (CLLE-ERSS)
École Pratique des Hautes Études (EPHE)
Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J)
Université de Toulouse (UT)-Université de Toulouse (UT)-Université Bordeaux Montaigne (UBM)-Centre National de la Recherche Scientifique (CNRS)
format Conference Object
author Balvet, Antonio
Stosic, Dejan
Miletic, Aleksandra
author_facet Balvet, Antonio
Stosic, Dejan
Miletic, Aleksandra
author_sort Balvet, Antonio
title TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_short TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_full TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_fullStr TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_full_unstemmed TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_sort talc-sef a manually-revised pos-tagged literary corpus in serbian, english and french
publisher HAL CCSD
publishDate 2014
url https://shs.hal.science/halshs-01077767
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
LREC 2014
https://shs.hal.science/halshs-01077767
LREC 2014, May 2014, Reykjavik, Iceland
op_relation halshs-01077767
https://shs.hal.science/halshs-01077767
_version_ 1802645966755463168