TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French

International audience In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the t...

Full description

Bibliographic Details
Main Authors: Balvet, Antonio, Stosic, Dejan, Miletic, Aleksandra
Other Authors: Savoirs, Textes, Langage (STL) - UMR 8163 (STL), Université de Lille-Centre National de la Recherche Scientifique (CNRS), Cognition, Langues, Langage, Ergonomie (CLLE-ERSS), École pratique des hautes études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J)-Université Bordeaux Montaigne-Centre National de la Recherche Scientifique (CNRS)
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://halshs.archives-ouvertes.fr/halshs-01077767
id ftccsdartic:oai:HAL:halshs-01077767v1
record_format openpolar
spelling ftccsdartic:oai:HAL:halshs-01077767v1 2023-05-15T16:50:08+02:00 TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French Balvet, Antonio Stosic, Dejan Miletic, Aleksandra Savoirs, Textes, Langage (STL) - UMR 8163 (STL) Université de Lille-Centre National de la Recherche Scientifique (CNRS) Cognition, Langues, Langage, Ergonomie (CLLE-ERSS) École pratique des hautes études (EPHE) Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J)-Université Bordeaux Montaigne-Centre National de la Recherche Scientifique (CNRS) Reykjavik, Iceland 2014-05-26 https://halshs.archives-ouvertes.fr/halshs-01077767 en eng HAL CCSD halshs-01077767 https://halshs.archives-ouvertes.fr/halshs-01077767 Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) LREC 2014 https://halshs.archives-ouvertes.fr/halshs-01077767 LREC 2014, May 2014, Reykjavik, Iceland Multilinguality Part-of-Speech Tagging Aligned Corpora [SHS.LANGUE]Humanities and Social Sciences/Linguistics info:eu-repo/semantics/conferenceObject Conference papers 2014 ftccsdartic 2021-11-21T03:22:09Z International audience In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the three languages, both for literary and linguistic studies. The French and English sub-corpora had been pos-tagged from the onset, using TreeTagger (Schmid, 1994), but the corpus lacked, until now, a tagged version of the Serbian sub-corpus. Here, we present the original parallel literary corpus, then we address issues related to pos-tagging a large collection of Serbian text: from the conception of an appropriate tagset for Serbian, to the choice of an automatic pos-tagger adapted to the task, and then to some quantitative and qualitative results. We then move on to a discussion of perspectives in the near future for further annotations of the whole parallel corpus. Conference Object Iceland Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe)
institution Open Polar
collection Archive ouverte HAL (Hyper Article en Ligne, CCSD - Centre pour la Communication Scientifique Directe)
op_collection_id ftccsdartic
language English
topic Multilinguality
Part-of-Speech Tagging
Aligned Corpora
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
spellingShingle Multilinguality
Part-of-Speech Tagging
Aligned Corpora
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
Balvet, Antonio
Stosic, Dejan
Miletic, Aleksandra
TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
topic_facet Multilinguality
Part-of-Speech Tagging
Aligned Corpora
[SHS.LANGUE]Humanities and Social Sciences/Linguistics
description International audience In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the three languages, both for literary and linguistic studies. The French and English sub-corpora had been pos-tagged from the onset, using TreeTagger (Schmid, 1994), but the corpus lacked, until now, a tagged version of the Serbian sub-corpus. Here, we present the original parallel literary corpus, then we address issues related to pos-tagging a large collection of Serbian text: from the conception of an appropriate tagset for Serbian, to the choice of an automatic pos-tagger adapted to the task, and then to some quantitative and qualitative results. We then move on to a discussion of perspectives in the near future for further annotations of the whole parallel corpus.
author2 Savoirs, Textes, Langage (STL) - UMR 8163 (STL)
Université de Lille-Centre National de la Recherche Scientifique (CNRS)
Cognition, Langues, Langage, Ergonomie (CLLE-ERSS)
École pratique des hautes études (EPHE)
Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J)-Université Bordeaux Montaigne-Centre National de la Recherche Scientifique (CNRS)
format Conference Object
author Balvet, Antonio
Stosic, Dejan
Miletic, Aleksandra
author_facet Balvet, Antonio
Stosic, Dejan
Miletic, Aleksandra
author_sort Balvet, Antonio
title TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_short TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_full TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_fullStr TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_full_unstemmed TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French
title_sort talc-sef a manually-revised pos-tagged literary corpus in serbian, english and french
publisher HAL CCSD
publishDate 2014
url https://halshs.archives-ouvertes.fr/halshs-01077767
op_coverage Reykjavik, Iceland
genre Iceland
genre_facet Iceland
op_source Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
LREC 2014
https://halshs.archives-ouvertes.fr/halshs-01077767
LREC 2014, May 2014, Reykjavik, Iceland
op_relation halshs-01077767
https://halshs.archives-ouvertes.fr/halshs-01077767
_version_ 1766040311605231616