Data for "SuperSim: a test set for word similarity and relatedness in Swedish"

This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021). If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry: Hengchen, Simon and Tahmasebi, Nina, 202...

Full description

Bibliographic Details
Main Authors:	Hengchen, Simon, Tahmasebi, Nina
Format:	Dataset
Language:	Swedish
Published:	Zenodo 2021
Subjects:	Iceland
Online Access:	https://dx.doi.org/10.5281/zenodo.4660083 https://zenodo.org/record/4660083

id	ftdatacite:10.5281/zenodo.4660083
record_format	openpolar
spelling	ftdatacite:10.5281/zenodo.4660083 2023-05-15T16:52:31+02:00 Data for "SuperSim: a test set for word similarity and relatedness in Swedish" Hengchen, Simon Tahmasebi, Nina 2021 https://dx.doi.org/10.5281/zenodo.4660083 https://zenodo.org/record/4660083 sv swe Zenodo https://dx.doi.org/10.5281/zenodo.4660084 Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 info:eu-repo/semantics/openAccess CC-BY dataset Dataset 2021 ftdatacite https://doi.org/10.5281/zenodo.4660083 https://doi.org/10.5281/zenodo.4660084 2021-11-05T12:55:41Z This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021). If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry: Hengchen, Simon and Tahmasebi, Nina, 2021. SuperSim: a test set for word similarity and relatedness in Swedish. In The 23rd Nordic Conference on Computational Linguistics (NoDaLiDa’21) . @inproceedings{hengchen-tahmasebi-2021-supersim, title = "{SuperSim:} a test set for word similarity and relatedness in {Swedish}", author = "Hengchen, Simon and Tahmasebi, Nina", booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics", month = may # "{--}" # jun, year = "2021", address = "Reykjavik, Iceland, and Online", publisher = {Link{\"o}ping University Electronic Press}, } The data contained in this repository is as follows: The code folder contains: main.py utils.py train_base_models.py perl-clean.pl requirements.txt The data folder contains: gold_relatedness.tsv : all relatedness judgments from all annotators, as well as the mean gold_similarity.tsv : all similarity judgments from all annotators, as well as the mean models contains baseline models: Trained on the Swedish Gigaword: FastText: gigaword_sv.ft (and gigaword_sv.ft.trainables.syn1neg.npy , gigaword_sv.ft.trainables.vectors_ngrams_lockf.npy , gigaword_sv.ft.trainables.vectors_vocab_lockf.npy , gigaword_sv.ft.wv.vectors_ngrams.npy , gigaword_sv.ft.wv.vectors_vocab.npy , gigaword_sv.ft.wv.vectors.npy ) Word2Vec: gigaword_sv.w2v (and gigaword_sv.w2v.trainables.syn1neg.npy , gigaword_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_giga.txt and glove_vocab_giga.txt Trained on Swedish Wikipedia: FastText: wiki_sv.ft (and wiki_sv.ft.trainables.syn1neg.npy , wiki_sv.ft.trainables.vectors_ngrams_lockf.npy , wiki_sv.ft.trainables.vectors_vocab_lockf.npy , wiki_sv.ft.wv.vectors_ngrams.npy , wiki_sv.ft.wv.vectors.npy , wiki_sv.ft.wv.vectors_vocab.npy ) Word2Vec: wiki_sv.w2v (and wiki_sv.w2v.trainables.syn1neg.npy , wiki_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_WIKI.txt and glove_vocab_WIKI.txt corpora : The Swedish Gigaword corpus can be downloaded, along with code, from: https://spraakbanken.gu.se/en/resources/gigaword. We created our corpus with python extract_bw.py --mode plain outfile.txt . sv_wiki.gensim is a cleaned Swedish Wikipedia dump from 2020/10/20 (originally svwiki-20201020-pages-articles.xml ) and one of our baseline corpora. Details on annotation procedures are available in the paper. Acknowledgments : This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank), jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its ten partner institutions. Dataset Iceland DataCite Metadata Store (German National Library of Science and Technology)
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	Swedish
description	This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021). If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry: Hengchen, Simon and Tahmasebi, Nina, 2021. SuperSim: a test set for word similarity and relatedness in Swedish. In The 23rd Nordic Conference on Computational Linguistics (NoDaLiDa’21) . @inproceedings{hengchen-tahmasebi-2021-supersim, title = "{SuperSim:} a test set for word similarity and relatedness in {Swedish}", author = "Hengchen, Simon and Tahmasebi, Nina", booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics", month = may # "{--}" # jun, year = "2021", address = "Reykjavik, Iceland, and Online", publisher = {Link{\"o}ping University Electronic Press}, } The data contained in this repository is as follows: The code folder contains: main.py utils.py train_base_models.py perl-clean.pl requirements.txt The data folder contains: gold_relatedness.tsv : all relatedness judgments from all annotators, as well as the mean gold_similarity.tsv : all similarity judgments from all annotators, as well as the mean models contains baseline models: Trained on the Swedish Gigaword: FastText: gigaword_sv.ft (and gigaword_sv.ft.trainables.syn1neg.npy , gigaword_sv.ft.trainables.vectors_ngrams_lockf.npy , gigaword_sv.ft.trainables.vectors_vocab_lockf.npy , gigaword_sv.ft.wv.vectors_ngrams.npy , gigaword_sv.ft.wv.vectors_vocab.npy , gigaword_sv.ft.wv.vectors.npy ) Word2Vec: gigaword_sv.w2v (and gigaword_sv.w2v.trainables.syn1neg.npy , gigaword_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_giga.txt and glove_vocab_giga.txt Trained on Swedish Wikipedia: FastText: wiki_sv.ft (and wiki_sv.ft.trainables.syn1neg.npy , wiki_sv.ft.trainables.vectors_ngrams_lockf.npy , wiki_sv.ft.trainables.vectors_vocab_lockf.npy , wiki_sv.ft.wv.vectors_ngrams.npy , wiki_sv.ft.wv.vectors.npy , wiki_sv.ft.wv.vectors_vocab.npy ) Word2Vec: wiki_sv.w2v (and wiki_sv.w2v.trainables.syn1neg.npy , wiki_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_WIKI.txt and glove_vocab_WIKI.txt corpora : The Swedish Gigaword corpus can be downloaded, along with code, from: https://spraakbanken.gu.se/en/resources/gigaword. We created our corpus with python extract_bw.py --mode plain outfile.txt . sv_wiki.gensim is a cleaned Swedish Wikipedia dump from 2020/10/20 (originally svwiki-20201020-pages-articles.xml ) and one of our baseline corpora. Details on annotation procedures are available in the paper. Acknowledgments : This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank), jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its ten partner institutions.
format	Dataset
author	Hengchen, Simon Tahmasebi, Nina
spellingShingle	Hengchen, Simon Tahmasebi, Nina Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
author_facet	Hengchen, Simon Tahmasebi, Nina
author_sort	Hengchen, Simon
title	Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_short	Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_full	Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_fullStr	Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_full_unstemmed	Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_sort	data for "supersim: a test set for word similarity and relatedness in swedish"
publisher	Zenodo
publishDate	2021
url	https://dx.doi.org/10.5281/zenodo.4660083 https://zenodo.org/record/4660083
genre	Iceland
genre_facet	Iceland
op_relation	https://dx.doi.org/10.5281/zenodo.4660084
op_rights	Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 info:eu-repo/semantics/openAccess
op_rightsnorm	CC-BY
op_doi	https://doi.org/10.5281/zenodo.4660083 https://doi.org/10.5281/zenodo.4660084
_version_	1766042848537346048

Data for "SuperSim: a test set for word similarity and relatedness in Swedish"

Similar Items