Data for "SuperSim: a test set for word similarity and relatedness in Swedish"

This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021). If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry: Hengchen, Simon and Tahmasebi, Nina, 202...

Full description

Bibliographic Details
Main Authors: Hengchen, Simon, Tahmasebi, Nina
Format: Dataset
Language:Swedish
Published: Zenodo 2021
Subjects:
Online Access:https://dx.doi.org/10.5281/zenodo.4660083
https://zenodo.org/record/4660083
id ftdatacite:10.5281/zenodo.4660083
record_format openpolar
spelling ftdatacite:10.5281/zenodo.4660083 2023-05-15T16:52:31+02:00 Data for "SuperSim: a test set for word similarity and relatedness in Swedish" Hengchen, Simon Tahmasebi, Nina 2021 https://dx.doi.org/10.5281/zenodo.4660083 https://zenodo.org/record/4660083 sv swe Zenodo https://dx.doi.org/10.5281/zenodo.4660084 Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 info:eu-repo/semantics/openAccess CC-BY dataset Dataset 2021 ftdatacite https://doi.org/10.5281/zenodo.4660083 https://doi.org/10.5281/zenodo.4660084 2021-11-05T12:55:41Z This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021). If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry: Hengchen, Simon and Tahmasebi, Nina, 2021. SuperSim: a test set for word similarity and relatedness in Swedish. In The 23rd Nordic Conference on Computational Linguistics (NoDaLiDa’21) . @inproceedings{hengchen-tahmasebi-2021-supersim, title = "{SuperSim:} a test set for word similarity and relatedness in {Swedish}", author = "Hengchen, Simon and Tahmasebi, Nina", booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics", month = may # "{--}" # jun, year = "2021", address = "Reykjavik, Iceland, and Online", publisher = {Link{\"o}ping University Electronic Press}, } The data contained in this repository is as follows: The code folder contains: main.py utils.py train_base_models.py perl-clean.pl requirements.txt The data folder contains: gold_relatedness.tsv : all relatedness judgments from all annotators, as well as the mean gold_similarity.tsv : all similarity judgments from all annotators, as well as the mean models contains baseline models: Trained on the Swedish Gigaword: FastText: gigaword_sv.ft (and gigaword_sv.ft.trainables.syn1neg.npy , gigaword_sv.ft.trainables.vectors_ngrams_lockf.npy , gigaword_sv.ft.trainables.vectors_vocab_lockf.npy , gigaword_sv.ft.wv.vectors_ngrams.npy , gigaword_sv.ft.wv.vectors_vocab.npy , gigaword_sv.ft.wv.vectors.npy ) Word2Vec: gigaword_sv.w2v (and gigaword_sv.w2v.trainables.syn1neg.npy , gigaword_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_giga.txt and glove_vocab_giga.txt Trained on Swedish Wikipedia: FastText: wiki_sv.ft (and wiki_sv.ft.trainables.syn1neg.npy , wiki_sv.ft.trainables.vectors_ngrams_lockf.npy , wiki_sv.ft.trainables.vectors_vocab_lockf.npy , wiki_sv.ft.wv.vectors_ngrams.npy , wiki_sv.ft.wv.vectors.npy , wiki_sv.ft.wv.vectors_vocab.npy ) Word2Vec: wiki_sv.w2v (and wiki_sv.w2v.trainables.syn1neg.npy , wiki_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_WIKI.txt and glove_vocab_WIKI.txt corpora : The Swedish Gigaword corpus can be downloaded, along with code, from: https://spraakbanken.gu.se/en/resources/gigaword. We created our corpus with python extract_bw.py --mode plain outfile.txt . sv_wiki.gensim is a cleaned Swedish Wikipedia dump from 2020/10/20 (originally svwiki-20201020-pages-articles.xml ) and one of our baseline corpora. Details on annotation procedures are available in the paper. Acknowledgments : This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank), jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its ten partner institutions. Dataset Iceland DataCite Metadata Store (German National Library of Science and Technology)
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language Swedish
description This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021). If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry: Hengchen, Simon and Tahmasebi, Nina, 2021. SuperSim: a test set for word similarity and relatedness in Swedish. In The 23rd Nordic Conference on Computational Linguistics (NoDaLiDa’21) . @inproceedings{hengchen-tahmasebi-2021-supersim, title = "{SuperSim:} a test set for word similarity and relatedness in {Swedish}", author = "Hengchen, Simon and Tahmasebi, Nina", booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics", month = may # "{--}" # jun, year = "2021", address = "Reykjavik, Iceland, and Online", publisher = {Link{\"o}ping University Electronic Press}, } The data contained in this repository is as follows: The code folder contains: main.py utils.py train_base_models.py perl-clean.pl requirements.txt The data folder contains: gold_relatedness.tsv : all relatedness judgments from all annotators, as well as the mean gold_similarity.tsv : all similarity judgments from all annotators, as well as the mean models contains baseline models: Trained on the Swedish Gigaword: FastText: gigaword_sv.ft (and gigaword_sv.ft.trainables.syn1neg.npy , gigaword_sv.ft.trainables.vectors_ngrams_lockf.npy , gigaword_sv.ft.trainables.vectors_vocab_lockf.npy , gigaword_sv.ft.wv.vectors_ngrams.npy , gigaword_sv.ft.wv.vectors_vocab.npy , gigaword_sv.ft.wv.vectors.npy ) Word2Vec: gigaword_sv.w2v (and gigaword_sv.w2v.trainables.syn1neg.npy , gigaword_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_giga.txt and glove_vocab_giga.txt Trained on Swedish Wikipedia: FastText: wiki_sv.ft (and wiki_sv.ft.trainables.syn1neg.npy , wiki_sv.ft.trainables.vectors_ngrams_lockf.npy , wiki_sv.ft.trainables.vectors_vocab_lockf.npy , wiki_sv.ft.wv.vectors_ngrams.npy , wiki_sv.ft.wv.vectors.npy , wiki_sv.ft.wv.vectors_vocab.npy ) Word2Vec: wiki_sv.w2v (and wiki_sv.w2v.trainables.syn1neg.npy , wiki_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_WIKI.txt and glove_vocab_WIKI.txt corpora : The Swedish Gigaword corpus can be downloaded, along with code, from: https://spraakbanken.gu.se/en/resources/gigaword. We created our corpus with python extract_bw.py --mode plain outfile.txt . sv_wiki.gensim is a cleaned Swedish Wikipedia dump from 2020/10/20 (originally svwiki-20201020-pages-articles.xml ) and one of our baseline corpora. Details on annotation procedures are available in the paper. Acknowledgments : This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank), jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its ten partner institutions.
format Dataset
author Hengchen, Simon
Tahmasebi, Nina
spellingShingle Hengchen, Simon
Tahmasebi, Nina
Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
author_facet Hengchen, Simon
Tahmasebi, Nina
author_sort Hengchen, Simon
title Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_short Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_full Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_fullStr Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_full_unstemmed Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
title_sort data for "supersim: a test set for word similarity and relatedness in swedish"
publisher Zenodo
publishDate 2021
url https://dx.doi.org/10.5281/zenodo.4660083
https://zenodo.org/record/4660083
genre Iceland
genre_facet Iceland
op_relation https://dx.doi.org/10.5281/zenodo.4660084
op_rights Open Access
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
cc-by-4.0
info:eu-repo/semantics/openAccess
op_rightsnorm CC-BY
op_doi https://doi.org/10.5281/zenodo.4660083
https://doi.org/10.5281/zenodo.4660084
_version_ 1766042848537346048