Data for "SuperSim: a test set for word similarity and relatedness in Swedish"

This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021). If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry: Hengchen, Simon and Tahmasebi, Nina, 202...

Full description

Bibliographic Details
Main Authors: Hengchen, Simon, Tahmasebi, Nina
Format: Dataset
Language:Swedish
Published: Zenodo 2021
Subjects:
Online Access:https://dx.doi.org/10.5281/zenodo.4660083
https://zenodo.org/record/4660083
Description
Summary:This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021). If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry: Hengchen, Simon and Tahmasebi, Nina, 2021. SuperSim: a test set for word similarity and relatedness in Swedish. In The 23rd Nordic Conference on Computational Linguistics (NoDaLiDa’21) . @inproceedings{hengchen-tahmasebi-2021-supersim, title = "{SuperSim:} a test set for word similarity and relatedness in {Swedish}", author = "Hengchen, Simon and Tahmasebi, Nina", booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics", month = may # "{--}" # jun, year = "2021", address = "Reykjavik, Iceland, and Online", publisher = {Link{\"o}ping University Electronic Press}, } The data contained in this repository is as follows: The code folder contains: main.py utils.py train_base_models.py perl-clean.pl requirements.txt The data folder contains: gold_relatedness.tsv : all relatedness judgments from all annotators, as well as the mean gold_similarity.tsv : all similarity judgments from all annotators, as well as the mean models contains baseline models: Trained on the Swedish Gigaword: FastText: gigaword_sv.ft (and gigaword_sv.ft.trainables.syn1neg.npy , gigaword_sv.ft.trainables.vectors_ngrams_lockf.npy , gigaword_sv.ft.trainables.vectors_vocab_lockf.npy , gigaword_sv.ft.wv.vectors_ngrams.npy , gigaword_sv.ft.wv.vectors_vocab.npy , gigaword_sv.ft.wv.vectors.npy ) Word2Vec: gigaword_sv.w2v (and gigaword_sv.w2v.trainables.syn1neg.npy , gigaword_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_giga.txt and glove_vocab_giga.txt Trained on Swedish Wikipedia: FastText: wiki_sv.ft (and wiki_sv.ft.trainables.syn1neg.npy , wiki_sv.ft.trainables.vectors_ngrams_lockf.npy , wiki_sv.ft.trainables.vectors_vocab_lockf.npy , wiki_sv.ft.wv.vectors_ngrams.npy , wiki_sv.ft.wv.vectors.npy , wiki_sv.ft.wv.vectors_vocab.npy ) Word2Vec: wiki_sv.w2v (and wiki_sv.w2v.trainables.syn1neg.npy , wiki_sv.w2v.wv.vectors.npy ) GloVe: glove_vectors_WIKI.txt and glove_vocab_WIKI.txt corpora : The Swedish Gigaword corpus can be downloaded, along with code, from: https://spraakbanken.gu.se/en/resources/gigaword. We created our corpus with python extract_bw.py --mode plain outfile.txt . sv_wiki.gensim is a cleaned Swedish Wikipedia dump from 2020/10/20 (originally svwiki-20201020-pages-articles.xml ) and one of our baseline corpora. Details on annotation procedures are available in the paper. Acknowledgments : This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank), jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its ten partner institutions.