The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment

The SICK data set consists of about 10,000 English sentence pairs, generated starting from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description data set. We randomly selected a subset of sentence pairs from each of these sources and we applied a 3-step genera...

Full description

Bibliographic Details
Main Authors:	Marelli, Marco, Menini, Stefano, Baroni, Marco, Bentivogli, Luisa, Bernardi, Raffaella, Zamparelli, Roberto
Format:	Dataset
Language:	English
Published:	Zenodo 2014
Subjects:	computational linguistics, entailment, sentence similarity, sentence relatedness, compositional semantics, distributional semantics Iceland
Online Access:	https://dx.doi.org/10.5281/zenodo.2787612 https://zenodo.org/record/2787612

id	ftdatacite:10.5281/zenodo.2787612
record_format	openpolar
spelling	ftdatacite:10.5281/zenodo.2787612 2023-05-15T16:53:03+02:00 The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment Marelli, Marco Menini, Stefano Baroni, Marco Bentivogli, Luisa Bernardi, Raffaella Zamparelli, Roberto 2014 https://dx.doi.org/10.5281/zenodo.2787612 https://zenodo.org/record/2787612 en eng Zenodo https://dx.doi.org/10.5281/zenodo.2787611 Open Access Creative Commons Attribution Non Commercial Share Alike 3.0 Unported https://creativecommons.org/licenses/by-nc-sa/3.0/legalcode cc-by-nc-sa-3.0 info:eu-repo/semantics/openAccess CC-BY-NC-SA computational linguistics, entailment, sentence similarity, sentence relatedness, compositional semantics, distributional semantics dataset Dataset 2014 ftdatacite https://doi.org/10.5281/zenodo.2787612 https://doi.org/10.5281/zenodo.2787611 2021-11-05T12:55:41Z The SICK data set consists of about 10,000 English sentence pairs, generated starting from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description data set. We randomly selected a subset of sentence pairs from each of these sources and we applied a 3-step generation process: first, the original sentences were normalized to remove unwanted linguistic phenomena; the normalized sentences were then expanded to obtain up to three new sentences with specific characteristics suitable to CDSM evaluation; as a last step, all the sentences generated in the expansion phase were paired with the normalized sentences in order to obtain the final data set. Each sentence pair was annotated for relatedness and entailment by means of crowdsourcing techniques. The sentence relatedness score (on a 5-point rating scale) provides a direct way to evaluate CDSMs, insofar as their outputs are meant to quantify the degree of semantic relatedness between sentences; the categorizations in terms of the entailment relation between the two sentences (with entailment, contradiction , and neutral as gold labels) is also a crucial aspect to consider, since detecting the presence of entailment is one of the traditional benchmarks of a successful semantic system. In the final set, gold scores for relatedness and entailment were distributed as follows: the relatednes scoring resulted in 923 pairs within the [1,2) range, 1373 pairs within the [2,3) range, 3872 pairs within the [3,4) range, and 3672 pairs within the [4,5] range; the entailment annotation led to 5595 neutral pairs, 1424 contradiction pairs, and 2821 entailment pairs. Files SICK.zip (main file) SICK_Annotated.zip (a version of the data set annotated for the expansion rule which was used in each case) SICK_subsets.zip (a Indexes specifying further classifications, used in the JLRE 2016 publication) : {"references": ["L. Bentivogli, R. Bernardi, M. Marelli, S. Menini, M. Baroni and R. Zamparelli (2016). SICK Through the SemEval Glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Journal of Language Resources and Evaluation, 50(1), 95-124", "M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi and R. Zamparelli (2014). A SICK cure for the evaluation of compositional distributional semantic models. Proceedings of LREC 2014, Reykjavik (Iceland): ELRA, 216-223."]} Dataset Iceland DataCite Metadata Store (German National Library of Science and Technology)
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	English
topic	computational linguistics, entailment, sentence similarity, sentence relatedness, compositional semantics, distributional semantics
spellingShingle	computational linguistics, entailment, sentence similarity, sentence relatedness, compositional semantics, distributional semantics Marelli, Marco Menini, Stefano Baroni, Marco Bentivogli, Luisa Bernardi, Raffaella Zamparelli, Roberto The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment
topic_facet	computational linguistics, entailment, sentence similarity, sentence relatedness, compositional semantics, distributional semantics
description	The SICK data set consists of about 10,000 English sentence pairs, generated starting from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description data set. We randomly selected a subset of sentence pairs from each of these sources and we applied a 3-step generation process: first, the original sentences were normalized to remove unwanted linguistic phenomena; the normalized sentences were then expanded to obtain up to three new sentences with specific characteristics suitable to CDSM evaluation; as a last step, all the sentences generated in the expansion phase were paired with the normalized sentences in order to obtain the final data set. Each sentence pair was annotated for relatedness and entailment by means of crowdsourcing techniques. The sentence relatedness score (on a 5-point rating scale) provides a direct way to evaluate CDSMs, insofar as their outputs are meant to quantify the degree of semantic relatedness between sentences; the categorizations in terms of the entailment relation between the two sentences (with entailment, contradiction , and neutral as gold labels) is also a crucial aspect to consider, since detecting the presence of entailment is one of the traditional benchmarks of a successful semantic system. In the final set, gold scores for relatedness and entailment were distributed as follows: the relatednes scoring resulted in 923 pairs within the [1,2) range, 1373 pairs within the [2,3) range, 3872 pairs within the [3,4) range, and 3672 pairs within the [4,5] range; the entailment annotation led to 5595 neutral pairs, 1424 contradiction pairs, and 2821 entailment pairs. Files SICK.zip (main file) SICK_Annotated.zip (a version of the data set annotated for the expansion rule which was used in each case) SICK_subsets.zip (a Indexes specifying further classifications, used in the JLRE 2016 publication) : {"references": ["L. Bentivogli, R. Bernardi, M. Marelli, S. Menini, M. Baroni and R. Zamparelli (2016). SICK Through the SemEval Glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Journal of Language Resources and Evaluation, 50(1), 95-124", "M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi and R. Zamparelli (2014). A SICK cure for the evaluation of compositional distributional semantic models. Proceedings of LREC 2014, Reykjavik (Iceland): ELRA, 216-223."]}
format	Dataset
author	Marelli, Marco Menini, Stefano Baroni, Marco Bentivogli, Luisa Bernardi, Raffaella Zamparelli, Roberto
author_facet	Marelli, Marco Menini, Stefano Baroni, Marco Bentivogli, Luisa Bernardi, Raffaella Zamparelli, Roberto
author_sort	Marelli, Marco
title	The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment
title_short	The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment
title_full	The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment
title_fullStr	The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment
title_full_unstemmed	The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment
title_sort	sick (sentences involving compositional knowledge) dataset for relatedness and entailment
publisher	Zenodo
publishDate	2014
url	https://dx.doi.org/10.5281/zenodo.2787612 https://zenodo.org/record/2787612
genre	Iceland
genre_facet	Iceland
op_relation	https://dx.doi.org/10.5281/zenodo.2787611
op_rights	Open Access Creative Commons Attribution Non Commercial Share Alike 3.0 Unported https://creativecommons.org/licenses/by-nc-sa/3.0/legalcode cc-by-nc-sa-3.0 info:eu-repo/semantics/openAccess
op_rightsnorm	CC-BY-NC-SA
op_doi	https://doi.org/10.5281/zenodo.2787612 https://doi.org/10.5281/zenodo.2787611
_version_	1766043569219436544

The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment

Similar Items