Spaces:

adamtayzzz
/

test

Running

App Files Files Community

test / glue_data /STS-B /readme.txt

adamtayzzz

Upload 41 files

1076673 over 1 year ago

raw

history blame contribute delete

5.99 kB


	STS Benchmark: Main English dataset

	Semantic Textual Similarity 2012-2017 Dataset

	http://ixa2.si.ehu.eus/stswiki


	STS Benchmark comprises a selection of the English datasets used in
	the STS tasks organized by us in the context of SemEval between 2012
	and 2017.

	In order to provide a standard benchmark to compare among systems, we
	organized it into train, development and test. The development part
	can be used to develop and tune hyperparameters of the systems, and
	the test part should be only used once for the final system.

	The benchmark comprises 8628 sentence pairs. This is the breakdown
	according to genres and train-dev-test splits:

	train dev test total
	-----------------------------
	news 3299 500 500 4299
	caption 2000 625 525 3250
	forum 450 375 254 1079
	-----------------------------
	total 5749 1500 1379 8628

	For reference, this is the breakdown according to the original names
	and task years of the datasets:

	genre file years train dev test
	------------------------------------------------
	news MSRpar 2012 1000 250 250
	news headlines 2013-16 1999 250 250
	news deft-news 2014 300 0 0
	captions MSRvid 2012 1000 250 250
	captions images 2014-15 1000 250 250
	captions track5.en-en 2017 0 125 125
	forum deft-forum 2014 450 0 0
	forum answers-forums 2015 0 375 0
	forum answer-answer 2016 0 0 254

	In addition to the standard benchmark, we also include other datasets
	(see readme.txt in "companion" directory).


	Introduction
	------------

	Given two sentences of text, s1 and s2, the systems need to compute
	how similar s1 and s2 are, returning a similarity score between 0 and
	5. The dataset comprises naturally occurring pairs of sentences drawn
	from several domains and genres, annotated by crowdsourcing. See
	papers by Agirre et al. (2012; 2013; 2014; 2015; 2016; 2017).

	Format
	------

	Each file is encoded in utf-8 (a superset of ASCII), and has the
	following tab separated fields:

	genre filename year score sentence1 sentence2

	optionally there might be some license-related fields after sentence2.

	NOTE: Given that some sentence pairs have been reused here and
	elsewhere, systems should NOT use the following datasets to develop or
	train their systems (see below for more details on datasets):

	- Any of the datasets in Semeval STS competitions, including Semeval
	2014 task 1 (also known as SICK).
	- The test part of MSR-Paraphrase (development and train are fine).
	- The text of the videos in MSR-Video.


	Evaluation script
	-----------------

	The official evaluation is the Pearson correlation coefficient. Given
	an output file comprising the system scores (one per line) in a file
	called sys.txt, you can use the evaluation script as follows:

	$ perl correlation.pl sts-dev.txt sys.txt


	Other
	-----

	Please check http://ixa2.si.ehu.eus/stswiki

	We recommend that interested researchers join the (low traffic)
	mailing list:

	http://groups.google.com/group/STS-semeval

	Notse on datasets and licenses
	------------------------------

	If using this data in your research please cite (Agirre et al. 2017)
	and the STS website: http://ixa2.si.ehu.eus/stswiki.

	Please see LICENSE.txt


	Organizers of tasks by year
	---------------------------

	2012 Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre

	2013 Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre,
	WeiWei Guo

	2014 Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab,
	Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau,
	Janyce Wiebe

	2015 Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab,
	Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse
	Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, Janyce
	Wiebe

	2016 Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor
	Gonzalez-Agirre, Rada Mihalcea, German Rigau, Janyce
	Wiebe

	2017 Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia
	Specia


	References
	----------

	Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre. Task 6: A
	Pilot on Semantic Textual Similarity. Procceedings of Semeval 2012

	Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, WeiWei
	Guo. *SEM 2013 shared task: Semantic Textual
	Similarity. Procceedings of *SEM 2013

	Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab,
	Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau,
	Janyce Wiebe. Task 10: Multilingual Semantic Textual
	Similarity. Proceedings of SemEval 2014.

	Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab,
	Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse
	Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, Janyce
	Wiebe. Task 2: Semantic Textual Similarity, English, Spanish and
	Pilot on Interpretability. Proceedings of SemEval 2015.

	Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor
	Gonzalez-Agirre, Rada Mihalcea, German Rigau, Janyce
	Wiebe. Semeval-2016 Task 1: Semantic Textual Similarity,
	Monolingual and Cross-Lingual Evaluation. Proceedings of SemEval
	2016.

	Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia
	Specia. Semeval-2017 Task 1: Semantic Textual Similarity
	Multilingual and Crosslingual Focused Evaluation. Proceedings of
	SemEval 2017.

	Clive Best, Erik van der Goot, Ken Blackler, Tefilo Garcia, and David
	Horby. 2005. Europe media monitor - system description. In EUR
	Report 22173-En, Ispra, Italy.

	Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier.
	Collecting Image Annotations Using Amazon's Mechanical Turk. In
	Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and
	Language Data with Amazon's Mechanical Turk.