The Stanford Natural Language Inference (SNLI) Corpus

Creators:

Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D.

Publication Date:

2015

Data Category:

Dataset Description:

The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. It is only 0.09GB large It consists of a training, validation, and test set. The variables contained in each of these sub datasets is described below. The data providers aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation-learning methods, as well as a resource for developing NLP models of any kind. The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it: Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Publications Citing This Dataset:

Devlin, Jacob; Chang, Ming-Wie; Lee, Kenton; Toutanova, Kristina (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
10.18653/v1/N19-1423
Liu, Pengfei and Yuan, Weizhe and Fu, Jinlan and Jiang, Zhengbao and Hayashi, Hiroaki and Neubig, Graham (2023), "Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing," ACM Computing Services, 55 (9), 1-35.
https://doi.org/10.1145/3560815

Variables:

Name	Description
gold_label	This is the label chosen by the majority of annotators. Where no majority exists, this is '-', and the pair should not be included when evaluating hard classification accuracy.
sentence1_binary_parse	The same parse as in sentence{1,2}_parse, but formatted for use in tree-structured neural networks with no unary nodes and no labels.
sentence2_binary_parse	The same parse as in sentence{1,2}_parse, but formatted for use in tree-structured neural networks with no unary nodes and no labels.
sentence1_parse	The parse produced by the Stanford Parser (3.5.2, case insensitive PCFG, trained on the standard training set augmented with the parsed Brown Corpus) in Penn Treebank format.
sentence2_parse	The parse produced by the Stanford Parser (3.5.2, case insensitive PCFG, trained on the standard training set augmented with the parsed Brown Corpus) in Penn Treebank format.
sentence1	The premise caption that was supplied to the author of the pair.
sentence2	The hypothesis caption that was written by the author of the pair.
captionID	A unique identifier for each sentence1 from the original Flickr30k example.
pairID	A unique identifier for each sentence1--sentence2 pair.
label1	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label2	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label3	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label4	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label5	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.

Details:

Bookmark this Dataset/Publication

The Stanford Natural Language Inference (SNLI) Corpus

Real-World LLM Use Cases

Computer generated building footprints for the United States

Firm-Level Data on IT Capital, Factor Substitution, and Returns to Scale in France

The Stanford Natural Language Inference (SNLI) Corpus

Sign In

Register

Reset Password