Showing 1-5 of 5 results

ConceptNet

Creators: ConceptNet
Publication Date: 2021
Creators: ConceptNet

ConceptNet aims to give computers access to common-sense knowledge, the kind of information that ordinary people know but usually leave unstated.ConceptNet is a semantic network that represents things that computers should know about the world, especially for the purpose of understanding text written by people. Its “concepts” are represented using words and phrases of many different natural language — unlike similar projects, it’s not limited to a single language such as English. It expresses over 13 million links between these concepts, and makes the whole data set available under a Creative Commons license. ConceptNet is structured as a graph, where nodes represent concepts (words or phrases), and edges represent the relationships between these concepts. Each edge is labeled with a relation type, such as “IsA,” “PartOf,” or “RelatedTo,” indicating the nature of the relationship. The dataset is organized into sub-datasets based on language and relation types, allowing users to work with specific subsets relevant to their applications.

MUStARD: Multimodal Sarcasm Detection Dataset

Creators: Castro, Santiago; Hazarika, Devamanyu; Pérez-Rosas, Verónica; Zimmermann, Roger; Mihalcea, Rada; Poria, Soujanya
Publication Date: 2019
Creators: Castro, Santiago; Hazarika, Devamanyu; Pérez-Rosas, Verónica; Zimmermann, Roger; Mihalcea, Rada; Poria, Soujanya

We release the MUStARD dataset which is a multimodal video corpus for research in automated sarcasm discovery. The dataset is compiled from popular TV shows including Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each entry in the dataset combines textual transcripts, audio signals, and visual cues, enabling comprehensive analysis of sarcasm as it manifests across different channels. Beyond isolated utterances, the dataset includes preceding conversational context, providing insights into how prior dialogue influences the interpretation of sarcasm. The dataset was compiled and released in 2019 and is approximately 11,9 kB in size.

Key numbers:

  • Total Utterances: 690

  • Sarcastic Utterances: 345

  • Non-Sarcastic Utterances: 345

The dataset is organized in JSON format with the following fields:

  • utterance: The text of the target utterance to classify.

  • speaker: Speaker of the target utterance.

  • context: List of utterances (in chronological order) preceding the target utterance.

  • context_speakers: Respective speakers of the context utterances.

  • sarcasm: Binary label indicating sarcasm (1 for sarcastic, 0 for non-sarcastic).

Twitter US Airline Sentiment

Creators: (Makone, Ashutosh)
Publication Date: 2016
Creators: (Makone, Ashutosh)

The Twitter US Airline Sentiment dataset is a collection of tweets aimed at analyzing public sentiment toward major U.S. airlines. Compiled in February 2015, the dataset consists of 14,640 tweets directed at several U.S. airlines. It serves as a valuable resource for sentiment analysis and natural language processing research, particularly in understanding customer satisfaction, airline service quality, and issues reported by travelers. Each tweet in the dataset is labeled with one of three sentiment categories: positive, neutral, or negative. Tweets labeled as negative are further categorized into specific negative sentiment reasons, such as late flight, customer service issue, canceled flight, and lost luggage, providing deeper insights into common complaints. The dataset also identifies the airline mentioned in each tweet, covering six major U.S. carriers: United Airlines, US Airways, American Airlines, Southwest Airlines, Delta Air Lines, and Virgin America. Additional metadata is provided for each tweet, including tweet ID, tweet text, tweet coordinates (if available), user information, and location data, allowing for further contextual analysis. The dataset is relatively small, with a total size of 8,46 MB, making it easily manageable for sentiment analysis tasks and machine learning applications. It includes 14,640 tweets from 7,700 unique users, providing a broad yet concise representation of customer interactions with airlines on Twitter. The tweets were collected over a one-month period in February 2015, offering a snapshot of public sentiment during that specific timeframe.

Consumer Complaint Database

Creators: Consumer Financial Protection Bureau (CFPB)
Publication Date: 2011
Creators: Consumer Financial Protection Bureau (CFPB)
The database encompasses a wide array of complaints related to various financial sectors, including debt collection, credit reporting, mortgages, credit cards, and more. Each week we send thousands of consumers’ complaints about financial products and services to companies for response. Those complaints are published here after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaint narratives are consumers’ descriptions of their experiences in their own words. By adding their voice, consumers help improve the financial marketplace. The database generally updates daily.  Each complaint entry includes details such as the date received, product type, issue, company involved, consumer’s narrative (if consented for publication), company response, and the complaint’s current status. This dataset serves as a valuable resource for identifying trends, assessing company practices, and informing policy decisions. 
As of February 22, 2025, the database contains a total of 7,867,198 complaints. The dataset is 1,36 GB in size and available for download in CSV format. The dataset spans from December 1, 2011, to the present, with regular updates to include new complaints.

The Stanford Natural Language Inference (SNLI) Corpus

Creators: Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D.
Publication Date: 2015
Creators: Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D.

The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. It is only 0.09GB large

It consists of a training, validation, and test set. The variables contained in each of these sub datasets is described below.

The data providers aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation-learning methods, as well as a resource for developing NLP models of any kind.

The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it:

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.