Resources by Stefan

Flickr30k

Creators: Young, Peter; Lai, Alice; Hodosh, Micah; Hockenmaier, Julia
Publication Date: 2014
Creators: Young, Peter; Lai, Alice; Hodosh, Micah; Hockenmaier, Julia
The Flickr30k dataset consists of 31,783 images, each accompanied by five human-generated captions, adding up to 158,915 captions. These images predominantly depict people engaged in everyday activities and events. The dataset serves as a benchmark for sentence-based image description tasks. Each image is associated with five descriptive captions provided by human annotators. The dataset has been further enhanced by the Flickr30k Entities extension, which adds 244,000 coreference chains linking mentions of the same entities across different captions for the same image, and associates them with 276,000 manually annotated bounding boxes. This augmentation facilitates tasks such as phrase localization and grounded language understanding.

Trip Advisor Hotel Reviews

Creators: Alam, Md. H.; Ryu, Woo-Jong; Lee, SangKeun
Publication Date: 2020
Creators: Alam, Md. H.; Ryu, Woo-Jong; Lee, SangKeun

Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged. With this dataset, consisting of 20k reviews crawled from Tripadvisor, you can explore what makes a great hotel and maybe even use this model in your travels. It enables the exploration of factors contributing to hotel quality and can be utilized for sentiment analysis and natural language processing tasks. The dataset is 0,01 GB in size and covers textual reviews of hotels in combination with a numerical rating associated with the reviews.

Amazon Product Reviews

Creators: Ni, Jianmo; Li, Jiacheng; McAuley, Julian
Publication Date: 2018
Creators: Ni, Jianmo; Li, Jiacheng; McAuley, Julian

The Amazon Product Reviews dataset encompasses a comprehensive collection of 233.1 million customer reviews from Amazon, covering the period from May 1996 to October 2018. It includes various features such as ratings, textual reviews, helpfulness votes, product metadata (descriptions, category information, price, brand, and image features), and links to related products (e.g., also viewed, also bought graphs). The dataset serves as a valuable resource for analyzing consumer behavior, product trends, and for developing recommendation systems. In total, the dataset is 34 GB in size. It is organized into the following files and subsets:

  • Complete Review Data: A comprehensive file containing all 233.1 million reviews.
  • Ratings Only: A CSV file focusing solely on ratings, excluding textual reviews and metadata.
  • 5-Core: A subset where all users and items have at least five reviews, comprising 75.26 million reviews.
  • Per-Category Data: Reviews and product metadata categorized by specific product types (e.g., Books, Electronics).

Marketing Bias data

Creators: Wan, Mengting; Ni, Jianmo; Misra, Rishabh; McAuley, Julian
Publication Date: 2020
Creators: Wan, Mengting; Ni, Jianmo; Misra, Rishabh; McAuley, Julian

This dataset contains attributes of products sold on ModCloth and Amazon (in particular, attributes about how the products are marketed), which may introduce biases in recommendation systems. It is designed to facilitate research on marketing biases in product recommendations. Data also includes user/item interactions for recommendation.

The dataset amounts to 0,09 GB in size and is built upon two processed subsets:

  • ModCloth Dataset: Contains product attributes from the ModCloth platform.
  • Electronics Dataset: Comprises product attributes from Amazon’s Electronics category.

In total, the dataset includes 99,893 reviews for ModCloth and 1,292,954 reviews for the Electronics category of Amazon.

Google Local Reviews

Creators: He, Ruining; Kang, Wang-Cheng; McAuley, Julian
Publication Date: 2017
Creators: He, Ruining; Kang, Wang-Cheng; McAuley, Julian

The Google Local Reviews dataset comprises 11,453,845 reviews and ratings from 4,567,431 users on 3,116,785 local businesses. Each business entry dataset contains user reviews for local businesses, including variables such as rating, review text, business category, location details (address, GPS, phone number, opening hours), and user and business IDs. It also includes timestamps of reviews, price level, and whether the business is closed. The dataset has a size of 7 GB and spans 48,013 categories of local businesses across five continents, encompassing a diverse range of establishments from restaurants and hotels to parks and shopping malls.

 

Future of Business - Survey Results

Creators: Facebook; OECD; World Bank
Publication Date: 2018
Creators: Facebook; OECD; World Bank

The Future of Business survey is a collaboration between Facebook, the OECD and the World Bank to provide timely insights on the perceptions, challenges, and outlook of online Small and Medium Enterprises (SMEs). The Future of Business survey was first launched as a monthly survey in 17 countries in February 2016 and expanded to 42 countries in 2018. In 2019, the Future of Business survey increased coverage to 97 countries and moved to a bi-annual cadence.

The target population consists of SMEs that have an active Facebook business Page and include both newer and longer-standing businesses, spanning across a variety of sectors. To date, more than 90 million SMEs have created a Facebook Page, and more than 700,000 of these Facebook Page owners have taken the survey. With more businesses leveraging online tools each day, the survey provides a lens into a new mobilized, digital economy and, in particular, insights on the actors: a relatively unmeasured community worthy of deeper consideration and considerable policy interest. The dataset is approximately 0,04 GB in size.

The survey includes questions about perceptions of current and future economic activity, challenges, business characteristics and strategy. Custom modules include questions related to regulation, access to finance, digital payments, and digital skills.

The Stanford Natural Language Inference (SNLI) Corpus

Creators: Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D.
Publication Date: 2015
Creators: Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D.

The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. It is only 0.09GB large

It consists of a training, validation, and test set. The variables contained in each of these sub datasets is described below.

The data providers aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation-learning methods, as well as a resource for developing NLP models of any kind.

The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it:

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

IMDb Movie Reviews Dataset

Creators: Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher
Publication Date: 2011
Creators: Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning. In total, the dataset amounts to 0,08 GB in size.

The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional  50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.

The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided – please download the data directly from the data provider’s website.

When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.