Showing 257-264 of 272 results

Goodreads Datasets

Creators: Wan, Mengting; McAuley, Julian
Publication Date: 2017
Creators: Wan, Mengting; McAuley, Julian

The Goodreads Datasets provide a large-scale collection of book-related data, making them valuable for analyzing reading behavior, book popularity, and recommendation systems. They contain rich metadata on over 2.3 million books, including titles, authors, publication years, genres, and average ratings. Additionally, they feature nearly 229 million user-book interactions, capturing explicit preferences such as bookshelf assignments (“read,” “to-read”) and user ratings. The dataset also covers a vast collection of user-generated textual reviews, offering insights into reader sentiments and opinions. The Goodreads Datasets contain three primary components: book metadata, user-book interactions, and book reviews. The book metadata includes details on 2,360,655 books, such as title, author, publication date, and genre. The user-book interactions dataset comprises 228,648,342 interactions from 876,145 users, capturing activities like adding books to shelves (“read,” “to-read”) and providing ratings. The book reviews dataset contains detailed textual reviews written by users, providing insights into reader sentiments and book popularity.The total size of the combined datasets is approximately 11 GB, available in JSON and CSV formats.

Social Interaction QA; Social IQA

Creators: Sap, Maarten; Rashkin, Hannah; Chen, Derek; Le Bras, Ronan; Choi, Yejin
Publication Date: 2019
Creators: Sap, Maarten; Rashkin, Hannah; Chen, Derek; Le Bras, Ronan; Choi, Yejin

Social Interaction QA, a new question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications. For example, given an action like “Jesse saw a concert” and a question like “Why did Jesse do this?”, humans can easily infer that Jesse wanted “to see their favorite performer” or “to enjoy the music”, and not “to see what’s happening inside” or “to see if it works”. The actions in Social IQa span a wide variety of social situations, and answer candidates contain both human-curated answers and adversarially-filtered machine-generated candidates. Social IQa contains over 37,000 QA pairs for evaluating models’ abilities to reason about the social implications of everyday events and situations. The dataset is relatively small, with a size of about 0.01 GB, and is available in JSON format.

The structure of the dataset consists of a set of question-answer pairs, where each entry contains:

  • A context describing a social situation.
  • A question that requires reasoning about the context.
  • Three answer choices (one correct, two incorrect).
  • A label indicating the correct answer.

Instagram Influencer Marketing Dataset

Creators: Kim, Seungbae; Jiang, Jyun-Yu; Nakada, Masaki; Han, Jinyoung; Wang, Wie
Publication Date: 2020
Creators: Kim, Seungbae; Jiang, Jyun-Yu; Nakada, Masaki; Han, Jinyoung; Wang, Wie

This dataset contains 33,935 Instagram influencers who are classified into the following nine categories including beauty, family, fashion, fitness, food, interior, pet, travel, and other. The dataset is 262 GB in size, including both metadata in JSON format and images in JPEG format. We collect 300 posts per influencer so that there are 10,180,500 Instagram posts in the dataset. The dataset includes two types of files, post metadata and image files. Post metadata files are in JSON format and contain the following information: caption, usertags, hashtags, timestamp, sponsorship, likes, comments, etc. Image files are in JPEG format and the dataset contains 12,933,406 image files since a post can have more than one image file. If a post has only one image file then the JSON file and the corresponding image files have the same name. However, if a post has more than one image then the JSON file and corresponding image files have different names. Therefore, we also provide a JSON-Image_mapping file that shows a list of image files that corresponds to post metadata.

If you want to use this dataset, please cite it accordingly. The data can be accessed on the respective website link below.

“Multimodal Post Attentive Profiling for Influencer Marketing,” Seungbae Kim, Jyun-Yu Jiang, Masaki Nakada, Jinyoung Han and Wei Wang. In Proceedings of The Web Conference (WWW ’20), ACM, 2020.

Consumer Complaint Database

Creators: Consumer Financial Protection Bureau (CFPB)
Publication Date: 2011
Creators: Consumer Financial Protection Bureau (CFPB)
The database encompasses a wide array of complaints related to various financial sectors, including debt collection, credit reporting, mortgages, credit cards, and more. Each week we send thousands of consumers’ complaints about financial products and services to companies for response. Those complaints are published here after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaint narratives are consumers’ descriptions of their experiences in their own words. By adding their voice, consumers help improve the financial marketplace. The database generally updates daily.  Each complaint entry includes details such as the date received, product type, issue, company involved, consumer’s narrative (if consented for publication), company response, and the complaint’s current status. This dataset serves as a valuable resource for identifying trends, assessing company practices, and informing policy decisions. 
As of February 22, 2025, the database contains a total of 7,867,198 complaints. The dataset is 1,36 GB in size and available for download in CSV format. The dataset spans from December 1, 2011, to the present, with regular updates to include new complaints.

ImageNet Large Scale Visual Recognition Challenge

Creators: Russakovsky, Olga; Deng, Jia; Su, Hao; Krause, Jonathan; Satheesh, Sanjeev; Ma, Sean; Huang, Zhiheng; Karpathy, Andrej; Khosla, Aditya; Bernstein, Michael; Berg, Alexander C.; Fei-Fei, Li
Publication Date: 2009
Creators: Russakovsky, Olga; Deng, Jia; Su, Hao; Krause, Jonathan; Satheesh, Sanjeev; Ma, Sean; Huang, Zhiheng; Karpathy, Andrej; Khosla, Aditya; Bernstein, Michael; Berg, Alexander C.; Fei-Fei, Li

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research. It contains data from 2012 until 2017. The dataset includes over 14 million images, while the biggest subset ImageNet Large Scale Visual Recognition Challenge (ILSVRC) covers 1,281,167 training images, 50,000 validation images, and 100,000 test images. In total, the dataset has a size of 167 GB. The data is available for free to researchers for non-commercial use on the data provider’s website.

For access to the full ImageNet dataset and other commonly used subsets, please login or request access on the website of the data providers. In doing so, you will need to agree to the ImageNet’s terms of access. Therefore, no data preview can be provided here.

When reporting results of the challenges or using the datasets, please cite:

Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

File Descriptions

1) ILSVRC/ contains the image data and ground truth for the train and validation sets, and the image data for the test set.

  • The image annotations are saved in XML files in PASCAL VOC format. Users can parse the annotations using the PASCAL Development Toolkit.
  • Annotations are ordered by their synsets (for example, “Persian cat”, “mountain bike”, or “hot dog”) as their wnid. These id’s look like n00141669. Each image’s name has direct correspondence with the annotation file name. For example, the bounding box for n02123394/n02123394_28.xml is n02123394_28.JPEG.
  • You can download all the bounding boxes of a particular synset from http://www.image-net.org/api/download/imagenet.bbox.synset?wnid=%5Bwnid]
  • The training images are under the folders with the names of their synsets. The validation images are all in the same folder. The test images are also all in the same folder.
  • ImageSet folder contains text files specifying lists of images for the main localization task.

2) LOC_sample_submission.csv is the correct format of the submission file. It contains two columns:

  • ImageId: the id of the test image, for example ILSVRC2012_test_00000001
  • PredictionString: the prediction string should be a space delimited of 5 integers. For example, 1000 240 170 260 240 means it’s label 1000, with a bounding box of coordinates (x_min, y_min, x_max, y_max). We accept up to 5 predictions. For example, if you submit 862 42 24 170 186 862 292 28 430 198 862 168 24 292 190 862 299 238 443 374 862 160 195 294 357 862 3 214 135 356 which contains 6 bounding boxes, we will only take the first 5 into consideration.

3) LOC_train_solution.csv and LOC_val_solution.csv: These information are available in ILSVRC/ already, but we are providing them in csv format to be consistent with LOC_sample_submission.csv. Each file contains two columns:

  • ImageId: the id of the train/val image, for example n02017213_7894 or ILSVRC2012_val_00048981
  • PredictionString: the prediction string is a space delimited of 5 integers. For example, n01978287 240 170 260 240 means it’s label n01978287, with a bounding box of coordinates (x_min, y_min, x_max, y_max). Repeated bounding boxes represent multiple boxes in the same image: n04447861 248 177 417 332 n04447861 171 156 251 175 n04447861 24 133 115 254

4) LOC_synset_mapping.txt: The mapping between the 1000 synset id and their descriptions. For example, Line 1 says n01440764 tench, Tinca tinca means this is class 1, has a synset id of n01440764, and it contains the fish tench.

Flickr30k

Creators: Young, Peter; Lai, Alice; Hodosh, Micah; Hockenmaier, Julia
Publication Date: 2014
Creators: Young, Peter; Lai, Alice; Hodosh, Micah; Hockenmaier, Julia
The Flickr30k dataset consists of 31,783 images, each accompanied by five human-generated captions, adding up to 158,915 captions. These images predominantly depict people engaged in everyday activities and events. The dataset serves as a benchmark for sentence-based image description tasks. Each image is associated with five descriptive captions provided by human annotators. The dataset has been further enhanced by the Flickr30k Entities extension, which adds 244,000 coreference chains linking mentions of the same entities across different captions for the same image, and associates them with 276,000 manually annotated bounding boxes. This augmentation facilitates tasks such as phrase localization and grounded language understanding.

Trip Advisor Hotel Reviews

Creators: Alam, Md. H.; Ryu, Woo-Jong; Lee, SangKeun
Publication Date: 2020
Creators: Alam, Md. H.; Ryu, Woo-Jong; Lee, SangKeun

Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged. With this dataset, consisting of 20k reviews crawled from Tripadvisor, you can explore what makes a great hotel and maybe even use this model in your travels. It enables the exploration of factors contributing to hotel quality and can be utilized for sentiment analysis and natural language processing tasks. The dataset is 0,01 GB in size and covers textual reviews of hotels in combination with a numerical rating associated with the reviews.

Amazon Product Reviews

Creators: Ni, Jianmo; Li, Jiacheng; McAuley, Julian
Publication Date: 2018
Creators: Ni, Jianmo; Li, Jiacheng; McAuley, Julian

The Amazon Product Reviews dataset encompasses a comprehensive collection of 233.1 million customer reviews from Amazon, covering the period from May 1996 to October 2018. It includes various features such as ratings, textual reviews, helpfulness votes, product metadata (descriptions, category information, price, brand, and image features), and links to related products (e.g., also viewed, also bought graphs). The dataset serves as a valuable resource for analyzing consumer behavior, product trends, and for developing recommendation systems. In total, the dataset is 34 GB in size. It is organized into the following files and subsets:

  • Complete Review Data: A comprehensive file containing all 233.1 million reviews.
  • Ratings Only: A CSV file focusing solely on ratings, excluding textual reviews and metadata.
  • 5-Core: A subset where all users and items have at least five reviews, comprising 75.26 million reviews.
  • Per-Category Data: Reviews and product metadata categorized by specific product types (e.g., Books, Electronics).

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.