Resources by Stefan

YouTube-8M Dataset

Creators: Abu-El-Haija, Sami; Kothari, Nisarg; Lee, Joonseok; Natsev, Paul; Toderici, George; Varadarajan, Balakrishnan; Vijayanarasimhan, Sudheendra
Publication Date: 2016
Creators: Abu-El-Haija, Sami; Kothari, Nisarg; Lee, Joonseok; Natsev, Paul; Toderici, George; Varadarajan, Balakrishnan; Vijayanarasimhan, Sudheendra

YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and with high-quality machine-generated & partially human-verified annotations from a diverse vocabulary of 3,800+ visual entities.

It comprises two subsets:

8M Segments Dataset: 230K human-verified segment labels, 1000 classes, 5 segments/video
8M Dataset: May 2018 version (current): 6.1M videos, 3862 classes, 3.0 labels/video, 2.6B audio-visual features

Thus, it comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to train a strong baseline model on this dataset in less than a day on a single GPU! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.

YouTube offers the YouTube8M dataset for download as TensorFlow Record files on their website. Starter code for the dataset can be found on their GitHubpage.

Amazon Reviews: Unlocked Mobile Phones

Creators: PromptCloud, Inc.
Publication Date: 2019
Creators: PromptCloud, Inc.
We analyzed more than 400,000 reviews of close to 4,400 unlocked mobile phones sold on Amazon.com to find out insights with respect to reviews, ratings, price and their relationships, making it a rich resource for analyzing customer sentiment and product performance. The dataset is approximately 0,13 GB in size and is available in CSV format. The author found that on Amazon’s product review platform most of the reviewers have given 4-star and 3-star ratings. The average length of the reviews comes close to 230 characters. They also uncovered that lengthier reviews tend to be more helpful and there is a positive correlation between price & rating. 

Structurally, each entry in the dataset includes the following variables:

  • Product Name: The name of the product (e.g., “Sprint EPIC 4G Galaxy SPH-D7”).
  • Brand Name: The manufacturer or parent company (e.g., “Samsung”).
  • Price: The listed price of the product, with values ranging from a minimum of $1.73 to a maximum of $2,598, and an average price of $226.86.
  • Rating: The user-assigned rating, ranging between 1 and 5 stars.
  • Reviews: The textual content of the user’s review, detailing their experience and opinions.
  • Review Votes: The number of helpfulness votes each review received from other users, with a minimum of 0, a maximum of 645, and an average of 1.50 votes.

Amazon question/answer data

Creators: McAuley, Julian; Yang, Alex
Publication Date: 2016
Creators: McAuley, Julian; Yang, Alex
This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions and around 4 million answers. This dataset offers valuable insights into consumer inquiries and the corresponding responses, facilitating research in natural language processing, question-answering systems, and e-commerce analytics. It can be combined with Amazon product review data (available here) by matching ASINs in the Q/A dataset with ASINs in the review data. The dataset is approximately 766 kB in size and is available in JSON format.

Structurally, each entry in the dataset includes the following variables:

  • asin: The Amazon Standard Identification Number (ASIN) of the product, e.g., “B000050B6Z”.

  • questionType: The type of question, either ‘yes/no’ or ‘open-ended’.

  • answerType: For yes/no questions, this indicates the type of answer: ‘Y’ for yes, ‘N’ for no, or ‘?’ if the polarity of the answer could not be determined.

  • answerTime: The raw timestamp of when the answer was provided.

  • unixTime: The answer timestamp converted to Unix time.

  • question: The text of the question asked by the consumer.

  • answer: The text of the answer provided.

Goodreads Datasets

Creators: Wan, Mengting; McAuley, Julian
Publication Date: 2017
Creators: Wan, Mengting; McAuley, Julian

The Goodreads Datasets provide a large-scale collection of book-related data, making them valuable for analyzing reading behavior, book popularity, and recommendation systems. They contain rich metadata on over 2.3 million books, including titles, authors, publication years, genres, and average ratings. Additionally, they feature nearly 229 million user-book interactions, capturing explicit preferences such as bookshelf assignments (“read,” “to-read”) and user ratings. The dataset also covers a vast collection of user-generated textual reviews, offering insights into reader sentiments and opinions. The Goodreads Datasets contain three primary components: book metadata, user-book interactions, and book reviews. The book metadata includes details on 2,360,655 books, such as title, author, publication date, and genre. The user-book interactions dataset comprises 228,648,342 interactions from 876,145 users, capturing activities like adding books to shelves (“read,” “to-read”) and providing ratings. The book reviews dataset contains detailed textual reviews written by users, providing insights into reader sentiments and book popularity.The total size of the combined datasets is approximately 11 GB, available in JSON and CSV formats.

Social Interaction QA; Social IQA

Creators: Sap, Maarten; Rashkin, Hannah; Chen, Derek; Le Bras, Ronan; Choi, Yejin
Publication Date: 2019
Creators: Sap, Maarten; Rashkin, Hannah; Chen, Derek; Le Bras, Ronan; Choi, Yejin

Social Interaction QA, a new question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications. For example, given an action like “Jesse saw a concert” and a question like “Why did Jesse do this?”, humans can easily infer that Jesse wanted “to see their favorite performer” or “to enjoy the music”, and not “to see what’s happening inside” or “to see if it works”. The actions in Social IQa span a wide variety of social situations, and answer candidates contain both human-curated answers and adversarially-filtered machine-generated candidates. Social IQa contains over 37,000 QA pairs for evaluating models’ abilities to reason about the social implications of everyday events and situations. The dataset is relatively small, with a size of about 0.01 GB, and is available in JSON format.

The structure of the dataset consists of a set of question-answer pairs, where each entry contains:

  • A context describing a social situation.
  • A question that requires reasoning about the context.
  • Three answer choices (one correct, two incorrect).
  • A label indicating the correct answer.

Instagram Influencer Marketing Dataset

Creators: Kim, Seungbae; Jiang, Jyun-Yu; Nakada, Masaki; Han, Jinyoung; Wang, Wie
Publication Date: 2020
Creators: Kim, Seungbae; Jiang, Jyun-Yu; Nakada, Masaki; Han, Jinyoung; Wang, Wie

This dataset contains 33,935 Instagram influencers who are classified into the following nine categories including beauty, family, fashion, fitness, food, interior, pet, travel, and other. The dataset is 262 GB in size, including both metadata in JSON format and images in JPEG format. We collect 300 posts per influencer so that there are 10,180,500 Instagram posts in the dataset. The dataset includes two types of files, post metadata and image files. Post metadata files are in JSON format and contain the following information: caption, usertags, hashtags, timestamp, sponsorship, likes, comments, etc. Image files are in JPEG format and the dataset contains 12,933,406 image files since a post can have more than one image file. If a post has only one image file then the JSON file and the corresponding image files have the same name. However, if a post has more than one image then the JSON file and corresponding image files have different names. Therefore, we also provide a JSON-Image_mapping file that shows a list of image files that corresponds to post metadata.

If you want to use this dataset, please cite it accordingly. The data can be accessed on the respective website link below.

“Multimodal Post Attentive Profiling for Influencer Marketing,” Seungbae Kim, Jyun-Yu Jiang, Masaki Nakada, Jinyoung Han and Wei Wang. In Proceedings of The Web Conference (WWW ’20), ACM, 2020.

Consumer Complaint Database

Creators: Consumer Financial Protection Bureau (CFPB)
Publication Date: 2011
Creators: Consumer Financial Protection Bureau (CFPB)
The database encompasses a wide array of complaints related to various financial sectors, including debt collection, credit reporting, mortgages, credit cards, and more. Each week we send thousands of consumers’ complaints about financial products and services to companies for response. Those complaints are published here after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaint narratives are consumers’ descriptions of their experiences in their own words. By adding their voice, consumers help improve the financial marketplace. The database generally updates daily.  Each complaint entry includes details such as the date received, product type, issue, company involved, consumer’s narrative (if consented for publication), company response, and the complaint’s current status. This dataset serves as a valuable resource for identifying trends, assessing company practices, and informing policy decisions. 
As of February 22, 2025, the database contains a total of 7,867,198 complaints. The dataset is 1,36 GB in size and available for download in CSV format. The dataset spans from December 1, 2011, to the present, with regular updates to include new complaints.

ImageNet Large Scale Visual Recognition Challenge

Creators: Russakovsky, Olga; Deng, Jia; Su, Hao; Krause, Jonathan; Satheesh, Sanjeev; Ma, Sean; Huang, Zhiheng; Karpathy, Andrej; Khosla, Aditya; Bernstein, Michael; Berg, Alexander C.; Fei-Fei, Li
Publication Date: 2009
Creators: Russakovsky, Olga; Deng, Jia; Su, Hao; Krause, Jonathan; Satheesh, Sanjeev; Ma, Sean; Huang, Zhiheng; Karpathy, Andrej; Khosla, Aditya; Bernstein, Michael; Berg, Alexander C.; Fei-Fei, Li

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research. It contains data from 2012 until 2017. The dataset includes over 14 million images, while the biggest subset ImageNet Large Scale Visual Recognition Challenge (ILSVRC) covers 1,281,167 training images, 50,000 validation images, and 100,000 test images. In total, the dataset has a size of 167 GB. The data is available for free to researchers for non-commercial use on the data provider’s website.

For access to the full ImageNet dataset and other commonly used subsets, please login or request access on the website of the data providers. In doing so, you will need to agree to the ImageNet’s terms of access. Therefore, no data preview can be provided here.

When reporting results of the challenges or using the datasets, please cite:

Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

File Descriptions

1) ILSVRC/ contains the image data and ground truth for the train and validation sets, and the image data for the test set.

  • The image annotations are saved in XML files in PASCAL VOC format. Users can parse the annotations using the PASCAL Development Toolkit.
  • Annotations are ordered by their synsets (for example, “Persian cat”, “mountain bike”, or “hot dog”) as their wnid. These id’s look like n00141669. Each image’s name has direct correspondence with the annotation file name. For example, the bounding box for n02123394/n02123394_28.xml is n02123394_28.JPEG.
  • You can download all the bounding boxes of a particular synset from http://www.image-net.org/api/download/imagenet.bbox.synset?wnid=%5Bwnid]
  • The training images are under the folders with the names of their synsets. The validation images are all in the same folder. The test images are also all in the same folder.
  • ImageSet folder contains text files specifying lists of images for the main localization task.

2) LOC_sample_submission.csv is the correct format of the submission file. It contains two columns:

  • ImageId: the id of the test image, for example ILSVRC2012_test_00000001
  • PredictionString: the prediction string should be a space delimited of 5 integers. For example, 1000 240 170 260 240 means it’s label 1000, with a bounding box of coordinates (x_min, y_min, x_max, y_max). We accept up to 5 predictions. For example, if you submit 862 42 24 170 186 862 292 28 430 198 862 168 24 292 190 862 299 238 443 374 862 160 195 294 357 862 3 214 135 356 which contains 6 bounding boxes, we will only take the first 5 into consideration.

3) LOC_train_solution.csv and LOC_val_solution.csv: These information are available in ILSVRC/ already, but we are providing them in csv format to be consistent with LOC_sample_submission.csv. Each file contains two columns:

  • ImageId: the id of the train/val image, for example n02017213_7894 or ILSVRC2012_val_00048981
  • PredictionString: the prediction string is a space delimited of 5 integers. For example, n01978287 240 170 260 240 means it’s label n01978287, with a bounding box of coordinates (x_min, y_min, x_max, y_max). Repeated bounding boxes represent multiple boxes in the same image: n04447861 248 177 417 332 n04447861 171 156 251 175 n04447861 24 133 115 254

4) LOC_synset_mapping.txt: The mapping between the 1000 synset id and their descriptions. For example, Line 1 says n01440764 tench, Tinca tinca means this is class 1, has a synset id of n01440764, and it contains the fish tench.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.