Showing 249-256 of 262 results

Instagram Influencer Marketing Dataset

Creators: Kim, Seungbae; Jiang, Jyun-Yu; Nakada, Masaki; Han, Jinyoung; Wang, Wie
Publication Date: 2020
Creators: Kim, Seungbae; Jiang, Jyun-Yu; Nakada, Masaki; Han, Jinyoung; Wang, Wie

This dataset contains 33,935 Instagram influencers who are classified into the following nine categories including beauty, family, fashion, fitness, food, interior, pet, travel, and other. The dataset is 262 GB in size, including both metadata in JSON format and images in JPEG format. We collect 300 posts per influencer so that there are 10,180,500 Instagram posts in the dataset. The dataset includes two types of files, post metadata and image files. Post metadata files are in JSON format and contain the following information: caption, usertags, hashtags, timestamp, sponsorship, likes, comments, etc. Image files are in JPEG format and the dataset contains 12,933,406 image files since a post can have more than one image file. If a post has only one image file then the JSON file and the corresponding image files have the same name. However, if a post has more than one image then the JSON file and corresponding image files have different names. Therefore, we also provide a JSON-Image_mapping file that shows a list of image files that corresponds to post metadata.

If you want to use this dataset, please cite it accordingly. The data can be accessed on the respective website link below.

“Multimodal Post Attentive Profiling for Influencer Marketing,” Seungbae Kim, Jyun-Yu Jiang, Masaki Nakada, Jinyoung Han and Wei Wang. In Proceedings of The Web Conference (WWW ’20), ACM, 2020.

Consumer Complaint Database

Creators: Consumer Financial Protection Bureau (CFPB)
Publication Date: 2011
Creators: Consumer Financial Protection Bureau (CFPB)
The database encompasses a wide array of complaints related to various financial sectors, including debt collection, credit reporting, mortgages, credit cards, and more. Each week we send thousands of consumers’ complaints about financial products and services to companies for response. Those complaints are published here after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaint narratives are consumers’ descriptions of their experiences in their own words. By adding their voice, consumers help improve the financial marketplace. The database generally updates daily.  Each complaint entry includes details such as the date received, product type, issue, company involved, consumer’s narrative (if consented for publication), company response, and the complaint’s current status. This dataset serves as a valuable resource for identifying trends, assessing company practices, and informing policy decisions. 
As of February 22, 2025, the database contains a total of 7,867,198 complaints. The dataset is 1,36 GB in size and available for download in CSV format. The dataset spans from December 1, 2011, to the present, with regular updates to include new complaints.

ImageNet Large Scale Visual Recognition Challenge

Creators: Russakovsky, Olga; Deng, Jia; Su, Hao; Krause, Jonathan; Satheesh, Sanjeev; Ma, Sean; Huang, Zhiheng; Karpathy, Andrej; Khosla, Aditya; Bernstein, Michael; Berg, Alexander C.; Fei-Fei, Li
Publication Date: 2009
Creators: Russakovsky, Olga; Deng, Jia; Su, Hao; Krause, Jonathan; Satheesh, Sanjeev; Ma, Sean; Huang, Zhiheng; Karpathy, Andrej; Khosla, Aditya; Bernstein, Michael; Berg, Alexander C.; Fei-Fei, Li

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research. It contains data from 2012 until 2017. The dataset includes over 14 million images, while the biggest subset ImageNet Large Scale Visual Recognition Challenge (ILSVRC) covers 1,281,167 training images, 50,000 validation images, and 100,000 test images. In total, the dataset has a size of 167 GB. The data is available for free to researchers for non-commercial use on the data provider’s website.

For access to the full ImageNet dataset and other commonly used subsets, please login or request access on the website of the data providers. In doing so, you will need to agree to the ImageNet’s terms of access. Therefore, no data preview can be provided here.

When reporting results of the challenges or using the datasets, please cite:

Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

File Descriptions

1) ILSVRC/ contains the image data and ground truth for the train and validation sets, and the image data for the test set.

  • The image annotations are saved in XML files in PASCAL VOC format. Users can parse the annotations using the PASCAL Development Toolkit.
  • Annotations are ordered by their synsets (for example, “Persian cat”, “mountain bike”, or “hot dog”) as their wnid. These id’s look like n00141669. Each image’s name has direct correspondence with the annotation file name. For example, the bounding box for n02123394/n02123394_28.xml is n02123394_28.JPEG.
  • You can download all the bounding boxes of a particular synset from http://www.image-net.org/api/download/imagenet.bbox.synset?wnid=%5Bwnid]
  • The training images are under the folders with the names of their synsets. The validation images are all in the same folder. The test images are also all in the same folder.
  • ImageSet folder contains text files specifying lists of images for the main localization task.

2) LOC_sample_submission.csv is the correct format of the submission file. It contains two columns:

  • ImageId: the id of the test image, for example ILSVRC2012_test_00000001
  • PredictionString: the prediction string should be a space delimited of 5 integers. For example, 1000 240 170 260 240 means it’s label 1000, with a bounding box of coordinates (x_min, y_min, x_max, y_max). We accept up to 5 predictions. For example, if you submit 862 42 24 170 186 862 292 28 430 198 862 168 24 292 190 862 299 238 443 374 862 160 195 294 357 862 3 214 135 356 which contains 6 bounding boxes, we will only take the first 5 into consideration.

3) LOC_train_solution.csv and LOC_val_solution.csv: These information are available in ILSVRC/ already, but we are providing them in csv format to be consistent with LOC_sample_submission.csv. Each file contains two columns:

  • ImageId: the id of the train/val image, for example n02017213_7894 or ILSVRC2012_val_00048981
  • PredictionString: the prediction string is a space delimited of 5 integers. For example, n01978287 240 170 260 240 means it’s label n01978287, with a bounding box of coordinates (x_min, y_min, x_max, y_max). Repeated bounding boxes represent multiple boxes in the same image: n04447861 248 177 417 332 n04447861 171 156 251 175 n04447861 24 133 115 254

4) LOC_synset_mapping.txt: The mapping between the 1000 synset id and their descriptions. For example, Line 1 says n01440764 tench, Tinca tinca means this is class 1, has a synset id of n01440764, and it contains the fish tench.

Flickr30k

Creators: Young, Peter; Lai, Alice; Hodosh, Micah; Hockenmaier, Julia
Publication Date: 2014
Creators: Young, Peter; Lai, Alice; Hodosh, Micah; Hockenmaier, Julia
The Flickr30k dataset consists of 31,783 images, each accompanied by five human-generated captions, adding up to 158,915 captions. These images predominantly depict people engaged in everyday activities and events. The dataset serves as a benchmark for sentence-based image description tasks. Each image is associated with five descriptive captions provided by human annotators. The dataset has been further enhanced by the Flickr30k Entities extension, which adds 244,000 coreference chains linking mentions of the same entities across different captions for the same image, and associates them with 276,000 manually annotated bounding boxes. This augmentation facilitates tasks such as phrase localization and grounded language understanding.

Trip Advisor Hotel Reviews

Creators: Alam, Md. H.; Ryu, Woo-Jong; Lee, SangKeun
Publication Date: 2020
Creators: Alam, Md. H.; Ryu, Woo-Jong; Lee, SangKeun

Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged. With this dataset, consisting of 20k reviews crawled from Tripadvisor, you can explore what makes a great hotel and maybe even use this model in your travels. It enables the exploration of factors contributing to hotel quality and can be utilized for sentiment analysis and natural language processing tasks. The dataset is 0,01 GB in size and covers textual reviews of hotels in combination with a numerical rating associated with the reviews.

Amazon Product Reviews

Creators: Ni, Jianmo; Li, Jiacheng; McAuley, Julian
Publication Date: 2018
Creators: Ni, Jianmo; Li, Jiacheng; McAuley, Julian

The Amazon Product Reviews dataset encompasses a comprehensive collection of 233.1 million customer reviews from Amazon, covering the period from May 1996 to October 2018. It includes various features such as ratings, textual reviews, helpfulness votes, product metadata (descriptions, category information, price, brand, and image features), and links to related products (e.g., also viewed, also bought graphs). The dataset serves as a valuable resource for analyzing consumer behavior, product trends, and for developing recommendation systems. In total, the dataset is 34 GB in size. It is organized into the following files and subsets:

  • Complete Review Data: A comprehensive file containing all 233.1 million reviews.
  • Ratings Only: A CSV file focusing solely on ratings, excluding textual reviews and metadata.
  • 5-Core: A subset where all users and items have at least five reviews, comprising 75.26 million reviews.
  • Per-Category Data: Reviews and product metadata categorized by specific product types (e.g., Books, Electronics).

Marketing Bias data

Creators: Wan, Mengting; Ni, Jianmo; Misra, Rishabh; McAuley, Julian
Publication Date: 2020
Creators: Wan, Mengting; Ni, Jianmo; Misra, Rishabh; McAuley, Julian

This dataset contains attributes of products sold on ModCloth and Amazon (in particular, attributes about how the products are marketed), which may introduce biases in recommendation systems. It is designed to facilitate research on marketing biases in product recommendations. Data also includes user/item interactions for recommendation.

The dataset amounts to 0,09 GB in size and is built upon two processed subsets:

  • ModCloth Dataset: Contains product attributes from the ModCloth platform.
  • Electronics Dataset: Comprises product attributes from Amazon’s Electronics category.

In total, the dataset includes 99,893 reviews for ModCloth and 1,292,954 reviews for the Electronics category of Amazon.

Google Local Reviews

Creators: He, Ruining; Kang, Wang-Cheng; McAuley, Julian
Publication Date: 2017
Creators: He, Ruining; Kang, Wang-Cheng; McAuley, Julian

The Google Local Reviews dataset comprises 11,453,845 reviews and ratings from 4,567,431 users on 3,116,785 local businesses. Each business entry dataset contains user reviews for local businesses, including variables such as rating, review text, business category, location details (address, GPS, phone number, opening hours), and user and business IDs. It also includes timestamps of reviews, price level, and whether the business is closed. The dataset has a size of 7 GB and spans 48,013 categories of local businesses across five continents, encompassing a diverse range of establishments from restaurants and hotels to parks and shopping malls.

 

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.