Showing 33-39 of 39 results

Social circles: Facebook

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012

This dataset consists of ‘circles’ (or ‘friends lists’) from Facebook. Facebook data was collected from survey participants using this Facebook app. The dataset includes node features (profiles), circles, and ego networks. The dataset includes 4,039 nodes and 88,234 edges. Facebook data has been anonymized by replacing the Facebook-internal ids for each user with a new value.

Instagram Influencer Marketing Dataset

Creators: Kim, Seungbae; Jiang, Jyun-Yu; Nakada, Masaki; Han, Jinyoung; Wang, Wie
Publication Date: 2020

This dataset contains 33,935 Instagram influencers who are classified into the following nine categories including beauty, family, fashion, fitness, food, interior, pet, travel, and other. We collect 300 posts per influencer so that there are 10,180,500 Instagram posts in the dataset. The dataset includes two types of files, post metadata and image files. Post metadata files are in JSON format and contain the following information: caption, usertags, hashtags, timestamp, sponsorship, likes, comments, etc. Image files are in JPEG format and the dataset contains 12,933,406 image files since a post can have more than one image file. If a post has only one image file then the JSON file and the corresponding image files have the same name. However, if a post has more than one image then the JSON file and corresponding image files have different names. Therefore, we also provide a JSON-Image_mapping file that shows a list of image files that corresponds to post metadata.

If you want to use this dataset, please cite it accordingly. The data can be accessed on the respective website link below.

“Multimodal Post Attentive Profiling for Influencer Marketing,” Seungbae Kim, Jyun-Yu Jiang, Masaki Nakada, Jinyoung Han and Wei Wang. In Proceedings of The Web Conference (WWW ’20), ACM, 2020.

Trip Advisor Hotel Reviews

Creators: Alam, Md. H.; Ryu, Woo-Jong; Lee, SangKeun
Publication Date: 2020

Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged. With this dataset, consisting of 20k reviews crawled from Tripadvisor, you can explore what makes a great hotel and maybe even use this model in your travels!

Flickr30k

Creators: Young, Peter; Lai, Alice; Hodosh, Micah; Hockenmaier, Julia
Publication Date: 2014
To produce the denotation graph, we have created an image caption corpus consisting of 158,915 crowd-sourced captions describing 31,783 images. The new images and captions (compared to an earlier, smaller Flickr dataset) focus on people involved in everyday activities and events. 

Google Local Reviews

Creators: He, Ruining; Kang, Wang-Cheng; McAuley, Julian
Publication Date: 2017

We introduce a new dataset from Google which contains 11,453,845 reviews and ratings from 4,567,431 users on 3,116,785 local businesses (with detailed name, hours, phone number, address, GPS, etc.). Œere are as many as 48,013 categories of local businesses distributed over €ve continents, ranging from restaurants, hotels, parks, shopping malls, movie theaters, schools, military recruiting oces, bird control, mediation services (etc.)

IMDb Movie Reviews Dataset

Creators: Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher
Publication Date: 2011

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning.

The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional  50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.

The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided – please download the data directly from the data provider’s website.

When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Food.com Recipe & Review Data

Creators: Majumder, Bodhisattwa P.; Li, Shuyang; Ni, Jianmo; McAuley, Julian
Publication Date: 2019
This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen), an online recipe aggregator.This dataset contains three sets of data from Food.com:

Interaction splits

  • interactions_test.csv
  • interactions_validation.csv
  • interactions_train.csv

Preprocessed data for result reproduction

In this format, the recipe text metadata is tokenized via the GPT subword tokenizer with start-of-step, etc. tokens.

  • PP_recipes.csv
  • PP_users.csv

To convert these files into the pickle format required to run our code off-the-shelf, you may use pandas.read_csv and pandas.to_pickle to convert the CSV’s into the proper pickle format.

 

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.