Showing 33-39 of 39 results

Social circles: Facebook

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of ‘circles’ (or ‘friends lists’) from Facebook. Facebook data was collected from survey participants using this Facebook app. The dataset includes node features (profiles), circles, and ego networks, offering valuable insights into the structure and characteristics of social connections. The dataset includes 4,039 nodes and 88,234 edges. Facebook data has been anonymized by replacing the Facebook-internal ids for each user with a new value. The dataset is approximately 0.01 GB in size.

The dataset is structured into:

  • Node features (profiles): Anonymized user profile information, where specific attributes (e.g., political affiliation) are replaced with generic labels (e.g., ‘anonymized feature 1’).

  • Circles: Lists of friends grouped by users, representing their social circles.

  • Ego networks: Subgraphs centered around individual users (egos), including their direct friends and the connections among those friends.

Instagram Influencer Marketing Dataset

Creators: Kim, Seungbae; Jiang, Jyun-Yu; Nakada, Masaki; Han, Jinyoung; Wang, Wie
Publication Date: 2020
Creators: Kim, Seungbae; Jiang, Jyun-Yu; Nakada, Masaki; Han, Jinyoung; Wang, Wie

This dataset contains 33,935 Instagram influencers who are classified into the following nine categories including beauty, family, fashion, fitness, food, interior, pet, travel, and other. The dataset is 262 GB in size, including both metadata in JSON format and images in JPEG format. We collect 300 posts per influencer so that there are 10,180,500 Instagram posts in the dataset. The dataset includes two types of files, post metadata and image files. Post metadata files are in JSON format and contain the following information: caption, usertags, hashtags, timestamp, sponsorship, likes, comments, etc. Image files are in JPEG format and the dataset contains 12,933,406 image files since a post can have more than one image file. If a post has only one image file then the JSON file and the corresponding image files have the same name. However, if a post has more than one image then the JSON file and corresponding image files have different names. Therefore, we also provide a JSON-Image_mapping file that shows a list of image files that corresponds to post metadata.

If you want to use this dataset, please cite it accordingly. The data can be accessed on the respective website link below.

“Multimodal Post Attentive Profiling for Influencer Marketing,” Seungbae Kim, Jyun-Yu Jiang, Masaki Nakada, Jinyoung Han and Wei Wang. In Proceedings of The Web Conference (WWW ’20), ACM, 2020.

Trip Advisor Hotel Reviews

Creators: Alam, Md. H.; Ryu, Woo-Jong; Lee, SangKeun
Publication Date: 2020
Creators: Alam, Md. H.; Ryu, Woo-Jong; Lee, SangKeun

Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged. With this dataset, consisting of 20k reviews crawled from Tripadvisor, you can explore what makes a great hotel and maybe even use this model in your travels. It enables the exploration of factors contributing to hotel quality and can be utilized for sentiment analysis and natural language processing tasks. The dataset is 0,01 GB in size and covers textual reviews of hotels in combination with a numerical rating associated with the reviews.

Flickr30k

Creators: Young, Peter; Lai, Alice; Hodosh, Micah; Hockenmaier, Julia
Publication Date: 2014
Creators: Young, Peter; Lai, Alice; Hodosh, Micah; Hockenmaier, Julia
The Flickr30k dataset consists of 31,783 images, each accompanied by five human-generated captions, adding up to 158,915 captions. These images predominantly depict people engaged in everyday activities and events. The dataset serves as a benchmark for sentence-based image description tasks. Each image is associated with five descriptive captions provided by human annotators. The dataset has been further enhanced by the Flickr30k Entities extension, which adds 244,000 coreference chains linking mentions of the same entities across different captions for the same image, and associates them with 276,000 manually annotated bounding boxes. This augmentation facilitates tasks such as phrase localization and grounded language understanding.

Google Local Reviews

Creators: He, Ruining; Kang, Wang-Cheng; McAuley, Julian
Publication Date: 2017
Creators: He, Ruining; Kang, Wang-Cheng; McAuley, Julian

The Google Local Reviews dataset comprises 11,453,845 reviews and ratings from 4,567,431 users on 3,116,785 local businesses. Each business entry dataset contains user reviews for local businesses, including variables such as rating, review text, business category, location details (address, GPS, phone number, opening hours), and user and business IDs. It also includes timestamps of reviews, price level, and whether the business is closed. The dataset has a size of 7 GB and spans 48,013 categories of local businesses across five continents, encompassing a diverse range of establishments from restaurants and hotels to parks and shopping malls.

 

IMDb Movie Reviews Dataset

Creators: Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher
Publication Date: 2011
Creators: Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning. In total, the dataset amounts to 0,08 GB in size.

The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional  50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.

The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided – please download the data directly from the data provider’s website.

When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Food.com Recipe & Review Data

Creators: Majumder, Bodhisattwa P.; Li, Shuyang; Ni, Jianmo; McAuley, Julian
Publication Date: 2019
Creators: Majumder, Bodhisattwa P.; Li, Shuyang; Ni, Jianmo; McAuley, Julian
This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen), an online recipe aggregator. This extensive collection allows for in-depth analysis of culinary trends, user preferences, and recipe characteristics over nearly two decades.The dataset is 0,85 GB in size and contains three sets of data from Food.com:Interaction splits

  • interactions_test.csv
  • interactions_validation.csv
  • interactions_train.csv

Preprocessed data for result reproduction

In this format, the recipe text metadata is tokenized via the GPT subword tokenizer with start-of-step, etc. tokens.

  • PP_recipes.csv
  • PP_users.csv

To convert these files into the pickle format required to run our code off-the-shelf, you may use pandas.read_csv and pandas.to_pickle to convert the CSV’s into the proper pickle format.

 

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.