Showing 17-24 of 39 results

Pinterest Fashion Compatibility

Creators: Kang, Wang-Cheng; Kim, Eric; Leskovec, Jure; Rosenberg, Charles; McAuley, Julian
Publication Date: 2019
Creators: Kang, Wang-Cheng; Kim, Eric; Leskovec, Jure; Rosenberg, Charles; McAuley, Julian

This dataset is a structured collection of images and metadata designed to study the compatibility of fashion products within real-world scenes. It enables detailed analysis of how fashion items appear in different settings and supports applications in machine learning, recommendation systems, and virtual styling tools. One of its key features is the scene-product pairing, where fashion items in real-world images are annotated with bounding boxes and linked to corresponding product images. In total, the dataset includes 47,739 scene images, 38,111 product images, and 93,274 scene-product pairs, making it a comprehensive resource for fashion compatibility research.

The dataset is about 29 MB large and includes:

  • Scenes: 47,739
  • Products: 38,111
  • Scene-Product Pairs: 93,274

Behance Community Art Data

Creators: He, Ruining; Fang, Chen; Wang, Zhaowen; McAuley, Julian
Publication Date: 2016
Creators: He, Ruining; Fang, Chen; Wang, Zhaowen; McAuley, Julian

Being a small, anonymized, version of a larger proprietary dataset, this dataset covers likes and image data from the community art website Behance. It provides valuable insights into user engagement with digital art, making it a significant resource for research in recommender systems, social network analysis, and the study of artistic preferences. Also, the dataset captures user interactions in the form of “appreciations” (akin to likes) on various art items. Each appreciation reflects a user’s positive acknowledgment of an artwork, offering a measurable indicator of engagement. Additionally, the dataset includes image features extracted from the artworks, facilitating analyses that combine user behavior with visual content characteristics.

In total, the dataset is about 3.5 GB large and encompasses:

  • Users: 63,497
  • Items: 178,788
  • Appreciates (“likes”): 1,000,000

The dataset is structured to include:

  • User Data: Anonymized identifiers representing individual users.

  • Item Data: Identifiers for each artwork, accompanied by associated image features.

  • Appreciation Data: Records of user-item interactions, indicating which user appreciated which artwork.

Facebook URL Shares

Creators: Solomon Messing; Bogdan State; Chaya Nayak; Gary King; Nate Persily
Publication Date: 2018
Creators: Solomon Messing; Bogdan State; Chaya Nayak; Gary King; Nate Persily

The data describes web page addresses (URLs) that have been shared on Facebook starting January 1, 2017 and ending about a month before the present day. URLs are included if shared by at least 20 unique accounts, and shared publicly at least once. We estimate the full data set will contain on the order of 2 million unique urls shared in 300 million posts, per week. By doing so, this dataset provides insights into the dissemination of web content on Facebook, capturing the dynamics of how information spreads across the platform. Researchers can use this data to explore patterns in user engagement, the virality of content, and the reach of various web pages within the Facebook ecosystem. The dataset’s focus on URLs shared by a minimum number of unique accounts ensures that the data represents content with a certain level of engagement, filtering out less significant shares.

The dataset is structured to include the following key components:

  • URL Information: Each entry includes the web page address (URL) that was shared on Facebook.

  • Share Metrics: Data on the number of times each URL was shared, including the count of unique accounts that shared it and the total number of posts containing the URL.

  • Engagement Metrics: Information on user interactions with the shared URLs, such as likes, comments, and shares.

Multi-aspect Reviews

Creators: Julian McAuley; Jure Leskovec; Dan Jurafsky
Publication Date: 2013
Creators: Julian McAuley; Jure Leskovec; Dan Jurafsky
These datasets include reviews with multiple rated dimensions.It is particularly valuable for research in sentiment analysis, recommender systems, and user modeling, as it allows for a nuanced understanding of user opinions beyond overall ratings.​The most comprehensive of these are beer review datasets from Ratebeer and Beeradvocate, which include sensory aspects such as taste, look, feel, and smell. The data set is about 1 GB large.
Ratebeer:

  • Number of users: 40,213
  • Number of items: 110,419
  • Number of ratings/reviews: 2,855,232
  • Timespan: April, 2000 – November, 2011

BeerAdvocate:

  • Number of users: 33,387
  • Number of items: 66,051
  • Number of ratings/reviews: 1,586,259
  • Timespan: January, 1998 – November, 2011

The datasets are structured in a JSON format, with each entry representing a single review that includes:

  • Product Information: Details about the beer being reviewed.

  • User Information: Anonymized identifiers of the reviewers.

  • Review Content: Textual feedback provided by the user.

  • Ratings: Numerical scores for overall satisfaction and specific aspects (appearance, aroma, palate, taste).

COVID-19 Twitter Chatter Dataset

Creators: Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo
Publication Date: 2024
Creators: Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The dataset is 14.2 GB large.

Food.com Recipes and Interactions

Creators: Shuyang, Li
Publication Date: 2019
Creators: Shuyang, Li

The Food.com Recipes and Interactions dataset is a large-scale collection of culinary data, comprising over 180,000 recipes and 700,000 user reviews spanning an 18-year period. Compiled by Shuyang Li and published in 2019, this dataset provides a rich source of information for studying user interactions, culinary trends, and recipe recommendation systems. It originates from Food.com (formerly GeniusKitchen), one of the largest online recipe-sharing platforms, making it a valuable resource for researchers and practitioners in food science, natural language processing, and user behavior analysis. The data set is 0.89 GB large and consists of two primary components: recipe data and user interaction data. The recipe data contains structured information about each recipe, including the recipe ID, name tokens (a tokenized version of the recipe title), ingredient tokens, steps tokens (instructions in tokenized form), cooking techniques, caloric level, and ingredient IDs corresponding to specific ingredients. These features allow for deep analysis of how recipes are structured, categorized, and consumed over time. The user interaction data captures engagement metrics, tracking how users interact with recipes. It includes the user ID, a list of recipes reviewed, the number of items reviewed, the ratings assigned, and the total number of ratings provided by each user. This structure enables research into user preference modeling, recipe popularity trends, and the development of personalized recommendation systems for recipe suggestions.

MovieTweetings

Creators: Dooms, Simon; De Pessemier, Toon; Martens, Luc
Publication Date: 2013
Creators: Dooms, Simon; De Pessemier, Toon; Martens, Luc
MovieTweetings is a dataset consisting of ratings on movies that were contained in well-structured tweets on Twitter. The goal of this dataset is to provide the RecSys community with a live, natural and always up-to-date movie ratings dataset. The dataset has been actively collecting ratings since February 28, 2013, and will be updated as much as possible to incorporate rating data from the newest tweets available. The dataset includes 921,398 ratings from 71,707 unique users. The ratings contained in the tweets are scaled from 0 to 10, as is the norm on the IMDb platform. In total, the dataset has a size of 26,2 MB and consists of two main files:

  • ratings.dat contains extracted ratings, structured as:
    user_id::movie_id::rating::rating_timestamp

    • user_id: Unique identifier for the user.
    • movie_id: IMDb identifier for the movie.
    • rating: User’s score on a 10-star scale.
    • rating_timestamp: Unix timestamp when the rating was extracted.
  • items.dat includes metadata about the rated movies, structured as:
    movie_id::movie_title (movie_year)::genre|genre|genre

    • movie_id: IMDb identifier for the movie.
    • movie_title: Name of the movie along with the release year.
    • genre: Pipe-separated list of genres.

Customer Support on Twitter

Creators: Axelbrooke, Stuart
Publication Date: 2017
Creators: Axelbrooke, Stuart

The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact. It is intended to facilitate advancements in natural language understanding and the development of conversational models. Compiled by Stuart Axelbrooke in 2017, this dataset encompasses tweets and replies from prominent companies such as Apple, Amazon, Uber, Delta, and Spotify. It provides valuable insights into contemporary customer support practices and their impact, making it an essential resource for researchers interested in automated response generation, sentiment analysis, and conversational flow modeling. The dataset is approximately 516.53 MB in size. It is designed for the analysis of conversation dynamics and contains several key attributes. Each tweet entry has a unique, anonymized tweet ID (tweet_id), an anonymized user ID (author_id), a timestamp (created_at), and the tweet text (text), where sensitive information such as phone numbers and email addresses has been masked to ensure privacy. It differentiates between inbound tweets (inbound), which are directed at companies by customers, and outbound tweets, which are responses from the companies. Additionally, in_response_to_tweet_id and response_tweet_id fields allow for the reconstruction of entire conversation threads by linking tweets to their respective responses.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.