Showing 193-200 of 262 results

Social Recommendation Data

Creators: Cai, Chenwei; He, Ruining; McAuley, Julian; Zhao, Tong; King, Irwin
Publication Date: 2017
Creators: Cai, Chenwei; He, Ruining; McAuley, Julian; Zhao, Tong; King, Irwin

These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews). Those specific user ratings allow for detailed analysis of user preferences. By capturing the social (or trust) relationships between users, this dataset enables the study of how social connections influence user behavior and recommendations. The dataset is approximately 660 MB in size and includes:

Number of Observations:

  • LibraryThing:

    • Users: 73,882
    • Items: 337,561
    • Ratings: 979,053
    • Social Relations: 120,536
  • Epinions:

    • Users: 116,260
    • Items: 41,269
    • Ratings/Feedback: 181,394
    • Social Relations: 181,304

The dataset is structured into:

  • User Information: Anonymized user identifiers.

  • Item Information: Identifiers for items such as books or products.

  • Ratings/Feedback: User-provided ratings or feedback scores for items.

  • Social Relations: Mappings of social or trust relationships between users.


Twitch Livestreaming Interactions

Creators: Rappaz, Jérémie; McAuley, Julian; Aberer, Karl
Publication Date: 2021
Creators: Rappaz, Jérémie; McAuley, Julian; Aberer, Karl

This is a dataset of users consuming streaming content on Twitch. We retrieved all streamers, and all users connected in their respective chats, every 10 minutes during 43 days. The dataset is unique because it captures real-time interactions between users and streamers at a high temporal resolution, allowing for detailed analysis of how audiences engage with live content. Below are the key features that make this dataset particularly valuable. The dataset is 6,47 GB in size and covers a 43-day period in July 2019, with data collected every 10 minutes, resulting in 6,148 time steps.

Overall, it includes:

  • Users: 100k
  • Streamers (items): 162.6k
  • Interactions: 3M
  • Time steps: 6148

Structurally, the dataset encompasses the following information:

  • User ID: Anonymized identifier for each user.

  • Stream ID: Identifier for each streaming session.

  • Streamer Username: Name of the channel or streamer.

  • Time Start: The initial time step (in 10-minute intervals) when the user was observed in the chat.

  • Time Stop: The final time step (in 10-minute intervals) when the user was observed in the chat.

One million comic books panel

Creators: Iyyer, Mohit; Manjunatha, Varun; Guha, Anupam; Vyas, Yogarshi; Boyd-Graber, Jordan; Daumé III, Hal; Davis, Larry
Publication Date: 2016
Creators: Iyyer, Mohit; Manjunatha, Varun; Guha, Anupam; Vyas, Yogarshi; Boyd-Graber, Jordan; Daumé III, Hal; Davis, Larry
Visual narrative is often a combination of explicit information and judicious omissions, relying on the viewer to supply missing details. In comics, most movements in time and space are hidden in the “gutters” between panels. To follow the story, readers logically connect panels together by inferring unseen actions through a process called “closure”. While computers can now describe what is explicitly depicted in natural images, in this paper we examine whether they can understand the closure-driven narratives conveyed by stylized artwork and dialogue in comic book panels. We construct a dataset, COMICS, that consists of over 1.2 million panels (120 GB) paired with automatic textbox transcriptions. An in-depth analysis of COMICS demonstrates that neither text nor image alone can tell a comic book story, so a computer must understand both modalities to keep up with the plot. We introduce three cloze-style tasks that ask models to predict narrative and character-centric aspects of a panel given n preceding panels as context. Various deep neural architectures underperform human baselines on these tasks, suggesting that COMICS contains fundamental challenges for both vision and language. Overall, the dataset is organized into three components:
  • Panel Images: Each panel is stored as an image file, capturing the visual content of the comic scenes.

  • Textbox Transcriptions: Textual content from each panel is extracted using OCR, allowing for analysis of dialogues, narratives, and other textual elements.

  • Metadata: Additional information such as panel dimensions, position within the page, and associated comic book identifiers is included to facilitate detailed analyses.

Google Restaurants

Creators: Zhankui He, Yan; Li, Jiacheng; Zhang, Tianyang; McAuley, Julian
Publication Date: 2022
Creators: Zhankui He, Yan; Li, Jiacheng; Zhang, Tianyang; McAuley, Julian

This is a mutli-modal dataset of restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as other metadata for each restaurant. The rich combination of textual reviews, numerical ratings, and visual content helps to provide a holistic view of user experiences and restaurant characteristics. Such a multi-faceted dataset is particularly valuable for developing and testing recommendation systems, conducting sentiment analysis, and exploring the relationships between visual content and user perceptions in the context of dining establishment. The total size of the dataset is approximately 120 GB and structured into:

  • Restaurant Metadata: Information such as restaurant names, locations, contact details, and operational hours.

  • User Reviews: Textual feedback and numerical ratings provided by users.

  • Images: Photographs uploaded by users, showcasing various aspects of the restaurants.

Behance Community Art Data

Creators: He, Ruining; Fang, Chen; Wang, Zhaowen; McAuley, Julian
Publication Date: 2016
Creators: He, Ruining; Fang, Chen; Wang, Zhaowen; McAuley, Julian

Being a small, anonymized, version of a larger proprietary dataset, this dataset covers likes and image data from the community art website Behance. It provides valuable insights into user engagement with digital art, making it a significant resource for research in recommender systems, social network analysis, and the study of artistic preferences. Also, the dataset captures user interactions in the form of “appreciations” (akin to likes) on various art items. Each appreciation reflects a user’s positive acknowledgment of an artwork, offering a measurable indicator of engagement. Additionally, the dataset includes image features extracted from the artworks, facilitating analyses that combine user behavior with visual content characteristics.

In total, the dataset is about 3.5 GB large and encompasses:

  • Users: 63,497
  • Items: 178,788
  • Appreciates (“likes”): 1,000,000

The dataset is structured to include:

  • User Data: Anonymized identifiers representing individual users.

  • Item Data: Identifiers for each artwork, accompanied by associated image features.

  • Appreciation Data: Records of user-item interactions, indicating which user appreciated which artwork.

Pinterest Fashion Compatibility

Creators: Kang, Wang-Cheng; Kim, Eric; Leskovec, Jure; Rosenberg, Charles; McAuley, Julian
Publication Date: 2019
Creators: Kang, Wang-Cheng; Kim, Eric; Leskovec, Jure; Rosenberg, Charles; McAuley, Julian

This dataset is a structured collection of images and metadata designed to study the compatibility of fashion products within real-world scenes. It enables detailed analysis of how fashion items appear in different settings and supports applications in machine learning, recommendation systems, and virtual styling tools. One of its key features is the scene-product pairing, where fashion items in real-world images are annotated with bounding boxes and linked to corresponding product images. In total, the dataset includes 47,739 scene images, 38,111 product images, and 93,274 scene-product pairs, making it a comprehensive resource for fashion compatibility research.

The dataset is about 29 MB large and includes:

  • Scenes: 47,739
  • Products: 38,111
  • Scene-Product Pairs: 93,274

EndoMondo Fitness Tracking Data

Creators: Ni, Jianmo; Muhlstein, Larry; McAuley, Julian
Publication Date: 2019
Creators: Ni, Jianmo; Muhlstein, Larry; McAuley, Julian

This is a collection of workout logs from users of EndoMondo. It contains sequential sensor data such as GPS coordinates (latitude, longitude, altitude), heart rate measurements, speed, and distance, making it valuable for studying workout patterns, performance tracking, and personalized fitness recommendations. Additionally, it includes user metadata such as anonymized user IDs, gender, and sport type, along with contextual factors like weather conditions. The dataset has a size of approximately 2.9 GB and consists of 1,104 users with 253,020 recorded workouts.

The dataset covers multiple components:

  • User Information: Anonymized user identifiers and gender.

  • Workout Details: Each workout log includes sport type, sequential data for GPS coordinates (latitude, longitude, altitude) with timestamps, heart rate measurements, and derived metrics such as speed and distance.

CrowdTangle Platform and API

Creators: Garmur, Matt; King, Gary; Mukerjee, Zagreb; Persily, Nate; Silverman, Brandon
Publication Date: 2019
Creators: Garmur, Matt; King, Gary; Mukerjee, Zagreb; Persily, Nate; Silverman, Brandon

This document describes the CrowdTangle API and user interface being provided to researchers
by Social Science One under its collaboration framework with Facebook. CrowdTangle is a
content discovery and analytics platform designed to give content creators the data and insights
they need to succeed. This dataset enables users to monitor public content interactions, track trends, and identify influential accounts. The CrowdTangle API surfaces stories, and data to measure their social performance and identify influencers. This codebook describes the data’s scope, structure, and fields.

CrowdTangle’s dataset offers insights into public posts made by pages, groups, or verified profiles that have either surpassed 100,000 likes since 2014 or have been tracked by any active API user. The dataset includes all public posts from pages, groups, or verified profiles meeting the aforementioned criteria since 2014.

Key features include:

  • Content Discovery: Access to real-time data on trending posts, facilitating the identification of viral content and emerging topics.

  • Performance Analytics: Metrics such as likes, shares, comments, and interaction rates, allowing for the assessment of content engagement.

  • Influencer Identification: Tools to pinpoint accounts with significant influence within specific niches or broader audiences.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.