Resources by Stefan

Amazon product co-purchasing network metadata

Creators: Leskovec, Jure
Publication Date: 2006
Creators: Leskovec, Jure

The data was collected by crawling the Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes). It is valuable for analyzing product relationships, customer behavior, and the dynamics of product co-purchasing networks. For each product the following information is available:

Title
Salesrank
List of similar products (that get co-purchased with the current product)
Detailed product categorization
Product reviews: time, customer, rating, number of votes, number of people that found the review helpful.

The data was collected in summer 2006. It has a size of 201 MB and structured into:

  • Product Metadata: Information such as product ID, ASIN, title, group, sales rank, similar products, and categories.

  • Product Reviews: Details including review time, customer ID, rating, number of votes, and helpfulness votes.

Yelp Open Dataset

Creators: Yelp, Inc.
Publication Date: 2015
Creators: Yelp, Inc.
The Yelp dataset offers a collection of real-world data from Yelp, intended for educational and academic purposes. It encompasses information about businesses, user reviews, photos, and check-ins, providing valuable insights into local commerce and consumer behavior. In total, this dataset contains 6.9M online reviews for 150k businesses and covers 11 metropolitan areas. It also includes more than 200,000 images related to the reviews. It has a compressed size of 4,9 GB and uncompressed 10,9 GB available in JSON files. The data consists of multiple sub datasets:

  1. Yelp Business data: Contains business data including location data, attributes, and categories.
  2. Yelp Review data: Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
  3. Yelp User data: User data including the user’s friend mapping and all the metadata associated with the user.
  4. Yelp Checkin data: Checkins on a business.
  5. Yelp Tip data: Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.
  6. Yelp Photo data: Contains photo data including the caption and classification (one of “food”, “drink”, “menu”, “inside” or “outside”).

Available as JSON files, use can use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps.

 

Twitter 2010 data set

Creators: Lerman, Kristina; Ghosh, Rumi; Surachawala, Tawan
Publication Date: 2010
Creators: Lerman, Kristina; Ghosh, Rumi; Surachawala, Tawan

The Twitter 2010 dataset contains Twitter activity data, focusing on tweets containing URLs and the follower-followee relationships among users during October 2010. It is particularly valuable for studying information diffusion, social network structures, and user interaction patterns on Twitter. The dataset has a size of 21,5 kB and includes 2,859,764 tweets that contain URLs, offering insights into content-sharing behaviors, and 736,930 users who posted these tweets. Additionally, it features 36,743,448 follower-followee relationships, allowing for the reconstruction of the social graph of active users. Each tweet record contains metadata such as tweet ID, creation date, source device, in-reply-to information, and the user’s follower and followee counts.

Web data: Amazon Fine Foods reviews

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of reviews of fine foods from amazon. The data is 116 MB ins size and spans a period of more than 10 years, including all ~500,000 reviews up to October 2012. Each review includes detailed information such as the product’s unique identifier (ASIN), user ID, profile name, helpfulness rating, score, time of review (in Unix time), summary, and the full text of the review. This dataset is particularly valuable for analyzing consumer behavior, sentiment analysis, and the evolution of user expertise in online reviews.

The dataset is organized with each review capturing multiple attributes:

  • Product Information: Including the product’s unique identifier (ASIN).

  • User Information: Such as user ID and profile name.

  • Review Details: Encompassing helpfulness rating, score, time of review, summary, and the full text.

 

Web data: Amazon movie reviews

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset is a collection of approximately 8 million movie reviews from Amazon, spanning over a decade up to October 2012. It is particularly valuable for analyzing consumer behavior, sentiment analysis, and the evolution of user expertise in online reviews. In total, the dataset has a size of 3,1 GB. Each review includes detailed information such as the product’s unique identifier (ASIN), user ID, profile name, helpfulness rating, score, time of review (in Unix time), summary, and the full text of the review.

The dataset is organized with each review capturing multiple attributes:

  • Product Information: Including the product’s unique identifier (ASIN).

  • User Information: Such as user ID and profile name.

  • Review Details: Encompassing helpfulness rating, score, time of review, summary, and the full text.

 

Youtube social network and ground-truth communities

Creators: Yang, Jaewon; Leskovec, Jure
Publication Date: 2012
Creators: Yang, Jaewon; Leskovec, Jure

Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendships and can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al. We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component. This data collected in 2012 is particularly valuable for studying community structures, information diffusion, and network dynamics within large-scale social platforms. It has a size of 0,001 GB and comprises 1,134,890 nodes (users) and 2,987,624 edges (friendship links), reflecting the complex web of user interactions on YouTube. Additionally, it identifies 8,385 ground-truth communities, which are user-defined groups that provide insights into the natural clustering within the network.

Structurally, the dataset includes:

  • Network Data: An undirected graph where nodes represent users and edges denote mutual friendships. This graph captures the largest connected component of the YouTube user network, ensuring a cohesive representation of user interactions.

  • Community Data: Ground-truth communities derived from user-defined groups. Each connected component within these groups is considered a separate community, with only those containing at least three nodes included. For enhanced analysis, the dataset also provides the top 5,000 communities with the highest quality, as detailed in the accompanying research paper.

Amazon product co-purchasing network and ground-truth communities

Creators: Yang, Jaewon; Leskovec, Jure
Publication Date: 2012
Creators: Yang, Jaewon; Leskovec, Jure

This dataset provides a comprehensive view of product relationships on Amazon, based on the “Customers Who Bought This Item Also Bought” feature. Products are represented as nodes, and an undirected edge between two products signifies frequent co-purchasing, reflecting consumer buying patterns and product associations. ​If a product i is frequently co-purchased with product j, the graph contains an undirected edge from i to j. Each product category provided by Amazon defines each ground-truth community. We regard each connected component in a product category as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. The dataset has a size of 0,01 GB and encompasses 334,863 nodes (products) and 925,872 edges (co-purchasing relationships).

The dataset is structured into:

  • Network Data: An undirected graph where nodes represent products, and edges indicate co-purchasing relationships.

  • Ground-Truth Communities: Each product category defined by Amazon serves as a ground-truth community. Connected components within these categories are treated as separate communities, excluding those with fewer than three nodes. Additionally, the dataset provides the top 5,000 communities with the highest quality, as detailed in the associated research paper.

Google web graph

Creators: Leskovec, Jure; Lang, Kevin J.; Dasgupta, Anirban; Mahoney, Michael W.
Publication Date: 2002
Creators: Leskovec, Jure; Lang, Kevin J.; Dasgupta, Anirban; Mahoney, Michael W.

The Google Web Graph dataset offers a detailed representation of the web’s hyperlink structure as captured in 2002. In this dataset, nodes correspond to individual web pages, and directed edges represent hyperlinks from one page to another. This structure is particularly valuable for studying web connectivity, page importance algorithms like PageRank, and the overall topology of the internet during that period. In total, the dataset has a size of 0,02 GB and comprises 875,713 nodes and 5,105,039 directed edges, reflecting the extensive interlinking characteristic of the early 2000s web. Structurally, the dataset is presented as a single directed graph where each node represents a web page, and each directed edge denotes a hyperlink from one page to another. This format facilitates analyses of web page connectivity, identification of influential pages, and exploration of community structures within the web.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.