Showing 241-248 of 272 results

Web data: Amazon movie reviews

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset is a collection of approximately 8 million movie reviews from Amazon, spanning over a decade up to October 2012. It is particularly valuable for analyzing consumer behavior, sentiment analysis, and the evolution of user expertise in online reviews. In total, the dataset has a size of 3,1 GB. Each review includes detailed information such as the product’s unique identifier (ASIN), user ID, profile name, helpfulness rating, score, time of review (in Unix time), summary, and the full text of the review.

The dataset is organized with each review capturing multiple attributes:

  • Product Information: Including the product’s unique identifier (ASIN).

  • User Information: Such as user ID and profile name.

  • Review Details: Encompassing helpfulness rating, score, time of review, summary, and the full text.

 

Web data: Amazon Fine Foods reviews

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of reviews of fine foods from amazon. The data is 116 MB ins size and spans a period of more than 10 years, including all ~500,000 reviews up to October 2012. Each review includes detailed information such as the product’s unique identifier (ASIN), user ID, profile name, helpfulness rating, score, time of review (in Unix time), summary, and the full text of the review. This dataset is particularly valuable for analyzing consumer behavior, sentiment analysis, and the evolution of user expertise in online reviews.

The dataset is organized with each review capturing multiple attributes:

  • Product Information: Including the product’s unique identifier (ASIN).

  • User Information: Such as user ID and profile name.

  • Review Details: Encompassing helpfulness rating, score, time of review, summary, and the full text.

 

Google web graph

Creators: Leskovec, Jure; Lang, Kevin J.; Dasgupta, Anirban; Mahoney, Michael W.
Publication Date: 2002
Creators: Leskovec, Jure; Lang, Kevin J.; Dasgupta, Anirban; Mahoney, Michael W.

The Google Web Graph dataset offers a detailed representation of the web’s hyperlink structure as captured in 2002. In this dataset, nodes correspond to individual web pages, and directed edges represent hyperlinks from one page to another. This structure is particularly valuable for studying web connectivity, page importance algorithms like PageRank, and the overall topology of the internet during that period. In total, the dataset has a size of 0,02 GB and comprises 875,713 nodes and 5,105,039 directed edges, reflecting the extensive interlinking characteristic of the early 2000s web. Structurally, the dataset is presented as a single directed graph where each node represents a web page, and each directed edge denotes a hyperlink from one page to another. This format facilitates analyses of web page connectivity, identification of influential pages, and exploration of community structures within the web.

Amazon product co-purchasing network and ground-truth communities

Creators: Yang, Jaewon; Leskovec, Jure
Publication Date: 2012
Creators: Yang, Jaewon; Leskovec, Jure

This dataset provides a comprehensive view of product relationships on Amazon, based on the “Customers Who Bought This Item Also Bought” feature. Products are represented as nodes, and an undirected edge between two products signifies frequent co-purchasing, reflecting consumer buying patterns and product associations. ​If a product i is frequently co-purchased with product j, the graph contains an undirected edge from i to j. Each product category provided by Amazon defines each ground-truth community. We regard each connected component in a product category as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. The dataset has a size of 0,01 GB and encompasses 334,863 nodes (products) and 925,872 edges (co-purchasing relationships).

The dataset is structured into:

  • Network Data: An undirected graph where nodes represent products, and edges indicate co-purchasing relationships.

  • Ground-Truth Communities: Each product category defined by Amazon serves as a ground-truth community. Connected components within these categories are treated as separate communities, excluding those with fewer than three nodes. Additionally, the dataset provides the top 5,000 communities with the highest quality, as detailed in the associated research paper.

Youtube social network and ground-truth communities

Creators: Yang, Jaewon; Leskovec, Jure
Publication Date: 2012
Creators: Yang, Jaewon; Leskovec, Jure

Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendships and can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al. We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component. This data collected in 2012 is particularly valuable for studying community structures, information diffusion, and network dynamics within large-scale social platforms. It has a size of 0,001 GB and comprises 1,134,890 nodes (users) and 2,987,624 edges (friendship links), reflecting the complex web of user interactions on YouTube. Additionally, it identifies 8,385 ground-truth communities, which are user-defined groups that provide insights into the natural clustering within the network.

Structurally, the dataset includes:

  • Network Data: An undirected graph where nodes represent users and edges denote mutual friendships. This graph captures the largest connected component of the YouTube user network, ensuring a cohesive representation of user interactions.

  • Community Data: Ground-truth communities derived from user-defined groups. Each connected component within these groups is considered a separate community, with only those containing at least three nodes included. For enhanced analysis, the dataset also provides the top 5,000 communities with the highest quality, as detailed in the accompanying research paper.

Facebook Large Page-Page Network; MUSAE

Creators: Rozemberczki, Benedek; Allen, Carl; Sarkar, Rik
Publication Date: 2019
Creators: Rozemberczki, Benedek; Allen, Carl; Sarkar, Rik

This webgraph is a page-page graph of verified Facebook sites. It offers a comprehensive snapshot of the mutual like relationships among verified Face. Nodes represent official Facebook pages while the links are mutual likes between sites. Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. These categories are: politicians, governmental organizations, television shows and companies. The task related to this dataset is multi-class node classification for the 4 site categories. The dataset has a total size of 0,001 GB and encompasses 22,470 nodes, each representing an official Facebook page, and 171,002 edges, indicating mutual like relationships between these pages. It is particularly suited for tasks such as multi-class node classification, link prediction, community detection, and network visualization, offering researchers a rich resource to explore the structural properties and interactions within the Facebook page ecosystem.

The components of the dataset contain:

  • Nodes: Each node corresponds to a verified Facebook page, categorized into one of four types: politicians, governmental organizations, television shows, and companies.

  • Edges: Edges represent mutual like relationships between pages, forming an undirected graph that reflects the interconnectedness of these entities.

  • Node Features: Features are extracted from the descriptions provided by page owners, summarizing the purpose of their respective page

Graph Embedding with Self Clustering: Facebook; GEMSEC

Creators: Rozemberczki, Benedek; Davies, Ryan; Sarkar, Rik; Sutton, Charles
Publication Date: 2019
Creators: Rozemberczki, Benedek; Davies, Ryan; Sarkar, Rik; Sutton, Charles

We collected data about Facebook pages (November 2017). These datasets represent blue verified Facebook page networks across eight distinct categories: Government, News Sites, Athletes, Public Figures, TV Shows, Politicians, Artists, and Companies. In this dataset, nodes represent individual Facebook pages, and edges denote mutual likes between these pages, reflecting the interconnectedness within and between different interest groups.  We reindexed the nodes in order to achieve a certain level of anonymity. The csv files contain the edges — nodes are indexed from 0. We included 8 different distinct types of pages. For each dataset we listed the number of nodes an edges. The dataset’s size varies by category, with the largest subset (Artists) containing 50,515 nodes and 819,306 edges, and the smallest subset (TV Shows) comprising 3,892 nodes and 17,262 edges. In total, the dataset has a size of 0,005 GB and encompasses 134,833 nodes and 1,380,293 edges, offering a rich source for analyzing the structure and dynamics of Facebook page interactions. Structurally, the dataset is divided into eight sub-datasets, each corresponding to a specific category of Facebook pages.

Wikipedia Requests for Adminship (with text)

Creators: West, Robert; Paskov, Hristo S.; Leskovec, Jure; Potts, Christopher
Publication Date: 2014
Creators: West, Robert; Paskov, Hristo S.; Leskovec, Jure; Potts, Christopher

For a Wikipedia editor to become an administrator, a request for adminship (RfA) must be submitted, either by the candidate or by another community member. Subsequently, any Wikipedia member may cast a supporting, neutral, or opposing vote. We crawled and parsed all votes since the adoption of the RfA process in 2003 through May 2013. The dataset has a size of 0,01 GB and contains 11,381 users (voters and votees) forming 189,004 distinct voter/votee pairs, for a total of 198,275 votes (this is larger than the number of distinct voter/votee pairs because, if the same user ran for election several times, the same voter/votee pair may contribute several votes). This induces a directed, signed network in which nodes represent Wikipedia members and edges represent votes. In this sense, the present dataset is a more recent version of the Wikipedia adminship election data. However, there is also a rich textual component in RfAs, which was not included in the older version: each vote is typically accompanied by a short comment (median/mean: 19/34 tokens). A typical positive comment reads, “I’ve no concerns, will make an excellent addition to the admin corps”, while an example of a negative comment is, “Little evidence of collaboration with other editors and limited content creation.”

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.