Resources by Stefan

Wikipedia Requests for Adminship (with text)

Creators: West, Robert; Paskov, Hristo S.; Leskovec, Jure; Potts, Christopher
Publication Date: 2014
Creators: West, Robert; Paskov, Hristo S.; Leskovec, Jure; Potts, Christopher

For a Wikipedia editor to become an administrator, a request for adminship (RfA) must be submitted, either by the candidate or by another community member. Subsequently, any Wikipedia member may cast a supporting, neutral, or opposing vote. We crawled and parsed all votes since the adoption of the RfA process in 2003 through May 2013. The dataset has a size of 0,01 GB and contains 11,381 users (voters and votees) forming 189,004 distinct voter/votee pairs, for a total of 198,275 votes (this is larger than the number of distinct voter/votee pairs because, if the same user ran for election several times, the same voter/votee pair may contribute several votes). This induces a directed, signed network in which nodes represent Wikipedia members and edges represent votes. In this sense, the present dataset is a more recent version of the Wikipedia adminship election data. However, there is also a rich textual component in RfAs, which was not included in the older version: each vote is typically accompanied by a short comment (median/mean: 19/34 tokens). A typical positive comment reads, “I’ve no concerns, will make an excellent addition to the admin corps”, while an example of a negative comment is, “Little evidence of collaboration with other editors and limited content creation.”

Graph Embedding with Self Clustering: Facebook; GEMSEC

Creators: Rozemberczki, Benedek; Davies, Ryan; Sarkar, Rik; Sutton, Charles
Publication Date: 2019
Creators: Rozemberczki, Benedek; Davies, Ryan; Sarkar, Rik; Sutton, Charles

We collected data about Facebook pages (November 2017). These datasets represent blue verified Facebook page networks across eight distinct categories: Government, News Sites, Athletes, Public Figures, TV Shows, Politicians, Artists, and Companies. In this dataset, nodes represent individual Facebook pages, and edges denote mutual likes between these pages, reflecting the interconnectedness within and between different interest groups.  We reindexed the nodes in order to achieve a certain level of anonymity. The csv files contain the edges — nodes are indexed from 0. We included 8 different distinct types of pages. For each dataset we listed the number of nodes an edges. The dataset’s size varies by category, with the largest subset (Artists) containing 50,515 nodes and 819,306 edges, and the smallest subset (TV Shows) comprising 3,892 nodes and 17,262 edges. In total, the dataset has a size of 0,005 GB and encompasses 134,833 nodes and 1,380,293 edges, offering a rich source for analyzing the structure and dynamics of Facebook page interactions. Structurally, the dataset is divided into eight sub-datasets, each corresponding to a specific category of Facebook pages.

Facebook Large Page-Page Network; MUSAE

Creators: Rozemberczki, Benedek; Allen, Carl; Sarkar, Rik
Publication Date: 2019
Creators: Rozemberczki, Benedek; Allen, Carl; Sarkar, Rik

This webgraph is a page-page graph of verified Facebook sites. It offers a comprehensive snapshot of the mutual like relationships among verified Face. Nodes represent official Facebook pages while the links are mutual likes between sites. Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. These categories are: politicians, governmental organizations, television shows and companies. The task related to this dataset is multi-class node classification for the 4 site categories. The dataset has a total size of 0,001 GB and encompasses 22,470 nodes, each representing an official Facebook page, and 171,002 edges, indicating mutual like relationships between these pages. It is particularly suited for tasks such as multi-class node classification, link prediction, community detection, and network visualization, offering researchers a rich resource to explore the structural properties and interactions within the Facebook page ecosystem.

The components of the dataset contain:

  • Nodes: Each node corresponds to a verified Facebook page, categorized into one of four types: politicians, governmental organizations, television shows, and companies.

  • Edges: Edges represent mutual like relationships between pages, forming an undirected graph that reflects the interconnectedness of these entities.

  • Node Features: Features are extracted from the descriptions provided by page owners, summarizing the purpose of their respective page

Social circles: Twitter

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of ‘circles’ (or ‘lists’) from Twitter. Twitter data was crawled from public sources. It includes node features (profiles), circles, and ego networks and covers 81,306 nodes and 1,768,149 edges, capturing the intricate web of connections among users. This makes the dataset particularly useful for analyzing community structures, social influence, and network dynamics in a more nuanced way. It is approximately 0,02 GB in size and structured into:

  • Node features (profiles): Anonymized user profile information, where specific attributes are replaced with generic labels.

  • Circles: Lists of friends grouped by users, representing their social circles.

  • Ego networks: Subgraphs centered around individual users (egos), including their direct friends and the connections among those friends.

Wikipedia vote network

Creators: Leskovec, Jure; Huttenlocher, Daniel; Kleinberg, Jon
Publication Date: 2010
Creators: Leskovec, Jure; Huttenlocher, Daniel; Kleinberg, Jon

A small part of Wikipedia contributors are administrators, who are users with access to additional technical features that aid in maintenance. In order for a user to become an administrator a Request for adminship (RfA) is issued and the Wikipedia community via a public discussion or a vote decides who to promote to adminship. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all administrator elections and vote history data. This gave us 2,794 elections with 103,663 total votes and 7,066 users participating in the elections (either casting a vote or being voted on). Out of these 1,235 elections resulted in a successful promotion, while 1,559 elections did not result in the promotion. About half of the votes in the dataset are by existing admins, while the other half comes from ordinary Wikipedia users. The network contains all the Wikipedia voting data from the inception of Wikipedia till January 2008. Nodes in the network represent wikipedia users and a directed edge from node i to node j represents that user i voted on user j. In total, the dataset has a size of 0,01 GB.

Social circles: Google+

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of ‘circles’ from Google+. Google+ data was collected from users who had manually shared their circles using the ‘share circle’ feature. The dataset includes node features (profiles), circles, and ego networks. The dataset includes 107,614 nodes and 13,673,453 edges, providing valuable insights into the structure and characteristics of social connections.

It has a size of 0,4 GB and is structured into:

  • Node features (profiles): Anonymized user profile information, where specific attributes are replaced with generic labels.

  • Circles: Lists of friends grouped by users, representing their social circles.

  • Ego networks: Subgraphs centered around individual users (egos), including their direct friends and the connections among those friends.

AmazonQA

Creators: Gupta, Mansi; Kulkarni, Nitish; Chanda, Raghuveer; Rayasam, Anirudha; Lipton, Zachary C.
Publication Date: 2019
Creators: Gupta, Mansi; Kulkarni, Nitish; Chanda, Raghuveer; Rayasam, Anirudha; Lipton, Zachary C.

We introduce a new dataset and propose a method that combines information retrieval techniques for selecting relevant reviews (given a question) and “reading comprehension” models for synthesizing an answer (given a question and review). Our dataset consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, we collect additional annotations, marking each question as either answerable or unanswerable based on the available reviews. This dataset is particularly valuable for developing models that integrate information retrieval techniques to select relevant reviews and “reading comprehension” models to synthesize answers based on those reviews. The dataset is approximately 4 GB in size and is available in JSON format.

The dataset uses the following variables:

  • questionText: The text of the question posed by the consumer.

  • questionType: Indicates whether the question is ‘yes/no’ for boolean questions or ‘descriptive’ for open-ended questions.

  • review_snippets: A list of extracted review snippets relevant to the question (up to ten).

  • answerText: The text of the answer provided.

  • answerType: Specifies the type of answer.

  • helpful: A list containing two integers; the first indicates the number of users who found the answer helpful, and the second indicates the total number of responses.

  • asin: The unique Amazon Standard Identification Number (ASIN) for the product the question pertains to.

  • qid: A unique question identifier within the dataset.

  • category: The product category.

  • top_review_wilson: The review with the highest Wilson score.

  • top_review_helpful: The review voted as most helpful by users.

  • is_answerable: A boolean indicating whether the question is answerable using the review snippets, based on an answerability classifier.

  • top_sentences_IR: A list of top sentences (up to ten) based on Information Retrieval (IR) score with the question.

Social circles: Facebook

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of ‘circles’ (or ‘friends lists’) from Facebook. Facebook data was collected from survey participants using this Facebook app. The dataset includes node features (profiles), circles, and ego networks, offering valuable insights into the structure and characteristics of social connections. The dataset includes 4,039 nodes and 88,234 edges. Facebook data has been anonymized by replacing the Facebook-internal ids for each user with a new value. The dataset is approximately 0.01 GB in size.

The dataset is structured into:

  • Node features (profiles): Anonymized user profile information, where specific attributes (e.g., political affiliation) are replaced with generic labels (e.g., ‘anonymized feature 1’).

  • Circles: Lists of friends grouped by users, representing their social circles.

  • Ego networks: Subgraphs centered around individual users (egos), including their direct friends and the connections among those friends.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.