Stack Exchange Data
Creators:
Stack Exchange Inc.
Publication Date:
2014
Data Category:
Dataset Description:
This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. The dataset covers detailed records of questions, answers, comments, user profiles, and other related metadata from numerous Stack Exchange communities. This breadth allows for in-depth analysis of community interactions, content evolution, and knowledge dissemination patterns. The dataset has a size of 92,3 GB and captures content from the inception of each Stack Exchange site up to the date of the specific data dump. For example, the September 2023 release includes data up to that month. Structurally, the database is organized into individual archives for each Stack Exchange community. Each archive contains several XML files representing different data tables:
-
Posts.xml: Contains both questions and answers, with fields detailing post ID, creation date, score, body content, and related metadata.
-
Users.xml: Includes user information such as user ID, reputation, creation date, and profile details.
-
Comments.xml: Encompasses comments made on posts, including comment ID, post ID, user ID, and content.
-
Votes.xml: Records voting data on posts, detailing vote type, user ID, and timestamps.
Variables:
Details: