Reddit datasets

Creators:

Conversational Analysis Toolkit (ConvoKit)

Publication Date:

n.a

Data Category:

Dataset Description:

The ConvoKit Subreddit Corpus is a collection of user comments from various subreddits on Reddit, gathered over time to facilitate research in conversational analysis and sociolinguistics. It encompasses posts and comments from 948,169 individual subreddits, each from its inception until October 2018. This dataset is organized into individual corpora for each subreddit, facilitating targeted analysis of specific communities. Each corpus includes detailed information at multiple levels: speaker-level, where speakers are identified by their Reddit usernames; utterance-level, where each post or comment is treated as an utterance with attributes such as unique ID, author, conversation ID, reply relationships, timestamp, and text content; conversation-level, where each post and its corresponding comments are considered a conversation, with metadata including the post's title, number of comments, domain, subreddit, and author flair; and corpus-level, which aggregates data such as the list of subreddits included, total number of posts and comments, and the number of unique speakers.

Variables:

Name	Description
speaker	Username of the Reddit user.
ID	Unique identifier of the utterance.
conversation_ID	ID of the conversation to which this utterance belongs.
reply-to	ID of the utterance being replied to (None if it's a standalone post).
timestamp	Time when the utterance was made.
text	Text content of the utterance.
score	Net upvotes (upvotes minus downvotes)
top_level_comment	ID of the top-level comment (None if the utterance is a post)
Retrieved_on	Unix timestamp when the data was retrieved.
gildings	Number of times the post/comment received Reddit awards.
stickied	Indicates if the post/comment is stickied
permalink	URL to the content.
author_flair_text	Text flair associated with the author

Details:

Bookmark this Dataset/Publication

Name	Description
Subreddit	List of subreddits included in this corpus
num_posts	Total number of posts in the corpus
num_comments	Total number of comments
num_speaker	Number of unique users in the corpus.

Name	Description
Title	Title of the post
Num_comments	Numer of comments on the post
domain	Domain of the post’s link (if applicable)
Subreddit	Subreddit from which the post is retrieved.
Glided	Number of awards received by the post
Glidings	Gilding information for the post.
Stickied	Indicates if the post is stickied.
Author-flair-text	Flair of the author

Reddit datasets

Utterance-level information

Corpus-level information

Conversational-level information

Chinese Social Media and Protest Dynamics Dataset (2009–2017)

Civilian Complaints Against New York City Police Officers

Credit-to-GDP gaps

Reddit datasets

Utterance-level information

Corpus-level information

Conversational-level information

Sign In

Register

Reset Password