Reddit datasets
Creators:
Conversational Analysis Toolkit (ConvoKit)
Publication Date:
n.a
Data Category:
Dataset Description:
The ConvoKit Subreddit Corpus is a collection of user comments from various subreddits on Reddit, gathered over time to facilitate research in conversational analysis and sociolinguistics. It encompasses posts and comments from 948,169 individual subreddits, each from its inception until October 2018. This dataset is organized into individual corpora for each subreddit, facilitating targeted analysis of specific communities. Each corpus includes detailed information at multiple levels: speaker-level, where speakers are identified by their Reddit usernames; utterance-level, where each post or comment is treated as an utterance with attributes such as unique ID, author, conversation ID, reply relationships, timestamp, and text content; conversation-level, where each post and its corresponding comments are considered a conversation, with metadata including the post's title, number of comments, domain, subreddit, and author flair; and corpus-level, which aggregates data such as the list of subreddits included, total number of posts and comments, and the number of unique speakers.
Variables:
Details: