Showing 9-16 of 39 results

Fact-Checking Facebook Politics Pages

Creators: Silverman Craig; Strapagiel, Lauren; Shaban, Hamza; Hall, Ellie; Singer-Vine, Jeremy
Publication Date: 2016
Creators: Silverman Craig; Strapagiel, Lauren; Shaban, Hamza; Hall, Ellie; Singer-Vine, Jeremy

This repository contains the data and analysis for the BuzzFeed News article, “Hyperpartisan Facebook Pages Are Publishing False And Misleading Information At An Alarming Rate,” published October 20, 2016. The dataset specifically examines content from hyperpartisan Facebook pages, providing insights into the spread of false and misleading information within polarized political communities. It includes Facebook engagement figures—such as shares, reactions, and comments—offering a perspective on how users interact with content of varying accuracy. The dataset has a total size of 364,8 kB and contains over 1,000 posts from hyperpartisan political Facebook pages. Data was collected up to October 11, 2016, capturing a snapshot of political content leading up to the 2016 U.S. presidential election. Structurally, the dataset is organized into a spreadsheet with columns representing:

  • Post Content: Text of the Facebook post.

  • Fact-Check Rating: Assessment of the post’s accuracy.

  • Engagement Metrics: Counts of shares, reactions, and comments.

Characterizing Online Discussion Using Coarse Discourse Sequences

Creators: Zhang, Amy; Culbertson, Brian; Paritosh, Praveen
Publication Date: 2017
Creators: Zhang, Amy; Culbertson, Brian; Paritosh, Praveen

In this work, we present a novel method for classifying comments in online discussions into a set of coarse discourse acts towards the goal of better understanding discussions at scale. To facilitate this study, we devise a categorization of coarse discourse acts designed to encompass general online discussion and allow for easy annotation by crowd workers. We collect and release a corpus of over 9,000 threads comprising over 100,000 comments manually annotated via paid crowdsourcing with discourse acts and randomly sampled from the site Reddit. Using our corpus, we demonstrate how the analysis of discourse acts can characterize different types of discussions, including discourse sequences such as Q&A pairs and chains of disagreement, as well as different communities. Finally, we conduct experiments to predict discourse acts using our corpus, finding that structured prediction models such as conditional random fields can achieve an F1 score of 75%. We also demonstrate how the broadening of discourse acts from simply question and answer to a richer set of categories can improve the recall performance of Q&A extraction.

Tracking Mastodon user numbers over time

Creators: Willison, Simon
Publication Date: 2022
Creators: Willison, Simon

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter. I’ve set up a new git scraper to track the number of registered user accounts on known Mastodon instances over time. The dataset collects data from numerous Mastodon instances, providing a holistic view of user distribution across the network. This approach captures the decentralized nature of Mastodon, offering insights into individual server growth and overall network expansion. By recording user numbers at regular intervals, the dataset enables the analysis of growth patterns over time, identifying trends and significant adoption milestones. The dataset includes user counts from approximately 1,830 Mastodon instances, with data points collected approximately every 20 minutes. This frequency allows for detailed temporal analysis of user growth. Data collection began on November 20, 2022, and has continued since then, capturing the rapid growth of Mastodon following significant events such as changes in other social media platforms.

The dataset is structured with each record representing a snapshot of user numbers across various Mastodon instances at a specific timestamp. Key fields include:

  • Instance Name: The domain name of the Mastodon instance.

  • User Count: The number of registered users on the instance at the time of data collection.

  • Timestamp: The date and time when the data was collected.

Stack Exchange Data

Creators: Stack Exchange Inc.
Publication Date: 2014
Creators: Stack Exchange Inc.

This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. The dataset covers detailed records of questions, answers, comments, user profiles, and other related metadata from numerous Stack Exchange communities. This breadth allows for in-depth analysis of community interactions, content evolution, and knowledge dissemination patterns. The dataset has a size of 92,3 GB and captures content from the inception of each Stack Exchange site up to the date of the specific data dump. For example, the September 2023 release includes data up to that month. Structurally, the database is organized into individual archives for each Stack Exchange community. Each archive contains several XML files representing different data tables:

  • Posts.xml: Contains both questions and answers, with fields detailing post ID, creation date, score, body content, and related metadata.

  • Users.xml: Includes user information such as user ID, reputation, creation date, and profile details.

  • Comments.xml: Encompasses comments made on posts, including comment ID, post ID, user ID, and content.

  • Votes.xml: Records voting data on posts, detailing vote type, user ID, and timestamps.

3 Million Russian troll tweets

Creators: FiveThirtyEight; Warren, Patrick ;Linvill, Darren
Publication Date: 2018
Creators: FiveThirtyEight; Warren, Patrick ;Linvill, Darren

This directory contains data on nearly 3 million tweets sent from Twitter handles connected to the Internet Research Agency, a Russian “troll factory” and a defendant in an indictment filed by the Justice Department in February 2018, as part of special counsel Robert Mueller’s Russia investigation. The tweets in this database were sent between February 2012 and May 2018, with the vast majority posted from 2015 through 2017. Each entry includes detailed information such as the tweet’s content, author handle, language, publication date, and engagement metrics (e.g., number of followers, following count). The dataset provides classifications for each account, indicating the thematic focus (e.g., Right Troll, Left Troll, News Feed), as coded by researchers Darren Linvill and Patrick Warren.​ It has a total size of 507,2 kB.

Facebook Social Connectedness Index

Creators: Meta
Publication Date: 2021
Creators: Meta

We use an anonymized snapshot of all active Facebook users and their friendship networks to measure the intensity of connectedness between locations. The Social Connectedness Index (SCI) is a measure of the social connectedness between different geographies. Specifically, it measures the relative probability that two individuals across two locations are friends with each other on Facebook. Each entry represents a pair of locations, detailing the strength of social connectedness between them. By doing so, the SCI provides a measure of the relative probability that two individuals from different locations are Facebook friends, offering insights into social ties across regions. The dataset has a a size of 3,9 kB and reflects a specific snapshot in time, with the latest available data from October 2021. The dataset is organized into multiple sub-datasets, each detailing social connectedness at different geographic levels:

  1. Country-Country Pairs:

    • user_loc: ISO2 code of the first country.

    • fr_loc: ISO2 code of the second country.

    • scaled_sci: Scaled Social Connectedness Index between the two countries.

  2. US County-Country Pairs:

    • user_loc: 5-digit FIPS code of the U.S. county.

    • fr_loc: ISO2 code of the country.

    • scaled_sci: Scaled Social Connectedness Index between the U.S. county and the country.

Third Eye Data: TV News Archive chyrons

Creators: TV News Archive
Publication Date: 2017
Creators: TV News Archive

The Third Eye: TV News Archive Chyrons dataset captures and analyzes the “lower third” text, known as chyrons, displayed during live TV news broadcasts. This dataset provides a unique look into the real-time editorial choices of major news networks, offering insights into how different media outlets frame news stories. Using Optical Character Recognition (OCR) technology, chyrons are extracted and archived continuously, making it possible to track how key topics are covered over time.

At its inception in September 2017, the dataset collected chyrons from four major news networks: BBC News, CNN, Fox News, and MSNBC. Within just two weeks of its launch, over four million chyrons had already been captured, highlighting the vast amount of real-time data available. The dataset has been continuously updated since, allowing for longitudinal studies of media framing and news presentation trends. It’s size is approximately 12.5 kB in TSV format.

The dataset is structured into several key components. Each chyron entry includes:

  • The exact chyron text, showing the wording used by the network.
  • Timestamps, allowing analysis of how frequently specific topics appear.
  • Channel identifiers, enabling comparisons between different networks.
  • Duration data, indicating how long a chyron remained on screen, which can suggest emphasis or prioritization of certain stories.

By leveraging this dataset, researchers, journalists, and media analysts can examine bias in news presentation, media influence on public perception, and breaking news coverage trends. It serves as a powerful tool for studying news framing, editorial strategies, and the evolution of televised news narratives across competing networks.

Social Recommendation Data

Creators: Cai, Chenwei; He, Ruining; McAuley, Julian; Zhao, Tong; King, Irwin
Publication Date: 2017
Creators: Cai, Chenwei; He, Ruining; McAuley, Julian; Zhao, Tong; King, Irwin

These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews). Those specific user ratings allow for detailed analysis of user preferences. By capturing the social (or trust) relationships between users, this dataset enables the study of how social connections influence user behavior and recommendations. The dataset is approximately 660 MB in size and includes:

Number of Observations:

  • LibraryThing:

    • Users: 73,882
    • Items: 337,561
    • Ratings: 979,053
    • Social Relations: 120,536
  • Epinions:

    • Users: 116,260
    • Items: 41,269
    • Ratings/Feedback: 181,394
    • Social Relations: 181,304

The dataset is structured into:

  • User Information: Anonymized user identifiers.

  • Item Information: Identifiers for items such as books or products.

  • Ratings/Feedback: User-provided ratings or feedback scores for items.

  • Social Relations: Mappings of social or trust relationships between users.


Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.