Resources by Stefan

IAB-SMART-Mobility

Creators: Research Data Centre of the German Federal Employment Agency (BA) at the Institute for Employment Research (IAB)
Publication Date: 2023
Creators: Research Data Centre of the German Federal Employment Agency (BA) at the Institute for Employment Research (IAB)

The IAB-SMART Mobility data module provides smartphone-generated mobility indicators as additional variables for the Panel Labor Market and Social Security (PASS). Participants in IAB-SMART are PASS respondents from wave 11 (2017) with an Android smartphone who were invited to install the IAB-SMART app.  Provided the user’s consent, the IAB-SMART app collected the location of the smartphones at half-hour intervals from January to August 2018. Mobility indicators such as the number of visited places or the share of location measurements at home were derived from this location information. The mobility indicators capture the entire observation period as well as weekdays and weekends separately. With IAB-SMART Mobility, research questions related to mobility behaviour concerning employment and unemployment can be analysed. The indicators from IAB-SMART Mobility complement the survey data from PASS and can also be linked with the administrative data from IAB (PASS-ADIAB). More Information can be found at the website of the FDZ-BA/IAB: DOI: 10.5164/IAB.IAB-SMART.de.en.v1

Detailed information about the data module, as well as a guide on how to link IAB-SMART Mobility with PASS data, can be found in the following publication: Zimmermann, Florian, Filser, Andreas, Haas, Georg-Christoph; Bähr, Sebastian (2023). The IAB-SMART-Mobility Module: An Innovative Research Dataset with Mobility Indicators Based on Raw Geodata. Jahrbücher für Nationalökonomie und Statistik. https://doi.org/10.1515/jbnst-2023-0051

Fact-Checking Facebook Politics Pages

Creators: Silverman Craig; Strapagiel, Lauren; Shaban, Hamza; Hall, Ellie; Singer-Vine, Jeremy
Publication Date: 2016
Creators: Silverman Craig; Strapagiel, Lauren; Shaban, Hamza; Hall, Ellie; Singer-Vine, Jeremy

This repository contains the data and analysis for the BuzzFeed News article, “Hyperpartisan Facebook Pages Are Publishing False And Misleading Information At An Alarming Rate,” published October 20, 2016. The dataset specifically examines content from hyperpartisan Facebook pages, providing insights into the spread of false and misleading information within polarized political communities. It includes Facebook engagement figures—such as shares, reactions, and comments—offering a perspective on how users interact with content of varying accuracy. The dataset has a total size of 364,8 kB and contains over 1,000 posts from hyperpartisan political Facebook pages. Data was collected up to October 11, 2016, capturing a snapshot of political content leading up to the 2016 U.S. presidential election. Structurally, the dataset is organized into a spreadsheet with columns representing:

  • Post Content: Text of the Facebook post.

  • Fact-Check Rating: Assessment of the post’s accuracy.

  • Engagement Metrics: Counts of shares, reactions, and comments.

2012-2016 Facebook Posts

Creators: Martinchek, Patrick
Publication Date: 2016
Creators: Martinchek, Patrick

This dataset comprises Facebook posts from the 15 mainstream media sources during the years 2012 to 2016. It includes posts from the top mainstream media outlets, offering insights into their social media strategies and audience engagement during a significant period in digital media evolution.The dataset is structured to include various fields such as post content, timestamps, and engagement metrics like likes, shares, and comments. Each record represents a single Facebook post, allowing for detailed analysis of individual entries. It has a size of 861,17 MB.

Huge Collection of Reddit Votes

Creators: Leake, Joseph
Publication Date: 2020
Creators: Leake, Joseph

The dataset covers data of over 44 million upvotes and downvotes cast by Reddit users between 2007 and 2020. This is a tab-delimited list of votes cast by reddit users who have opted-in to make their voting history public. Each row contains the submission id for the thread being voted on, the subreddit the submission was located in, the epoch timestamp associated with the vote, the voter’s username, and whether it was an upvote or a downvote. The votes included are from users who have chosen to make their voting history public, ensuring compliance with privacy preferences. There’s a separate file containing information about the submissions that were voted on. The dataset contains over 44 million voting records and has a size of 21,9 kB. Structurally, the dataset is organized into two main components:

  1. Votes Data: A tab-delimited file where each row represents a vote with the following fields:

    • submission_id: Identifier of the Reddit submission that received the vote.

    • subreddit: Name of the subreddit where the submission was posted.

    • created_time: Epoch timestamp indicating when the vote was cast.

    • username: Reddit username of the voter.

    • vote: Type of vote, either ‘upvote’ or ‘downvote’.

  2. Submissions Data: A separate file containing information about the submissions that received votes, including details such as submission titles, authors, and timestamps.

Tracking Mastodon user numbers over time

Creators: Willison, Simon
Publication Date: 2022
Creators: Willison, Simon

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter. I’ve set up a new git scraper to track the number of registered user accounts on known Mastodon instances over time. The dataset collects data from numerous Mastodon instances, providing a holistic view of user distribution across the network. This approach captures the decentralized nature of Mastodon, offering insights into individual server growth and overall network expansion. By recording user numbers at regular intervals, the dataset enables the analysis of growth patterns over time, identifying trends and significant adoption milestones. The dataset includes user counts from approximately 1,830 Mastodon instances, with data points collected approximately every 20 minutes. This frequency allows for detailed temporal analysis of user growth. Data collection began on November 20, 2022, and has continued since then, capturing the rapid growth of Mastodon following significant events such as changes in other social media platforms.

The dataset is structured with each record representing a snapshot of user numbers across various Mastodon instances at a specific timestamp. Key fields include:

  • Instance Name: The domain name of the Mastodon instance.

  • User Count: The number of registered users on the instance at the time of data collection.

  • Timestamp: The date and time when the data was collected.

Characterizing Online Discussion Using Coarse Discourse Sequences

Creators: Zhang, Amy; Culbertson, Brian; Paritosh, Praveen
Publication Date: 2017
Creators: Zhang, Amy; Culbertson, Brian; Paritosh, Praveen

In this work, we present a novel method for classifying comments in online discussions into a set of coarse discourse acts towards the goal of better understanding discussions at scale. To facilitate this study, we devise a categorization of coarse discourse acts designed to encompass general online discussion and allow for easy annotation by crowd workers. We collect and release a corpus of over 9,000 threads comprising over 100,000 comments manually annotated via paid crowdsourcing with discourse acts and randomly sampled from the site Reddit. Using our corpus, we demonstrate how the analysis of discourse acts can characterize different types of discussions, including discourse sequences such as Q&A pairs and chains of disagreement, as well as different communities. Finally, we conduct experiments to predict discourse acts using our corpus, finding that structured prediction models such as conditional random fields can achieve an F1 score of 75%. We also demonstrate how the broadening of discourse acts from simply question and answer to a richer set of categories can improve the recall performance of Q&A extraction.

Central African Republic: Displacement Data - Baseline Assessment

Creators: International Organization for Migration (IOM)
Publication Date: 2024
Creators: International Organization for Migration (IOM)

This Data is about IDP, returnees from CAR (previous IDP) and returnees from other countries repartition by origin and period of displacement and between 2013 and the date of assessment. It offers data on the distribution of IDPs and returnees, including their origins, periods of displacement, and reasons for displacement. This information is crucial for understanding the dynamics of displacement within CAR and for planning humanitarian interventions. As a sub-component of mobility tracking, the baseline assessment collects data on the presence of displaced populations in defined geographic areas. This foundational data supports the identification of needs and the coordination of assistance efforts. Evaluation has been run in 6 prefectures (admin1), 16 sub-prefectures (admin2) and 367 localities. The number of observations varies across different rounds of data collection. For instance, Round 6 of the assessment identified a total displaced population of 1,074,983 individuals, comprising 580,692 IDPs and 375,684 returnees. The dataset is 16,7 kB and organized into several key components:

  • IDP Data: Information on internally displaced persons, including their locations, demographics, and displacement periods.

  • Returnee Data: Details about returnees, both former IDPs and those returning from other countries, including their areas of origin and return timelines.

  • Reasons for Displacement: Categorization of displacement causes, such as armed conflicts, inter-community tensions, or preventive measures. Notably, 67% of internal displacements are linked to armed conflicts, 26% to inter-community tensions, and 6% are preventive.

  • Geographical Distribution: Data on the distribution of displaced populations across various regions and localities within CAR.

Crowdsourced air traffic data from The OpenSky Network 2020

Creators: Olive, Xavier; Strohmeier, Martin; Lübbe, Jannis
Publication Date: 2022
Creators: Olive, Xavier; Strohmeier, Martin; Lübbe, Jannis

The data in this dataset is derived and cleaned from the full OpenSky dataset to illustrate the development of air traffic during the COVID-19 pandemic. It spans all flights seen by the network’s more than 2500 members since 1 January 2019. More data will be periodically included in the dataset until the end of the COVID-19 pandemic. Leveraging a network of over 2,500 members, the dataset aggregates ADS-B signals received by volunteers worldwide, ensuring a rich and diverse data source. The dataset includes records of 41,900,660 flights, capturing data from 160,737 unique aircrafts. Flight operations involving 13,934 airports across 127 countries are documented. In total, the dataset has a size of 7,0 GB. Each month is represented by a separate CSV file, containing flight data for that specific period. ​ Each file includes the following columns:

  • callsign: Identifier used for air traffic control communications.
  • number: Commercial flight number, if available.
  • icao24: Unique 24-bit address assigned to the aircraft’s transponder.
  • registration: Aircraft’s registration number.
  • typecode: Aircraft model code.
  • origin: ICAO code of the departure airport.
  • destination: ICAO code of the arrival airport.
  • firstseen: Timestamp of the first detection during the flight.
  • lastseen: Timestamp of the last detection during the flight.
  • day: Date of the flight.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.