Showing 1-8 of 34 results

Instagram Posts from Football Players

Publication Date: 2023
Creators: Klostermann, Jan

This dataset includes information on 334,071 Instagram posts from 1,435 male professional football players that were under contract at any of the 56 clubs in the English Premier League, the Spanish La Liga, and the German Bundesliga. The data was colleced December 31th, 2019 and includes the whole history of Instagram posts up to that point in time.

The information provided in the dataset are the following:

  • Player information: Information on each of the football player in the dataset is collected from http://www.transfermarkt.de and includes club, position, market value (at the time of collecting the data), highest market value, and the year in which highest market value was observed. Further, the Instagram account name is provided.
  • Instagram post information: Information on the Instagram posts including the shortcode (which can be used to open the post on instagram.com), date, caption text, number of likes, number of comments, post type (image, sidecar, video).
  • Instagram post images: For each post, we analyzed the content of the image (first image for sidecar posts, first frame for video posts) using Google Vision and extract the number of persons, their age, and their gender. Further, we extract all tags that are included in the image, such as “soccer” or “car”.
  • Additional information: Additional information such as the images of the posts can be requested from the authors.

The dataset has been used in the following paper:

Klostermann, J., Meißner, M., Max, A., & Decker, R. (2023). Presentation of celebrities’ private life through visual social media. Journal of Business Research, 156, 113524.

Please cite the paper when using the dataset for your own research. It is recommended to read the paper for further information on the dataset.

Huge Collection of Reddit Votes

Publication Date: 2020
Creators: Leake, Joseph

Data on over 44 million upvotes and downvotes cast between 2007 – 2020.

This is a tab-delimited list of votes cast by reddit users who have opted-in to make their voting history public. Each row contains the submission id for the thread being voted on, the subreddit the submission was located in, the epoch timestamp associated with the vote, the voter’s username, and whether it was an upvote or a downvote. There’s a separate file containing information about the submissions that were voted on.

Fact-Checking Facebook Politics Pages

Publication Date: 2016
Creators: Silverman Craig; Strapagiel, Lauren; Shaban, Hamza; Hall, Ellie; Singer-Vine, Jeremy

This repository contains the data and analysis for the BuzzFeed News article, “Hyperpartisan Facebook Pages Are Publishing False And Misleading Information At An Alarming Rate,” published October 20, 2016.

Characterizing Online Discussion Using Coarse Discourse Sequences

Publication Date: 2017
Creators: Zhang, Amy; Culbertson, Brian; Paritosh, Praveen

In this work, we present a novel method for classifying comments in online discussions into a set of coarse discourse acts towards the goal of better understanding discussions at scale. To facilitate this study, we devise a categorization of coarse discourse acts designed to encompass general online discussion and allow for easy annotation by crowd workers. We collect and release a corpus of over 9,000 threads comprising over 100,000 comments manually annotated via paid crowdsourcing with discourse acts and randomly sampled from the site Reddit. Using our corpus, we demonstrate how the analysis of discourse acts can characterize different types of discussions, including discourse sequences such as Q&A pairs and chains of disagreement, as well as different communities. Finally, we conduct experiments to predict discourse acts using our corpus, finding that structured prediction models such as conditional random fields can achieve an F1 score of 75%. We also demonstrate how the broadening of discourse acts from simply question and answer to a richer set of categories can improve the recall performance of Q&A extraction.

Tracking Mastodon user numbers over time

Publication Date: 2022
Creators: Willison, Simon

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter. I’ve set up a new git scraper to track the number of registered user accounts on known Mastodon instances over time.

Stack Exchange Data

Publication Date: 2014
Creators: Stack Exchange Inc.

This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks.

3 Million Russian troll tweets

Publication Date: 2018
Creators: FiveThirtyEight; Warren, Patrick ;Linvill, Darren

This directory contains data on nearly 3 million tweets sent from Twitter handles connected to the Internet Research Agency, a Russian “troll factory” and a defendant in an indictment filed by the Justice Department in February 2018, as part of special counsel Robert Mueller’s Russia investigation. The tweets in this database were sent between February 2012 and May 2018, with the vast majority posted from 2015 through 2017.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.