Showing 1-8 of 39 results

Crowdfunding datasets

Creators: Web Robots
Publication Date: 2016-03-01
Creators: Web Robots

This dataset consists of large-scale web scraping projects that provide publicly available datasets of e-commerce product listings, reviews, pricing, and other related data from various sources such as Kickstarter and Indiegogo. These datasets are used for data mining, analysis, and machine learning applications, enabling users to explore trends in product performance, customer sentiment, and pricing strategies across multiple industries. Data collection is performed monthly, ensuring that the datasets remain up-to-date with the latest project information.The initial dataset release included data on approximately 91,500 Indiegogo projects. The dataset has been updated monthly since May 2016. ​As of the latest update, the Kickstarter dataset includes information on all current and historic projects, starting in March 2016. Each dataset entry includes the following key variables:

  • Project ID: A unique identifier for each project.

  • Title: The name of the project.

  • Category: The category under which the project is listed (e.g., Technology, Art).

  • Creator: The name of the project creator.

  • Goal: The funding goal set by the creator.

  • Pledged Amount: The total amount pledged by backers.

  • Backers: The number of backers supporting the project.

  • Launch Date: The date when the project was launched.

  • Deadline: The funding deadline for the project.

  • Status: The current status of the project (e.g., active, successful, failed).

Twitter datasets

Creators: GitHub
Publication Date: 2024-10-23
Creators: GitHub

This dataset collects data related to Twitter. It includes various publicly available Twitter datasets for research and analysis, covering a wide range of topics such as user behavior, tweets, hashtags, network analysis, and sentiment analysis. Researchers can utilize these datasets for tasks like sentiment analysis, trend detection, network analysis, and studying information diffusion on social platforms. The datasets are typically structured to include variables such as:

  • tweet_ID: Unique identifier for each tweet.

  • text: Content of the tweet.

  • date: Timestamp indicating when the tweet was posted.

  • user_ID: Identifier for the user who posted the tweet.

  • propositions: Extracted linguistic propositions or predicate phrases within the tweet.

  • coreference Information: Identifies predicates that refer to the same entities in context.

  • source_URL: URL link to the original tweet or associated content, if available.

Meta content library

Creators: Meta
Publication Date: 2024-10-23
Creators: Meta

The Meta Content Library and its associated API, provided by Meta, offer comprehensive access to public content from Facebook, Instagram, and Threads. These tools are designed to facilitate in-depth research and analysis of social media content by providing structured data on public posts and associated metrics. The library provides near real-time access to public content across Meta’s platforms, including posts from Pages, groups, events, business and creator accounts, and widely-known individuals and organizations. Each post includes information about the creator (e.g., user or account type), the number of reactions (likes, loves, haha, wow, sad, angry), shares, and views. This detailed metadata enables nuanced analysis of user engagement and content dissemination patterns.

The dataset is structured to include the following key variables:

  • Post Creator: Information about the user or account that created the post, including the account type (e.g., individual, public figure, business).

  • Reactions: Total number of reactions the post has received, categorized by type (likes, loves, haha, wow, sad, angry).

  • Shares: Number of times the post has been shared by other users.

  • Views: Total number of views the post has received.

Clubhouse data

Creators: Kaggle
Publication Date: 2024-10-23
Creators: Kaggle

The Clubhouse dataset on Kaggle provides data related to the social audio app Clubhouse. It contains information such as user demographics, room activity, user engagement metrics, and discussions held on the platform. The dataset is approximately 9.7 MB in size and comprises 1,300,515 user profiles. Each profile represents an individual observation, offering a substantial sample for analysis. It is particularly valuable for analyzing user demographics, social networking patterns, and engagement metrics within the platform. Structurally, the dataset is organized as a CSV file, with each row corresponding to a user’s profile and columns representing various attributes. The key variables included are:

  • user_ID: A unique identifier for each user on Clubhouse.

  • name: The display name of the user.

  • photo_url: URL of the user’s profile photo.

  • username: The username chosen by the user on Clubhouse.

  • Twitter: The user’s Twitter handle or linked Twitter account.

  • Instagram: The user’s Instagram handle or linked Instagram account.

  • num_followers: The number of followers the user has on Clubhouse.

  • num_following: The number of accounts the user is following on Clubhouse.

  • time_created: The date and time when the user’s account was created.

  • invited_by_user_profile: Profile information of the user who invited this user to Clubhouse.

Reddit datasets

Creators: Conversational Analysis Toolkit (ConvoKit)
Publication Date: n.a
Creators: Conversational Analysis Toolkit (ConvoKit)
The ConvoKit Subreddit Corpus is a collection of user comments from various subreddits on Reddit, gathered over time to facilitate research in conversational analysis and sociolinguistics. It encompasses posts and comments from 948,169 individual subreddits, each from its inception until October 2018. This dataset is organized into individual corpora for each subreddit, facilitating targeted analysis of specific communities. Each corpus includes detailed information at multiple levels: speaker-level, where speakers are identified by their Reddit usernames; utterance-level, where each post or comment is treated as an utterance with attributes such as unique ID, author, conversation ID, reply relationships, timestamp, and text content; conversation-level, where each post and its corresponding comments are considered a conversation, with metadata including the post’s title, number of comments, domain, subreddit, and author flair; and corpus-level, which aggregates data such as the list of subreddits included, total number of posts and comments, and the number of unique speakers.

Instagram Posts from Football Players

Creators: Klostermann, Jan
Publication Date: 2023
Creators: Klostermann, Jan

This dataset includes information on 334,071 Instagram posts from 1,435 male professional football players that were under contract at any of the 56 clubs in the English Premier League, the Spanish La Liga, and the German Bundesliga. The data was colleced December 31th, 2019 and includes the whole history of Instagram posts up to that point in time.

The information provided in the dataset are the following:

  • Player information: Information on each of the football player in the dataset is collected from http://www.transfermarkt.de and includes club, position, market value (at the time of collecting the data), highest market value, and the year in which highest market value was observed. Further, the Instagram account name is provided.
  • Instagram post information: Information on the Instagram posts including the shortcode (which can be used to open the post on instagram.com), date, caption text, number of likes, number of comments, post type (image, sidecar, video).
  • Instagram post images: For each post, we analyzed the content of the image (first image for sidecar posts, first frame for video posts) using Google Vision and extract the number of persons, their age, and their gender. Further, we extract all tags that are included in the image, such as “soccer” or “car”.
  • Additional information: Additional information such as the images of the posts can be requested from the authors.

The dataset has been used in the following paper:

Klostermann, J., Meißner, M., Max, A., & Decker, R. (2023). Presentation of celebrities’ private life through visual social media. Journal of Business Research, 156, 113524.

Please cite the paper when using the dataset for your own research. It is recommended to read the paper for further information on the dataset.

Huge Collection of Reddit Votes

Creators: Leake, Joseph
Publication Date: 2020
Creators: Leake, Joseph

The dataset covers data of over 44 million upvotes and downvotes cast by Reddit users between 2007 and 2020. This is a tab-delimited list of votes cast by reddit users who have opted-in to make their voting history public. Each row contains the submission id for the thread being voted on, the subreddit the submission was located in, the epoch timestamp associated with the vote, the voter’s username, and whether it was an upvote or a downvote. The votes included are from users who have chosen to make their voting history public, ensuring compliance with privacy preferences. There’s a separate file containing information about the submissions that were voted on. The dataset contains over 44 million voting records and has a size of 21,9 kB. Structurally, the dataset is organized into two main components:

  1. Votes Data: A tab-delimited file where each row represents a vote with the following fields:

    • submission_id: Identifier of the Reddit submission that received the vote.

    • subreddit: Name of the subreddit where the submission was posted.

    • created_time: Epoch timestamp indicating when the vote was cast.

    • username: Reddit username of the voter.

    • vote: Type of vote, either ‘upvote’ or ‘downvote’.

  2. Submissions Data: A separate file containing information about the submissions that received votes, including details such as submission titles, authors, and timestamps.

2012-2016 Facebook Posts

Creators: Martinchek, Patrick
Publication Date: 2016
Creators: Martinchek, Patrick

This dataset comprises Facebook posts from the 15 mainstream media sources during the years 2012 to 2016. It includes posts from the top mainstream media outlets, offering insights into their social media strategies and audience engagement during a significant period in digital media evolution.The dataset is structured to include various fields such as post content, timestamps, and engagement metrics like likes, shares, and comments. Each record represents a single Facebook post, allowing for detailed analysis of individual entries. It has a size of 861,17 MB.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.