Resources by david-stemper

Million song dataset

Creators: Thierry Bertin-Mahieux and Daniel P.W. Ellis from Columbia University, along with Brian Whitman and Paul Lamere from The Echo Nest.
Publication Date: 2011
Creators: Thierry Bertin-Mahieux and Daniel P.W. Ellis from Columbia University, along with Brian Whitman and Paul Lamere from The Echo Nest.

The Million Song Dataset is a large-scale music dataset created by The Echo Nest and LabROSA to advance research in music information retrieval and recommendation systems. It contains metadata for one million contemporary music tracks, including details such as song titles, artists, release years, and genres, as well as audio features like tempo, loudness, and key. Each track in the dataset includes detailed metadata such as song titles, artists, albums, release years, and genres. The dataset provides various audio features, including tempo, loudness, key, time signature, and mode, facilitating in-depth musical analysis. With data on one million tracks, the MSD enables the development and evaluation of algorithms that can scale to commercial music collections. The dataset comprises approximately 280 GB of data and features 44,745 unique artists. The MSD encompasses contemporary popular music tracks up to its release in 2011. The dataset is organized in the HDF5 file format, offering efficient storage and access to large amounts of data. Each track’s data is stored as a separate HDF5 file, containing the following key variables:

  • Metadata:

    • Song ID: A unique identifier for each track.

    • Title: The name of the song.

    • Artist: The performing artist or band.

    • Album: The album on which the song was released.

    • Release Year: The year the song was released.

    • Genre: The musical genre classification.

  • Audio Features:

    • Tempo: The speed of the song in beats per minute (BPM).

    • Loudness: The average volume level in decibels (dB).

    • Key: The musical key of the song (e.g., C major, G minor).

    • Time Signature: The meter of the song, indicating beats per measure (e.g., 4/4, 3/4).

    • Mode: The tonal mode, typically major or minor.

Crowdfunding datasets

Creators: Web Robots
Publication Date: 2016-03-01
Creators: Web Robots

This dataset consists of large-scale web scraping projects that provide publicly available datasets of e-commerce product listings, reviews, pricing, and other related data from various sources such as Kickstarter and Indiegogo. These datasets are used for data mining, analysis, and machine learning applications, enabling users to explore trends in product performance, customer sentiment, and pricing strategies across multiple industries. Data collection is performed monthly, ensuring that the datasets remain up-to-date with the latest project information.The initial dataset release included data on approximately 91,500 Indiegogo projects. The dataset has been updated monthly since May 2016. ​As of the latest update, the Kickstarter dataset includes information on all current and historic projects, starting in March 2016. Each dataset entry includes the following key variables:

  • Project ID: A unique identifier for each project.

  • Title: The name of the project.

  • Category: The category under which the project is listed (e.g., Technology, Art).

  • Creator: The name of the project creator.

  • Goal: The funding goal set by the creator.

  • Pledged Amount: The total amount pledged by backers.

  • Backers: The number of backers supporting the project.

  • Launch Date: The date when the project was launched.

  • Deadline: The funding deadline for the project.

  • Status: The current status of the project (e.g., active, successful, failed).

Twitter datasets

Creators: GitHub
Publication Date: 2024-10-23
Creators: GitHub

This dataset collects data related to Twitter. It includes various publicly available Twitter datasets for research and analysis, covering a wide range of topics such as user behavior, tweets, hashtags, network analysis, and sentiment analysis. Researchers can utilize these datasets for tasks like sentiment analysis, trend detection, network analysis, and studying information diffusion on social platforms. The datasets are typically structured to include variables such as:

  • tweet_ID: Unique identifier for each tweet.

  • text: Content of the tweet.

  • date: Timestamp indicating when the tweet was posted.

  • user_ID: Identifier for the user who posted the tweet.

  • propositions: Extracted linguistic propositions or predicate phrases within the tweet.

  • coreference Information: Identifies predicates that refer to the same entities in context.

  • source_URL: URL link to the original tweet or associated content, if available.

Meta content library

Creators: Meta
Publication Date: 2024-10-23
Creators: Meta

The Meta Content Library and its associated API, provided by Meta, offer comprehensive access to public content from Facebook, Instagram, and Threads. These tools are designed to facilitate in-depth research and analysis of social media content by providing structured data on public posts and associated metrics. The library provides near real-time access to public content across Meta’s platforms, including posts from Pages, groups, events, business and creator accounts, and widely-known individuals and organizations. Each post includes information about the creator (e.g., user or account type), the number of reactions (likes, loves, haha, wow, sad, angry), shares, and views. This detailed metadata enables nuanced analysis of user engagement and content dissemination patterns.

The dataset is structured to include the following key variables:

  • Post Creator: Information about the user or account that created the post, including the account type (e.g., individual, public figure, business).

  • Reactions: Total number of reactions the post has received, categorized by type (likes, loves, haha, wow, sad, angry).

  • Shares: Number of times the post has been shared by other users.

  • Views: Total number of views the post has received.

Wikipedia archive

Creators: DBpedia Association
Publication Date: 2024-10-23
Creators: DBpedia Association

The DBpedia individual datasets extract structured information from Wikipedia, such as labels, facts, geo-coordinates, and Wikipedia categories, with wide language coverage for data analysis. This dataset enables users to perform complex queries across a wide range of topics, including information about people, places, organizations, and more. DBpedia systematically extracts structured data from Wikipedia’s semi-structured content, such as infoboxes, categorization information, and links. This process transforms unstructured text into a machine-readable format, facilitating advanced data analysis and integration. The dataset is organized according to a cross-domain ontology, providing a consistent framework for representing diverse types of information. This ontology supports complex queries and data integration across various domains. As of the 2016-04 release, DBpedia describes 6.0 million entities, including 1.5 million persons, 810,000 places, 135,000 music albums, 106,000 films, 20,000 video games, 275,000 organizations, 301,000 species, and 5,000 diseases. The dataset comprises 9.5 billion RDF triples, with 1.3 billion extracted from the English edition of Wikipedia and 5.0 billion from other language editions.The dataset reflects the state of Wikipedia at the time of each DBpedia release. For example, the 2016-04 release corresponds to Wikipedia’s content as of April 2016. The dataset is structured as RDF triples, each consisting of a subject, predicate, and object. DBpedia utilizes a community-curated ontology to categorize information, with mappings from Wikipedia infoboxes to ontology classes and properties. This structure ensures consistency and facilitates data integration.

This America Life podcast transcripts

Creators: Julian McAuley
Publication Date: 2020
Creators: Julian McAuley

This dataset contains transcripts from This America Life podcast episodes, including speaker utterances and associated audio for in-depth analysis of long conversations, spanning from 1995 to 2020. It is particularly valuable for research in natural language processing, speech recognition, and multi-speaker diarization, as it provides real-world examples of long-form, multi-speaker conversations. This dataset provides a rich resource for developing and testing algorithms in areas like automatic speech recognition, speaker diarization, and natural language understanding, contributing to advancements in processing long-form, multi-speaker audio content. In total, the dataset encompasses 663 episodes, totaling approximately 637.70 hours of audio content. Each episode serves as an individual observation, offering a substantial collection for analysis. Structurally, the dataset consists of program transcripts and associated audio files. Each transcript includes metadata such as episode acts, speaker names, speaker utterances, utterance lengths, and episode audio. 

Clubhouse data

Creators: Kaggle
Publication Date: 2024-10-23
Creators: Kaggle

The Clubhouse dataset on Kaggle provides data related to the social audio app Clubhouse. It contains information such as user demographics, room activity, user engagement metrics, and discussions held on the platform. The dataset is approximately 9.7 MB in size and comprises 1,300,515 user profiles. Each profile represents an individual observation, offering a substantial sample for analysis. It is particularly valuable for analyzing user demographics, social networking patterns, and engagement metrics within the platform. Structurally, the dataset is organized as a CSV file, with each row corresponding to a user’s profile and columns representing various attributes. The key variables included are:

  • user_ID: A unique identifier for each user on Clubhouse.

  • name: The display name of the user.

  • photo_url: URL of the user’s profile photo.

  • username: The username chosen by the user on Clubhouse.

  • Twitter: The user’s Twitter handle or linked Twitter account.

  • Instagram: The user’s Instagram handle or linked Instagram account.

  • num_followers: The number of followers the user has on Clubhouse.

  • num_following: The number of accounts the user is following on Clubhouse.

  • time_created: The date and time when the user’s account was created.

  • invited_by_user_profile: Profile information of the user who invited this user to Clubhouse.

TikTok dataset

Creators: Kaggle
Publication Date: 2021
Creators: Kaggle

The dataset contains data related to TikTok content, including video statistics such as likes, shares, and views, useful for understanding user engagement. It comprises 300 dance videos sourced from the TikTok mobile social networking application. This dataset is particularly valuable for research in computer vision, human pose estimation, and motion analysis, as it provides real-world examples of human dance movements captured in diverse settings. The dataset is available in CSV format and has a total size of approximately 1.4 GB. Each of the 300 videos serves as an individual observation, offering a substantial collection for analysis. Structurally, the dataset consists of video files accompanied by a CSV file containing metadata for each video. The metadata includes information such as video ID, duration, and possibly other attributes relevant to the content. This structure facilitates various analyses, including temporal segmentation, motion tracking, and pattern recognition within the dance sequences.This dataset provides a rich resource for developing and testing algorithms in areas like action recognition, pose estimation, and dance movement analysis, contributing to advancements in computer vision and machine learning applications related to human motion.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.