Resources by david-stemper

Yahoo songs ratings

Creators: Yahoo labs
Publication Date: varies by dataset
Creators: Yahoo labs

A collection of datasets curated by Yahoo for research purposes, including ratings, social data, and multimedia content, mainly used for recommender systems and machine learning research. The datasets encompass a vast number of user ratings, reflecting a wide array of musical tastes and preferences. Each song is accompanied by detailed metadata, including artist, album, and genre information, facilitating multifaceted analyses. Users are represented by randomly assigned numeric IDs, ensuring privacy while allowing for user behavior studies. This dataset includes approximately 300,000 user-supplied ratings and exactly 54,000 ratings for randomly selected songs. It involves 15,400 users and 1,000 songs, with a total size of 1.2 MB. The data was collected between 2002 and 2006, capturing user interactions and preferences during this period. The main components of the dataset are structured as follows:

  • R2 Dataset: Contains user ratings for songs, with each entry linked to metadata such as artist, album, and genre. All identifiers (users, songs, artists, albums) are anonymized.

  • R3 Dataset: Comprises two subsets: one with ratings from normal user interactions and another with ratings for randomly selected songs collected via an online survey. It also includes responses to seven multiple-choice survey questions regarding rating behavior for a subset of

Audio books data (audio + text)

Creators: Usborne Publishing Ltd
Publication Date: 2017-xx-xx
Creators: Usborne Publishing Ltd

The dataset contains audiobook recordings of a female British English speaker (Lesley Sims), used for the Blizzard Challenge to advance speech synthesis research. ​It comprises approximately 6.5 hours of speech data, encompassing 56 children’s audiobooks. This data is particularly valuable for advancing speech synthesis research, as it offers high-quality, natural speech recordings paired with their textual content. In total, the dataset has a size of 765 MB. The temporal coverage of the dataset pertains to the period leading up to the Blizzard Challenge 2017, with the data being released specifically for that event. Structurally, the dataset consists of:

  • Audio Files: High-quality recordings of Lesley Sims narrating various texts. These files capture natural prosody and articulation, essential for developing and testing speech synthesis models.

  • Text Files: Corresponding textual content for each audio recording, facilitating alignment between spoken and written language.

  • Label Files: Sentence-level segmentation and alignment between the text and speech for a portion of the data. These labels were initially created by Toshiba’s Cambridge Research Laboratory and later re-processed by the University of Edinburgh to ensure they correspond accurately to the original audio recordings.

NPR (national public radio) interview dialog data

Creators: Julian McAuley
Publication Date: 2020-xx-xx
Creators: Julian McAuley

The NPR Media Dialog Transcripts dataset contains interview transcripts from National Public Radio (NPR) programs, spanning approximately 20 years. This dataset includes over 140,000 transcripts, covering more than 10,000 hours of audio content. This results in 3.2 million utterances and 126.7 million words collected. Each transcript provides detailed information such as episode titles, broadcast dates, speaker names, and the full text of conversations between hosts and guests. These features are valuable for analyzing discourse patterns, speaker interactions, and content trends within NPR’s programming. Structurally, the dataset is organized into individual transcripts, each corresponding to a specific NPR episode. Each transcript includes metadata such as the episode title, broadcast date, program name, speaker identities, and the full text of the interview.

Public Domain Music Dataset

Creators: Phillip Long, Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick
Publication Date: 2024-09-16
Creators: Phillip Long, Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick

PDMX is a large-scale public domain MusicXML dataset for symbolic music processing, consisting of over 250,000 MusicXML scores sourced from MuseScore, designed for symbolic music research and analysis. Each score in the dataset is accompanied by metadata detailing attributes such as user interactions, ratings, and licensing information, which can be utilized to filter and analyze the dataset for specific research needs. The number of observations corresponds to the over 250,000 individual MusicXML scores included in the dataset. In total, the dataset has a size of 1,6 GB. Structurally, the dataset is organized with each row representing a different song, including paths to the MusicRender JSON file and associated metadata, along with various attributes such as user status, ratings, and licensing information.

AI-generated marketing images (>10K) + human ratings for quality and realism

Creators: Jochen Hartmann, Yannick Exner
Publication Date: not specified
Creators: Jochen Hartmann, Yannick Exner

The GenImageNet dataset is a comprehensive collection of AI-generated marketing images accompanied by human evaluations assessing their quality and realism. It encompasses 10,320 synthetic marketing images created using seven state-of-the-art generative text-to-image models, including DALL-E 3, Midjourney v6, Firefly 2, Imagen 2, Imagine, Realistic Vision, and Stable Diffusion XL Turbo. Each image was generated based on prompts derived from 2,400 real-world, human-made images, facilitating a direct comparison between AI-generated and human-made marketing visuals.The dataset’s total size is approximately 3.1 GB, making it substantial yet manageable for analysis. It comprises 10,320 observations, each corresponding to an individual AI-generated image. The temporal coverage of the dataset aligns with the period during which the images were generated and evaluated, specifically around July 2023. Structurally, the dataset is organized to include each AI-generated image alongside its corresponding human-made image prompt and the human evaluations. The human evaluations consist of 254,400 individual assessments, with each image receiving multiple ratings to ensure reliability. These evaluations cover aspects such as quality, realism, aesthetics, creativity, adherence to the prompt, and overall effectiveness in a marketing context. 

Stack Overflow Q&A

Creators: Stack Exchange
Publication Date: 2014-01-23
Creators: Stack Exchange

The Stack Exchange Data Dump is a quarterly, anonymized release of all user-contributed content from the Stack Exchange network, including posts, comments, votes, and user data, licensed under the Creative Commons BY-SA 3.0. These features facilitated comprehensive analyses of user interactions, content quality, and community dynamics within the Stack Exchange network. ​As of March 1, 2020, the Stack Exchange Data Dump contained a total of 47,931,101 posts, encompassing both questions and answers, accumulated from 2008 to that date. The size of the data dumps varied over time, reflecting the growth of the platform. For instance, the January 2011 dump exceeded 3 GB. As the network expanded, subsequent dumps increased in size, with later versions reaching tens of gigabytes. Each data dump captured the state of the Stack Exchange network up to the date of its release, providing a temporal snapshot of user-generated content and activity. The dumps were released quarterly, offering periodic insights into the evolving dynamics of the platform. Structurally, the dataset is organized into multiple tables, each corresponding to different aspects of the platform:

  • Posts: Contained all questions and answers, including metadata such as post ID, score, and content.

  • Users: Included user-related information like user ID, reputation score, and profile details.

  • Votes: Recorded voting activity on posts, specifying vote types and associated post IDs.

  • Comments: Held all comments made on posts, detailing comment text, scores, and related post and user IDs.

  • Badges: Documented badges awarded to users, noting badge names, dates awarded, and badge classes (e.g., bronze, silver, gold).

  • Post History: Tracked changes to posts, recording edits, rollbacks, and other modifications along with details on the post and the user making the change.

  • Post Links: Contained links between posts, such as duplicates or related posts, along with link types and creation dates.

Airbnb datasets

Creators: Inside Airbnb
Publication Date: varies by city
Creators: Inside Airbnb

Inside Airbnb provides detailed data on Airbnb listings, including reviews, calendar availability, and neighborhood information to offer insights into short-term rental markets. The dataset’s size varies depending on the city and the number of active listings at the time of data collection. For instance, as of March 6, 2023, the New York City dataset contained information on over 42,000 listings. The temporal coverage of the dataset reflects specific points in time, capturing the state of Airbnb listings in various cities as of the data collection dates. Inside Airbnb provides quarterly data for the last year for each region, with archived files available for research on entire countries, including Australia, Canada, France, Germany, Greece, Italy, the Netherlands, Portugal, Spain, Sweden, the United Kingdom, and the United States. Structurally, the dataset is organized in a tabular format, with each row representing an individual listing and columns detailing various attributes such as listing ID, host information, location, property characteristics, pricing, review statistics, and availability.

Reddit datasets

Creators: Conversational Analysis Toolkit (ConvoKit)
Publication Date: n.a
Creators: Conversational Analysis Toolkit (ConvoKit)
The ConvoKit Subreddit Corpus is a collection of user comments from various subreddits on Reddit, gathered over time to facilitate research in conversational analysis and sociolinguistics. It encompasses posts and comments from 948,169 individual subreddits, each from its inception until October 2018. This dataset is organized into individual corpora for each subreddit, facilitating targeted analysis of specific communities. Each corpus includes detailed information at multiple levels: speaker-level, where speakers are identified by their Reddit usernames; utterance-level, where each post or comment is treated as an utterance with attributes such as unique ID, author, conversation ID, reply relationships, timestamp, and text content; conversation-level, where each post and its corresponding comments are considered a conversation, with metadata including the post’s title, number of comments, domain, subreddit, and author flair; and corpus-level, which aggregates data such as the list of subreddits included, total number of posts and comments, and the number of unique speakers.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.