Showing 1-8 of 27 results

Digital Twin Dataset

Creators: Olivier Toubia , George Z. Gui , Tianyi Peng , Daniel J. Merlau , Ang Li , Haozhe Chen
Publication Date: 2025-08-20
Creators: Olivier Toubia , George Z. Gui , Tianyi Peng , Daniel J. Merlau , Ang Li , Haozhe Chen

This dataset Twin-2K-500 contains comprehensive persona information from a representative sample of 2,058 US participants, providing rich demographic and psychological data. The dataset is specifically designed for building digital twins for LLM simulations.

Dataset Structure and Format

Twin-2K-500 Dataset is organized into three folders, each with its specific format and purpose:

1. Full Persona Folder

This folder contains complete persona information for each participant. The data is split into chunks for easier processing:

  • pid: Participant ID
  • persona_text: Complete survey responses in text format, including all questions and answers. For questions that appear in both waves 1-3 and wave 4, the wave 4 responses are used.
  • persona_summary: A concise summary of each participant’s key characteristics and responses, designed to provide a quick overview without needing to process the full survey data. This summary captures the essential traits and patterns in the participant’s responses.
  • persona_json: Complete survey responses in JSON format, following the same structure as persona_text. The JSON file is useful if a subset of questions wanted to be excluded or revised.

2. Wave Split Folder

This folder is designed for testing and evaluating different LLM persona creation methodologies (from prompt engineering to RAG, fine-tuning, and RLHF):

  • pid: Participant ID
  • wave1_3_persona_text: Persona information from waves 1-3 in text format, including questions that did not appear in wave 4. This can be used as training data for creating personas.
  • wave1_3_persona_json: Persona information from waves 1-3 in JSON format, following the same structure as wave1_3_persona_text.
  • wave4_Q_wave1_3_A: Wave 4 questions with answers from waves 1-3, useful for human test-retest evaluation.
  • wave4_Q_wave4_A: Wave 4 questions with their actual answers from wave 4, serving as ground truth for evaluating persona prediction accuracy.

The Wave Split Folder is particularly useful for:

  • Training persona creation models using wave1-3 data
  • Evaluating how well the created personas can predict wave 4 responses
  • Comparing different LLM-based approaches (prompt engineering, RAG, fine-tuning, RLHF) for persona creation
  • Testing the reliability and consistency of persona predictions across different time periods

3. Raw Data Folder

This folder provides access to the raw survey response files from Qualtrics, after anonymization and removal of sensitive columns. These files are particularly useful for social scientists interested in measuring correlations across questions and analyzing heterogeneous effects for experiments.

The folder contains the following files for each wave (1-4):

  • Labels CSV (e.g., wave_1_labels_anonymized.csv): Contains survey answers as text.
  • Numbers CSV (e.g., wave_1_numbers_anonymized.csv): Contains survey answers as numerical codes.

Additionally,

  • Questionaire: questionnaire files are provided in the questionnaire subfolder. These files can help visualize the survey structure and question flows.
  • Wave 4 Simulation Results: We also uploaded the Wave 4 csv simluated by GPT4.1-mini (our default setup) along with the human csv. This facilitates the analysis that aims to understand deeply for the llm simulation pattern.

Artstor Museum Image Data

Creators: ARTSTOR
Publication Date: 2025
Creators: ARTSTOR

Explore Artstor’s collections of high-quality images, curated from leading museums and archives around the world. Artstor’s diverse collections are rights-cleared for education and research, and include Open Access content as well as rare materials not available elsewhere. Artstor gives access to 865.914 items in 308 collections.

Million song dataset

Creators: Thierry Bertin-Mahieux and Daniel P.W. Ellis from Columbia University, along with Brian Whitman and Paul Lamere from The Echo Nest.
Publication Date: 2011
Creators: Thierry Bertin-Mahieux and Daniel P.W. Ellis from Columbia University, along with Brian Whitman and Paul Lamere from The Echo Nest.

The Million Song Dataset is a large-scale music dataset created by The Echo Nest and LabROSA to advance research in music information retrieval and recommendation systems. It contains metadata for one million contemporary music tracks, including details such as song titles, artists, release years, and genres, as well as audio features like tempo, loudness, and key. Each track in the dataset includes detailed metadata such as song titles, artists, albums, release years, and genres. The dataset provides various audio features, including tempo, loudness, key, time signature, and mode, facilitating in-depth musical analysis. With data on one million tracks, the MSD enables the development and evaluation of algorithms that can scale to commercial music collections. The dataset comprises approximately 280 GB of data and features 44,745 unique artists. The MSD encompasses contemporary popular music tracks up to its release in 2011. The dataset is organized in the HDF5 file format, offering efficient storage and access to large amounts of data. Each track’s data is stored as a separate HDF5 file, containing the following key variables:

  • Metadata:

    • Song ID: A unique identifier for each track.

    • Title: The name of the song.

    • Artist: The performing artist or band.

    • Album: The album on which the song was released.

    • Release Year: The year the song was released.

    • Genre: The musical genre classification.

  • Audio Features:

    • Tempo: The speed of the song in beats per minute (BPM).

    • Loudness: The average volume level in decibels (dB).

    • Key: The musical key of the song (e.g., C major, G minor).

    • Time Signature: The meter of the song, indicating beats per measure (e.g., 4/4, 3/4).

    • Mode: The tonal mode, typically major or minor.

This America Life podcast transcripts

Creators: Julian McAuley
Publication Date: 2020
Creators: Julian McAuley

This dataset contains transcripts from This America Life podcast episodes, including speaker utterances and associated audio for in-depth analysis of long conversations, spanning from 1995 to 2020. It is particularly valuable for research in natural language processing, speech recognition, and multi-speaker diarization, as it provides real-world examples of long-form, multi-speaker conversations. This dataset provides a rich resource for developing and testing algorithms in areas like automatic speech recognition, speaker diarization, and natural language understanding, contributing to advancements in processing long-form, multi-speaker audio content. In total, the dataset encompasses 663 episodes, totaling approximately 637.70 hours of audio content. Each episode serves as an individual observation, offering a substantial collection for analysis. Structurally, the dataset consists of program transcripts and associated audio files. Each transcript includes metadata such as episode acts, speaker names, speaker utterances, utterance lengths, and episode audio. 

TikTok dataset

Creators: Kaggle
Publication Date: 2021
Creators: Kaggle

The dataset contains data related to TikTok content, including video statistics such as likes, shares, and views, useful for understanding user engagement. It comprises 300 dance videos sourced from the TikTok mobile social networking application. This dataset is particularly valuable for research in computer vision, human pose estimation, and motion analysis, as it provides real-world examples of human dance movements captured in diverse settings. The dataset is available in CSV format and has a total size of approximately 1.4 GB. Each of the 300 videos serves as an individual observation, offering a substantial collection for analysis. Structurally, the dataset consists of video files accompanied by a CSV file containing metadata for each video. The metadata includes information such as video ID, duration, and possibly other attributes relevant to the content. This structure facilitates various analyses, including temporal segmentation, motion tracking, and pattern recognition within the dance sequences.This dataset provides a rich resource for developing and testing algorithms in areas like action recognition, pose estimation, and dance movement analysis, contributing to advancements in computer vision and machine learning applications related to human motion.

Yahoo songs ratings

Creators: Yahoo labs
Publication Date: varies by dataset
Creators: Yahoo labs

A collection of datasets curated by Yahoo for research purposes, including ratings, social data, and multimedia content, mainly used for recommender systems and machine learning research. The datasets encompass a vast number of user ratings, reflecting a wide array of musical tastes and preferences. Each song is accompanied by detailed metadata, including artist, album, and genre information, facilitating multifaceted analyses. Users are represented by randomly assigned numeric IDs, ensuring privacy while allowing for user behavior studies. This dataset includes approximately 300,000 user-supplied ratings and exactly 54,000 ratings for randomly selected songs. It involves 15,400 users and 1,000 songs, with a total size of 1.2 MB. The data was collected between 2002 and 2006, capturing user interactions and preferences during this period. The main components of the dataset are structured as follows:

  • R2 Dataset: Contains user ratings for songs, with each entry linked to metadata such as artist, album, and genre. All identifiers (users, songs, artists, albums) are anonymized.

  • R3 Dataset: Comprises two subsets: one with ratings from normal user interactions and another with ratings for randomly selected songs collected via an online survey. It also includes responses to seven multiple-choice survey questions regarding rating behavior for a subset of

NPR (national public radio) interview dialog data

Creators: Julian McAuley
Publication Date: 2020-xx-xx
Creators: Julian McAuley

The NPR Media Dialog Transcripts dataset contains interview transcripts from National Public Radio (NPR) programs, spanning approximately 20 years. This dataset includes over 140,000 transcripts, covering more than 10,000 hours of audio content. This results in 3.2 million utterances and 126.7 million words collected. Each transcript provides detailed information such as episode titles, broadcast dates, speaker names, and the full text of conversations between hosts and guests. These features are valuable for analyzing discourse patterns, speaker interactions, and content trends within NPR’s programming. Structurally, the dataset is organized into individual transcripts, each corresponding to a specific NPR episode. Each transcript includes metadata such as the episode title, broadcast date, program name, speaker identities, and the full text of the interview.

Public Domain Music Dataset

Creators: Phillip Long, Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick
Publication Date: 2024-09-16
Creators: Phillip Long, Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick

PDMX is a large-scale public domain MusicXML dataset for symbolic music processing, consisting of over 250,000 MusicXML scores sourced from MuseScore, designed for symbolic music research and analysis. Each score in the dataset is accompanied by metadata detailing attributes such as user interactions, ratings, and licensing information, which can be utilized to filter and analyze the dataset for specific research needs. The number of observations corresponds to the over 250,000 individual MusicXML scores included in the dataset. In total, the dataset has a size of 1,6 GB. Structurally, the dataset is organized with each row representing a different song, including paths to the MusicRender JSON file and associated metadata, along with various attributes such as user status, ratings, and licensing information.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.