Digital Twin Dataset

Creators:

Olivier Toubia , George Z. Gui , Tianyi Peng , Daniel J. Merlau , Ang Li , Haozhe Chen

Publication Date:

2025-08-20

Data Category:

Dataset Description:

This dataset Twin-2K-500 contains comprehensive persona information from a representative sample of 2,058 US participants, providing rich demographic and psychological data. The dataset is specifically designed for building digital twins for LLM simulations.

Dataset Structure and Format

Twin-2K-500 Dataset is organized into three folders, each with its specific format and purpose:

1. Full Persona Folder

This folder contains complete persona information for each participant. The data is split into chunks for easier processing:

pid: Participant ID
persona_text: Complete survey responses in text format, including all questions and answers. For questions that appear in both waves 1-3 and wave 4, the wave 4 responses are used.
persona_summary: A concise summary of each participant's key characteristics and responses, designed to provide a quick overview without needing to process the full survey data. This summary captures the essential traits and patterns in the participant's responses.
persona_json: Complete survey responses in JSON format, following the same structure as persona_text. The JSON file is useful if a subset of questions wanted to be excluded or revised.

2. Wave Split Folder

This folder is designed for testing and evaluating different LLM persona creation methodologies (from prompt engineering to RAG, fine-tuning, and RLHF):

pid: Participant ID
wave1_3_persona_text: Persona information from waves 1-3 in text format, including questions that did not appear in wave 4. This can be used as training data for creating personas.
wave1_3_persona_json: Persona information from waves 1-3 in JSON format, following the same structure as wave1_3_persona_text.
wave4_Q_wave1_3_A: Wave 4 questions with answers from waves 1-3, useful for human test-retest evaluation.
wave4_Q_wave4_A: Wave 4 questions with their actual answers from wave 4, serving as ground truth for evaluating persona prediction accuracy.

The Wave Split Folder is particularly useful for:

Training persona creation models using wave1-3 data
Evaluating how well the created personas can predict wave 4 responses
Comparing different LLM-based approaches (prompt engineering, RAG, fine-tuning, RLHF) for persona creation
Testing the reliability and consistency of persona predictions across different time periods

3. Raw Data Folder

This folder provides access to the raw survey response files from Qualtrics, after anonymization and removal of sensitive columns. These files are particularly useful for social scientists interested in measuring correlations across questions and analyzing heterogeneous effects for experiments. The folder contains the following files for each wave (1-4):

Labels CSV (e.g., wave_1_labels_anonymized.csv): Contains survey answers as text.
Numbers CSV (e.g., wave_1_numbers_anonymized.csv): Contains survey answers as numerical codes.

Additionally,

Questionaire: questionnaire files are provided in the questionnaire subfolder. These files can help visualize the survey structure and question flows.
Wave 4 Simulation Results: We also uploaded the Wave 4 csv simluated by GPT4.1-mini (our default setup) along with the human csv. This facilitates the analysis that aims to understand deeply for the llm simulation pattern.

Publications Citing This Dataset:

Olivier Toubia, George Z. Gui, Tianyi Peng, Daniel J. Merlau, Ang Li, Haozhe Chen (2025) Database Report: Twin-2K-500: A Data Set for Building Digital Twins of over 2,000 People Based on Their Answers to over 500 Questions. Marketing Science.

Variables:

Details:

Bookmark this Dataset/Publication

Digital Twin Dataset

Dataset Structure and Format

1. Full Persona Folder

2. Wave Split Folder

3. Raw Data Folder

Twitter US Airline Sentiment

Longitudinal Data on Genetics, Parental Investments, and Child Skill Formation in Early Childhood

USGS Earth Explorer

Digital Twin Dataset

Dataset Structure and Format

1. Full Persona Folder

2. Wave Split Folder

3. Raw Data Folder

Sign In

Register

Reset Password