Digital Twin Dataset
Creators:
Olivier Toubia , George Z. Gui , Tianyi Peng , Daniel J. Merlau , Ang Li , Haozhe Chen
Publication Date:
2025-08-20
Data Category:
Dataset Description:
This dataset Twin-2K-500 contains comprehensive persona information from a representative sample of 2,058 US participants, providing rich demographic and psychological data. The dataset is specifically designed for building digital twins for LLM simulations.
Dataset Structure and Format
Twin-2K-500 Dataset is organized into three folders, each with its specific format and purpose:1. Full Persona Folder
This folder contains complete persona information for each participant. The data is split into chunks for easier processing:pid
: Participant IDpersona_text
: Complete survey responses in text format, including all questions and answers. For questions that appear in both waves 1-3 and wave 4, the wave 4 responses are used.persona_summary
: A concise summary of each participant's key characteristics and responses, designed to provide a quick overview without needing to process the full survey data. This summary captures the essential traits and patterns in the participant's responses.persona_json
: Complete survey responses in JSON format, following the same structure as persona_text. The JSON file is useful if a subset of questions wanted to be excluded or revised.
2. Wave Split Folder
This folder is designed for testing and evaluating different LLM persona creation methodologies (from prompt engineering to RAG, fine-tuning, and RLHF):pid
: Participant IDwave1_3_persona_text
: Persona information from waves 1-3 in text format, including questions that did not appear in wave 4. This can be used as training data for creating personas.wave1_3_persona_json
: Persona information from waves 1-3 in JSON format, following the same structure as wave1_3_persona_text.wave4_Q_wave1_3_A
: Wave 4 questions with answers from waves 1-3, useful for human test-retest evaluation.wave4_Q_wave4_A
: Wave 4 questions with their actual answers from wave 4, serving as ground truth for evaluating persona prediction accuracy.
- Training persona creation models using wave1-3 data
- Evaluating how well the created personas can predict wave 4 responses
- Comparing different LLM-based approaches (prompt engineering, RAG, fine-tuning, RLHF) for persona creation
- Testing the reliability and consistency of persona predictions across different time periods
3. Raw Data Folder
This folder provides access to the raw survey response files from Qualtrics, after anonymization and removal of sensitive columns. These files are particularly useful for social scientists interested in measuring correlations across questions and analyzing heterogeneous effects for experiments. The folder contains the following files for each wave (1-4):- Labels CSV (e.g.,
wave_1_labels_anonymized.csv
): Contains survey answers as text. - Numbers CSV (e.g.,
wave_1_numbers_anonymized.csv
): Contains survey answers as numerical codes.
- Questionaire: questionnaire files are provided in the
questionnaire
subfolder. These files can help visualize the survey structure and question flows. - Wave 4 Simulation Results: We also uploaded the Wave 4 csv simluated by GPT4.1-mini (our default setup) along with the human csv. This facilitates the analysis that aims to understand deeply for the llm simulation pattern.
Variables:
Details: