Resources by Stefan

Operational Firm Risk Dataset

Creators: Vivek Astvansh and Joseph Simpson
Publication Date: 2025
Creators: Vivek Astvansh and Joseph Simpson

The authors score 131,920 firm-years (16,959 firms, 2005 to 2024) on eight risk factors: (1) accounting, (2) finance, (3) international, (4) legal, (5) management, (6) marketing, (7) operations, and (8) technology. We measure each transformer’s performance on eight metrics.

Their OSF repository (https://osf.io/gz93b/files/osfstorage?view_only=9086f51f8704462d8933a8131b22da25) includes an Excel file that contains a data dictionary (dataset_dictionary.xlsx) and count and probability scores of the eight risk factors for 131,920 firm-years (16,959 firms, 2005 to 2024) (dataset.xlsx).

Digital Twin Dataset

Creators: Olivier Toubia , George Z. Gui , Tianyi Peng , Daniel J. Merlau , Ang Li , Haozhe Chen
Publication Date: 2025-08-20
Creators: Olivier Toubia , George Z. Gui , Tianyi Peng , Daniel J. Merlau , Ang Li , Haozhe Chen

This dataset Twin-2K-500 contains comprehensive persona information from a representative sample of 2,058 US participants, providing rich demographic and psychological data. The dataset is specifically designed for building digital twins for LLM simulations.

Dataset Structure and Format

Twin-2K-500 Dataset is organized into three folders, each with its specific format and purpose:

1. Full Persona Folder

This folder contains complete persona information for each participant. The data is split into chunks for easier processing:

  • pid: Participant ID
  • persona_text: Complete survey responses in text format, including all questions and answers. For questions that appear in both waves 1-3 and wave 4, the wave 4 responses are used.
  • persona_summary: A concise summary of each participant’s key characteristics and responses, designed to provide a quick overview without needing to process the full survey data. This summary captures the essential traits and patterns in the participant’s responses.
  • persona_json: Complete survey responses in JSON format, following the same structure as persona_text. The JSON file is useful if a subset of questions wanted to be excluded or revised.

2. Wave Split Folder

This folder is designed for testing and evaluating different LLM persona creation methodologies (from prompt engineering to RAG, fine-tuning, and RLHF):

  • pid: Participant ID
  • wave1_3_persona_text: Persona information from waves 1-3 in text format, including questions that did not appear in wave 4. This can be used as training data for creating personas.
  • wave1_3_persona_json: Persona information from waves 1-3 in JSON format, following the same structure as wave1_3_persona_text.
  • wave4_Q_wave1_3_A: Wave 4 questions with answers from waves 1-3, useful for human test-retest evaluation.
  • wave4_Q_wave4_A: Wave 4 questions with their actual answers from wave 4, serving as ground truth for evaluating persona prediction accuracy.

The Wave Split Folder is particularly useful for:

  • Training persona creation models using wave1-3 data
  • Evaluating how well the created personas can predict wave 4 responses
  • Comparing different LLM-based approaches (prompt engineering, RAG, fine-tuning, RLHF) for persona creation
  • Testing the reliability and consistency of persona predictions across different time periods

3. Raw Data Folder

This folder provides access to the raw survey response files from Qualtrics, after anonymization and removal of sensitive columns. These files are particularly useful for social scientists interested in measuring correlations across questions and analyzing heterogeneous effects for experiments.

The folder contains the following files for each wave (1-4):

  • Labels CSV (e.g., wave_1_labels_anonymized.csv): Contains survey answers as text.
  • Numbers CSV (e.g., wave_1_numbers_anonymized.csv): Contains survey answers as numerical codes.

Additionally,

  • Questionaire: questionnaire files are provided in the questionnaire subfolder. These files can help visualize the survey structure and question flows.
  • Wave 4 Simulation Results: We also uploaded the Wave 4 csv simluated by GPT4.1-mini (our default setup) along with the human csv. This facilitates the analysis that aims to understand deeply for the llm simulation pattern.

Job postings of DAX40 companies (2023)

Creators: HR Forecast GmbH
Publication Date: 2022-12-31
Creators: HR Forecast GmbH

Job posting dataset of the DAX40 companies for the year 2023, aggregated from multiple public sources. The data contains anonymized information about job advertisements, including job title, job requirements, location, and type of employment.

Real-World LLM Use Cases

Creators: Jingwen Cheng, Kshitish Ghate, Wenyue Hua, William Yang Wang, Hong Shen, Fei Fang
Publication Date: 2025-03-24
Creators: Jingwen Cheng, Kshitish Ghate, Wenyue Hua, William Yang Wang, Hong Shen, Fei Fang

This data contains 93,259 LLM use cases collected from Reddit and news articles between June 2020 and December 2024. It captures two key dimensions: the diverse applications of LLMs and the demographics of their users. It categorizes LLM applications and explores how users’ occupations relate to the types of applications they use.

If you use this dataset, please cite this paper: https://doi.org/10.48550/arXiv.2503.18792.

Artstor Museum Image Data

Creators: ARTSTOR
Publication Date: 2025
Creators: ARTSTOR

Explore Artstor’s collections of high-quality images, curated from leading museums and archives around the world. Artstor’s diverse collections are rights-cleared for education and research, and include Open Access content as well as rare materials not available elsewhere. Artstor gives access to 865.914 items in 308 collections.

Social Network Data of Student Relationships

Creators: Rebecca Mauldin
Publication Date: 2024
Creators: Rebecca Mauldin

This dataset is from longitudinal social network analysis research that collected survey data from one class of graduate students (N=142) in a Master of Social Work (MSW) program in a large U.S. public university. The program used cohort-based learning in the first semester after which students were integrated into the student body as a whole. The dataset contains network data about friendships, academic discussion ties, and professional influence among classmates. Student attribute data include archival data from the school (e.g., student demographics, incoming GPA, GRE scores) and survey items (e.g., sense of belonging scale, multicultural perspective, perceived stress).

DATA-SPECIFIC INFORMATION

Participation Status across all Four Waves Overview:
File name: ParticipationAcrossWaves.csv
Number of variables: 6
Number of cases/rows: 145

Wave 1 Characteristics Overview:
File name: w1_Characteristics.csv
Number of variables: 39
Number of cases/rows: 145

Wave 3 Characteristics Overview:
File name: w3_Characteristics.csv
Number of variables: 40
Number of cases/rows: 145

Wave 4 Characteristics Overview:
File name: w4_Characteristics.csv
Number of variables: 49
Number of cases/rows: 145

Wave 1 Know-of Ties:
File name: w1_KnowofEdgelist.csv
Number of variables: 2
Number of cases/rows: 169

 

Academic Ties Overview:
File name (wave 2): w2_AcademicEdgelist.csv
Number of variables: 2
Number of cases/rows: 1464

File name (wave 3): w3_AcademicEdgelist.csv
Number of variables: 2
Number of cases/rows: 1642

File name (wave 4): w4_AcademicEdgelist.csv
Number of variables: 2
Number of cases/rows: 2260

Friendship Ties Overview:
File name (wave 2): w2_FriendshipEdgelist.csv
Number of variables: 2
Number of cases/rows: 684

File name (wave 3): w3_FriendshipEdgelist.csv
Number of variables: 2
Number of cases/rows: 752

File name (wave 4): w4_FriendshipEdgelist.csv
Number of variables: 2
Number of cases/rows: 964

Professional Influence Ties Overview:
File name (wave 2): w2_ProfessionalEdgelist.csv
Number of variables: 2
Number of cases/rows: 567

File name (wave 3): w3_ProfessionalEdgelist.csv;
Number of variables: 2
Number of cases/rows: 809

File name (wave 4): w4_ProfessionalEdgelist.csv
Number of variables: 2
Number of cases/rows: 981

Shared Courses Edgelist Overview:
File name: SharedCourseValuedEdgelist.csv
Number of variables: 3
Number of cases/rows: 14,714

Shared Course Affiliation Matrix Overview:
File name: SharedCourseAffiliationMatrix.csv
Number of matrix rows: 145
Number of matrix columns: 145

Instagram Data

Creators: Thales Bertaglia
Publication Date: 2023
Creators: Thales Bertaglia

The dataset can be self-created by the user by following the main script to collect and process data from Instagram using the CrowdTangle API. An exemplary sample of the data is attached.

Instagram Posts from Football Players

Creators: Klostermann, Jan
Publication Date: 2023
Creators: Klostermann, Jan

This dataset includes information on 334,071 Instagram posts from 1,435 male professional football players that were under contract at any of the 56 clubs in the English Premier League, the Spanish La Liga, and the German Bundesliga. The data was colleced December 31th, 2019 and includes the whole history of Instagram posts up to that point in time.

The information provided in the dataset are the following:

  • Player information: Information on each of the football player in the dataset is collected from http://www.transfermarkt.de and includes club, position, market value (at the time of collecting the data), highest market value, and the year in which highest market value was observed. Further, the Instagram account name is provided.
  • Instagram post information: Information on the Instagram posts including the shortcode (which can be used to open the post on instagram.com), date, caption text, number of likes, number of comments, post type (image, sidecar, video).
  • Instagram post images: For each post, we analyzed the content of the image (first image for sidecar posts, first frame for video posts) using Google Vision and extract the number of persons, their age, and their gender. Further, we extract all tags that are included in the image, such as “soccer” or “car”.
  • Additional information: Additional information such as the images of the posts can be requested from the authors.

The dataset has been used in the following paper:

Klostermann, J., Meißner, M., Max, A., & Decker, R. (2023). Presentation of celebrities’ private life through visual social media. Journal of Business Research, 156, 113524.

Please cite the paper when using the dataset for your own research. It is recommended to read the paper for further information on the dataset.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.