Resources by Stefan

Job postings of DAX40 companies (2023)

Creators: HR Forecast GmbH
Publication Date: 2022-12-31
Creators: HR Forecast GmbH

Job posting dataset of the DAX40 companies for the year 2023, aggregated from multiple public sources. The data contains anonymized information about job advertisements, including job title, job requirements, location, and type of employment.

Real-World LLM Use Cases

Creators: Jingwen Cheng, Kshitish Ghate, Wenyue Hua, William Yang Wang, Hong Shen, Fei Fang
Publication Date: 2025-03-24
Creators: Jingwen Cheng, Kshitish Ghate, Wenyue Hua, William Yang Wang, Hong Shen, Fei Fang

This data contains 93,259 LLM use cases collected from Reddit and news articles between June 2020 and December 2024. It captures two key dimensions: the diverse applications of LLMs and the demographics of their users. It categorizes LLM applications and explores how users’ occupations relate to the types of applications they use.

If you use this dataset, please cite this paper: https://doi.org/10.48550/arXiv.2503.18792.

Artstor Museum Image Data

Creators: ARTSTOR
Publication Date: 2025
Creators: ARTSTOR

Explore Artstor’s collections of high-quality images, curated from leading museums and archives around the world. Artstor’s diverse collections are rights-cleared for education and research, and include Open Access content as well as rare materials not available elsewhere. Artstor gives access to 865.914 items in 308 collections.

Social Network Data of Student Relationships

Creators: Rebecca Mauldin
Publication Date: 2024
Creators: Rebecca Mauldin

This dataset is from longitudinal social network analysis research that collected survey data from one class of graduate students (N=142) in a Master of Social Work (MSW) program in a large U.S. public university. The program used cohort-based learning in the first semester after which students were integrated into the student body as a whole. The dataset contains network data about friendships, academic discussion ties, and professional influence among classmates. Student attribute data include archival data from the school (e.g., student demographics, incoming GPA, GRE scores) and survey items (e.g., sense of belonging scale, multicultural perspective, perceived stress).

DATA-SPECIFIC INFORMATION

Participation Status across all Four Waves Overview:
File name: ParticipationAcrossWaves.csv
Number of variables: 6
Number of cases/rows: 145

Wave 1 Characteristics Overview:
File name: w1_Characteristics.csv
Number of variables: 39
Number of cases/rows: 145

Wave 3 Characteristics Overview:
File name: w3_Characteristics.csv
Number of variables: 40
Number of cases/rows: 145

Wave 4 Characteristics Overview:
File name: w4_Characteristics.csv
Number of variables: 49
Number of cases/rows: 145

Wave 1 Know-of Ties:
File name: w1_KnowofEdgelist.csv
Number of variables: 2
Number of cases/rows: 169

 

Academic Ties Overview:
File name (wave 2): w2_AcademicEdgelist.csv
Number of variables: 2
Number of cases/rows: 1464

File name (wave 3): w3_AcademicEdgelist.csv
Number of variables: 2
Number of cases/rows: 1642

File name (wave 4): w4_AcademicEdgelist.csv
Number of variables: 2
Number of cases/rows: 2260

Friendship Ties Overview:
File name (wave 2): w2_FriendshipEdgelist.csv
Number of variables: 2
Number of cases/rows: 684

File name (wave 3): w3_FriendshipEdgelist.csv
Number of variables: 2
Number of cases/rows: 752

File name (wave 4): w4_FriendshipEdgelist.csv
Number of variables: 2
Number of cases/rows: 964

Professional Influence Ties Overview:
File name (wave 2): w2_ProfessionalEdgelist.csv
Number of variables: 2
Number of cases/rows: 567

File name (wave 3): w3_ProfessionalEdgelist.csv;
Number of variables: 2
Number of cases/rows: 809

File name (wave 4): w4_ProfessionalEdgelist.csv
Number of variables: 2
Number of cases/rows: 981

Shared Courses Edgelist Overview:
File name: SharedCourseValuedEdgelist.csv
Number of variables: 3
Number of cases/rows: 14,714

Shared Course Affiliation Matrix Overview:
File name: SharedCourseAffiliationMatrix.csv
Number of matrix rows: 145
Number of matrix columns: 145

Instagram Data

Creators: Thales Bertaglia
Publication Date: 2023
Creators: Thales Bertaglia

The dataset can be self-created by the user by following the main script to collect and process data from Instagram using the CrowdTangle API. An exemplary sample of the data is attached.

Instagram Posts from Football Players

Creators: Klostermann, Jan
Publication Date: 2023
Creators: Klostermann, Jan

This dataset includes information on 334,071 Instagram posts from 1,435 male professional football players that were under contract at any of the 56 clubs in the English Premier League, the Spanish La Liga, and the German Bundesliga. The data was colleced December 31th, 2019 and includes the whole history of Instagram posts up to that point in time.

The information provided in the dataset are the following:

  • Player information: Information on each of the football player in the dataset is collected from http://www.transfermarkt.de and includes club, position, market value (at the time of collecting the data), highest market value, and the year in which highest market value was observed. Further, the Instagram account name is provided.
  • Instagram post information: Information on the Instagram posts including the shortcode (which can be used to open the post on instagram.com), date, caption text, number of likes, number of comments, post type (image, sidecar, video).
  • Instagram post images: For each post, we analyzed the content of the image (first image for sidecar posts, first frame for video posts) using Google Vision and extract the number of persons, their age, and their gender. Further, we extract all tags that are included in the image, such as “soccer” or “car”.
  • Additional information: Additional information such as the images of the posts can be requested from the authors.

The dataset has been used in the following paper:

Klostermann, J., Meißner, M., Max, A., & Decker, R. (2023). Presentation of celebrities’ private life through visual social media. Journal of Business Research, 156, 113524.

Please cite the paper when using the dataset for your own research. It is recommended to read the paper for further information on the dataset.

The Economist Historical Advertisements - Master Dataset

Creators: Kluge, Stefan; Gehrmann, Leonie; Stahl, Florian
Publication Date: 2023
Creators: Kluge, Stefan; Gehrmann, Leonie; Stahl, Florian

This dataset contains metadata of 512.599 historical advertisements from all 8,840 issues of The Economist magazine, years 1843 to 2014. It is part of a series of datasets related to The Economist Historical Archive (https://www.gale.com/intl/c/the-economist-historical-archive). You will need this Master Dataset, if you want to work with any of the related datasets. Each advertisement entry includes various metadata fields such as publication date, issue number, page number, and advertisement dimensions. This structured information enables detailed analyses of trends and patterns within the advertising practices over time. In total, the dataset has a size of 195,4 MB.

MLW Zettelmaterial

Creators: Bayerische Akademie der Wissenschaften (BAdW)
Publication Date: 2023
Creators: Bayerische Akademie der Wissenschaften (BAdW)

General information:

The data set comprises a total of 114,653 images (18,9 GB), corresponding to 3,507 distinct lemmas.
All images are in RGB, but not uniform in size, i.e. height, and width differ from image to image.
Additionally, the information on the corresponding lemma is available for each image in a separate json file.

Structure:

Most record cards follow the same structure being composed of three main parts.

  • The first one (1), and the one deemed most challenging, is the lemma, which is always located in the upper left corner of the record card.
  • The second part (2) is the index of the text where the lemma is found.
  • The third part (3) contains a text extract in which the word (corresponding to the lemma) occurs in context.

Character inventory:

There is a total of 17 different first letters, eight of which are each upper- and lowercase, as well as one special character.
The capitalization of a word plays a crucial role since a word’s meaning changes depending on capitalization.
Since the majority of our data stems from the S-series of the dictionary, most lemmas start with the letter “s”.
Likewise, a larger number of lemmas also starts with “m”, “v”, “t”, “u”, “l”, and “n”.

Occurrence frequencies:

  • A total of 2,420 lemmas (69%) were found to appear on ten record cards or less
  • 854 lemmas (24.4%) are present on between 10 and 100 record cards
  • 233 lemmas (6.6%)can be found on more than 100 record cards
  • 1,123 lemmas (approximately 36.7%) had only one record card

Lengths:

  • Lemma lengths range from one character up to a maximum of 19 characters.
  • The average length of the lemmas lies between five and six characters.

Availability:

Research activity:

  • Koch, P., Nuñez, G. V., Arias, E. G., Heumann, C., Schöffel, M., Häberlin, A., & Aßenmacher, M. (2023). A tailored Handwritten-Text-Recognition System for Medieval Latin. arXiv preprint arXiv:2308.09368.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.