Resources by Stefan

The Economist Historical Advertisements - Master Dataset

Creators: Kluge, Stefan; Gehrmann, Leonie; Stahl, Florian
Publication Date: 2023
Creators: Kluge, Stefan; Gehrmann, Leonie; Stahl, Florian

This dataset contains metadata of 512.599 historical advertisements from all 8,840 issues of The Economist magazine, years 1843 to 2014. It is part of a series of datasets related to The Economist Historical Archive (https://www.gale.com/intl/c/the-economist-historical-archive). You will need this Master Dataset, if you want to work with any of the related datasets. Each advertisement entry includes various metadata fields such as publication date, issue number, page number, and advertisement dimensions. This structured information enables detailed analyses of trends and patterns within the advertising practices over time. In total, the dataset has a size of 195,4 MB.

MLW Zettelmaterial

Creators: Bayerische Akademie der Wissenschaften (BAdW)
Publication Date: 2023
Creators: Bayerische Akademie der Wissenschaften (BAdW)

General information:

The data set comprises a total of 114,653 images (18,9 GB), corresponding to 3,507 distinct lemmas.
All images are in RGB, but not uniform in size, i.e. height, and width differ from image to image.
Additionally, the information on the corresponding lemma is available for each image in a separate json file.

Structure:

Most record cards follow the same structure being composed of three main parts.

  • The first one (1), and the one deemed most challenging, is the lemma, which is always located in the upper left corner of the record card.
  • The second part (2) is the index of the text where the lemma is found.
  • The third part (3) contains a text extract in which the word (corresponding to the lemma) occurs in context.

Character inventory:

There is a total of 17 different first letters, eight of which are each upper- and lowercase, as well as one special character.
The capitalization of a word plays a crucial role since a word’s meaning changes depending on capitalization.
Since the majority of our data stems from the S-series of the dictionary, most lemmas start with the letter “s”.
Likewise, a larger number of lemmas also starts with “m”, “v”, “t”, “u”, “l”, and “n”.

Occurrence frequencies:

  • A total of 2,420 lemmas (69%) were found to appear on ten record cards or less
  • 854 lemmas (24.4%) are present on between 10 and 100 record cards
  • 233 lemmas (6.6%)can be found on more than 100 record cards
  • 1,123 lemmas (approximately 36.7%) had only one record card

Lengths:

  • Lemma lengths range from one character up to a maximum of 19 characters.
  • The average length of the lemmas lies between five and six characters.

Availability:

Research activity:

  • Koch, P., Nuñez, G. V., Arias, E. G., Heumann, C., Schöffel, M., Häberlin, A., & Aßenmacher, M. (2023). A tailored Handwritten-Text-Recognition System for Medieval Latin. arXiv preprint arXiv:2308.09368.

The Economist Historical Advertisements - Faces Dataset

Creators: Kluge, Stefan
Publication Date: 2023
Creators: Kluge, Stefan

This dataset contains 116.746 identified faces (bounding box location on image, predicted age and gender) for all historical advertisements from all 8,840 issues of The Economist magazine, years 1843 to 2014. Faces have been detected using the following library:  https://pythonrepo.com/repo/timesler-facenet-pytorch-python-deep-learning. You will need the The Economist Historical Advertisements – Master Dataset as well, to work with the data. In total, the dataset has a size of 20,2 MB and is organized as follows:

  • Filename: A unique identifier corresponding to each advertisement where a face has been detected. This identifier links directly to the specific advertisement within The Economist archives.
  • Bounding Box Coordinates:

    • Bounding Box relative X1 and Y1: These values represent the top-left corner coordinates of the bounding box encapsulating the detected face, expressed as proportions relative to the image dimensions.
    • Bounding Box relative X2 and Y2: These values denote the bottom-right corner coordinates of the bounding box, also as relative proportions. To determine the absolute pixel coordinates, multiply these relative values by the image’s width and height, respectively.
  • Segmentation Confidence Score: A numerical value indicating the confidence level of the neural network algorithm that the identified bounding box indeed contains a face. Higher scores reflect greater confidence in accurate face detection.

  • Size Relative: A metric indicating the proportion of the advertisement occupied by the detected face. For example, a value of 1 signifies that the face covers the entire advertisement, while 0.5 indicates it covers half.

  • Predicted Age: An estimated age of the individual based on facial analysis performed by the detection algorithm.

  • Gender Probability: A probability score representing the likelihood of the detected face being female. A value of 0 indicates male, 1 indicates female, and intermediate values (e.g., 0.4) suggest a 40% likelihood of the individual being female

The Economist Historical Advertisements - Objects Dataset

Creators: Kluge, Stefan
Publication Date: 2023
Creators: Kluge, Stefan

This dataset contains 191.994 identified object locations and classes for all historical advertisements from all 8,840 issues of The Economist magazine, years 1843 to 2014. We used a state of the art classifier to detect the objects: https://tfhub.dev/google/openimages_v4/ssd/mobilenet_v2/1. You will need the The Economist Historical Advertisements – Master Dataset, as well, to work with the data. The dataset has a size of 29,8 MB.

Creators: Kluge, Stefan
This dataset is a specialized collection of metadata from advertisements related to the banking industry, extracted from The Economist magazine issues spanning 1843 to 2014. It contains metadata of 92,592 historical advertisements from the banking industry, from all 8,840 issues of The Economist magazine, years 1843 to 2014. It is part of a series of  datasets related to The Economist Historical Archive (https://www.gale.com/intl/c/the-economist-historical-archive). In total, the dataset has a size of 136,0 MB. Each advertisement entry includes various metadata fields such as publication date, issue number, page number, and advertisement dimensions. This structured information enables detailed analyses of trends and patterns within the banking industry’s advertising practices.

Raw Bay Area Craigslist Rental Housing Posts

Creators: Pennigton, Kate
Publication Date: 2018
Creators: Pennigton, Kate

Like many cities, San Francisco doesn’t track rents. ​The Bay Area Craigslist Rental Housing Posts dataset comprises rental housing listings from the San Francisco Bay Area, spanning from 2000 to 2018. Each entry includes various attributes such as posting date, neighborhood, price, number of bedrooms and bathrooms, square footage, and geographic coordinates, facilitating in-depth analysis of housing trends. There are 200,796 individual rental listings documented in the dataset with a total size of 16,3 kB. The dataset is organized into three main components:

  1. Raw Data (2000-2012): This subset includes 167,090 entries with fields for posting date, title, and neighborhood.

  2. Raw Data (2013-2018): Comprising 58,551 entries, this subset offers more detailed information, including post ID, date, neighborhood, price, square footage, number of bedrooms, address, latitude, longitude, description, title, and details.

  3. Cleaned Data (2000-2018): This consolidated and processed dataset contains 200,796 entries with variables such as post ID, date, year, neighborhood, city, county, price, number of bedrooms, number of bathrooms, square footage, room type indicator, address, latitude, longitude, title, description, and details.

Creators: Habernal, Ivan

This dataset contains transcribed court hearings sourced from official hearings of the European Court of Human Rights (https://www.echr.coe.int/webcasts-of-hearings). The hearings are 154 selected webcasts (videos) from 2012-2022 in their original language (no interpretation). With manual annotation for language labels and automatic processing of the extracted audio with pyannote and whisper-large-v2, the resulting dataset contains 4000 speaker turns and 88920 individual lines. The dataset has a size of 1,9 MB and contains two subsets, the transcripts and the metadata with linked documents. The transcripts are additionally available as .txt or .xml.
Languages

The largest amounts in the transcripts are: English, French

A smaller portion also contains the following languages:

Russian, Spanish, Croatian, Italian, Portuguese, Turkish, Polish, Lithuanian, German, Ukrainian, Hungarian, Dutch, Albanian, Romanian, Serbian

The collected metadata is: English
Dataset Structure
Data Instances

Each instance in transcripts represents an entire segment of a transcript, similar to a conversation turn in a dialog.

{ ‘id’: 0, ‘webcast_id’: ‘1021112_29112017’, ‘segment_id’: 0, ‘speaker_name’: ‘UNK’, ‘speaker_role’: ‘Announcer’, ‘data’: { ‘begin’: [12.479999542236328], ‘end’: [13.359999656677246], ‘language’: [‘fr’], ‘text’: [‘La Cour!’] } }

Each instance in documents represents a information on a document in hudoc associated with a hearing and the metadata associated with a hearing. The actual document is linked and can also be found in hudocwith the case_id. Note: hearing_type states the type of the hearing, type states the type of the document. If the hearing is a “Grand Chamber hearing”, the “CHAMBER” document refers to a different hearing.

{

‘id’: 16,

‘webcast_id’: ‘1232311_02102012’,

‘hearing_title’: ‘Michaud v. France (nos. 12323/11)’,

‘hearing_date’: ‘2012-10-02 00:00:00’,

‘hearing_type’: ‘Chamber hearing’,

‘application_number’: [‘12323/11’],

‘case_id’: ‘001-115377’,

‘case_name’: ‘CASE OF MICHAUD v. FRANCE’,

‘case_url’: ‘https://hudoc.echr.coe.int/eng?i=001-115377’,

‘ecli’: ‘ECLI:CE:ECHR:2012:1206JUD001232311’,

‘type’: ‘CHAMBER’,

‘document_date’: ‘2012-12-06 00:00:00’,

‘importance’: 1,

‘articles’: [‘8’, ‘8-1’, ‘8-2′, ’34’, ’35’],

‘respondent_government’: [‘FRA’],

‘issue’: ‘Decision of the National Bar Council of 12 July 2007 “adopting regulations on internal procedures for implementing the obligation to combat money laundering and terrorist financing, and an internal supervisory mechanism to guarantee compliance with those procedures” ; Article 21-1 of the Law of 31 December 1971 ; Law no. 2004-130 of 11 February 2004 ; Monetary and Financial Code’,

‘strasbourg_caselaw’: ‘André and Other v. France, no 18603/03, 24 July 2008;Bosphorus Hava Yollari Turizm ve Ticaret Anonim Sirketi v. Ireland [GC], no 45036/98, ECHR 2005-VI;[…]’,

‘external_sources’: ‘Directive 91/308/EEC, 10 June 1991;Article 6 of the Treaty on European Union;Charter of Fundamental Rights of the European Union;Articles 169, 170, 173, 175, 177, 184 and 189 of the Treaty establishing the European Community;Recommendations 12 and 16 of the financial action task force (“FATF”) on money laundering;Council of Europe Convention on Laundering, Search, Seizure and Confiscation of the Proceeds from Crime and on the Financing of Terrorism (16 May 2005)’,

‘conclusion’: ‘Remainder inadmissible;No violation of Article 8 – Right to respect for private and family life (Article 8-1 – Respect for correspondence;Respect for private life)’,

‘separate_opinion’: True

}

SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments

Creators: Lange, Kai-Robin; Jentsch, Carsten
Publication Date: 2023
Creators: Lange, Kai-Robin; Jentsch, Carsten

A dataset of German parliament debates covering 74 years of plenary protocols across all 16 state parliaments of Germany as well as the German Bundestag. The debates are separated into individual speeches which are enriched with meta data identifying the speaker as a member of the parliament (mp). In total, the dataset covers more than 15,000,000 speeches with a total size of 10 GB.

When using this data set, please cite the original paper “Lange, K.-R., Jentsch, C. (2023). SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments. Proceedings of the 3rd Workshop on Computational Linguistics for Political Text Analysis@KONVENS 2023.“.

The meta data is separated into two different types: time-specific meta-data that contains only information for a legislative period but can change over time (e.g. the party or constituency of an mp) and meta-data that is considered fixed, such as the birth date or the name of a speaker. The former information are stored aong with the speeches as it is considered temporal information of that point in time, but are additionally stored in the file all_mps_mapping.csv if there is the need to double-check something. The rest of the meta-data are stored in the file all_mps_meta.csv. The meta-data from this file can be matched with a speech by comparing the speaker ID-variable “MPID”. The speeches of each parliament are saved in a csv format. Along with the speeches, they contain the following meta-data:

  • Period: int. The period in which the speech took place
  • Session: int. The session in which the speech took place
  • Chair: boolean. The information if the speaker was the chair of the plenary session
  • Interjection: boolean. The information if the speech is a comment or an interjection from the crowd
  • Party: list (e.g. [“cdu”] or [“cdu”, “fdp”] when having more than one speaker during an interjection). List of the party of the speaker or the parties whom the comment/interjection references
  • Consituency: string. The consituency of the speaker in the current legislative period
  • MPID: int. The ID of the speaker, which can be used to get more meta-data from the file all_mps_meta.csv

The file all_mps_meta.csv contains the following meta information:

  • MPID: int. The ID of the speaker, which can be used to match the mp with his/her speeches.
  • WikipediaLink: The Link to the mps Wikipedia page
  • WikiDataLink: The Link to the mps WikiData page
  • Name: string. The full name of the mp.
  • Last Name: string. The last name of the mp, found on WikiData. If no last name is given on WikiData, the full name was heuristically cut at the last space to get the information neccessary for splitting the speeches.
  • Born: string, format: YYYY-MM-DD. Birth date of the mp. If an exact birth date is found on WikiData, this exact date is used. Otherwise, a day in the year of birth given on Wikipedia is used.
  • SexOrGender: string. Information on the sex or gender of the mp. Disclaimer: This infomation was taken from WikiData, which does not seem to differentiate between sex or gender.
  • Occupation: list. Occupation(s) of the mp.
  • Religion: string. Religious believes of the mp.
  • AbgeordnetenwatchID: int. ID of the mp on the website Abgeordnetenwatch

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.