Showing 1-8 of 13 results

MLW Zettelmaterial

Publication Date: 2023
Creators: Bayerische Akademie der Wissenschaften (BAdW)

General information:

The data set comprises a total of 114,653 images (18,9 GB), corresponding to 3,507 distinct lemmas.
All images are in RGB, but not uniform in size, i.e. height, and width differ from image to image.
Additionally, the information on the corresponding lemma is available for each image in a separate json file.

Structure:

Most record cards follow the same structure being composed of three main parts.

  • The first one (1), and the one deemed most challenging, is the lemma, which is always located in the upper left corner of the record card.
  • The second part (2) is the index of the text where the lemma is found.
  • The third part (3) contains a text extract in which the word (corresponding to the lemma) occurs in context.

Character inventory:

There is a total of 17 different first letters, eight of which are each upper- and lowercase, as well as one special character.
The capitalization of a word plays a crucial role since a word’s meaning changes depending on capitalization.
Since the majority of our data stems from the S-series of the dictionary, most lemmas start with the letter “s”.
Likewise, a larger number of lemmas also starts with “m”, “v”, “t”, “u”, “l”, and “n”.

Occurrence frequencies:

  • A total of 2,420 lemmas (69%) were found to appear on ten record cards or less
  • 854 lemmas (24.4%) are present on between 10 and 100 record cards
  • 233 lemmas (6.6%)can be found on more than 100 record cards
  • 1,123 lemmas (approximately 36.7%) had only one record card

Lengths:

  • Lemma lengths range from one character up to a maximum of 19 characters.
  • The average length of the lemmas lies between five and six characters.

Availability:

Research activity:

  • Koch, P., Nuñez, G. V., Arias, E. G., Heumann, C., Schöffel, M., Häberlin, A., & Aßenmacher, M. (2023). A tailored Handwritten-Text-Recognition System for Medieval Latin. arXiv preprint arXiv:2308.09368.

Crowdsourced air traffic data from The OpenSky Network 2020

Publication Date: 2022
Creators: Olive, Xavier; Strohmeier, Martin; Lübbe, Jannis

The data in this dataset is derived and cleaned from the full OpenSky dataset to illustrate the development of air traffic during the COVID-19 pandemic. It spans all flights seen by the network’s more than 2500 members since 1 January 2019. More data will be periodically included in the dataset until the end of the COVID-19 pandemic.

U.S.-Mexico Border Surveillance Data

Publication Date: 2024
Creators: Electronic Frontier Foundaton (EFF)

This dataset includes the locations of Customs & Border Patrol surveillance towers, proposed tower locations, and automated license plate readers. There is an accompanying blog post and map.

AudioSet dataset

Publication Date: 2017
Creators: Google

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

Airwar

Publication Date: 2022
Creators: Yemen Data Project

The dataset lists the date of incident, geographical location, type of target, target category and sub-category, and, where known, time of day. Each incident indicates a stated number of air raids, which in turn may comprise multiple air strikes. It is not possible to generate an average number of air strikes per air raid as these vary greatly, from a couple of airstrikes up to several dozen per air-raid. YDP’s dataset records the unverified numbers of individual air strikes that constitute a recorded single air raid.

With millions of images in our library and billions of user-submitted keywords, we work hard at Shutterstock to make sure that bad words don’t show up in places they shouldn’t. This repo contains a list of words that we use to filter results from our autocomplete server and recommendation engine.

News Homepage Archive

Publication Date: 2019
Creators: Jones, Nick

This project aims to provide a visual representation of how different media organizations cover various topics. Screenshots of the homepages of five different news organizations are taken once per hour, and made public thereafter.Screenshots are available at every hour starting from January 1, 2019. Currently, the only websites being tracked are:
nytimes.com;
washingtonpost.com;
cnn.com;
wsj.com;
foxnews.com;

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.