Resources by Stefan

ConceptNet

Creators: ConceptNet
Publication Date: 2021
Creators: ConceptNet

ConceptNet aims to give computers access to common-sense knowledge, the kind of information that ordinary people know but usually leave unstated.ConceptNet is a semantic network that represents things that computers should know about the world, especially for the purpose of understanding text written by people. Its “concepts” are represented using words and phrases of many different natural language — unlike similar projects, it’s not limited to a single language such as English. It expresses over 13 million links between these concepts, and makes the whole data set available under a Creative Commons license. ConceptNet is structured as a graph, where nodes represent concepts (words or phrases), and edges represent the relationships between these concepts. Each edge is labeled with a relation type, such as “IsA,” “PartOf,” or “RelatedTo,” indicating the nature of the relationship. The dataset is organized into sub-datasets based on language and relation types, allowing users to work with specific subsets relevant to their applications.

O Say Can You See: Early Washington, D.C., Law and Family Project

Creators: O Say Can You See Project
Publication Date: 2024
Creators: O Say Can You See Project

This site documents the challenge to slavery and the quest for freedom in early Washington, D.C., by collecting, digitizing, making accessible, and analyzing freedom suits filed between 1800 and 1862, as well as tracing the multigenerational family networks they reveal. The project encompasses hundreds of freedom suits from the Circuit Court for the District of Columbia, Maryland state courts, and the U.S. Supreme Court, providing invaluable insights into the legal battles fought by enslaved individuals seeking freedom. By exploring the web of litigants, jurists, attorneys, and community members present in case files, the project shows deep relationship mapping of early Washington, D.C., illustrating how each person is connected to others in the city and beyond. The dataset has a size of 3,0 kB and is organized into several interconnected components:

  • People: A database of individuals involved in the cases, including litigants, jurists, attorneys, and community members, with detailed profiles and social connections.

  • Families: Kinship and family networks of multigenerational Black, white, and mixed families, created using information derived from court records and genealogical research.

  • Cases: A collection of hundreds of freedom cases from various courts, providing detailed accounts of each legal battle.

  • Stories: Interactive analyses of the court cases, families, attorneys, and judges, focusing on historical or legal questions raised by these cases.

Super Bowl Ads

Creators: Superbowl Ads; FiveThirtyEight
Publication Date: 2021
Creators: Superbowl Ads; FiveThirtyEight

This dataset contains a list of ads from the 10 brands that had the most advertisements in Super Bowls from 2000 to 2020, according to data from superbowl-ads.com, with matching videos found on YouTube. Each advertisement is evaluated across seven defining characteristics: humor, early product display, patriotism, celebrity presence, danger elements, inclusion of animals, and use of sexual content. This granular assessment allows for in-depth analysis of advertising strategies. Furthermore, links to corresponding YouTube videos are included, facilitating immediate access to the commercials for further qualitative analysis.There are 233 advertisements documented in the dataset, spanning from 2000 to 2020 with a total size of 38,6 kB. Structurally, the dataset is organized as a CSV file with the following columns:

  • year: Year the advertisement aired.

  • brand: Brand of the advertiser, standardized to account for variations and sub-brands.

  • superbowl_ads_dot_com_url: Link to the advertisement’s entry on superbowl-ads.com.

  • youtube_url: Link to the corresponding YouTube video of the advertisement.

  • funny: Indicates if the ad was intended to be humorous (TRUE/FALSE).

  • show_product_quickly: Indicates if the product was shown within the first 10 seconds (TRUE/FALSE).

  • patriotic: Indicates if the ad had patriotic elements (TRUE/FALSE).

  • celebrity: Indicates if a celebrity appeared in the ad (TRUE/FALSE).

  • danger: Indicates if the ad involved elements of danger (TRUE/FALSE).

  • animals: Indicates if animals were featured in the ad (TRUE/FALSE).

  • use_sex: Indicates if sexual content was used to promote the product (TRUE/FALSE).

Stack Exchange Data

Creators: Stack Exchange Inc.
Publication Date: 2014
Creators: Stack Exchange Inc.

This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. The dataset covers detailed records of questions, answers, comments, user profiles, and other related metadata from numerous Stack Exchange communities. This breadth allows for in-depth analysis of community interactions, content evolution, and knowledge dissemination patterns. The dataset has a size of 92,3 GB and captures content from the inception of each Stack Exchange site up to the date of the specific data dump. For example, the September 2023 release includes data up to that month. Structurally, the database is organized into individual archives for each Stack Exchange community. Each archive contains several XML files representing different data tables:

  • Posts.xml: Contains both questions and answers, with fields detailing post ID, creation date, score, body content, and related metadata.

  • Users.xml: Includes user information such as user ID, reputation, creation date, and profile details.

  • Comments.xml: Encompasses comments made on posts, including comment ID, post ID, user ID, and content.

  • Votes.xml: Records voting data on posts, detailing vote type, user ID, and timestamps.

U.S.-Mexico Border Surveillance Data

Creators: Electronic Frontier Foundaton (EFF)
Publication Date: 2024
Creators: Electronic Frontier Foundaton (EFF)

This dataset includes the locations of Customs & Border Patrol surveillance towers, proposed tower locations, and automated license plate readers. There is an accompanying blog post and map. The dataset includes precise locations of Customs & Border Protection (CBP) surveillance towers, proposed tower sites, automated license plate readers, aerostats (tethered surveillance balloons), and facial recognition systems at land ports of entry. This extensive mapping offers valuable insights into the deployment and reach of surveillance infrastructure along the border. By making this data publicly available, the database facilitates research into the implications of surveillance practices on civil liberties and border communities. The dataset was last updated on November 21, 2024. It reflects the state of surveillance infrastructure up to that date and has a total size of 187,3 kB. Structurally, the dataset is organized into several categories, each detailing a specific type of surveillance technology:

  • Surveillance Towers: Locations and specifications of existing and proposed CBP surveillance towers.

  • Automated License Plate Readers (ALPRs): Positions of ALPR systems used to monitor vehicle movements across the border.

  • Aerostats: Details on tethered surveillance balloons employed for aerial monitoring.

  • Facial Recognition Systems: Information on the deployment of facial recognition technology at land ports of entry.

AudioSet dataset

Creators: Google
Publication Date: 2017
Creators: Google

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. The dataset has a size of 19,0 kB and is divided into three primary subsets:

  • Evaluation Set: Contains 20,383 segments from distinct videos, ensuring at least 59 examples for each of the 527 sound classes used.

  • Balanced Training Set: Consists of 22,176 segments from distinct videos, selected to provide a balanced representation with at least 59 examples per class.

  • Unbalanced Training Set: Includes 2,042,985 segments from distinct videos, representing the remainder of the dataset

Airwar

Creators: Yemen Data Project
Publication Date: 2022
Creators: Yemen Data Project

The dataset collects data detailing air raids conducted in Yemen, primarily focusing on the Saudi-led coalition’s operations from March 2015 to April 2022. It lists the date of incident, geographical location, type of target, target category and sub-category, and, where known, time of day. Each incident indicates a stated number of air raids, which in turn may comprise multiple air strikes. It is not possible to generate an average number of air strikes per air raid as these vary greatly, from a couple of airstrikes up to several dozen per air-raid. YDP’s dataset records the unverified numbers of individual air strikes that constitute a recorded single air raid. Each entry in the dataset includes specifics such as the date of the incident, geographical location (down to the district level), target type, target category and sub-category, and, where available, the time of day. This granularity allows for in-depth analysis of air raid patterns and their impacts. The dataset has a size of 17,0 kB and comprises 25,054 recorded air raids conducted by the Saudi-UAE-led coalition. It is structured with each record representing a single air raid incident. Key fields include:

  • Date of Incident: Specifies the exact date the air raid occurred.

  • Geographical Location: Details the governorate and district where the air raid took place.

  • Target Type: Describes the nature of the target, such as military sites, residential areas, or infrastructure facilities.

  • Target Category and Sub-category: Provides further classification of the target, offering more granular insight into the specific nature of the targeted site.

  • Time of Day: Indicates the time at which the air raid occurred, where such information is available.

Africapolis data

Creators: OECD; SWAC
Publication Date: 2022
Creators: OECD; SWAC

Africapolis has been designed to provide a much needed standardised and geospatial database on urbanisation dynamics in Africa, with the aim of making urban data in Africa comparable across countries and across time. This version of Africapolis is the first time that the data for the 54 countries currently covered are available for the same base year — 2015. In addition, Africapolis closes one major data gap by integrating 7,496 small towns and intermediary cities between 10,000 and 300,000 inhabitants. Africapolis data is based on a large inventory of housing and population censuses, electoral registers and other official population sources, in some cases dating back to the beginning of the 20th century. The dataset has a size of 34,5 kB and contains the following information:

  • Spatial Data: Urban agglomerations are represented as polygon vector data, delineating the spatial extent of settlements. Each polygon’s size corresponds to the settled area, providing insights into urban sprawl and density.

  • Attribute Data: Each spatial unit is linked to demographic attributes, such as total population figures for specific years (e.g., 2015, 2020). This linkage enables analyses of population distribution and urban growth patterns.

  • Data Layers: The dataset includes multiple layers corresponding to different years, supporting temporal analyses of urbanization trends.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.