Showing 161-168 of 272 results

Stack Exchange Data

Creators: Stack Exchange Inc.
Publication Date: 2014
Creators: Stack Exchange Inc.

This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. The dataset covers detailed records of questions, answers, comments, user profiles, and other related metadata from numerous Stack Exchange communities. This breadth allows for in-depth analysis of community interactions, content evolution, and knowledge dissemination patterns. The dataset has a size of 92,3 GB and captures content from the inception of each Stack Exchange site up to the date of the specific data dump. For example, the September 2023 release includes data up to that month. Structurally, the database is organized into individual archives for each Stack Exchange community. Each archive contains several XML files representing different data tables:

  • Posts.xml: Contains both questions and answers, with fields detailing post ID, creation date, score, body content, and related metadata.

  • Users.xml: Includes user information such as user ID, reputation, creation date, and profile details.

  • Comments.xml: Encompasses comments made on posts, including comment ID, post ID, user ID, and content.

  • Votes.xml: Records voting data on posts, detailing vote type, user ID, and timestamps.

Super Bowl Ads

Creators: Superbowl Ads; FiveThirtyEight
Publication Date: 2021
Creators: Superbowl Ads; FiveThirtyEight

This dataset contains a list of ads from the 10 brands that had the most advertisements in Super Bowls from 2000 to 2020, according to data from superbowl-ads.com, with matching videos found on YouTube. Each advertisement is evaluated across seven defining characteristics: humor, early product display, patriotism, celebrity presence, danger elements, inclusion of animals, and use of sexual content. This granular assessment allows for in-depth analysis of advertising strategies. Furthermore, links to corresponding YouTube videos are included, facilitating immediate access to the commercials for further qualitative analysis.There are 233 advertisements documented in the dataset, spanning from 2000 to 2020 with a total size of 38,6 kB. Structurally, the dataset is organized as a CSV file with the following columns:

  • year: Year the advertisement aired.

  • brand: Brand of the advertiser, standardized to account for variations and sub-brands.

  • superbowl_ads_dot_com_url: Link to the advertisement’s entry on superbowl-ads.com.

  • youtube_url: Link to the corresponding YouTube video of the advertisement.

  • funny: Indicates if the ad was intended to be humorous (TRUE/FALSE).

  • show_product_quickly: Indicates if the product was shown within the first 10 seconds (TRUE/FALSE).

  • patriotic: Indicates if the ad had patriotic elements (TRUE/FALSE).

  • celebrity: Indicates if a celebrity appeared in the ad (TRUE/FALSE).

  • danger: Indicates if the ad involved elements of danger (TRUE/FALSE).

  • animals: Indicates if animals were featured in the ad (TRUE/FALSE).

  • use_sex: Indicates if sexual content was used to promote the product (TRUE/FALSE).

Airwar

Creators: Yemen Data Project
Publication Date: 2022
Creators: Yemen Data Project

The dataset collects data detailing air raids conducted in Yemen, primarily focusing on the Saudi-led coalition’s operations from March 2015 to April 2022. It lists the date of incident, geographical location, type of target, target category and sub-category, and, where known, time of day. Each incident indicates a stated number of air raids, which in turn may comprise multiple air strikes. It is not possible to generate an average number of air strikes per air raid as these vary greatly, from a couple of airstrikes up to several dozen per air-raid. YDP’s dataset records the unverified numbers of individual air strikes that constitute a recorded single air raid. Each entry in the dataset includes specifics such as the date of the incident, geographical location (down to the district level), target type, target category and sub-category, and, where available, the time of day. This granularity allows for in-depth analysis of air raid patterns and their impacts. The dataset has a size of 17,0 kB and comprises 25,054 recorded air raids conducted by the Saudi-UAE-led coalition. It is structured with each record representing a single air raid incident. Key fields include:

  • Date of Incident: Specifies the exact date the air raid occurred.

  • Geographical Location: Details the governorate and district where the air raid took place.

  • Target Type: Describes the nature of the target, such as military sites, residential areas, or infrastructure facilities.

  • Target Category and Sub-category: Provides further classification of the target, offering more granular insight into the specific nature of the targeted site.

  • Time of Day: Indicates the time at which the air raid occurred, where such information is available.

AudioSet dataset

Creators: Google
Publication Date: 2017
Creators: Google

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. The dataset has a size of 19,0 kB and is divided into three primary subsets:

  • Evaluation Set: Contains 20,383 segments from distinct videos, ensuring at least 59 examples for each of the 527 sound classes used.

  • Balanced Training Set: Consists of 22,176 segments from distinct videos, selected to provide a balanced representation with at least 59 examples per class.

  • Unbalanced Training Set: Includes 2,042,985 segments from distinct videos, representing the remainder of the dataset

U.S.-Mexico Border Surveillance Data

Creators: Electronic Frontier Foundaton (EFF)
Publication Date: 2024
Creators: Electronic Frontier Foundaton (EFF)

This dataset includes the locations of Customs & Border Patrol surveillance towers, proposed tower locations, and automated license plate readers. There is an accompanying blog post and map. The dataset includes precise locations of Customs & Border Protection (CBP) surveillance towers, proposed tower sites, automated license plate readers, aerostats (tethered surveillance balloons), and facial recognition systems at land ports of entry. This extensive mapping offers valuable insights into the deployment and reach of surveillance infrastructure along the border. By making this data publicly available, the database facilitates research into the implications of surveillance practices on civil liberties and border communities. The dataset was last updated on November 21, 2024. It reflects the state of surveillance infrastructure up to that date and has a total size of 187,3 kB. Structurally, the dataset is organized into several categories, each detailing a specific type of surveillance technology:

  • Surveillance Towers: Locations and specifications of existing and proposed CBP surveillance towers.

  • Automated License Plate Readers (ALPRs): Positions of ALPR systems used to monitor vehicle movements across the border.

  • Aerostats: Details on tethered surveillance balloons employed for aerial monitoring.

  • Facial Recognition Systems: Information on the deployment of facial recognition technology at land ports of entry.

The Upworthy Research Archive

Creators: The Upworthy Research Archive
Publication Date: 2019
Creators: The Upworthy Research Archive

The Upworthy Research Archive is an open dataset of thousands of A/B tests of headlines conducted by Upworthy from January 2013 to April 2015. This repository includes the full data from the archive. The dataset’s size is approximately 149,7 MB. It includes 32,488 records of headline experiments, providing insights into how different headline variations impacted user engagement. The dataset is structured as a time series of experiments, with each record detailing the performance metrics of different headline variations. This structure enables researchers to analyze the effectiveness of various headlines and understand user engagement patterns over time.

News Homepage Archive

Creators: Jones, Nick
Publication Date: 2019
Creators: Jones, Nick

This project aims to provide a visual representation of how different media organizations cover various topics. Screenshots of the homepages of five different news organizations are taken once per hour, and made public thereafter. For each website, this amounts to 24 screenshots per day. Over a year, this results in approximately 8,760 screenshots per website. Screenshots are available at every hour starting from January 1, 2019. The size of the dataset is 1,8 MB. Currently, the only websites being tracked are:
nytimes.com;
washingtonpost.com;
cnn.com;
wsj.com;
foxnews.com;
By capturing hourly screenshots, this dataset offers a unique visual chronicle of news presentation, allowing for analysis of editorial choices, headline prominence, and the evolution of news stories across different media outlets. The dataset is organized hierarchically based on the website name and timestamp of each screenshot. Each sub-dataset corresponds to a specific news website, containing a chronological collection of its homepage screenshots. This structure facilitates targeted analysis of individual news outlets over time.

Creators: Shutterstock

With millions of images in our library and billions of user-submitted keywords, we work hard at Shutterstock to make sure that bad words don’t show up in places they shouldn’t. This repo, published in 2019, contains a list of words that we use to filter results from our autocomplete server and recommendation engine. The dataset encompasses offensive terms in multiple languages. It is open for contributions, allowing users to add or refine entries, particularly in non-English languages, enhancing its comprehensiveness and applicability across diverse cultural contexts. The exact number of entries varies by language. For instance, the English list contains 403 entries. In total, the dataset has a size of 25,7 kB. The data is organized into separate files for each language, with each file containing a list of offensive words in that particular language. For example, the English words are listed in the ‘en’ file, German words in the ‘de’ file, and so on. This allows the targeted application of language-specific content filtering systems. Each sub-dataset (language file) consists of a plain text file with one offensive term per line, facilitating easy integration into various text processing pipelines.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.