Showing 9-16 of 17 results

Airwar

Creators: Yemen Data Project
Publication Date: 2022
Creators: Yemen Data Project

The dataset collects data detailing air raids conducted in Yemen, primarily focusing on the Saudi-led coalition’s operations from March 2015 to April 2022. It lists the date of incident, geographical location, type of target, target category and sub-category, and, where known, time of day. Each incident indicates a stated number of air raids, which in turn may comprise multiple air strikes. It is not possible to generate an average number of air strikes per air raid as these vary greatly, from a couple of airstrikes up to several dozen per air-raid. YDP’s dataset records the unverified numbers of individual air strikes that constitute a recorded single air raid. Each entry in the dataset includes specifics such as the date of the incident, geographical location (down to the district level), target type, target category and sub-category, and, where available, the time of day. This granularity allows for in-depth analysis of air raid patterns and their impacts. The dataset has a size of 17,0 kB and comprises 25,054 recorded air raids conducted by the Saudi-UAE-led coalition. It is structured with each record representing a single air raid incident. Key fields include:

  • Date of Incident: Specifies the exact date the air raid occurred.

  • Geographical Location: Details the governorate and district where the air raid took place.

  • Target Type: Describes the nature of the target, such as military sites, residential areas, or infrastructure facilities.

  • Target Category and Sub-category: Provides further classification of the target, offering more granular insight into the specific nature of the targeted site.

  • Time of Day: Indicates the time at which the air raid occurred, where such information is available.

U.S.-Mexico Border Surveillance Data

Creators: Electronic Frontier Foundaton (EFF)
Publication Date: 2024
Creators: Electronic Frontier Foundaton (EFF)

This dataset includes the locations of Customs & Border Patrol surveillance towers, proposed tower locations, and automated license plate readers. There is an accompanying blog post and map. The dataset includes precise locations of Customs & Border Protection (CBP) surveillance towers, proposed tower sites, automated license plate readers, aerostats (tethered surveillance balloons), and facial recognition systems at land ports of entry. This extensive mapping offers valuable insights into the deployment and reach of surveillance infrastructure along the border. By making this data publicly available, the database facilitates research into the implications of surveillance practices on civil liberties and border communities. The dataset was last updated on November 21, 2024. It reflects the state of surveillance infrastructure up to that date and has a total size of 187,3 kB. Structurally, the dataset is organized into several categories, each detailing a specific type of surveillance technology:

  • Surveillance Towers: Locations and specifications of existing and proposed CBP surveillance towers.

  • Automated License Plate Readers (ALPRs): Positions of ALPR systems used to monitor vehicle movements across the border.

  • Aerostats: Details on tethered surveillance balloons employed for aerial monitoring.

  • Facial Recognition Systems: Information on the deployment of facial recognition technology at land ports of entry.

News Homepage Archive

Creators: Jones, Nick
Publication Date: 2019
Creators: Jones, Nick

This project aims to provide a visual representation of how different media organizations cover various topics. Screenshots of the homepages of five different news organizations are taken once per hour, and made public thereafter. For each website, this amounts to 24 screenshots per day. Over a year, this results in approximately 8,760 screenshots per website. Screenshots are available at every hour starting from January 1, 2019. The size of the dataset is 1,8 MB. Currently, the only websites being tracked are:
nytimes.com;
washingtonpost.com;
cnn.com;
wsj.com;
foxnews.com;
By capturing hourly screenshots, this dataset offers a unique visual chronicle of news presentation, allowing for analysis of editorial choices, headline prominence, and the evolution of news stories across different media outlets. The dataset is organized hierarchically based on the website name and timestamp of each screenshot. Each sub-dataset corresponds to a specific news website, containing a chronological collection of its homepage screenshots. This structure facilitates targeted analysis of individual news outlets over time.

Creators: Shutterstock

With millions of images in our library and billions of user-submitted keywords, we work hard at Shutterstock to make sure that bad words don’t show up in places they shouldn’t. This repo, published in 2019, contains a list of words that we use to filter results from our autocomplete server and recommendation engine. The dataset encompasses offensive terms in multiple languages. It is open for contributions, allowing users to add or refine entries, particularly in non-English languages, enhancing its comprehensiveness and applicability across diverse cultural contexts. The exact number of entries varies by language. For instance, the English list contains 403 entries. In total, the dataset has a size of 25,7 kB. The data is organized into separate files for each language, with each file containing a list of offensive words in that particular language. For example, the English words are listed in the ‘en’ file, German words in the ‘de’ file, and so on. This allows the targeted application of language-specific content filtering systems. Each sub-dataset (language file) consists of a plain text file with one offensive term per line, facilitating easy integration into various text processing pipelines.

Large-scale CelebFaces Attributes (CelebA) Dataset

Creators: Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou
Publication Date: 2015
Creators: Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including: 10,177 number of identities, 202,599 number of face images, and 5 landmark locations, 40 binary attributes annotations per image. Each image in the dataset captures various facial features and accessories, such as eyeglasses, smiling, or bangs. Additionally, five landmark points (e.g., eyes, nose, mouth corners) are provided per image, facilitating tasks like facial alignment. Also, a wide range of poses, expressions, and occlusions are included, reflecting real-world conditions and enhancing the robustness of models trained on this data. The dataset has a size of 25,3 kB and is organized in theree main components:

  • Images:

    • In-the-Wild Images: Original images depicting celebrities in various environments and conditions.

    • Aligned and Cropped Images: Faces have been aligned and cropped to a consistent size, facilitating standardized analysis.

  • Annotations:

    • Landmark Locations: Coordinates for five key facial points (left eye, right eye, nose, left mouth corner, right mouth corner) per image.

    • Attribute Labels: Binary labels indicating the presence or absence of 40 distinct facial attributes for each image.

    • Identity Labels: Each image is associated with an identity label, linking it to one of the 10,177 unique individuals.

  • Evaluation Partitions:

    • The dataset is divided into training, validation, and test sets, enabling standardized evaluation of algorithms.

CVE List

Creators: CVE
Publication Date: 2024
Creators: CVE

The mission of the CVE® Program is to identify, define, and catalog publicly disclosed cybersecurity vulnerabilities identified from 1999 through June 25, 2024, providing a historical perspective on the evolution of cybersecurity threats over a span of 25 years. Its primary purpose is to standardize the identification of vulnerabilities across various platforms and security tools, facilitating consistent and efficient communication within the cybersecurity community. As of June 25, 2024, the CVE List comprises 269,759 records with a size of 36,3 kB.

Each CVE entry includes several key components:

  • CVE Identifier (CVE ID): A unique alphanumeric code assigned to each vulnerability, following the format “CVE-YYYY-NNNN,” where “YYYY” denotes the year of identification, and “NNNN” is a sequential number.

  • Description: A brief summary outlining the nature of the vulnerability, including affected software or hardware, potential impacts, and any known exploits.

  • References: Links to external resources such as security advisories, vendor bulletins, or detailed analyses that provide additional context or mitigation information.

EndoMondo Fitness Tracking Data

Creators: Ni, Jianmo; Muhlstein, Larry; McAuley, Julian
Publication Date: 2019
Creators: Ni, Jianmo; Muhlstein, Larry; McAuley, Julian

This is a collection of workout logs from users of EndoMondo. It contains sequential sensor data such as GPS coordinates (latitude, longitude, altitude), heart rate measurements, speed, and distance, making it valuable for studying workout patterns, performance tracking, and personalized fitness recommendations. Additionally, it includes user metadata such as anonymized user IDs, gender, and sport type, along with contextual factors like weather conditions. The dataset has a size of approximately 2.9 GB and consists of 1,104 users with 253,020 recorded workouts.

The dataset covers multiple components:

  • User Information: Anonymized user identifiers and gender.

  • Workout Details: Each workout log includes sport type, sequential data for GPS coordinates (latitude, longitude, altitude) with timestamps, heart rate measurements, and derived metrics such as speed and distance.

CrowdTangle Platform and API

Creators: Garmur, Matt; King, Gary; Mukerjee, Zagreb; Persily, Nate; Silverman, Brandon
Publication Date: 2019
Creators: Garmur, Matt; King, Gary; Mukerjee, Zagreb; Persily, Nate; Silverman, Brandon

This document describes the CrowdTangle API and user interface being provided to researchers
by Social Science One under its collaboration framework with Facebook. CrowdTangle is a
content discovery and analytics platform designed to give content creators the data and insights
they need to succeed. This dataset enables users to monitor public content interactions, track trends, and identify influential accounts. The CrowdTangle API surfaces stories, and data to measure their social performance and identify influencers. This codebook describes the data’s scope, structure, and fields.

CrowdTangle’s dataset offers insights into public posts made by pages, groups, or verified profiles that have either surpassed 100,000 likes since 2014 or have been tracked by any active API user. The dataset includes all public posts from pages, groups, or verified profiles meeting the aforementioned criteria since 2014.

Key features include:

  • Content Discovery: Access to real-time data on trending posts, facilitating the identification of viral content and emerging topics.

  • Performance Analytics: Metrics such as likes, shares, comments, and interaction rates, allowing for the assessment of content engagement.

  • Influencer Identification: Tools to pinpoint accounts with significant influence within specific niches or broader audiences.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.