Showing 153-160 of 262 results

AudioSet dataset

Creators: Google
Publication Date: 2017
Creators: Google

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. The dataset has a size of 19,0 kB and is divided into three primary subsets:

  • Evaluation Set: Contains 20,383 segments from distinct videos, ensuring at least 59 examples for each of the 527 sound classes used.

  • Balanced Training Set: Consists of 22,176 segments from distinct videos, selected to provide a balanced representation with at least 59 examples per class.

  • Unbalanced Training Set: Includes 2,042,985 segments from distinct videos, representing the remainder of the dataset

Airwar

Creators: Yemen Data Project
Publication Date: 2022
Creators: Yemen Data Project

The dataset collects data detailing air raids conducted in Yemen, primarily focusing on the Saudi-led coalition’s operations from March 2015 to April 2022. It lists the date of incident, geographical location, type of target, target category and sub-category, and, where known, time of day. Each incident indicates a stated number of air raids, which in turn may comprise multiple air strikes. It is not possible to generate an average number of air strikes per air raid as these vary greatly, from a couple of airstrikes up to several dozen per air-raid. YDP’s dataset records the unverified numbers of individual air strikes that constitute a recorded single air raid. Each entry in the dataset includes specifics such as the date of the incident, geographical location (down to the district level), target type, target category and sub-category, and, where available, the time of day. This granularity allows for in-depth analysis of air raid patterns and their impacts. The dataset has a size of 17,0 kB and comprises 25,054 recorded air raids conducted by the Saudi-UAE-led coalition. It is structured with each record representing a single air raid incident. Key fields include:

  • Date of Incident: Specifies the exact date the air raid occurred.

  • Geographical Location: Details the governorate and district where the air raid took place.

  • Target Type: Describes the nature of the target, such as military sites, residential areas, or infrastructure facilities.

  • Target Category and Sub-category: Provides further classification of the target, offering more granular insight into the specific nature of the targeted site.

  • Time of Day: Indicates the time at which the air raid occurred, where such information is available.

Africapolis data

Creators: OECD; SWAC
Publication Date: 2022
Creators: OECD; SWAC

Africapolis has been designed to provide a much needed standardised and geospatial database on urbanisation dynamics in Africa, with the aim of making urban data in Africa comparable across countries and across time. This version of Africapolis is the first time that the data for the 54 countries currently covered are available for the same base year — 2015. In addition, Africapolis closes one major data gap by integrating 7,496 small towns and intermediary cities between 10,000 and 300,000 inhabitants. Africapolis data is based on a large inventory of housing and population censuses, electoral registers and other official population sources, in some cases dating back to the beginning of the 20th century. The dataset has a size of 34,5 kB and contains the following information:

  • Spatial Data: Urban agglomerations are represented as polygon vector data, delineating the spatial extent of settlements. Each polygon’s size corresponds to the settled area, providing insights into urban sprawl and density.

  • Attribute Data: Each spatial unit is linked to demographic attributes, such as total population figures for specific years (e.g., 2015, 2020). This linkage enables analyses of population distribution and urban growth patterns.

  • Data Layers: The dataset includes multiple layers corresponding to different years, supporting temporal analyses of urbanization trends.

Creators: Shutterstock

With millions of images in our library and billions of user-submitted keywords, we work hard at Shutterstock to make sure that bad words don’t show up in places they shouldn’t. This repo, published in 2019, contains a list of words that we use to filter results from our autocomplete server and recommendation engine. The dataset encompasses offensive terms in multiple languages. It is open for contributions, allowing users to add or refine entries, particularly in non-English languages, enhancing its comprehensiveness and applicability across diverse cultural contexts. The exact number of entries varies by language. For instance, the English list contains 403 entries. In total, the dataset has a size of 25,7 kB. The data is organized into separate files for each language, with each file containing a list of offensive words in that particular language. For example, the English words are listed in the ‘en’ file, German words in the ‘de’ file, and so on. This allows the targeted application of language-specific content filtering systems. Each sub-dataset (language file) consists of a plain text file with one offensive term per line, facilitating easy integration into various text processing pipelines.

News Homepage Archive

Creators: Jones, Nick
Publication Date: 2019
Creators: Jones, Nick

This project aims to provide a visual representation of how different media organizations cover various topics. Screenshots of the homepages of five different news organizations are taken once per hour, and made public thereafter. For each website, this amounts to 24 screenshots per day. Over a year, this results in approximately 8,760 screenshots per website. Screenshots are available at every hour starting from January 1, 2019. The size of the dataset is 1,8 MB. Currently, the only websites being tracked are:
nytimes.com;
washingtonpost.com;
cnn.com;
wsj.com;
foxnews.com;
By capturing hourly screenshots, this dataset offers a unique visual chronicle of news presentation, allowing for analysis of editorial choices, headline prominence, and the evolution of news stories across different media outlets. The dataset is organized hierarchically based on the website name and timestamp of each screenshot. Each sub-dataset corresponds to a specific news website, containing a chronological collection of its homepage screenshots. This structure facilitates targeted analysis of individual news outlets over time.

The Upworthy Research Archive

Creators: The Upworthy Research Archive
Publication Date: 2019
Creators: The Upworthy Research Archive

The Upworthy Research Archive is an open dataset of thousands of A/B tests of headlines conducted by Upworthy from January 2013 to April 2015. This repository includes the full data from the archive. The dataset’s size is approximately 149,7 MB. It includes 32,488 records of headline experiments, providing insights into how different headline variations impacted user engagement. The dataset is structured as a time series of experiments, with each record detailing the performance metrics of different headline variations. This structure enables researchers to analyze the effectiveness of various headlines and understand user engagement patterns over time.

Steam Video Game Database

Creators: Beliaev, Volodymyr
Publication Date: 2023
Creators: Beliaev, Volodymyr

This dataset aggregates information on all games available on the Steam platform, enriched with additional data from sources like Steam Spy, GameFAQs, Metacritic, IGDB, and HowLongToBeat (HLTB). It is particularly valuable for researchers, developers, and enthusiasts interested in analyzing various aspects of video games, such as pricing, ratings, and gameplay duration. Each entry provides detailed data, including game identifiers, store URLs, promotional content, user scores, release dates, descriptions, pricing, supported platforms, developers, publishers, available languages, genres, tags, and achievements. The dataset reflects the state of the Steam catalog as of 2023 and has a size of 7,1 kB.

Large-scale CelebFaces Attributes (CelebA) Dataset

Creators: Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou
Publication Date: 2015
Creators: Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including: 10,177 number of identities, 202,599 number of face images, and 5 landmark locations, 40 binary attributes annotations per image. Each image in the dataset captures various facial features and accessories, such as eyeglasses, smiling, or bangs. Additionally, five landmark points (e.g., eyes, nose, mouth corners) are provided per image, facilitating tasks like facial alignment. Also, a wide range of poses, expressions, and occlusions are included, reflecting real-world conditions and enhancing the robustness of models trained on this data. The dataset has a size of 25,3 kB and is organized in theree main components:

  • Images:

    • In-the-Wild Images: Original images depicting celebrities in various environments and conditions.

    • Aligned and Cropped Images: Faces have been aligned and cropped to a consistent size, facilitating standardized analysis.

  • Annotations:

    • Landmark Locations: Coordinates for five key facial points (left eye, right eye, nose, left mouth corner, right mouth corner) per image.

    • Attribute Labels: Binary labels indicating the presence or absence of 40 distinct facial attributes for each image.

    • Identity Labels: Each image is associated with an identity label, linking it to one of the 10,177 unique individuals.

  • Evaluation Partitions:

    • The dataset is divided into training, validation, and test sets, enabling standardized evaluation of algorithms.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.