Showing 169-176 of 272 results

Africapolis data

Creators: OECD; SWAC
Publication Date: 2022
Creators: OECD; SWAC

Africapolis has been designed to provide a much needed standardised and geospatial database on urbanisation dynamics in Africa, with the aim of making urban data in Africa comparable across countries and across time. This version of Africapolis is the first time that the data for the 54 countries currently covered are available for the same base year — 2015. In addition, Africapolis closes one major data gap by integrating 7,496 small towns and intermediary cities between 10,000 and 300,000 inhabitants. Africapolis data is based on a large inventory of housing and population censuses, electoral registers and other official population sources, in some cases dating back to the beginning of the 20th century. The dataset has a size of 34,5 kB and contains the following information:

  • Spatial Data: Urban agglomerations are represented as polygon vector data, delineating the spatial extent of settlements. Each polygon’s size corresponds to the settled area, providing insights into urban sprawl and density.

  • Attribute Data: Each spatial unit is linked to demographic attributes, such as total population figures for specific years (e.g., 2015, 2020). This linkage enables analyses of population distribution and urban growth patterns.

  • Data Layers: The dataset includes multiple layers corresponding to different years, supporting temporal analyses of urbanization trends.

3 Million Russian troll tweets

Creators: FiveThirtyEight; Warren, Patrick ;Linvill, Darren
Publication Date: 2018
Creators: FiveThirtyEight; Warren, Patrick ;Linvill, Darren

This directory contains data on nearly 3 million tweets sent from Twitter handles connected to the Internet Research Agency, a Russian “troll factory” and a defendant in an indictment filed by the Justice Department in February 2018, as part of special counsel Robert Mueller’s Russia investigation. The tweets in this database were sent between February 2012 and May 2018, with the vast majority posted from 2015 through 2017. Each entry includes detailed information such as the tweet’s content, author handle, language, publication date, and engagement metrics (e.g., number of followers, following count). The dataset provides classifications for each account, indicating the thematic focus (e.g., Right Troll, Left Troll, News Feed), as coded by researchers Darren Linvill and Patrick Warren.​ It has a total size of 507,2 kB.

Computer generated building footprints for the United States

Creators: Microsoft Bing Maps Team
Publication Date: 2018
Creators: Microsoft Bing Maps Team

Microsoft Maps is releasing country wide open building footprints datasets in United States. This dataset contains 129,591,852 computer generated building footprints derived using our computer vision algorithms on satellite imagery. Building footprints were extracted using deep neural networks for semantic segmentation, followed by polygonization to convert detected building pixels into vector shapes. This data is freely available for download and use. The dataset is organized by U.S. state and provided in GeoJSON format. Each GeoJSON file contains polygon geometries representing building footprints, accompanied by metadata such as the capture date of the underlying imagery. Notably, footprints within specific regions are based on imagery from 2019-2020, accounting for approximately 73,250,745 buildings.

IRS E-File Bucket

Creators: Internal Revenue Service
Publication Date: 2016
Creators: Internal Revenue Service

This bucket contains a mirror of the IRS e-file release as of December 31, 2016, which are annual information returns submitted by tax-exempt organizations in the United States. The data helps to understand the financial and operational aspects of nonprofit organizations. Each Form 990 provides insights into an organization’s mission, programs, and governance structures.​ The forms include detailed financial data, such as revenues, expenses, assets, and liabilities, offering a clear view of an organization’s financial health. As mandated by law, these forms are publicly accessible, promoting transparency and allowing stakeholders to make informed decision. In total, the dataset has a size of 5,3 kB and is divided into individual Form 990 filings, each corresponding to a specific tax-exempt organization. Each filing includes:

  • Organizational Details: Name, Employer Identification Number (EIN), address, and mission statement.

  • Financial Information: Detailed breakdowns of revenues (e.g., contributions, grants, program service revenue), expenses (e.g., salaries, grants, operational costs), assets, and liabilities.

  • Governance and Compliance: Information on board members, key employees, governance policies, and compliance with tax regulations.

Large-scale CelebFaces Attributes (CelebA) Dataset

Creators: Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou
Publication Date: 2015
Creators: Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including: 10,177 number of identities, 202,599 number of face images, and 5 landmark locations, 40 binary attributes annotations per image. Each image in the dataset captures various facial features and accessories, such as eyeglasses, smiling, or bangs. Additionally, five landmark points (e.g., eyes, nose, mouth corners) are provided per image, facilitating tasks like facial alignment. Also, a wide range of poses, expressions, and occlusions are included, reflecting real-world conditions and enhancing the robustness of models trained on this data. The dataset has a size of 25,3 kB and is organized in theree main components:

  • Images:

    • In-the-Wild Images: Original images depicting celebrities in various environments and conditions.

    • Aligned and Cropped Images: Faces have been aligned and cropped to a consistent size, facilitating standardized analysis.

  • Annotations:

    • Landmark Locations: Coordinates for five key facial points (left eye, right eye, nose, left mouth corner, right mouth corner) per image.

    • Attribute Labels: Binary labels indicating the presence or absence of 40 distinct facial attributes for each image.

    • Identity Labels: Each image is associated with an identity label, linking it to one of the 10,177 unique individuals.

  • Evaluation Partitions:

    • The dataset is divided into training, validation, and test sets, enabling standardized evaluation of algorithms.

Steam Video Game Database

Creators: Beliaev, Volodymyr
Publication Date: 2023
Creators: Beliaev, Volodymyr

This dataset aggregates information on all games available on the Steam platform, enriched with additional data from sources like Steam Spy, GameFAQs, Metacritic, IGDB, and HowLongToBeat (HLTB). It is particularly valuable for researchers, developers, and enthusiasts interested in analyzing various aspects of video games, such as pricing, ratings, and gameplay duration. Each entry provides detailed data, including game identifiers, store URLs, promotional content, user scores, release dates, descriptions, pricing, supported platforms, developers, publishers, available languages, genres, tags, and achievements. The dataset reflects the state of the Steam catalog as of 2023 and has a size of 7,1 kB.

Amazon Brand and Exclusives

Creators: Jeffries, Adrianne; Yin, Leon
Publication Date: 2021
Creators: Jeffries, Adrianne; Yin, Leon
My co-author Adrianne Jeffries and I found Amazon gave its own branded products an advantage over better-rated competitors in search results. This repository contains code to reproduce the findings featured in our story “Amazon Puts Its Own ‘Brands’ First Above Better-Rated Products” and “When Amazon Takes the Buy Box, it Doesn’t Give it up” from our series Amazon’s Advantage. Each product in this dataset is identified by its unique Amazon Standard Identification Number (ASIN), facilitating precise tracking and analysis. They are categorized based on their association with Amazon, distinguishing between Amazon-owned brands, exclusive partnerships, and proprietary electronics. Data is derived from extensive web scraping of Amazon’s product listings, ensuring a comprehensive and up-to-date collection. The data collection occurred primarily in early 2021, with search results gathered in January 2021 and product pages in February 2021. In total, the dataset comprises 137,428 products, each represented by a unique ASIN and has a size of 221,0 kB. It is organized into several sub-datasets, each serving a specific analytical purpose:

  1. Amazon Private Label Products: Contains detailed information on 137,428 products identified as Amazon brands, exclusives, or proprietary electronics.

  2. Search Results: Includes parsed search result pages from top and generic searches, totaling 187,534 product positions.

  3. Product Pages: Comprises parsed product pages corresponding to the search results, encompassing 157,405 product pages.

  4. Training Set: Provides metadata used to train and evaluate machine learning models, with feature engineering conducted in associated Jupyter notebooks.

  5. Trademarks: Contains a dataset of trademarked brands registered by Amazon, collected from USPTO.gov and Amazon.

Fact Check Data

Creators: Data Commons
Publication Date: 2019
Creators: Data Commons

This is a data feed of ClaimReview markups created via the Google Fact Check Markup Tool and the new ClaimReview Read/Write API. The data in the feed also follows the schema.org ClaimReview standard, namely the same schema as the data in the historical research dataset. It compiles structured metadata from fact-checking articles and serves as a valuable resource for researchers and developers aiming to analyze and combat misinformation. Each entry follows the ClaimReview schema, providing standardized fields such as the claim reviewed, the author, the date of publication, and the URL to the original fact-checking article. It aggregates fact-checks from multiple reputable organizations, offering a comprehensive view of fact-checking efforts across various domains​. In total, the dataset has a size of 73,4 kB.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.