Showing 161-168 of 262 results

IRS E-File Bucket

Creators: Internal Revenue Service
Publication Date: 2016
Creators: Internal Revenue Service

This bucket contains a mirror of the IRS e-file release as of December 31, 2016, which are annual information returns submitted by tax-exempt organizations in the United States. The data helps to understand the financial and operational aspects of nonprofit organizations. Each Form 990 provides insights into an organization’s mission, programs, and governance structures.​ The forms include detailed financial data, such as revenues, expenses, assets, and liabilities, offering a clear view of an organization’s financial health. As mandated by law, these forms are publicly accessible, promoting transparency and allowing stakeholders to make informed decision. In total, the dataset has a size of 5,3 kB and is divided into individual Form 990 filings, each corresponding to a specific tax-exempt organization. Each filing includes:

  • Organizational Details: Name, Employer Identification Number (EIN), address, and mission statement.

  • Financial Information: Detailed breakdowns of revenues (e.g., contributions, grants, program service revenue), expenses (e.g., salaries, grants, operational costs), assets, and liabilities.

  • Governance and Compliance: Information on board members, key employees, governance policies, and compliance with tax regulations.

Computer generated building footprints for the United States

Creators: Microsoft Bing Maps Team
Publication Date: 2018
Creators: Microsoft Bing Maps Team

Microsoft Maps is releasing country wide open building footprints datasets in United States. This dataset contains 129,591,852 computer generated building footprints derived using our computer vision algorithms on satellite imagery. Building footprints were extracted using deep neural networks for semantic segmentation, followed by polygonization to convert detected building pixels into vector shapes. This data is freely available for download and use. The dataset is organized by U.S. state and provided in GeoJSON format. Each GeoJSON file contains polygon geometries representing building footprints, accompanied by metadata such as the capture date of the underlying imagery. Notably, footprints within specific regions are based on imagery from 2019-2020, accounting for approximately 73,250,745 buildings.

3 Million Russian troll tweets

Creators: FiveThirtyEight; Warren, Patrick ;Linvill, Darren
Publication Date: 2018
Creators: FiveThirtyEight; Warren, Patrick ;Linvill, Darren

This directory contains data on nearly 3 million tweets sent from Twitter handles connected to the Internet Research Agency, a Russian “troll factory” and a defendant in an indictment filed by the Justice Department in February 2018, as part of special counsel Robert Mueller’s Russia investigation. The tweets in this database were sent between February 2012 and May 2018, with the vast majority posted from 2015 through 2017. Each entry includes detailed information such as the tweet’s content, author handle, language, publication date, and engagement metrics (e.g., number of followers, following count). The dataset provides classifications for each account, indicating the thematic focus (e.g., Right Troll, Left Troll, News Feed), as coded by researchers Darren Linvill and Patrick Warren.​ It has a total size of 507,2 kB.

Creators: ProPublica

This free download is a database of more than 12,000 civilian complaints filed against New York City police officers. After New York state repealed the statute that kept police disciplinary records secret, known as 50-a, ProPublica filed a records request with New York City’s Civilian Complaint Review Board, which investigates complaints by the public about NYPD officers. The board provided us with records about closed cases for every police officer still on the force as of late June 2020 who had at least one substantiated allegation against them. The records span decades, from September 1985 to January 2020. Each entry includes specifics such as the nature of the allegation (e.g., use of force, abuse of authority), the outcome of the investigation, and any disciplinary actions taken. The dataset provides information on the officers involved, including their rank and assignment at the time of the complaint. Entries contain timestamps and locations of the alleged incidents, facilitating analyses of patterns over time and across different areas.​​

Structurally, the dataset contains the following information:

  • Complaint ID: A unique identifier for each complaint.

  • Date and Time: When the incident allegedly occurred.

  • Location: Where the incident took place.

  • Officer Details: Information about the officer(s) involved, such as badge number, rank, and assignment.

  • Allegation Details: Type of misconduct reported (e.g., excessive force, discourtesy).

  • Investigation Outcome: Findings of the investigation, including whether the allegation was substantiated, unsubstantiated, exonerated, or unfounded.

  • Disciplinary Action: Any penalties or corrective actions imposed following the investigation.

Facebook Social Connectedness Index

Creators: Meta
Publication Date: 2021
Creators: Meta

We use an anonymized snapshot of all active Facebook users and their friendship networks to measure the intensity of connectedness between locations. The Social Connectedness Index (SCI) is a measure of the social connectedness between different geographies. Specifically, it measures the relative probability that two individuals across two locations are friends with each other on Facebook. Each entry represents a pair of locations, detailing the strength of social connectedness between them. By doing so, the SCI provides a measure of the relative probability that two individuals from different locations are Facebook friends, offering insights into social ties across regions. The dataset has a a size of 3,9 kB and reflects a specific snapshot in time, with the latest available data from October 2021. The dataset is organized into multiple sub-datasets, each detailing social connectedness at different geographic levels:

  1. Country-Country Pairs:

    • user_loc: ISO2 code of the first country.

    • fr_loc: ISO2 code of the second country.

    • scaled_sci: Scaled Social Connectedness Index between the two countries.

  2. US County-Country Pairs:

    • user_loc: 5-digit FIPS code of the U.S. county.

    • fr_loc: ISO2 code of the country.

    • scaled_sci: Scaled Social Connectedness Index between the U.S. county and the country.

Fact Check Data

Creators: Data Commons
Publication Date: 2019
Creators: Data Commons

This is a data feed of ClaimReview markups created via the Google Fact Check Markup Tool and the new ClaimReview Read/Write API. The data in the feed also follows the schema.org ClaimReview standard, namely the same schema as the data in the historical research dataset. It compiles structured metadata from fact-checking articles and serves as a valuable resource for researchers and developers aiming to analyze and combat misinformation. Each entry follows the ClaimReview schema, providing standardized fields such as the claim reviewed, the author, the date of publication, and the URL to the original fact-checking article. It aggregates fact-checks from multiple reputable organizations, offering a comprehensive view of fact-checking efforts across various domains​. In total, the dataset has a size of 73,4 kB.

Amazon Brand and Exclusives

Creators: Jeffries, Adrianne; Yin, Leon
Publication Date: 2021
Creators: Jeffries, Adrianne; Yin, Leon
My co-author Adrianne Jeffries and I found Amazon gave its own branded products an advantage over better-rated competitors in search results. This repository contains code to reproduce the findings featured in our story “Amazon Puts Its Own ‘Brands’ First Above Better-Rated Products” and “When Amazon Takes the Buy Box, it Doesn’t Give it up” from our series Amazon’s Advantage. Each product in this dataset is identified by its unique Amazon Standard Identification Number (ASIN), facilitating precise tracking and analysis. They are categorized based on their association with Amazon, distinguishing between Amazon-owned brands, exclusive partnerships, and proprietary electronics. Data is derived from extensive web scraping of Amazon’s product listings, ensuring a comprehensive and up-to-date collection. The data collection occurred primarily in early 2021, with search results gathered in January 2021 and product pages in February 2021. In total, the dataset comprises 137,428 products, each represented by a unique ASIN and has a size of 221,0 kB. It is organized into several sub-datasets, each serving a specific analytical purpose:

  1. Amazon Private Label Products: Contains detailed information on 137,428 products identified as Amazon brands, exclusives, or proprietary electronics.

  2. Search Results: Includes parsed search result pages from top and generic searches, totaling 187,534 product positions.

  3. Product Pages: Comprises parsed product pages corresponding to the search results, encompassing 157,405 product pages.

  4. Training Set: Provides metadata used to train and evaluate machine learning models, with feature engineering conducted in associated Jupyter notebooks.

  5. Trademarks: Contains a dataset of trademarked brands registered by Amazon, collected from USPTO.gov and Amazon.

CVE List

Creators: CVE
Publication Date: 2024
Creators: CVE

The mission of the CVE® Program is to identify, define, and catalog publicly disclosed cybersecurity vulnerabilities identified from 1999 through June 25, 2024, providing a historical perspective on the evolution of cybersecurity threats over a span of 25 years. Its primary purpose is to standardize the identification of vulnerabilities across various platforms and security tools, facilitating consistent and efficient communication within the cybersecurity community. As of June 25, 2024, the CVE List comprises 269,759 records with a size of 36,3 kB.

Each CVE entry includes several key components:

  • CVE Identifier (CVE ID): A unique alphanumeric code assigned to each vulnerability, following the format “CVE-YYYY-NNNN,” where “YYYY” denotes the year of identification, and “NNNN” is a sequential number.

  • Description: A brief summary outlining the nature of the vulnerability, including affected software or hardware, potential impacts, and any known exploits.

  • References: Links to external resources such as security advisories, vendor bulletins, or detailed analyses that provide additional context or mitigation information.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.