Resources by Stefan

Creators: Shutterstock

With millions of images in our library and billions of user-submitted keywords, we work hard at Shutterstock to make sure that bad words don’t show up in places they shouldn’t. This repo, published in 2019, contains a list of words that we use to filter results from our autocomplete server and recommendation engine. The dataset encompasses offensive terms in multiple languages. It is open for contributions, allowing users to add or refine entries, particularly in non-English languages, enhancing its comprehensiveness and applicability across diverse cultural contexts. The exact number of entries varies by language. For instance, the English list contains 403 entries. In total, the dataset has a size of 25,7 kB. The data is organized into separate files for each language, with each file containing a list of offensive words in that particular language. For example, the English words are listed in the ‘en’ file, German words in the ‘de’ file, and so on. This allows the targeted application of language-specific content filtering systems. Each sub-dataset (language file) consists of a plain text file with one offensive term per line, facilitating easy integration into various text processing pipelines.

News Homepage Archive

Creators: Jones, Nick
Publication Date: 2019
Creators: Jones, Nick

This project aims to provide a visual representation of how different media organizations cover various topics. Screenshots of the homepages of five different news organizations are taken once per hour, and made public thereafter. For each website, this amounts to 24 screenshots per day. Over a year, this results in approximately 8,760 screenshots per website. Screenshots are available at every hour starting from January 1, 2019. The size of the dataset is 1,8 MB. Currently, the only websites being tracked are:
nytimes.com;
washingtonpost.com;
cnn.com;
wsj.com;
foxnews.com;
By capturing hourly screenshots, this dataset offers a unique visual chronicle of news presentation, allowing for analysis of editorial choices, headline prominence, and the evolution of news stories across different media outlets. The dataset is organized hierarchically based on the website name and timestamp of each screenshot. Each sub-dataset corresponds to a specific news website, containing a chronological collection of its homepage screenshots. This structure facilitates targeted analysis of individual news outlets over time.

The Upworthy Research Archive

Creators: The Upworthy Research Archive
Publication Date: 2019
Creators: The Upworthy Research Archive

The Upworthy Research Archive is an open dataset of thousands of A/B tests of headlines conducted by Upworthy from January 2013 to April 2015. This repository includes the full data from the archive. The dataset’s size is approximately 149,7 MB. It includes 32,488 records of headline experiments, providing insights into how different headline variations impacted user engagement. The dataset is structured as a time series of experiments, with each record detailing the performance metrics of different headline variations. This structure enables researchers to analyze the effectiveness of various headlines and understand user engagement patterns over time.

Steam Video Game Database

Creators: Beliaev, Volodymyr
Publication Date: 2023
Creators: Beliaev, Volodymyr

This dataset aggregates information on all games available on the Steam platform, enriched with additional data from sources like Steam Spy, GameFAQs, Metacritic, IGDB, and HowLongToBeat (HLTB). It is particularly valuable for researchers, developers, and enthusiasts interested in analyzing various aspects of video games, such as pricing, ratings, and gameplay duration. Each entry provides detailed data, including game identifiers, store URLs, promotional content, user scores, release dates, descriptions, pricing, supported platforms, developers, publishers, available languages, genres, tags, and achievements. The dataset reflects the state of the Steam catalog as of 2023 and has a size of 7,1 kB.

Large-scale CelebFaces Attributes (CelebA) Dataset

Creators: Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou
Publication Date: 2015
Creators: Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including: 10,177 number of identities, 202,599 number of face images, and 5 landmark locations, 40 binary attributes annotations per image. Each image in the dataset captures various facial features and accessories, such as eyeglasses, smiling, or bangs. Additionally, five landmark points (e.g., eyes, nose, mouth corners) are provided per image, facilitating tasks like facial alignment. Also, a wide range of poses, expressions, and occlusions are included, reflecting real-world conditions and enhancing the robustness of models trained on this data. The dataset has a size of 25,3 kB and is organized in theree main components:

  • Images:

    • In-the-Wild Images: Original images depicting celebrities in various environments and conditions.

    • Aligned and Cropped Images: Faces have been aligned and cropped to a consistent size, facilitating standardized analysis.

  • Annotations:

    • Landmark Locations: Coordinates for five key facial points (left eye, right eye, nose, left mouth corner, right mouth corner) per image.

    • Attribute Labels: Binary labels indicating the presence or absence of 40 distinct facial attributes for each image.

    • Identity Labels: Each image is associated with an identity label, linking it to one of the 10,177 unique individuals.

  • Evaluation Partitions:

    • The dataset is divided into training, validation, and test sets, enabling standardized evaluation of algorithms.

IRS E-File Bucket

Creators: Internal Revenue Service
Publication Date: 2016
Creators: Internal Revenue Service

This bucket contains a mirror of the IRS e-file release as of December 31, 2016, which are annual information returns submitted by tax-exempt organizations in the United States. The data helps to understand the financial and operational aspects of nonprofit organizations. Each Form 990 provides insights into an organization’s mission, programs, and governance structures.​ The forms include detailed financial data, such as revenues, expenses, assets, and liabilities, offering a clear view of an organization’s financial health. As mandated by law, these forms are publicly accessible, promoting transparency and allowing stakeholders to make informed decision. In total, the dataset has a size of 5,3 kB and is divided into individual Form 990 filings, each corresponding to a specific tax-exempt organization. Each filing includes:

  • Organizational Details: Name, Employer Identification Number (EIN), address, and mission statement.

  • Financial Information: Detailed breakdowns of revenues (e.g., contributions, grants, program service revenue), expenses (e.g., salaries, grants, operational costs), assets, and liabilities.

  • Governance and Compliance: Information on board members, key employees, governance policies, and compliance with tax regulations.

Computer generated building footprints for the United States

Creators: Microsoft Bing Maps Team
Publication Date: 2018
Creators: Microsoft Bing Maps Team

Microsoft Maps is releasing country wide open building footprints datasets in United States. This dataset contains 129,591,852 computer generated building footprints derived using our computer vision algorithms on satellite imagery. Building footprints were extracted using deep neural networks for semantic segmentation, followed by polygonization to convert detected building pixels into vector shapes. This data is freely available for download and use. The dataset is organized by U.S. state and provided in GeoJSON format. Each GeoJSON file contains polygon geometries representing building footprints, accompanied by metadata such as the capture date of the underlying imagery. Notably, footprints within specific regions are based on imagery from 2019-2020, accounting for approximately 73,250,745 buildings.

3 Million Russian troll tweets

Creators: FiveThirtyEight; Warren, Patrick ;Linvill, Darren
Publication Date: 2018
Creators: FiveThirtyEight; Warren, Patrick ;Linvill, Darren

This directory contains data on nearly 3 million tweets sent from Twitter handles connected to the Internet Research Agency, a Russian “troll factory” and a defendant in an indictment filed by the Justice Department in February 2018, as part of special counsel Robert Mueller’s Russia investigation. The tweets in this database were sent between February 2012 and May 2018, with the vast majority posted from 2015 through 2017. Each entry includes detailed information such as the tweet’s content, author handle, language, publication date, and engagement metrics (e.g., number of followers, following count). The dataset provides classifications for each account, indicating the thematic focus (e.g., Right Troll, Left Troll, News Feed), as coded by researchers Darren Linvill and Patrick Warren.​ It has a total size of 507,2 kB.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.