Showing 121-128 of 272 results

AI-generated marketing images (>10K) + human ratings for quality and realism

Creators: Jochen Hartmann, Yannick Exner
Publication Date: not specified
Creators: Jochen Hartmann, Yannick Exner

The GenImageNet dataset is a comprehensive collection of AI-generated marketing images accompanied by human evaluations assessing their quality and realism. It encompasses 10,320 synthetic marketing images created using seven state-of-the-art generative text-to-image models, including DALL-E 3, Midjourney v6, Firefly 2, Imagen 2, Imagine, Realistic Vision, and Stable Diffusion XL Turbo. Each image was generated based on prompts derived from 2,400 real-world, human-made images, facilitating a direct comparison between AI-generated and human-made marketing visuals.The dataset’s total size is approximately 3.1 GB, making it substantial yet manageable for analysis. It comprises 10,320 observations, each corresponding to an individual AI-generated image. The temporal coverage of the dataset aligns with the period during which the images were generated and evaluated, specifically around July 2023. Structurally, the dataset is organized to include each AI-generated image alongside its corresponding human-made image prompt and the human evaluations. The human evaluations consist of 254,400 individual assessments, with each image receiving multiple ratings to ensure reliability. These evaluations cover aspects such as quality, realism, aesthetics, creativity, adherence to the prompt, and overall effectiveness in a marketing context. 

Stack Overflow Q&A

Creators: Stack Exchange
Publication Date: 2014-01-23
Creators: Stack Exchange

The Stack Exchange Data Dump is a quarterly, anonymized release of all user-contributed content from the Stack Exchange network, including posts, comments, votes, and user data, licensed under the Creative Commons BY-SA 3.0. These features facilitated comprehensive analyses of user interactions, content quality, and community dynamics within the Stack Exchange network. ​As of March 1, 2020, the Stack Exchange Data Dump contained a total of 47,931,101 posts, encompassing both questions and answers, accumulated from 2008 to that date. The size of the data dumps varied over time, reflecting the growth of the platform. For instance, the January 2011 dump exceeded 3 GB. As the network expanded, subsequent dumps increased in size, with later versions reaching tens of gigabytes. Each data dump captured the state of the Stack Exchange network up to the date of its release, providing a temporal snapshot of user-generated content and activity. The dumps were released quarterly, offering periodic insights into the evolving dynamics of the platform. Structurally, the dataset is organized into multiple tables, each corresponding to different aspects of the platform:

  • Posts: Contained all questions and answers, including metadata such as post ID, score, and content.

  • Users: Included user-related information like user ID, reputation score, and profile details.

  • Votes: Recorded voting activity on posts, specifying vote types and associated post IDs.

  • Comments: Held all comments made on posts, detailing comment text, scores, and related post and user IDs.

  • Badges: Documented badges awarded to users, noting badge names, dates awarded, and badge classes (e.g., bronze, silver, gold).

  • Post History: Tracked changes to posts, recording edits, rollbacks, and other modifications along with details on the post and the user making the change.

  • Post Links: Contained links between posts, such as duplicates or related posts, along with link types and creation dates.

Airbnb datasets

Creators: Inside Airbnb
Publication Date: varies by city
Creators: Inside Airbnb

Inside Airbnb provides detailed data on Airbnb listings, including reviews, calendar availability, and neighborhood information to offer insights into short-term rental markets. The dataset’s size varies depending on the city and the number of active listings at the time of data collection. For instance, as of March 6, 2023, the New York City dataset contained information on over 42,000 listings. The temporal coverage of the dataset reflects specific points in time, capturing the state of Airbnb listings in various cities as of the data collection dates. Inside Airbnb provides quarterly data for the last year for each region, with archived files available for research on entire countries, including Australia, Canada, France, Germany, Greece, Italy, the Netherlands, Portugal, Spain, Sweden, the United Kingdom, and the United States. Structurally, the dataset is organized in a tabular format, with each row representing an individual listing and columns detailing various attributes such as listing ID, host information, location, property characteristics, pricing, review statistics, and availability.

Reddit datasets

Creators: Conversational Analysis Toolkit (ConvoKit)
Publication Date: n.a
Creators: Conversational Analysis Toolkit (ConvoKit)
The ConvoKit Subreddit Corpus is a collection of user comments from various subreddits on Reddit, gathered over time to facilitate research in conversational analysis and sociolinguistics. It encompasses posts and comments from 948,169 individual subreddits, each from its inception until October 2018. This dataset is organized into individual corpora for each subreddit, facilitating targeted analysis of specific communities. Each corpus includes detailed information at multiple levels: speaker-level, where speakers are identified by their Reddit usernames; utterance-level, where each post or comment is treated as an utterance with attributes such as unique ID, author, conversation ID, reply relationships, timestamp, and text content; conversation-level, where each post and its corresponding comments are considered a conversation, with metadata including the post’s title, number of comments, domain, subreddit, and author flair; and corpus-level, which aggregates data such as the list of subreddits included, total number of posts and comments, and the number of unique speakers.

Top streamers on Twitch

Creators: Aayush Mishra
Publication Date: n.a
Creators: Aayush Mishra

This dataset includes information on Twitch streamers, such as their follower count, channel views, watch time, stream time, peak viewers, and average viewers, providing insights into streaming trends and audience engagement on the Twitch platform. The dataset size is relatively small, approximately 0.1 MB, making it easily manageable for data analysis. It contains observations on the top 1,000 Twitch streamers, providing a substantial sample size for statistical analysis and modeling. In terms of temporal coverage, the dataset typically reflects streamer metrics from the past year relative to its compilation date. For example, if the dataset was published in May 2024, it would cover data from May 2023 to May 2024. Structurally, the dataset is organized into a tabular format where each row represents a streamer and the columns provide specific performance metrics. The key columns include the streamer’s rank based on performance metrics, their name, the total hours they have streamed, the number of new followers they have acquired, the average number of viewers per stream, and the variety or count of different games they have played.

NES music database

Creators: Chris Donahue
Publication Date: 2018-xx-xx
Creators: Chris Donahue

The NES Music Database (NES-MDB) is a dataset designed for machine learning applications in music composition for the Nintendo Entertainment System (NES). It ) encompasses music from 397 NES games, primarily released during the console’s peak popularity in the 1980s and early 1990s. The dataset is about 155 MB in its largest format and is available in MIDI, score, and VGM formats. It is primarily relevant to the gaming, machine learning, and music technology industries. The dataset is organized into several formats to accommodate diverse research and application needs:

  • MIDI Format: This format stores discrete musical events, including note, velocity, and timbre, with a high temporal resolution of 44.1 kHz, enabling precise reconstruction by an NES synthesizer.

  • Score Formats: These are piano roll representations sampled at a fixed rate of 24 Hz, providing dense, compact data suitable for detailed music analysis.

  • Language Modeling Format: Tailored for language modeling applications, this format facilitates the study of sequential patterns in NES music compositions.

USGS Earth Explorer

Creators: U.S. Geological Survey (USGS)
Publication Date: 1972
Creators: U.S. Geological Survey (USGS)

USGS Earth Explorer is a powerful search platform that provides free access to satellite imagery, aerial photography, and remote sensing data. With its extensive archive, the Earth Explorer offers data from various satellites like Landsat, Sentinel, and ASTER. The platform’s extensive archive encompasses data from multiple satellite missions, with each mission contributing a significant volume of data. For instance, Landsat 9 alone collects up to 750 scenes per day, and together with Landsat 8, they add nearly 1,500 new scenes daily to the USGS archive. Given that each Landsat scene is approximately 1 GB in size, the daily data addition from these two satellites alone can be estimated at around 1.5 TB. The number of observations within Earth Explorer is continually increasing as satellites like Landsat 8 and 9 are operational and consistently capturing data. Landsat 9, for example, collects as many as 750 scenes per day, and together with Landsat 8, they add nearly 1,500 new scenes daily to the USGS archive. The temporal coverage of datasets accessible via Earth Explorer varies depending on the specific mission or dataset. For instance, the Landsat program offers a historical archive dating back to 1972, providing over five decades of Earth observation data. Structurally, Earth Explorer organizes its datasets based on the source and type of data. For example, Landsat data is categorized by satellite and processing level, including Level-1 and Level-2 products, with varying spatial resolutions. Similarly, other datasets are organized according to their respective missions and data characteristics, allowing users to efficiently locate and utilize the data relevant to their specific needs.

ESA’s Sentinel Data

Creators: European Space Agency (ESA)
Publication Date: 2014
Creators: European Space Agency (ESA)

The Sentinel satellite constellation provides Earth observation data under the Copernicus program, offering free and open access to a wide range of environmental information. The data covers various spectral, spatial, and temporal scales, enabling applications in environmental monitoring, land-use change, climate research, disaster management, and ocean observation. Sentinel-1 is equipped with C-band Synthetic Aperture Radar (SAR) that provides all-weather, day-and-night radar imagery, making it particularly useful for monitoring land and ocean surfaces, including applications like mapping, forestry, soil moisture estimation, and sea-ice observations. Sentinel-2 features a multispectral optical sensor capturing 13 spectral bands, offering high-resolution imagery (up to 10 meters), which is ideal for land cover classification, agricultural monitoring, and disaster management. Sentinel-3 carries multiple instruments to measure sea-surface topography, sea and land surface temperature, and ocean and land color, supporting ocean forecasting, environmental and climate monitoring, and land-use change detection. The data volume varies depending on the specific mission and product type. Sentinel-1 data files range from 1 GB to 6 GB per scene, depending on the observation mode and product type. Sentinel-2 Level-1C products, which cover a 100×100 km area, range from approximately 500 MB to 1 GB. Sentinel-3 product sizes vary widely depending on the instrument and processing level, ranging from several megabytes to over a gigabyte per product. The number of observations is continually increasing as the satellites remain operational and continuously capture data. Sentinel-1 revisits each region every 6 to 12 days, depending on the observation mode. Sentinel-2 provides global coverage of land surfaces every 5 days with its two satellites, Sentinel-2A and Sentinel-2B. Sentinel-3 offers a near-daily revisit for optical instruments, with shorter revisit times for altimetry measurements. Sentinel-1A was launched in April 2014, followed by Sentinel-1B in April 2016, and Sentinel-1C in December 2024. Sentinel-2A was launched in June 2015, Sentinel-2B in March 2017, and Sentinel-2C in September 2024. Sentinel-3A was launched in February 2016, and Sentinel-3B followed in April 2018.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.