Showing 113-120 of 262 results

Reddit datasets

Creators: Conversational Analysis Toolkit (ConvoKit)
Publication Date: n.a
Creators: Conversational Analysis Toolkit (ConvoKit)
The ConvoKit Subreddit Corpus is a collection of user comments from various subreddits on Reddit, gathered over time to facilitate research in conversational analysis and sociolinguistics. It encompasses posts and comments from 948,169 individual subreddits, each from its inception until October 2018. This dataset is organized into individual corpora for each subreddit, facilitating targeted analysis of specific communities. Each corpus includes detailed information at multiple levels: speaker-level, where speakers are identified by their Reddit usernames; utterance-level, where each post or comment is treated as an utterance with attributes such as unique ID, author, conversation ID, reply relationships, timestamp, and text content; conversation-level, where each post and its corresponding comments are considered a conversation, with metadata including the post’s title, number of comments, domain, subreddit, and author flair; and corpus-level, which aggregates data such as the list of subreddits included, total number of posts and comments, and the number of unique speakers.

Top streamers on Twitch

Creators: Aayush Mishra
Publication Date: n.a
Creators: Aayush Mishra

This dataset includes information on Twitch streamers, such as their follower count, channel views, watch time, stream time, peak viewers, and average viewers, providing insights into streaming trends and audience engagement on the Twitch platform. The dataset size is relatively small, approximately 0.1 MB, making it easily manageable for data analysis. It contains observations on the top 1,000 Twitch streamers, providing a substantial sample size for statistical analysis and modeling. In terms of temporal coverage, the dataset typically reflects streamer metrics from the past year relative to its compilation date. For example, if the dataset was published in May 2024, it would cover data from May 2023 to May 2024. Structurally, the dataset is organized into a tabular format where each row represents a streamer and the columns provide specific performance metrics. The key columns include the streamer’s rank based on performance metrics, their name, the total hours they have streamed, the number of new followers they have acquired, the average number of viewers per stream, and the variety or count of different games they have played.

NES music database

Creators: Chris Donahue
Publication Date: 2018-xx-xx
Creators: Chris Donahue

The NES Music Database (NES-MDB) is a dataset designed for machine learning applications in music composition for the Nintendo Entertainment System (NES). It ) encompasses music from 397 NES games, primarily released during the console’s peak popularity in the 1980s and early 1990s. The dataset is about 155 MB in its largest format and is available in MIDI, score, and VGM formats. It is primarily relevant to the gaming, machine learning, and music technology industries. The dataset is organized into several formats to accommodate diverse research and application needs:

  • MIDI Format: This format stores discrete musical events, including note, velocity, and timbre, with a high temporal resolution of 44.1 kHz, enabling precise reconstruction by an NES synthesizer.

  • Score Formats: These are piano roll representations sampled at a fixed rate of 24 Hz, providing dense, compact data suitable for detailed music analysis.

  • Language Modeling Format: Tailored for language modeling applications, this format facilitates the study of sequential patterns in NES music compositions.

USGS Earth Explorer

Creators: U.S. Geological Survey (USGS)
Publication Date: 1972
Creators: U.S. Geological Survey (USGS)

USGS Earth Explorer is a powerful search platform that provides free access to satellite imagery, aerial photography, and remote sensing data. With its extensive archive, the Earth Explorer offers data from various satellites like Landsat, Sentinel, and ASTER. The platform’s extensive archive encompasses data from multiple satellite missions, with each mission contributing a significant volume of data. For instance, Landsat 9 alone collects up to 750 scenes per day, and together with Landsat 8, they add nearly 1,500 new scenes daily to the USGS archive. Given that each Landsat scene is approximately 1 GB in size, the daily data addition from these two satellites alone can be estimated at around 1.5 TB. The number of observations within Earth Explorer is continually increasing as satellites like Landsat 8 and 9 are operational and consistently capturing data. Landsat 9, for example, collects as many as 750 scenes per day, and together with Landsat 8, they add nearly 1,500 new scenes daily to the USGS archive. The temporal coverage of datasets accessible via Earth Explorer varies depending on the specific mission or dataset. For instance, the Landsat program offers a historical archive dating back to 1972, providing over five decades of Earth observation data. Structurally, Earth Explorer organizes its datasets based on the source and type of data. For example, Landsat data is categorized by satellite and processing level, including Level-1 and Level-2 products, with varying spatial resolutions. Similarly, other datasets are organized according to their respective missions and data characteristics, allowing users to efficiently locate and utilize the data relevant to their specific needs.

ESA’s Sentinel Data

Creators: European Space Agency (ESA)
Publication Date: 2014
Creators: European Space Agency (ESA)

The Sentinel satellite constellation provides Earth observation data under the Copernicus program, offering free and open access to a wide range of environmental information. The data covers various spectral, spatial, and temporal scales, enabling applications in environmental monitoring, land-use change, climate research, disaster management, and ocean observation. Sentinel-1 is equipped with C-band Synthetic Aperture Radar (SAR) that provides all-weather, day-and-night radar imagery, making it particularly useful for monitoring land and ocean surfaces, including applications like mapping, forestry, soil moisture estimation, and sea-ice observations. Sentinel-2 features a multispectral optical sensor capturing 13 spectral bands, offering high-resolution imagery (up to 10 meters), which is ideal for land cover classification, agricultural monitoring, and disaster management. Sentinel-3 carries multiple instruments to measure sea-surface topography, sea and land surface temperature, and ocean and land color, supporting ocean forecasting, environmental and climate monitoring, and land-use change detection. The data volume varies depending on the specific mission and product type. Sentinel-1 data files range from 1 GB to 6 GB per scene, depending on the observation mode and product type. Sentinel-2 Level-1C products, which cover a 100×100 km area, range from approximately 500 MB to 1 GB. Sentinel-3 product sizes vary widely depending on the instrument and processing level, ranging from several megabytes to over a gigabyte per product. The number of observations is continually increasing as the satellites remain operational and continuously capture data. Sentinel-1 revisits each region every 6 to 12 days, depending on the observation mode. Sentinel-2 provides global coverage of land surfaces every 5 days with its two satellites, Sentinel-2A and Sentinel-2B. Sentinel-3 offers a near-daily revisit for optical instruments, with shorter revisit times for altimetry measurements. Sentinel-1A was launched in April 2014, followed by Sentinel-1B in April 2016, and Sentinel-1C in December 2024. Sentinel-2A was launched in June 2015, Sentinel-2B in March 2017, and Sentinel-2C in September 2024. Sentinel-3A was launched in February 2016, and Sentinel-3B followed in April 2018.

NOAA (National Oceanic and Atmospheric Administration) Class

Creators: National Oceanic and Atmospheric Administration (NOAA)
Publication Date: 1970
Creators: National Oceanic and Atmospheric Administration (NOAA)

NOAA’s CLASS (Comprehensive Large Array-data Stewardship System) is a web-based data repository that archives and distributes environmental data collected by various NOAA satellites and missions. CLASS provides access to a variety of data types, including:Environmental satellite data, weather data, other datasets. CLASS archives data from various NOAA satellite missions and instruments, including the Advanced Very High Resolution Radiometer (AVHRR), which captures global cloud cover, sea surface temperatures, and vegetation indices; the Geostationary Operational Environmental Satellites (GOES), which provide real-time weather monitoring and forecasting data; and the Joint Polar Satellite System (JPSS), which offers global environmental data for weather forecasting and climate monitoring. A key feature of CLASS is its role in long-term data preservation, ensuring that environmental information remains accessible for climate research and historical analyses. The system also provides user-friendly access, offering search and retrieval capabilities that allow users to locate and download specific datasets efficiently. The temporal coverage of datasets within CLASS varies by instrument and mission. AVHRR data is available from the late 1970s to the present, GOES data coverage starts from the 1970s and continues to current operations, and JPSS data has been available since 2011.

Sentinel Hub

Creators: Sinergise
Publication Date: 1970
Creators: Sinergise

Sentinel Hub provides access to satellite imagery from various missions, including Sentinel-1, Sentinel-2, and Landsat satellites. It offers tools for visualizing, analyzing, and processing satellite data. Sentinel Hub provides access to a wide array of satellite imagery, including data from the Copernicus Sentinel missions (e.g., Sentinel-1, Sentinel-2), Landsat series, and commercial providers like PlanetScope. This extensive dataset supports diverse applications such as agriculture, forestry, land monitoring, and emergency response. The platform offers on-the-fly processing capabilities, allowing users to perform complex analyses without the need for local computational resources. This feature streamlines workflows and accelerates data processing tasks. Users can create custom processing scripts using Sentinel Hub’s scripting language, enabling tailored data analyses and visualization outputs to meet specific project requirements. Sentinel Hub provides access to historical satellite imagery dating back to the launch dates of respective missions. For instance, Sentinel-2 data is available from 2015 onwards, enabling users to analyze changes over the past decade.

Planet Labs

Creators: Planet Labs Inc.
Publication Date: 2014
Creators: Planet Labs Inc.

Planet Labs offers high-resolution satellite imagery from its fleet of small satellites (Dove and SkySat). The platform provides access to detailed and frequent imagery for various applications such as monitoring and analysis. The satellites collect data across multiple spectral bands, including Red, Green, Blue, and Near-Infrared (NIR), facilitating diverse analyses such as vegetation health assessment and land cover classification. The constellation’s design allows for frequent revisits, with PlanetScope achieving near-daily coverage and SkySat capable of sub-daily tasking, essential for applications requiring up-to-date information. Planet Labs collects imagery covering up to 300 million square kilometers daily, resulting in a vast number of observations over time. PlanetScope imagery dates back to 2014, with 8-band data available from 2020. The RapidEye constellation, acquired by Planet Labs, provided imagery from 2009 until its retirement in 2020. In total, the dataset has data in a size exceeding 10 PB. Planet Labs’ dataset covers several sub-datasets, each corresponding to different satellite constellations and imaging capabilities:

  • PlanetScope: Consists of hundreds of small satellites capturing imagery at approximately 3.7-meter resolution across four multispectral bands (Red, Green, Blue, NIR). This dataset supports applications like agricultural monitoring, forest management, and environmental assessment.

  • SkySat: Comprises around 15 satellites providing high-resolution imagery at 50-centimeter resolution, with capabilities for sub-daily tasking. SkySat data is valuable for detailed analyses, including infrastructure monitoring and urban planning.

  • RapidEye (Retired): Included five satellites that operated from 2009 to 2020, delivering 5-meter resolution imagery across five spectral bands (Blue, Green, Red, Red Edge, NIR). RapidEye data was utilized for applications such as land use mapping and crop monitoring

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.