Showing 113-120 of 272 results

Wikipedia archive

Creators: DBpedia Association
Publication Date: 2024-10-23
Creators: DBpedia Association

The DBpedia individual datasets extract structured information from Wikipedia, such as labels, facts, geo-coordinates, and Wikipedia categories, with wide language coverage for data analysis. This dataset enables users to perform complex queries across a wide range of topics, including information about people, places, organizations, and more. DBpedia systematically extracts structured data from Wikipedia’s semi-structured content, such as infoboxes, categorization information, and links. This process transforms unstructured text into a machine-readable format, facilitating advanced data analysis and integration. The dataset is organized according to a cross-domain ontology, providing a consistent framework for representing diverse types of information. This ontology supports complex queries and data integration across various domains. As of the 2016-04 release, DBpedia describes 6.0 million entities, including 1.5 million persons, 810,000 places, 135,000 music albums, 106,000 films, 20,000 video games, 275,000 organizations, 301,000 species, and 5,000 diseases. The dataset comprises 9.5 billion RDF triples, with 1.3 billion extracted from the English edition of Wikipedia and 5.0 billion from other language editions.The dataset reflects the state of Wikipedia at the time of each DBpedia release. For example, the 2016-04 release corresponds to Wikipedia’s content as of April 2016. The dataset is structured as RDF triples, each consisting of a subject, predicate, and object. DBpedia utilizes a community-curated ontology to categorize information, with mappings from Wikipedia infoboxes to ontology classes and properties. This structure ensures consistency and facilitates data integration.

This America Life podcast transcripts

Creators: Julian McAuley
Publication Date: 2020
Creators: Julian McAuley

This dataset contains transcripts from This America Life podcast episodes, including speaker utterances and associated audio for in-depth analysis of long conversations, spanning from 1995 to 2020. It is particularly valuable for research in natural language processing, speech recognition, and multi-speaker diarization, as it provides real-world examples of long-form, multi-speaker conversations. This dataset provides a rich resource for developing and testing algorithms in areas like automatic speech recognition, speaker diarization, and natural language understanding, contributing to advancements in processing long-form, multi-speaker audio content. In total, the dataset encompasses 663 episodes, totaling approximately 637.70 hours of audio content. Each episode serves as an individual observation, offering a substantial collection for analysis. Structurally, the dataset consists of program transcripts and associated audio files. Each transcript includes metadata such as episode acts, speaker names, speaker utterances, utterance lengths, and episode audio. 

Clubhouse data

Creators: Kaggle
Publication Date: 2024-10-23
Creators: Kaggle

The Clubhouse dataset on Kaggle provides data related to the social audio app Clubhouse. It contains information such as user demographics, room activity, user engagement metrics, and discussions held on the platform. The dataset is approximately 9.7 MB in size and comprises 1,300,515 user profiles. Each profile represents an individual observation, offering a substantial sample for analysis. It is particularly valuable for analyzing user demographics, social networking patterns, and engagement metrics within the platform. Structurally, the dataset is organized as a CSV file, with each row corresponding to a user’s profile and columns representing various attributes. The key variables included are:

  • user_ID: A unique identifier for each user on Clubhouse.

  • name: The display name of the user.

  • photo_url: URL of the user’s profile photo.

  • username: The username chosen by the user on Clubhouse.

  • Twitter: The user’s Twitter handle or linked Twitter account.

  • Instagram: The user’s Instagram handle or linked Instagram account.

  • num_followers: The number of followers the user has on Clubhouse.

  • num_following: The number of accounts the user is following on Clubhouse.

  • time_created: The date and time when the user’s account was created.

  • invited_by_user_profile: Profile information of the user who invited this user to Clubhouse.

TikTok dataset

Creators: Kaggle
Publication Date: 2021
Creators: Kaggle

The dataset contains data related to TikTok content, including video statistics such as likes, shares, and views, useful for understanding user engagement. It comprises 300 dance videos sourced from the TikTok mobile social networking application. This dataset is particularly valuable for research in computer vision, human pose estimation, and motion analysis, as it provides real-world examples of human dance movements captured in diverse settings. The dataset is available in CSV format and has a total size of approximately 1.4 GB. Each of the 300 videos serves as an individual observation, offering a substantial collection for analysis. Structurally, the dataset consists of video files accompanied by a CSV file containing metadata for each video. The metadata includes information such as video ID, duration, and possibly other attributes relevant to the content. This structure facilitates various analyses, including temporal segmentation, motion tracking, and pattern recognition within the dance sequences.This dataset provides a rich resource for developing and testing algorithms in areas like action recognition, pose estimation, and dance movement analysis, contributing to advancements in computer vision and machine learning applications related to human motion.

Yahoo songs ratings

Creators: Yahoo labs
Publication Date: varies by dataset
Creators: Yahoo labs

A collection of datasets curated by Yahoo for research purposes, including ratings, social data, and multimedia content, mainly used for recommender systems and machine learning research. The datasets encompass a vast number of user ratings, reflecting a wide array of musical tastes and preferences. Each song is accompanied by detailed metadata, including artist, album, and genre information, facilitating multifaceted analyses. Users are represented by randomly assigned numeric IDs, ensuring privacy while allowing for user behavior studies. This dataset includes approximately 300,000 user-supplied ratings and exactly 54,000 ratings for randomly selected songs. It involves 15,400 users and 1,000 songs, with a total size of 1.2 MB. The data was collected between 2002 and 2006, capturing user interactions and preferences during this period. The main components of the dataset are structured as follows:

  • R2 Dataset: Contains user ratings for songs, with each entry linked to metadata such as artist, album, and genre. All identifiers (users, songs, artists, albums) are anonymized.

  • R3 Dataset: Comprises two subsets: one with ratings from normal user interactions and another with ratings for randomly selected songs collected via an online survey. It also includes responses to seven multiple-choice survey questions regarding rating behavior for a subset of

Audio books data (audio + text)

Creators: Usborne Publishing Ltd
Publication Date: 2017-xx-xx
Creators: Usborne Publishing Ltd

The dataset contains audiobook recordings of a female British English speaker (Lesley Sims), used for the Blizzard Challenge to advance speech synthesis research. ​It comprises approximately 6.5 hours of speech data, encompassing 56 children’s audiobooks. This data is particularly valuable for advancing speech synthesis research, as it offers high-quality, natural speech recordings paired with their textual content. In total, the dataset has a size of 765 MB. The temporal coverage of the dataset pertains to the period leading up to the Blizzard Challenge 2017, with the data being released specifically for that event. Structurally, the dataset consists of:

  • Audio Files: High-quality recordings of Lesley Sims narrating various texts. These files capture natural prosody and articulation, essential for developing and testing speech synthesis models.

  • Text Files: Corresponding textual content for each audio recording, facilitating alignment between spoken and written language.

  • Label Files: Sentence-level segmentation and alignment between the text and speech for a portion of the data. These labels were initially created by Toshiba’s Cambridge Research Laboratory and later re-processed by the University of Edinburgh to ensure they correspond accurately to the original audio recordings.

NPR (national public radio) interview dialog data

Creators: Julian McAuley
Publication Date: 2020-xx-xx
Creators: Julian McAuley

The NPR Media Dialog Transcripts dataset contains interview transcripts from National Public Radio (NPR) programs, spanning approximately 20 years. This dataset includes over 140,000 transcripts, covering more than 10,000 hours of audio content. This results in 3.2 million utterances and 126.7 million words collected. Each transcript provides detailed information such as episode titles, broadcast dates, speaker names, and the full text of conversations between hosts and guests. These features are valuable for analyzing discourse patterns, speaker interactions, and content trends within NPR’s programming. Structurally, the dataset is organized into individual transcripts, each corresponding to a specific NPR episode. Each transcript includes metadata such as the episode title, broadcast date, program name, speaker identities, and the full text of the interview.

Public Domain Music Dataset

Creators: Phillip Long, Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick
Publication Date: 2024-09-16
Creators: Phillip Long, Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick

PDMX is a large-scale public domain MusicXML dataset for symbolic music processing, consisting of over 250,000 MusicXML scores sourced from MuseScore, designed for symbolic music research and analysis. Each score in the dataset is accompanied by metadata detailing attributes such as user interactions, ratings, and licensing information, which can be utilized to filter and analyze the dataset for specific research needs. The number of observations corresponds to the over 250,000 individual MusicXML scores included in the dataset. In total, the dataset has a size of 1,6 GB. Structurally, the dataset is organized with each row representing a different song, including paths to the MusicRender JSON file and associated metadata, along with various attributes such as user status, ratings, and licensing information.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.