Showing 217-224 of 272 results

COVID-19 Twitter Chatter Dataset

Creators: Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo
Publication Date: 2024
Creators: Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The dataset is 14.2 GB large.

Goodreads-books

Creators: Zając, Zygmunt
Publication Date: 2019
Creators: Zając, Zygmunt

The primary reason for creating this dataset is the requirement of a good clean dataset of books. It contains important features such as book titles, authors, average ratings, ISBN identifiers, language codes, number of pages, ratings count, text reviews count, publication dates, and publishers. A distinctive aspect of this dataset is its ability to support a wide range of book-related analyses, such as trends in book popularity, author influence, and reader preferences. The data set is 1.56 MB large and was scraped via the Goodreads API. It encompasses over 10,000 observations, each representing a unique book entry with multiple attributes. The structure of the dataset is straightforward, consisting of a single CSV file with the following key columns:

  • bookID: A unique identification number for each book.
  • title: The official title of the book.
  • authors: Names of the authors, with multiple authors separated by a delimiter.
  • average_rating: The average user rating for the book.
  • isbn & isbn13: The 10-digit and 13-digit International Standard Book Numbers, respectively.
  • language_code: The primary language in which the book is published (e.g., ‘eng’ for English).
  • num_pages: The total number of pages in the book.
  • ratings_count: The total number of ratings the book has received from users.
  • text_reviews_count: The total number of text reviews written by users.
  • publication_date: The original publication date of the book.
  • publisher: The name of the publishing house.

goodbooks-10k

Creators: Zając, Zygmunt
Publication Date: 2017
Creators: Zając, Zygmunt

The dataset contains six million ratings for ten thousand most popular books (with most ratings). It offers a rich resource for analyzing reading habits, book popularity, and user engagement within the literary community. There are also books marked to read by the users, book metadata (author, year, etc.) and tags/shelves/genres.

ratings contains ratings sorted by time. Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424.

to_read  provides IDs of the books marked “to read” by each user, as user_id,book_id pairs, sorted by time. There are close to a million pairs.

books has metadata for each book (goodreads IDs, authors, title, average rating, etc.). The metadata have been extracted from goodreads XML files.

book_tags contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs. They are sorted by goodreads_book_id  ascending and count descending.

The date set is 68.8 MB large.

Popular Movies of TMDb

Creators: Mondal, Sankha Subhra
Publication Date: 2020
Creators: Mondal, Sankha Subhra

This dataset of the 10,000 most popular movies across the world has been fetched through the read API.
TMDB’s free API provides for developers and their team to programmatically fetch and use TMDb’s data.
Their API is to use as long as you attribute TMDb as the source of the data and/or images. Also, they update their API from time to time. The data set is 3.2 MB large. It offers valuable insights into global cinematic trends and preferences.

Each movie entry in the dataset includes the following attributes:

  • title: The name of the movie.
  • overview: A brief summary of the movie’s plot.
  • original_language: The language in which the movie was originally produced.
  • vote_average: The average user rating of the movie on TMDb.

World Happiness Report

Creators: F. Helliwell, John; Layard, Richard; Sachs, Jeffrey D. ; De Neve, Jan-Emmanuel; Aknin, Lara B.; Wang, Shun
Publication Date: 2012
Creators: F. Helliwell, John; Layard, Richard; Sachs, Jeffrey D. ; De Neve, Jan-Emmanuel; Aknin, Lara B.; Wang, Shun

The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative. The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others. The dataset has a size of 80,86 kB.

Video Game Sales

Creators: Smith, Gregory
Publication Date: 2016
Creators: Smith, Gregory

This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com. The dataset has a size of 1,36 MB and includes games released up to the year 2016, offering a historical perspective on video game sales over several decades. It allows for in-depth analysis of sales trends across different regions, platforms, and genres, making it a valuable resource for market analysis and strategic planning within the video game industry. Each entry in the dataset includes the following attributes:

  • Rank: Overall sales ranking of the game.
  • Name: Title of the game.
  • Platform: The platform on which the game was released (e.g., PC, PS4).
  • Year: Year of the game’s release.
  • Genre: Genre classification of the game.
  • Publisher: Company that published the game.
  • NA_Sales: Sales figures in North America (in millions).
  • EU_Sales: Sales figures in Europe (in millions).
  • JP_Sales: Sales figures in Japan (in millions).
  • Other_Sales: Sales figures in the rest of the world (in millions).
  • Global_Sales: Total worldwide sales (in millions).

Heart Disease Data Set

Creators: Janosi, Andras; Steinbrunn, William; Pfisterer, Matthias; Detrano, Robert
Publication Date: 1988
Creators: Janosi, Andras; Steinbrunn, William; Pfisterer, Matthias; Detrano, Robert

The Heart Disease database is a well-regarded resource in the medical research community, particularly for studies related to cardiovascular conditions. It comprises data from four distinct databases: the Cleveland Clinic Foundation, the Hungarian Institute of Cardiology in Budapest, the V.A. Medical Center in Long Beach, California, and the University Hospital in Zurich, Switzerland. Each of these databases contains patient records with various medical attributes, totaling 76 features. However, most research has focused on a subset of 14 key attributes to diagnose the presence of heart disease. he dataset is relatively small, with each database containing a few hundred records. For example, the Cleveland database includes 303 instances. Given the number of attributes and instances, the dataset’s size is minimal, making it easily manageable for analysis without requiring significant storage resources. The data was collected over several years, primarily during the 1980s.

Each patient record in the dataset includes the following 14 attributes commonly used in research:

  • Age: Age of the patient in years.
  • Sex: Gender of the patient (1 = male; 0 = female).
  • Chest Pain Type (cp): Categorical variable indicating the type of chest pain experienced, with values ranging from 0 to 3.
  • Resting Blood Pressure (trestbps): Resting blood pressure in mm Hg upon hospital admission.
  • Serum Cholesterol (chol): Serum cholesterol level in mg/dl.
  • Fasting Blood Sugar (fbs): Binary variable indicating if fasting blood sugar is greater than 120 mg/dl (1 = true; 0 = false).
  • Resting Electrocardiographic Results (restecg): Categorical variable with values 0 to 2 indicating ECG results.
  • Maximum Heart Rate Achieved (thalach): Maximum heart rate achieved during exercise.
  • Exercise-Induced Angina (exang): Binary variable indicating if exercise-induced angina occurred (1 = yes; 0 = no).
  • ST Depression (oldpeak): ST depression induced by exercise relative to rest.
  • Slope of the Peak Exercise ST Segment (slope): Categorical variable with values 0 to 2.
  • Number of Major Vessels Colored by Fluoroscopy (ca): Integer value ranging from 0 to 3.
  • Thalassemia (thal): Categorical variable indicating blood disorder status (3 = normal; 6 = fixed defect; 7 = reversible defect).
  • Diagnosis of Heart Disease (target): Integer value ranging from 0 to 4, indicating the presence and severity of heart disease.

Credit Card Fraud Detection

Creators: Worldline and the Machine Learning Group of ULB ((Universite Libre de Bruxelles)
Publication Date: 2016
Creators: Worldline and the Machine Learning Group of ULB ((Universite Libre de Bruxelles)

The Credit Card Fraud Detection dataset is a rich collection of credit card transactions made by European cardholders in September 2013. It presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The dataset is 0.15 GB large.

The data has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

Each transaction record in the dataset includes several features:

  • Time: The number of seconds elapsed between this transaction and the first transaction in the dataset.

  • V1 to V28: These are the result of a Principal Component Analysis (PCA) transformation applied to the original features to protect sensitive information.

  • Amount: The monetary value of the transaction.

  • Class: A binary indicator where ‘1’ signifies a fraudulent transaction and ‘0’ denotes a legitimate one.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.