Showing 225-232 of 272 results

Credit Card Fraud Detection

Creators: Worldline and the Machine Learning Group of ULB ((Universite Libre de Bruxelles)
Publication Date: 2016
Creators: Worldline and the Machine Learning Group of ULB ((Universite Libre de Bruxelles)

The Credit Card Fraud Detection dataset is a rich collection of credit card transactions made by European cardholders in September 2013. It presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The dataset is 0.15 GB large.

The data has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

Each transaction record in the dataset includes several features:

  • Time: The number of seconds elapsed between this transaction and the first transaction in the dataset.

  • V1 to V28: These are the result of a Principal Component Analysis (PCA) transformation applied to the original features to protect sensitive information.

  • Amount: The monetary value of the transaction.

  • Class: A binary indicator where ‘1’ signifies a fraudulent transaction and ‘0’ denotes a legitimate one.

Heart Disease Data Set

Creators: Janosi, Andras; Steinbrunn, William; Pfisterer, Matthias; Detrano, Robert
Publication Date: 1988
Creators: Janosi, Andras; Steinbrunn, William; Pfisterer, Matthias; Detrano, Robert

The Heart Disease database is a well-regarded resource in the medical research community, particularly for studies related to cardiovascular conditions. It comprises data from four distinct databases: the Cleveland Clinic Foundation, the Hungarian Institute of Cardiology in Budapest, the V.A. Medical Center in Long Beach, California, and the University Hospital in Zurich, Switzerland. Each of these databases contains patient records with various medical attributes, totaling 76 features. However, most research has focused on a subset of 14 key attributes to diagnose the presence of heart disease. he dataset is relatively small, with each database containing a few hundred records. For example, the Cleveland database includes 303 instances. Given the number of attributes and instances, the dataset’s size is minimal, making it easily manageable for analysis without requiring significant storage resources. The data was collected over several years, primarily during the 1980s.

Each patient record in the dataset includes the following 14 attributes commonly used in research:

  • Age: Age of the patient in years.
  • Sex: Gender of the patient (1 = male; 0 = female).
  • Chest Pain Type (cp): Categorical variable indicating the type of chest pain experienced, with values ranging from 0 to 3.
  • Resting Blood Pressure (trestbps): Resting blood pressure in mm Hg upon hospital admission.
  • Serum Cholesterol (chol): Serum cholesterol level in mg/dl.
  • Fasting Blood Sugar (fbs): Binary variable indicating if fasting blood sugar is greater than 120 mg/dl (1 = true; 0 = false).
  • Resting Electrocardiographic Results (restecg): Categorical variable with values 0 to 2 indicating ECG results.
  • Maximum Heart Rate Achieved (thalach): Maximum heart rate achieved during exercise.
  • Exercise-Induced Angina (exang): Binary variable indicating if exercise-induced angina occurred (1 = yes; 0 = no).
  • ST Depression (oldpeak): ST depression induced by exercise relative to rest.
  • Slope of the Peak Exercise ST Segment (slope): Categorical variable with values 0 to 2.
  • Number of Major Vessels Colored by Fluoroscopy (ca): Integer value ranging from 0 to 3.
  • Thalassemia (thal): Categorical variable indicating blood disorder status (3 = normal; 6 = fixed defect; 7 = reversible defect).
  • Diagnosis of Heart Disease (target): Integer value ranging from 0 to 4, indicating the presence and severity of heart disease.

FilmTV movies dataset

Creators: (Leone, Stefano)
Publication Date: 2018
Creators: (Leone, Stefano)

The FilmTV movies dataset serves as a valuable resource for researchers, data analysts, and movie enthusiasts interested in exploring various aspects of cinema. With data spanning over a century, the dataset provides a broad temporal view of film trends, genre popularity, and audience reception. Movies data are available on websites such as IMDb with average votes, vote numbers, reviews and descriptions. While IMDb is the most trustworthy source for data, other websites as FilmTV can provide the information on how users from different countries rate the movies compared to each other. The dataset is 0.11 GB large.

Each row represents a movie available on FilmTV.it, with the original title, year, genre, duration, country, director, actors, average vote and votes.
The file in the English version contains 37,711 movies and 19 attributes, while the Italian version contains one extra-attribute for the local title used when the movie was published in Italy.

The data set includes movies from: 1897 – 2023. Data has been scraped from the publicly available website https://www.filmtv.it as of 2023-10-21.

Food.com Recipes and Interactions

Creators: Shuyang, Li
Publication Date: 2019
Creators: Shuyang, Li

The Food.com Recipes and Interactions dataset is a large-scale collection of culinary data, comprising over 180,000 recipes and 700,000 user reviews spanning an 18-year period. Compiled by Shuyang Li and published in 2019, this dataset provides a rich source of information for studying user interactions, culinary trends, and recipe recommendation systems. It originates from Food.com (formerly GeniusKitchen), one of the largest online recipe-sharing platforms, making it a valuable resource for researchers and practitioners in food science, natural language processing, and user behavior analysis. The data set is 0.89 GB large and consists of two primary components: recipe data and user interaction data. The recipe data contains structured information about each recipe, including the recipe ID, name tokens (a tokenized version of the recipe title), ingredient tokens, steps tokens (instructions in tokenized form), cooking techniques, caloric level, and ingredient IDs corresponding to specific ingredients. These features allow for deep analysis of how recipes are structured, categorized, and consumed over time. The user interaction data captures engagement metrics, tracking how users interact with recipes. It includes the user ID, a list of recipes reviewed, the number of items reviewed, the ratings assigned, and the total number of ratings provided by each user. This structure enables research into user preference modeling, recipe popularity trends, and the development of personalized recommendation systems for recipe suggestions.

Rotten Tomatoes movies and critic reviews dataset

Creators: (Leone, Stefano)
Publication Date: 2020
Creators: (Leone, Stefano)

The Rotten Tomatoes Movies and Critic Reviews dataset is a collection of information scraped from the Rotten Tomatoes website as of October 31, 2020. It encompasses data on over 17,000 movies, including details such as movie titles, descriptions, genres, durations, directors, actors, as well as user and critic ratings. A distinctive feature of this dataset is its ability to facilitate comparisons between audience scores (ratings from regular users) and tomatometer scores (ratings from certified critics), offering valuable insights into differing perspectives on films. In the movies dataset each record represents a movie available on Rotten Tomatoes, with the URL used for the scraping, movie tile, description, genres, duration, director, actors, users’ ratings, and critics’ ratings.
In the critics dataset each record represents a critic review published on Rotten Tomatoes, with the URL used for the scraping, critic name, review publication, date, score, and content.

Rotten Tomatoes allows to compare the ratings given by regular users (audience score) and the ratings given/reviews provided by critics (tomatometer) who are certified members of various writing guilds or film critic-associations.The dataset is 0.23 GB large.

The dataset is structured into two main components:

  1. Movies Dataset: Each record represents a movie available on Rotten Tomatoes, containing fields such as:

    • rotten_tomatoes_link: The specific URL from which the movie data was scraped.
    • movie_title: The title of the movie as displayed on the Rotten Tomatoes website.
    • movie_info: A brief description of the movie.
    • genres: The genres associated with the movie, separated by commas if multiple.
    • original_release_date: The date on which the movie was originally released.
    • content_rating: The category indicating the movie’s suitability for different audiences.
    • critics_consensus: Comments from Rotten Tomatoes summarizing critics’ opinions.
  2. Critics Dataset: Each record represents a critic’s review published on Rotten Tomatoes, including details such as:

    • critic_name: The name of the critic who reviewed the movie.
    • top_critic: A boolean value indicating whether the critic is classified as a top critic.
    • publisher_name: The name of the publication for which the critic works.
    • review_type: Specifies whether the review was labeled as ‘fresh’ or ‘rotten’.
    • review_score: The score provided by the critic for the movie.
    • review_date: The date when the review was published.
    • review_content: The content of the review.

US Funds dataset from Yahoo Finance

Creators: (Leone, Stefano)
Publication Date: 2018
Creators: (Leone, Stefano)

The US Funds dataset from Yahoo Finance collects data on 24,821 mutual funds and 1,680 exchange-traded funds (ETFs). This contains detailed information on various aspects of each fund, including general characteristics, portfolio indicators, returns, and financial ratios. A notable feature of this dataset is its extensive coverage, offering insights into both mutual funds and ETFs, which can be instrumental for comparative analyses and investment research. The dataset was published in 2018 and contains data up to November 2020, providing a temporal coverage that spans several years leading up to that point. In total, it covers 1.7 GB.

The dataset includes various variables for each fund, such as:

  • fund_symbol: Symbol of the ETF.
  • price_date: Date of the price (in YYYY-MM-DD format).
  • open: Open daily price.
  • high: Highest daily price.
  • low: Lowest daily price.
  • close: Close daily price.
  • adj_close: Adjusted close daily price, which considers elements that have impacted the price such as share splits, dividends, etc.
  • volume: Daily traded volume.
  • nav_per_share: Daily Net Asset Value (NAV) per share.
  • region: Name of the region in which the fund has the domicile.
  • initial_investment: Minimum amount for initial investment.
  • subsequent_investment: Minimum amount for subsequent investments.
  • exchange_code: Code of the exchange where the fund is traded.
  • exchange_name: Name of the exchange where the fund is traded

 

Twitter US Airline Sentiment

Creators: (Makone, Ashutosh)
Publication Date: 2016
Creators: (Makone, Ashutosh)

The Twitter US Airline Sentiment dataset is a collection of tweets aimed at analyzing public sentiment toward major U.S. airlines. Compiled in February 2015, the dataset consists of 14,640 tweets directed at several U.S. airlines. It serves as a valuable resource for sentiment analysis and natural language processing research, particularly in understanding customer satisfaction, airline service quality, and issues reported by travelers. Each tweet in the dataset is labeled with one of three sentiment categories: positive, neutral, or negative. Tweets labeled as negative are further categorized into specific negative sentiment reasons, such as late flight, customer service issue, canceled flight, and lost luggage, providing deeper insights into common complaints. The dataset also identifies the airline mentioned in each tweet, covering six major U.S. carriers: United Airlines, US Airways, American Airlines, Southwest Airlines, Delta Air Lines, and Virgin America. Additional metadata is provided for each tweet, including tweet ID, tweet text, tweet coordinates (if available), user information, and location data, allowing for further contextual analysis. The dataset is relatively small, with a total size of 8,46 MB, making it easily manageable for sentiment analysis tasks and machine learning applications. It includes 14,640 tweets from 7,700 unique users, providing a broad yet concise representation of customer interactions with airlines on Twitter. The tweets were collected over a one-month period in February 2015, offering a snapshot of public sentiment during that specific timeframe.

MovieTweetings

Creators: Dooms, Simon; De Pessemier, Toon; Martens, Luc
Publication Date: 2013
Creators: Dooms, Simon; De Pessemier, Toon; Martens, Luc
MovieTweetings is a dataset consisting of ratings on movies that were contained in well-structured tweets on Twitter. The goal of this dataset is to provide the RecSys community with a live, natural and always up-to-date movie ratings dataset. The dataset has been actively collecting ratings since February 28, 2013, and will be updated as much as possible to incorporate rating data from the newest tweets available. The dataset includes 921,398 ratings from 71,707 unique users. The ratings contained in the tweets are scaled from 0 to 10, as is the norm on the IMDb platform. In total, the dataset has a size of 26,2 MB and consists of two main files:

  • ratings.dat contains extracted ratings, structured as:
    user_id::movie_id::rating::rating_timestamp

    • user_id: Unique identifier for the user.
    • movie_id: IMDb identifier for the movie.
    • rating: User’s score on a 10-star scale.
    • rating_timestamp: Unix timestamp when the rating was extracted.
  • items.dat includes metadata about the rated movies, structured as:
    movie_id::movie_title (movie_year)::genre|genre|genre

    • movie_id: IMDb identifier for the movie.
    • movie_title: Name of the movie along with the release year.
    • genre: Pipe-separated list of genres.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.