internet

Showing 1-5 of 5 results

Goodreads-books

Creators: Zając, Zygmunt
Publication Date: 2019
Creators: Zając, Zygmunt

The primary reason for creating this dataset is the requirement of a good clean dataset of books. It contains important features such as book titles, authors, average ratings, ISBN identifiers, language codes, number of pages, ratings count, text reviews count, publication dates, and publishers. A distinctive aspect of this dataset is its ability to support a wide range of book-related analyses, such as trends in book popularity, author influence, and reader preferences. The data set is 1.56 MB large and was scraped via the Goodreads API. It encompasses over 10,000 observations, each representing a unique book entry with multiple attributes. The structure of the dataset is straightforward, consisting of a single CSV file with the following key columns:

  • bookID: A unique identification number for each book.
  • title: The official title of the book.
  • authors: Names of the authors, with multiple authors separated by a delimiter.
  • average_rating: The average user rating for the book.
  • isbn & isbn13: The 10-digit and 13-digit International Standard Book Numbers, respectively.
  • language_code: The primary language in which the book is published (e.g., ‘eng’ for English).
  • num_pages: The total number of pages in the book.
  • ratings_count: The total number of ratings the book has received from users.
  • text_reviews_count: The total number of text reviews written by users.
  • publication_date: The original publication date of the book.
  • publisher: The name of the publishing house.

FilmTV movies dataset

Creators: (Leone, Stefano)
Publication Date: 2018
Creators: (Leone, Stefano)

The FilmTV movies dataset serves as a valuable resource for researchers, data analysts, and movie enthusiasts interested in exploring various aspects of cinema. With data spanning over a century, the dataset provides a broad temporal view of film trends, genre popularity, and audience reception. Movies data are available on websites such as IMDb with average votes, vote numbers, reviews and descriptions. While IMDb is the most trustworthy source for data, other websites as FilmTV can provide the information on how users from different countries rate the movies compared to each other. The dataset is 0.11 GB large.

Each row represents a movie available on FilmTV.it, with the original title, year, genre, duration, country, director, actors, average vote and votes.
The file in the English version contains 37,711 movies and 19 attributes, while the Italian version contains one extra-attribute for the local title used when the movie was published in Italy.

The data set includes movies from: 1897 – 2023. Data has been scraped from the publicly available website https://www.filmtv.it as of 2023-10-21.

Twitter US Airline Sentiment

Creators: (Makone, Ashutosh)
Publication Date: 2016
Creators: (Makone, Ashutosh)

The Twitter US Airline Sentiment dataset is a collection of tweets aimed at analyzing public sentiment toward major U.S. airlines. Compiled in February 2015, the dataset consists of 14,640 tweets directed at several U.S. airlines. It serves as a valuable resource for sentiment analysis and natural language processing research, particularly in understanding customer satisfaction, airline service quality, and issues reported by travelers. Each tweet in the dataset is labeled with one of three sentiment categories: positive, neutral, or negative. Tweets labeled as negative are further categorized into specific negative sentiment reasons, such as late flight, customer service issue, canceled flight, and lost luggage, providing deeper insights into common complaints. The dataset also identifies the airline mentioned in each tweet, covering six major U.S. carriers: United Airlines, US Airways, American Airlines, Southwest Airlines, Delta Air Lines, and Virgin America. Additional metadata is provided for each tweet, including tweet ID, tweet text, tweet coordinates (if available), user information, and location data, allowing for further contextual analysis. The dataset is relatively small, with a total size of 8,46 MB, making it easily manageable for sentiment analysis tasks and machine learning applications. It includes 14,640 tweets from 7,700 unique users, providing a broad yet concise representation of customer interactions with airlines on Twitter. The tweets were collected over a one-month period in February 2015, offering a snapshot of public sentiment during that specific timeframe.

Food.com Recipe & Review Data

Creators: Majumder, Bodhisattwa P.; Li, Shuyang; Ni, Jianmo; McAuley, Julian
Publication Date: 2019
Creators: Majumder, Bodhisattwa P.; Li, Shuyang; Ni, Jianmo; McAuley, Julian
This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen), an online recipe aggregator. This extensive collection allows for in-depth analysis of culinary trends, user preferences, and recipe characteristics over nearly two decades.The dataset is 0,85 GB in size and contains three sets of data from Food.com:Interaction splits

  • interactions_test.csv
  • interactions_validation.csv
  • interactions_train.csv

Preprocessed data for result reproduction

In this format, the recipe text metadata is tokenized via the GPT subword tokenizer with start-of-step, etc. tokens.

  • PP_recipes.csv
  • PP_users.csv

To convert these files into the pickle format required to run our code off-the-shelf, you may use pandas.read_csv and pandas.to_pickle to convert the CSV’s into the proper pickle format.

 

Advertisement CTR Prediction Data

Creators: Huawei
Publication Date: 2020
Creators: Huawei

Advertisement CTR prediction is the key problem in the area of computing advertising. Increasing the accuracy of Advertisement CTR prediction is critical to improve the effectiveness of precision marketing. In this competition, we release big advertising datasets that are anonymized. Based on the datasets, contestants are required to build Advertisement CTR prediction models. The aim of the event is to find talented individuals to promote the development of Advertisement CTR prediction algorithms. The datasets contain the advertising behavior data collected from seven consecutive days, including a training dataset and a testing dataset. The total size of the datasets amounts to 6,86 GB. It contains millions of observations and is structured into training and testing sets, with multiple variables capturing different aspects of user-ad interactions. These variables include user identifiers, ad identifiers, timestamps, user behavior features, and ad content features, allowing researchers to analyze engagement patterns and develop predictive models for ad click-through rates. This dataset is valuable for improving advertising strategies and refining targeted marketing approaches.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.