internet

Showing 1-5 of 5 results

Goodreads-books

Publication Date: 2019
Creators: Zając, Zygmunt

The primary reason for creating this dataset is the requirement of a good clean dataset of books. It contains book names, authors, ratings and review counts. The data set is 1.56 MB large and was scraped via the Goodreads API

FilmTV movies dataset

Publication Date: 2018
Creators: (Leone, Stefano)

Movies data are available on websites such as IMDb with average votes, vote numbers, reviews and descriptions. While IMDb is the most trustworthy source for data, other websites as FilmTV.it can provide the information on how users from different countries rate the movies compared to each other. The dataset is 0.11 GB large.

Each row represents a movie available on FilmTV.it, with the original title, year, genre, duration, country, director, actors, average vote and votes.
The file in the English version contains 37,711 movies and 19 attributes, while the Italian version contains one extra-attribute for the local title used when the movie was published in Italy.

The data set includes movies from: 1897 – 2023. Data has been scraped from the publicly available website https://www.filmtv.it as of 2023-10-21.

Twitter US Airline Sentiment

Publication Date: 2016
Creators: (Makone, Ashutosh)

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).

Food.com Recipe & Review Data

Publication Date: 2019
Creators: Majumder, Bodhisattwa P.; Li, Shuyang; Ni, Jianmo; McAuley, Julian
This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen), an online recipe aggregator.This dataset contains three sets of data from Food.com:

Interaction splits

  • interactions_test.csv
  • interactions_validation.csv
  • interactions_train.csv

Preprocessed data for result reproduction

In this format, the recipe text metadata is tokenized via the GPT subword tokenizer with start-of-step, etc. tokens.

  • PP_recipes.csv
  • PP_users.csv

To convert these files into the pickle format required to run our code off-the-shelf, you may use pandas.read_csv and pandas.to_pickle to convert the CSV’s into the proper pickle format.

 

Advertisement CTR Prediction Data

Publication Date: 2020
Creators: Huawei

Advertisement CTR prediction is the key problem in the area of computing advertising. Increasing the accuracy of Advertisement CTR prediction is critical to improve the effectiveness of precision marketing. In this competition, we release big advertising datasets that are anonymized. Based on the datasets, contestants are required to build Advertisement CTR prediction models. The aim of the event is to find talented individuals to promote the development of Advertisement CTR prediction algorithms. The datasets contain the advertising behavior data collected from seven consecutive days, including a training dataset and a testing dataset.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.