Showing 217-224 of 262 results

Food.com Recipes and Interactions

Creators: Shuyang, Li
Publication Date: 2019
Creators: Shuyang, Li

The Food.com Recipes and Interactions dataset is a large-scale collection of culinary data, comprising over 180,000 recipes and 700,000 user reviews spanning an 18-year period. Compiled by Shuyang Li and published in 2019, this dataset provides a rich source of information for studying user interactions, culinary trends, and recipe recommendation systems. It originates from Food.com (formerly GeniusKitchen), one of the largest online recipe-sharing platforms, making it a valuable resource for researchers and practitioners in food science, natural language processing, and user behavior analysis. The data set is 0.89 GB large and consists of two primary components: recipe data and user interaction data. The recipe data contains structured information about each recipe, including the recipe ID, name tokens (a tokenized version of the recipe title), ingredient tokens, steps tokens (instructions in tokenized form), cooking techniques, caloric level, and ingredient IDs corresponding to specific ingredients. These features allow for deep analysis of how recipes are structured, categorized, and consumed over time. The user interaction data captures engagement metrics, tracking how users interact with recipes. It includes the user ID, a list of recipes reviewed, the number of items reviewed, the ratings assigned, and the total number of ratings provided by each user. This structure enables research into user preference modeling, recipe popularity trends, and the development of personalized recommendation systems for recipe suggestions.

Rotten Tomatoes movies and critic reviews dataset

Creators: (Leone, Stefano)
Publication Date: 2020
Creators: (Leone, Stefano)

The Rotten Tomatoes Movies and Critic Reviews dataset is a collection of information scraped from the Rotten Tomatoes website as of October 31, 2020. It encompasses data on over 17,000 movies, including details such as movie titles, descriptions, genres, durations, directors, actors, as well as user and critic ratings. A distinctive feature of this dataset is its ability to facilitate comparisons between audience scores (ratings from regular users) and tomatometer scores (ratings from certified critics), offering valuable insights into differing perspectives on films. In the movies dataset each record represents a movie available on Rotten Tomatoes, with the URL used for the scraping, movie tile, description, genres, duration, director, actors, users’ ratings, and critics’ ratings.
In the critics dataset each record represents a critic review published on Rotten Tomatoes, with the URL used for the scraping, critic name, review publication, date, score, and content.

Rotten Tomatoes allows to compare the ratings given by regular users (audience score) and the ratings given/reviews provided by critics (tomatometer) who are certified members of various writing guilds or film critic-associations.The dataset is 0.23 GB large.

The dataset is structured into two main components:

  1. Movies Dataset: Each record represents a movie available on Rotten Tomatoes, containing fields such as:

    • rotten_tomatoes_link: The specific URL from which the movie data was scraped.
    • movie_title: The title of the movie as displayed on the Rotten Tomatoes website.
    • movie_info: A brief description of the movie.
    • genres: The genres associated with the movie, separated by commas if multiple.
    • original_release_date: The date on which the movie was originally released.
    • content_rating: The category indicating the movie’s suitability for different audiences.
    • critics_consensus: Comments from Rotten Tomatoes summarizing critics’ opinions.
  2. Critics Dataset: Each record represents a critic’s review published on Rotten Tomatoes, including details such as:

    • critic_name: The name of the critic who reviewed the movie.
    • top_critic: A boolean value indicating whether the critic is classified as a top critic.
    • publisher_name: The name of the publication for which the critic works.
    • review_type: Specifies whether the review was labeled as ‘fresh’ or ‘rotten’.
    • review_score: The score provided by the critic for the movie.
    • review_date: The date when the review was published.
    • review_content: The content of the review.

US Funds dataset from Yahoo Finance

Creators: (Leone, Stefano)
Publication Date: 2018
Creators: (Leone, Stefano)

The US Funds dataset from Yahoo Finance collects data on 24,821 mutual funds and 1,680 exchange-traded funds (ETFs). This contains detailed information on various aspects of each fund, including general characteristics, portfolio indicators, returns, and financial ratios. A notable feature of this dataset is its extensive coverage, offering insights into both mutual funds and ETFs, which can be instrumental for comparative analyses and investment research. The dataset was published in 2018 and contains data up to November 2020, providing a temporal coverage that spans several years leading up to that point. In total, it covers 1.7 GB.

The dataset includes various variables for each fund, such as:

  • fund_symbol: Symbol of the ETF.
  • price_date: Date of the price (in YYYY-MM-DD format).
  • open: Open daily price.
  • high: Highest daily price.
  • low: Lowest daily price.
  • close: Close daily price.
  • adj_close: Adjusted close daily price, which considers elements that have impacted the price such as share splits, dividends, etc.
  • volume: Daily traded volume.
  • nav_per_share: Daily Net Asset Value (NAV) per share.
  • region: Name of the region in which the fund has the domicile.
  • initial_investment: Minimum amount for initial investment.
  • subsequent_investment: Minimum amount for subsequent investments.
  • exchange_code: Code of the exchange where the fund is traded.
  • exchange_name: Name of the exchange where the fund is traded

 

Twitter US Airline Sentiment

Creators: (Makone, Ashutosh)
Publication Date: 2016
Creators: (Makone, Ashutosh)

The Twitter US Airline Sentiment dataset is a collection of tweets aimed at analyzing public sentiment toward major U.S. airlines. Compiled in February 2015, the dataset consists of 14,640 tweets directed at several U.S. airlines. It serves as a valuable resource for sentiment analysis and natural language processing research, particularly in understanding customer satisfaction, airline service quality, and issues reported by travelers. Each tweet in the dataset is labeled with one of three sentiment categories: positive, neutral, or negative. Tweets labeled as negative are further categorized into specific negative sentiment reasons, such as late flight, customer service issue, canceled flight, and lost luggage, providing deeper insights into common complaints. The dataset also identifies the airline mentioned in each tweet, covering six major U.S. carriers: United Airlines, US Airways, American Airlines, Southwest Airlines, Delta Air Lines, and Virgin America. Additional metadata is provided for each tweet, including tweet ID, tweet text, tweet coordinates (if available), user information, and location data, allowing for further contextual analysis. The dataset is relatively small, with a total size of 8,46 MB, making it easily manageable for sentiment analysis tasks and machine learning applications. It includes 14,640 tweets from 7,700 unique users, providing a broad yet concise representation of customer interactions with airlines on Twitter. The tweets were collected over a one-month period in February 2015, offering a snapshot of public sentiment during that specific timeframe.

MovieTweetings

Creators: Dooms, Simon; De Pessemier, Toon; Martens, Luc
Publication Date: 2013
Creators: Dooms, Simon; De Pessemier, Toon; Martens, Luc
MovieTweetings is a dataset consisting of ratings on movies that were contained in well-structured tweets on Twitter. The goal of this dataset is to provide the RecSys community with a live, natural and always up-to-date movie ratings dataset. The dataset has been actively collecting ratings since February 28, 2013, and will be updated as much as possible to incorporate rating data from the newest tweets available. The dataset includes 921,398 ratings from 71,707 unique users. The ratings contained in the tweets are scaled from 0 to 10, as is the norm on the IMDb platform. In total, the dataset has a size of 26,2 MB and consists of two main files:

  • ratings.dat contains extracted ratings, structured as:
    user_id::movie_id::rating::rating_timestamp

    • user_id: Unique identifier for the user.
    • movie_id: IMDb identifier for the movie.
    • rating: User’s score on a 10-star scale.
    • rating_timestamp: Unix timestamp when the rating was extracted.
  • items.dat includes metadata about the rated movies, structured as:
    movie_id::movie_title (movie_year)::genre|genre|genre

    • movie_id: IMDb identifier for the movie.
    • movie_title: Name of the movie along with the release year.
    • genre: Pipe-separated list of genres.

Stock return prediction with tweets

Creators: Madhyastha, Pranava; Sowinska, Karolina
Publication Date: 2020
Creators: Madhyastha, Pranava; Sowinska, Karolina

This dataset is designed to analyze the impact of Twitter-based textual information on stock returns. Compiled by researchers Karolina Sowinska and Pranava Madhyastha, this dataset was published in 2020 and is made available under the GNU General Public License v3.0 or later. It provides valuable data for financial analytics and natural language processing, particularly in studying the relationship between social media sentiment and stock market performance. By linking tweets to stock return data, the dataset enables the development of predictive models for stock movement based on public sentiment. The dataset comprises 862,231 labeled tweets, all in English, each associated with specific companies. These tweets serve as samples for analyzing public opinion and sentiment regarding different stocks and financial events. A cleaned subset of 85,176 labeled instances is also included, making the dataset suitable for both large-scale machine learning models and more focused analyses. Each tweet is linked to corresponding stock return data, allowing for a company-level examination of how Twitter sentiment impacts one-day, two-day, three-day, and seven-day stock returns. This structured linkage between tweets and financial performance provides a unique opportunity to study the effects of social media on stock price fluctuations. The dataset is approximately 225 MB in size on GitHub, making it manageable for various analytical tasks, including sentiment analysis, text-based predictive modeling, and financial forecasting. It is structured into two primary components:

  • Tweet Data: This includes the textual content of tweets, user metadata, timestamps, and the companies referenced in each tweet. These features allow researchers to perform sentiment analysis, track user engagement, and examine the frequency of stock-related discussions on social media.

  • Stock Return Data: This includes numerical stock return values corresponding to the companies mentioned in the tweets. The returns are recorded over multiple time intervals, enabling the study of both short-term and long-term price movements in response to social media discussions.

IMDb movies extensive dataset

Creators: (Leone, Stefano)
Publication Date: 2019
Creators: (Leone, Stefano)

The movies dataset serves as a valuable resource for researchers, data analysts, and movie enthusiasts looking to explore various aspects of cinema. It contains detailed metadata on movies, including ratings, cast and crew details, and audience reception, making it highly suitable for studies related to film trends, genre popularity, audience preferences, and predictive modeling of movie success. The dataset covers movies up to the year 2019, providing a broad temporal range that allows for longitudinal studies on film industry trends and developments. In total, it has a size of approximately 230 MB and includes 85,855 movies with attributes such as movie description, average rating, number of votes, genre, etc. The dataset is structured into multiple components, each offering specific insights: The ratings dataset includes 85,855 rating details from demographic perspective. The names dataset includes 297,705 cast members with personal attributes such as birth details, death details, height, spouses, children, etc. The title principals dataset includes 835,513 cast members roles in movies with attributes such as IMDb title id, IMDb name id, order of importance in the movie, role, and characters played. By offering a rich and detailed collection of movie-related information, the IMDb Movies Extensive Dataset is useful for researchers, film industry professionals, and data scientists looking to gain deeper insights into the world of cinema.

Twitter Dataset

Creators: Cheng, Zhiyuan; Caverlee, James; Lee, Kyumin
Publication Date: 2010
Creators: Cheng, Zhiyuan; Caverlee, James; Lee, Kyumin
This dataset is a collection of scraped public twitter updates used in coordination with an academic project to study the geolocation data related to twittering. We provide both training set and test set in the paper You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users in CIKM 2010. The training set contains 115,886 Twitter users and 3,844,612 updates from the users. All the locations of the users are self-labeled in United States in city-level granularity. The test set contains 5,136 Twitter users and 5,156,047 tweets from the users. In total, the dataset has a size of 30,0 kB. All the locations of users are uploaded from their smart phones with the form of “UT: Latitude,Longitude”. The Twitter activity is covered over a period of five months, from September 2009 to January 2010, offering a valuable temporal snapshot of user interactions and content generation during that time. Structurally, the dataset is divided into four text files. The training set users file (“training_set_users.txt”) contains user information in the format “UserIDtUserLocation”, and the training set tweets file (“training_set_tweets.txt”) stores tweets in the format “UserIDtTweetIDtTweettCreatedAt”. Similarly, the test set users file (“test_set_users.txt”) follows the same format as the training set users file, while the test set tweets file (“test_set_tweets.txt”) follows the same structure as the training set tweets file.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.