Showing 9-16 of 16 results

Web data: Amazon movie reviews

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset is a collection of approximately 8 million movie reviews from Amazon, spanning over a decade up to October 2012. It is particularly valuable for analyzing consumer behavior, sentiment analysis, and the evolution of user expertise in online reviews. In total, the dataset has a size of 3,1 GB. Each review includes detailed information such as the product’s unique identifier (ASIN), user ID, profile name, helpfulness rating, score, time of review (in Unix time), summary, and the full text of the review.

The dataset is organized with each review capturing multiple attributes:

  • Product Information: Including the product’s unique identifier (ASIN).

  • User Information: Such as user ID and profile name.

  • Review Details: Encompassing helpfulness rating, score, time of review, summary, and the full text.

 

Amazon product co-purchasing network and ground-truth communities

Creators: Yang, Jaewon; Leskovec, Jure
Publication Date: 2012
Creators: Yang, Jaewon; Leskovec, Jure

This dataset provides a comprehensive view of product relationships on Amazon, based on the “Customers Who Bought This Item Also Bought” feature. Products are represented as nodes, and an undirected edge between two products signifies frequent co-purchasing, reflecting consumer buying patterns and product associations. ​If a product i is frequently co-purchased with product j, the graph contains an undirected edge from i to j. Each product category provided by Amazon defines each ground-truth community. We regard each connected component in a product category as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. The dataset has a size of 0,01 GB and encompasses 334,863 nodes (products) and 925,872 edges (co-purchasing relationships).

The dataset is structured into:

  • Network Data: An undirected graph where nodes represent products, and edges indicate co-purchasing relationships.

  • Ground-Truth Communities: Each product category defined by Amazon serves as a ground-truth community. Connected components within these categories are treated as separate communities, excluding those with fewer than three nodes. Additionally, the dataset provides the top 5,000 communities with the highest quality, as detailed in the associated research paper.

Google web graph

Creators: Leskovec, Jure; Lang, Kevin J.; Dasgupta, Anirban; Mahoney, Michael W.
Publication Date: 2002
Creators: Leskovec, Jure; Lang, Kevin J.; Dasgupta, Anirban; Mahoney, Michael W.

The Google Web Graph dataset offers a detailed representation of the web’s hyperlink structure as captured in 2002. In this dataset, nodes correspond to individual web pages, and directed edges represent hyperlinks from one page to another. This structure is particularly valuable for studying web connectivity, page importance algorithms like PageRank, and the overall topology of the internet during that period. In total, the dataset has a size of 0,02 GB and comprises 875,713 nodes and 5,105,039 directed edges, reflecting the extensive interlinking characteristic of the early 2000s web. Structurally, the dataset is presented as a single directed graph where each node represents a web page, and each directed edge denotes a hyperlink from one page to another. This format facilitates analyses of web page connectivity, identification of influential pages, and exploration of community structures within the web.

AmazonQA

Creators: Gupta, Mansi; Kulkarni, Nitish; Chanda, Raghuveer; Rayasam, Anirudha; Lipton, Zachary C.
Publication Date: 2019
Creators: Gupta, Mansi; Kulkarni, Nitish; Chanda, Raghuveer; Rayasam, Anirudha; Lipton, Zachary C.

We introduce a new dataset and propose a method that combines information retrieval techniques for selecting relevant reviews (given a question) and “reading comprehension” models for synthesizing an answer (given a question and review). Our dataset consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, we collect additional annotations, marking each question as either answerable or unanswerable based on the available reviews. This dataset is particularly valuable for developing models that integrate information retrieval techniques to select relevant reviews and “reading comprehension” models to synthesize answers based on those reviews. The dataset is approximately 4 GB in size and is available in JSON format.

The dataset uses the following variables:

  • questionText: The text of the question posed by the consumer.

  • questionType: Indicates whether the question is ‘yes/no’ for boolean questions or ‘descriptive’ for open-ended questions.

  • review_snippets: A list of extracted review snippets relevant to the question (up to ten).

  • answerText: The text of the answer provided.

  • answerType: Specifies the type of answer.

  • helpful: A list containing two integers; the first indicates the number of users who found the answer helpful, and the second indicates the total number of responses.

  • asin: The unique Amazon Standard Identification Number (ASIN) for the product the question pertains to.

  • qid: A unique question identifier within the dataset.

  • category: The product category.

  • top_review_wilson: The review with the highest Wilson score.

  • top_review_helpful: The review voted as most helpful by users.

  • is_answerable: A boolean indicating whether the question is answerable using the review snippets, based on an answerability classifier.

  • top_sentences_IR: A list of top sentences (up to ten) based on Information Retrieval (IR) score with the question.

Amazon Reviews: Unlocked Mobile Phones

Creators: PromptCloud, Inc.
Publication Date: 2019
Creators: PromptCloud, Inc.
We analyzed more than 400,000 reviews of close to 4,400 unlocked mobile phones sold on Amazon.com to find out insights with respect to reviews, ratings, price and their relationships, making it a rich resource for analyzing customer sentiment and product performance. The dataset is approximately 0,13 GB in size and is available in CSV format. The author found that on Amazon’s product review platform most of the reviewers have given 4-star and 3-star ratings. The average length of the reviews comes close to 230 characters. They also uncovered that lengthier reviews tend to be more helpful and there is a positive correlation between price & rating. 

Structurally, each entry in the dataset includes the following variables:

  • Product Name: The name of the product (e.g., “Sprint EPIC 4G Galaxy SPH-D7”).
  • Brand Name: The manufacturer or parent company (e.g., “Samsung”).
  • Price: The listed price of the product, with values ranging from a minimum of $1.73 to a maximum of $2,598, and an average price of $226.86.
  • Rating: The user-assigned rating, ranging between 1 and 5 stars.
  • Reviews: The textual content of the user’s review, detailing their experience and opinions.
  • Review Votes: The number of helpfulness votes each review received from other users, with a minimum of 0, a maximum of 645, and an average of 1.50 votes.

Amazon question/answer data

Creators: McAuley, Julian; Yang, Alex
Publication Date: 2016
Creators: McAuley, Julian; Yang, Alex
This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions and around 4 million answers. This dataset offers valuable insights into consumer inquiries and the corresponding responses, facilitating research in natural language processing, question-answering systems, and e-commerce analytics. It can be combined with Amazon product review data (available here) by matching ASINs in the Q/A dataset with ASINs in the review data. The dataset is approximately 766 kB in size and is available in JSON format.

Structurally, each entry in the dataset includes the following variables:

  • asin: The Amazon Standard Identification Number (ASIN) of the product, e.g., “B000050B6Z”.

  • questionType: The type of question, either ‘yes/no’ or ‘open-ended’.

  • answerType: For yes/no questions, this indicates the type of answer: ‘Y’ for yes, ‘N’ for no, or ‘?’ if the polarity of the answer could not be determined.

  • answerTime: The raw timestamp of when the answer was provided.

  • unixTime: The answer timestamp converted to Unix time.

  • question: The text of the question asked by the consumer.

  • answer: The text of the answer provided.

Amazon Product Reviews

Creators: Ni, Jianmo; Li, Jiacheng; McAuley, Julian
Publication Date: 2018
Creators: Ni, Jianmo; Li, Jiacheng; McAuley, Julian

The Amazon Product Reviews dataset encompasses a comprehensive collection of 233.1 million customer reviews from Amazon, covering the period from May 1996 to October 2018. It includes various features such as ratings, textual reviews, helpfulness votes, product metadata (descriptions, category information, price, brand, and image features), and links to related products (e.g., also viewed, also bought graphs). The dataset serves as a valuable resource for analyzing consumer behavior, product trends, and for developing recommendation systems. In total, the dataset is 34 GB in size. It is organized into the following files and subsets:

  • Complete Review Data: A comprehensive file containing all 233.1 million reviews.
  • Ratings Only: A CSV file focusing solely on ratings, excluding textual reviews and metadata.
  • 5-Core: A subset where all users and items have at least five reviews, comprising 75.26 million reviews.
  • Per-Category Data: Reviews and product metadata categorized by specific product types (e.g., Books, Electronics).

Marketing Bias data

Creators: Wan, Mengting; Ni, Jianmo; Misra, Rishabh; McAuley, Julian
Publication Date: 2020
Creators: Wan, Mengting; Ni, Jianmo; Misra, Rishabh; McAuley, Julian

This dataset contains attributes of products sold on ModCloth and Amazon (in particular, attributes about how the products are marketed), which may introduce biases in recommendation systems. It is designed to facilitate research on marketing biases in product recommendations. Data also includes user/item interactions for recommendation.

The dataset amounts to 0,09 GB in size and is built upon two processed subsets:

  • ModCloth Dataset: Contains product attributes from the ModCloth platform.
  • Electronics Dataset: Comprises product attributes from Amazon’s Electronics category.

In total, the dataset includes 99,893 reviews for ModCloth and 1,292,954 reviews for the Electronics category of Amazon.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.