AmazonQA

Creators:

Gupta, Mansi; Kulkarni, Nitish; Chanda, Raghuveer; Rayasam, Anirudha; Lipton, Zachary C.

Publication Date:

2019

Data Category:

Dataset Description:

We introduce a new dataset and propose a method that combines information retrieval techniques for selecting relevant reviews (given a question) and “reading comprehension” models for synthesizing an answer (given a question and review). Our dataset consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, we collect additional annotations, marking each question as either answerable or unanswerable based on the available reviews. This dataset is particularly valuable for developing models that integrate information retrieval techniques to select relevant reviews and "reading comprehension" models to synthesize answers based on those reviews. The dataset is approximately 4 GB in size and is available in JSON format.

The dataset uses the following variables:

questionText: The text of the question posed by the consumer.
questionType: Indicates whether the question is 'yes/no' for boolean questions or 'descriptive' for open-ended questions.
review_snippets: A list of extracted review snippets relevant to the question (up to ten).
answerText: The text of the answer provided.
answerType: Specifies the type of answer.
helpful: A list containing two integers; the first indicates the number of users who found the answer helpful, and the second indicates the total number of responses.
asin: The unique Amazon Standard Identification Number (ASIN) for the product the question pertains to.
qid: A unique question identifier within the dataset.
category: The product category.
top_review_wilson: The review with the highest Wilson score.
top_review_helpful: The review voted as most helpful by users.
is_answerable: A boolean indicating whether the question is answerable using the review snippets, based on an answerability classifier.
top_sentences_IR: A list of top sentences (up to ten) based on Information Retrieval (IR) score with the question.

Publications Citing This Dataset:

Shen Gao, Xiuying Chen, Zhaochun Ren, Dongyan Zhao, and Rui Yan. 2021. Meaningful Answer Generation of E-Commerce Question-Answering. ACM Trans. Inf. Syst. 39, 2, Article 18 (April 2021), 26 pages. https://doi.org/10.1145/3432689

Variables:

Name	Description
questionText	String. The question.
questionType	String. Either "yesno" for a boolean question, or "descriptive" for a non-boolean question.
review_snippets	List of strings. Extracted review snippets relevant to the question (at most ten).
answerText	String. The text for the answer.
answerType	String. Type of the answer.
helpful	List of two integers. The first integer indicates the number of uses who found the answer helpful. The second integer indicates the total number of responses.
asin	String. Unique product ID for the product the question pertains to.
qid	Integer. Unique question id for the question (in the entire dataset).
category	String. Product category.
top_review_wilson	String. The review with the highest wilson score
top_review_helpful	String. The review voted as most helpful by the users.
is_answerable	Boolean. Output of the answerability classifier indicating whether the question is answerable using the review snippets.
top_sentences_IR	List of strings. A list of top sentences (at most 10) based on IR score with the question.

Details:

Bookmark this Dataset/Publication

AmazonQA

Trip Advisor Hotel Reviews

Large-scale CelebFaces Attributes (CelebA) Dataset

Click-through rate (CTR) dataset on conventional and AI-generated ads including demographical information

AmazonQA

Sign In

Register

Reset Password