Showing 241-248 of 262 results

Social circles: Google+

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of ‘circles’ from Google+. Google+ data was collected from users who had manually shared their circles using the ‘share circle’ feature. The dataset includes node features (profiles), circles, and ego networks. The dataset includes 107,614 nodes and 13,673,453 edges, providing valuable insights into the structure and characteristics of social connections.

It has a size of 0,4 GB and is structured into:

  • Node features (profiles): Anonymized user profile information, where specific attributes are replaced with generic labels.

  • Circles: Lists of friends grouped by users, representing their social circles.

  • Ego networks: Subgraphs centered around individual users (egos), including their direct friends and the connections among those friends.

AmazonQA

Creators: Gupta, Mansi; Kulkarni, Nitish; Chanda, Raghuveer; Rayasam, Anirudha; Lipton, Zachary C.
Publication Date: 2019
Creators: Gupta, Mansi; Kulkarni, Nitish; Chanda, Raghuveer; Rayasam, Anirudha; Lipton, Zachary C.

We introduce a new dataset and propose a method that combines information retrieval techniques for selecting relevant reviews (given a question) and “reading comprehension” models for synthesizing an answer (given a question and review). Our dataset consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, we collect additional annotations, marking each question as either answerable or unanswerable based on the available reviews. This dataset is particularly valuable for developing models that integrate information retrieval techniques to select relevant reviews and “reading comprehension” models to synthesize answers based on those reviews. The dataset is approximately 4 GB in size and is available in JSON format.

The dataset uses the following variables:

  • questionText: The text of the question posed by the consumer.

  • questionType: Indicates whether the question is ‘yes/no’ for boolean questions or ‘descriptive’ for open-ended questions.

  • review_snippets: A list of extracted review snippets relevant to the question (up to ten).

  • answerText: The text of the answer provided.

  • answerType: Specifies the type of answer.

  • helpful: A list containing two integers; the first indicates the number of users who found the answer helpful, and the second indicates the total number of responses.

  • asin: The unique Amazon Standard Identification Number (ASIN) for the product the question pertains to.

  • qid: A unique question identifier within the dataset.

  • category: The product category.

  • top_review_wilson: The review with the highest Wilson score.

  • top_review_helpful: The review voted as most helpful by users.

  • is_answerable: A boolean indicating whether the question is answerable using the review snippets, based on an answerability classifier.

  • top_sentences_IR: A list of top sentences (up to ten) based on Information Retrieval (IR) score with the question.

Social circles: Facebook

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of ‘circles’ (or ‘friends lists’) from Facebook. Facebook data was collected from survey participants using this Facebook app. The dataset includes node features (profiles), circles, and ego networks, offering valuable insights into the structure and characteristics of social connections. The dataset includes 4,039 nodes and 88,234 edges. Facebook data has been anonymized by replacing the Facebook-internal ids for each user with a new value. The dataset is approximately 0.01 GB in size.

The dataset is structured into:

  • Node features (profiles): Anonymized user profile information, where specific attributes (e.g., political affiliation) are replaced with generic labels (e.g., ‘anonymized feature 1’).

  • Circles: Lists of friends grouped by users, representing their social circles.

  • Ego networks: Subgraphs centered around individual users (egos), including their direct friends and the connections among those friends.

YouTube-8M Dataset

Creators: Abu-El-Haija, Sami; Kothari, Nisarg; Lee, Joonseok; Natsev, Paul; Toderici, George; Varadarajan, Balakrishnan; Vijayanarasimhan, Sudheendra
Publication Date: 2016
Creators: Abu-El-Haija, Sami; Kothari, Nisarg; Lee, Joonseok; Natsev, Paul; Toderici, George; Varadarajan, Balakrishnan; Vijayanarasimhan, Sudheendra

YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and with high-quality machine-generated & partially human-verified annotations from a diverse vocabulary of 3,800+ visual entities.

It comprises two subsets:

8M Segments Dataset: 230K human-verified segment labels, 1000 classes, 5 segments/video
8M Dataset: May 2018 version (current): 6.1M videos, 3862 classes, 3.0 labels/video, 2.6B audio-visual features

Thus, it comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to train a strong baseline model on this dataset in less than a day on a single GPU! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.

YouTube offers the YouTube8M dataset for download as TensorFlow Record files on their website. Starter code for the dataset can be found on their GitHubpage.

Amazon Reviews: Unlocked Mobile Phones

Creators: PromptCloud, Inc.
Publication Date: 2019
Creators: PromptCloud, Inc.
We analyzed more than 400,000 reviews of close to 4,400 unlocked mobile phones sold on Amazon.com to find out insights with respect to reviews, ratings, price and their relationships, making it a rich resource for analyzing customer sentiment and product performance. The dataset is approximately 0,13 GB in size and is available in CSV format. The author found that on Amazon’s product review platform most of the reviewers have given 4-star and 3-star ratings. The average length of the reviews comes close to 230 characters. They also uncovered that lengthier reviews tend to be more helpful and there is a positive correlation between price & rating. 

Structurally, each entry in the dataset includes the following variables:

  • Product Name: The name of the product (e.g., “Sprint EPIC 4G Galaxy SPH-D7”).
  • Brand Name: The manufacturer or parent company (e.g., “Samsung”).
  • Price: The listed price of the product, with values ranging from a minimum of $1.73 to a maximum of $2,598, and an average price of $226.86.
  • Rating: The user-assigned rating, ranging between 1 and 5 stars.
  • Reviews: The textual content of the user’s review, detailing their experience and opinions.
  • Review Votes: The number of helpfulness votes each review received from other users, with a minimum of 0, a maximum of 645, and an average of 1.50 votes.

Amazon question/answer data

Creators: McAuley, Julian; Yang, Alex
Publication Date: 2016
Creators: McAuley, Julian; Yang, Alex
This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions and around 4 million answers. This dataset offers valuable insights into consumer inquiries and the corresponding responses, facilitating research in natural language processing, question-answering systems, and e-commerce analytics. It can be combined with Amazon product review data (available here) by matching ASINs in the Q/A dataset with ASINs in the review data. The dataset is approximately 766 kB in size and is available in JSON format.

Structurally, each entry in the dataset includes the following variables:

  • asin: The Amazon Standard Identification Number (ASIN) of the product, e.g., “B000050B6Z”.

  • questionType: The type of question, either ‘yes/no’ or ‘open-ended’.

  • answerType: For yes/no questions, this indicates the type of answer: ‘Y’ for yes, ‘N’ for no, or ‘?’ if the polarity of the answer could not be determined.

  • answerTime: The raw timestamp of when the answer was provided.

  • unixTime: The answer timestamp converted to Unix time.

  • question: The text of the question asked by the consumer.

  • answer: The text of the answer provided.

Goodreads Datasets

Creators: Wan, Mengting; McAuley, Julian
Publication Date: 2017
Creators: Wan, Mengting; McAuley, Julian

The Goodreads Datasets provide a large-scale collection of book-related data, making them valuable for analyzing reading behavior, book popularity, and recommendation systems. They contain rich metadata on over 2.3 million books, including titles, authors, publication years, genres, and average ratings. Additionally, they feature nearly 229 million user-book interactions, capturing explicit preferences such as bookshelf assignments (“read,” “to-read”) and user ratings. The dataset also covers a vast collection of user-generated textual reviews, offering insights into reader sentiments and opinions. The Goodreads Datasets contain three primary components: book metadata, user-book interactions, and book reviews. The book metadata includes details on 2,360,655 books, such as title, author, publication date, and genre. The user-book interactions dataset comprises 228,648,342 interactions from 876,145 users, capturing activities like adding books to shelves (“read,” “to-read”) and providing ratings. The book reviews dataset contains detailed textual reviews written by users, providing insights into reader sentiments and book popularity.The total size of the combined datasets is approximately 11 GB, available in JSON and CSV formats.

Social Interaction QA; Social IQA

Creators: Sap, Maarten; Rashkin, Hannah; Chen, Derek; Le Bras, Ronan; Choi, Yejin
Publication Date: 2019
Creators: Sap, Maarten; Rashkin, Hannah; Chen, Derek; Le Bras, Ronan; Choi, Yejin

Social Interaction QA, a new question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications. For example, given an action like “Jesse saw a concert” and a question like “Why did Jesse do this?”, humans can easily infer that Jesse wanted “to see their favorite performer” or “to enjoy the music”, and not “to see what’s happening inside” or “to see if it works”. The actions in Social IQa span a wide variety of social situations, and answer candidates contain both human-curated answers and adversarially-filtered machine-generated candidates. Social IQa contains over 37,000 QA pairs for evaluating models’ abilities to reason about the social implications of everyday events and situations. The dataset is relatively small, with a size of about 0.01 GB, and is available in JSON format.

The structure of the dataset consists of a set of question-answer pairs, where each entry contains:

  • A context describing a social situation.
  • A question that requires reasoning about the context.
  • Three answer choices (one correct, two incorrect).
  • A label indicating the correct answer.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.