Showing 249-256 of 272 results

Social circles: Twitter

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of ‘circles’ (or ‘lists’) from Twitter. Twitter data was crawled from public sources. It includes node features (profiles), circles, and ego networks and covers 81,306 nodes and 1,768,149 edges, capturing the intricate web of connections among users. This makes the dataset particularly useful for analyzing community structures, social influence, and network dynamics in a more nuanced way. It is approximately 0,02 GB in size and structured into:

  • Node features (profiles): Anonymized user profile information, where specific attributes are replaced with generic labels.

  • Circles: Lists of friends grouped by users, representing their social circles.

  • Ego networks: Subgraphs centered around individual users (egos), including their direct friends and the connections among those friends.

Wikipedia vote network

Creators: Leskovec, Jure; Huttenlocher, Daniel; Kleinberg, Jon
Publication Date: 2010
Creators: Leskovec, Jure; Huttenlocher, Daniel; Kleinberg, Jon

A small part of Wikipedia contributors are administrators, who are users with access to additional technical features that aid in maintenance. In order for a user to become an administrator a Request for adminship (RfA) is issued and the Wikipedia community via a public discussion or a vote decides who to promote to adminship. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all administrator elections and vote history data. This gave us 2,794 elections with 103,663 total votes and 7,066 users participating in the elections (either casting a vote or being voted on). Out of these 1,235 elections resulted in a successful promotion, while 1,559 elections did not result in the promotion. About half of the votes in the dataset are by existing admins, while the other half comes from ordinary Wikipedia users. The network contains all the Wikipedia voting data from the inception of Wikipedia till January 2008. Nodes in the network represent wikipedia users and a directed edge from node i to node j represents that user i voted on user j. In total, the dataset has a size of 0,01 GB.

Social circles: Google+

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of ‘circles’ from Google+. Google+ data was collected from users who had manually shared their circles using the ‘share circle’ feature. The dataset includes node features (profiles), circles, and ego networks. The dataset includes 107,614 nodes and 13,673,453 edges, providing valuable insights into the structure and characteristics of social connections.

It has a size of 0,4 GB and is structured into:

  • Node features (profiles): Anonymized user profile information, where specific attributes are replaced with generic labels.

  • Circles: Lists of friends grouped by users, representing their social circles.

  • Ego networks: Subgraphs centered around individual users (egos), including their direct friends and the connections among those friends.

AmazonQA

Creators: Gupta, Mansi; Kulkarni, Nitish; Chanda, Raghuveer; Rayasam, Anirudha; Lipton, Zachary C.
Publication Date: 2019
Creators: Gupta, Mansi; Kulkarni, Nitish; Chanda, Raghuveer; Rayasam, Anirudha; Lipton, Zachary C.

We introduce a new dataset and propose a method that combines information retrieval techniques for selecting relevant reviews (given a question) and “reading comprehension” models for synthesizing an answer (given a question and review). Our dataset consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, we collect additional annotations, marking each question as either answerable or unanswerable based on the available reviews. This dataset is particularly valuable for developing models that integrate information retrieval techniques to select relevant reviews and “reading comprehension” models to synthesize answers based on those reviews. The dataset is approximately 4 GB in size and is available in JSON format.

The dataset uses the following variables:

  • questionText: The text of the question posed by the consumer.

  • questionType: Indicates whether the question is ‘yes/no’ for boolean questions or ‘descriptive’ for open-ended questions.

  • review_snippets: A list of extracted review snippets relevant to the question (up to ten).

  • answerText: The text of the answer provided.

  • answerType: Specifies the type of answer.

  • helpful: A list containing two integers; the first indicates the number of users who found the answer helpful, and the second indicates the total number of responses.

  • asin: The unique Amazon Standard Identification Number (ASIN) for the product the question pertains to.

  • qid: A unique question identifier within the dataset.

  • category: The product category.

  • top_review_wilson: The review with the highest Wilson score.

  • top_review_helpful: The review voted as most helpful by users.

  • is_answerable: A boolean indicating whether the question is answerable using the review snippets, based on an answerability classifier.

  • top_sentences_IR: A list of top sentences (up to ten) based on Information Retrieval (IR) score with the question.

Social circles: Facebook

Creators: McAuley, Julian; Leskovec, Jure
Publication Date: 2012
Creators: McAuley, Julian; Leskovec, Jure

This dataset consists of ‘circles’ (or ‘friends lists’) from Facebook. Facebook data was collected from survey participants using this Facebook app. The dataset includes node features (profiles), circles, and ego networks, offering valuable insights into the structure and characteristics of social connections. The dataset includes 4,039 nodes and 88,234 edges. Facebook data has been anonymized by replacing the Facebook-internal ids for each user with a new value. The dataset is approximately 0.01 GB in size.

The dataset is structured into:

  • Node features (profiles): Anonymized user profile information, where specific attributes (e.g., political affiliation) are replaced with generic labels (e.g., ‘anonymized feature 1’).

  • Circles: Lists of friends grouped by users, representing their social circles.

  • Ego networks: Subgraphs centered around individual users (egos), including their direct friends and the connections among those friends.

YouTube-8M Dataset

Creators: Abu-El-Haija, Sami; Kothari, Nisarg; Lee, Joonseok; Natsev, Paul; Toderici, George; Varadarajan, Balakrishnan; Vijayanarasimhan, Sudheendra
Publication Date: 2016
Creators: Abu-El-Haija, Sami; Kothari, Nisarg; Lee, Joonseok; Natsev, Paul; Toderici, George; Varadarajan, Balakrishnan; Vijayanarasimhan, Sudheendra

YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and with high-quality machine-generated & partially human-verified annotations from a diverse vocabulary of 3,800+ visual entities.

It comprises two subsets:

8M Segments Dataset: 230K human-verified segment labels, 1000 classes, 5 segments/video
8M Dataset: May 2018 version (current): 6.1M videos, 3862 classes, 3.0 labels/video, 2.6B audio-visual features

Thus, it comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to train a strong baseline model on this dataset in less than a day on a single GPU! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.

YouTube offers the YouTube8M dataset for download as TensorFlow Record files on their website. Starter code for the dataset can be found on their GitHubpage.

Amazon Reviews: Unlocked Mobile Phones

Creators: PromptCloud, Inc.
Publication Date: 2019
Creators: PromptCloud, Inc.
We analyzed more than 400,000 reviews of close to 4,400 unlocked mobile phones sold on Amazon.com to find out insights with respect to reviews, ratings, price and their relationships, making it a rich resource for analyzing customer sentiment and product performance. The dataset is approximately 0,13 GB in size and is available in CSV format. The author found that on Amazon’s product review platform most of the reviewers have given 4-star and 3-star ratings. The average length of the reviews comes close to 230 characters. They also uncovered that lengthier reviews tend to be more helpful and there is a positive correlation between price & rating. 

Structurally, each entry in the dataset includes the following variables:

  • Product Name: The name of the product (e.g., “Sprint EPIC 4G Galaxy SPH-D7”).
  • Brand Name: The manufacturer or parent company (e.g., “Samsung”).
  • Price: The listed price of the product, with values ranging from a minimum of $1.73 to a maximum of $2,598, and an average price of $226.86.
  • Rating: The user-assigned rating, ranging between 1 and 5 stars.
  • Reviews: The textual content of the user’s review, detailing their experience and opinions.
  • Review Votes: The number of helpfulness votes each review received from other users, with a minimum of 0, a maximum of 645, and an average of 1.50 votes.

Amazon question/answer data

Creators: McAuley, Julian; Yang, Alex
Publication Date: 2016
Creators: McAuley, Julian; Yang, Alex
This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions and around 4 million answers. This dataset offers valuable insights into consumer inquiries and the corresponding responses, facilitating research in natural language processing, question-answering systems, and e-commerce analytics. It can be combined with Amazon product review data (available here) by matching ASINs in the Q/A dataset with ASINs in the review data. The dataset is approximately 766 kB in size and is available in JSON format.

Structurally, each entry in the dataset includes the following variables:

  • asin: The Amazon Standard Identification Number (ASIN) of the product, e.g., “B000050B6Z”.

  • questionType: The type of question, either ‘yes/no’ or ‘open-ended’.

  • answerType: For yes/no questions, this indicates the type of answer: ‘Y’ for yes, ‘N’ for no, or ‘?’ if the polarity of the answer could not be determined.

  • answerTime: The raw timestamp of when the answer was provided.

  • unixTime: The answer timestamp converted to Unix time.

  • question: The text of the question asked by the consumer.

  • answer: The text of the answer provided.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.