Showing 201-208 of 262 results

Facebook URL Shares

Creators: Solomon Messing; Bogdan State; Chaya Nayak; Gary King; Nate Persily
Publication Date: 2018
Creators: Solomon Messing; Bogdan State; Chaya Nayak; Gary King; Nate Persily

The data describes web page addresses (URLs) that have been shared on Facebook starting January 1, 2017 and ending about a month before the present day. URLs are included if shared by at least 20 unique accounts, and shared publicly at least once. We estimate the full data set will contain on the order of 2 million unique urls shared in 300 million posts, per week. By doing so, this dataset provides insights into the dissemination of web content on Facebook, capturing the dynamics of how information spreads across the platform. Researchers can use this data to explore patterns in user engagement, the virality of content, and the reach of various web pages within the Facebook ecosystem. The dataset’s focus on URLs shared by a minimum number of unique accounts ensures that the data represents content with a certain level of engagement, filtering out less significant shares.

The dataset is structured to include the following key components:

  • URL Information: Each entry includes the web page address (URL) that was shared on Facebook.

  • Share Metrics: Data on the number of times each URL was shared, including the count of unique accounts that shared it and the total number of posts containing the URL.

  • Engagement Metrics: Information on user interactions with the shared URLs, such as likes, comments, and shares.

Facebook Ad Library

Creators: Franklin Fowler, Erika; Franz, Mike; King, Gary; Martin, Greg; Mukerjee, Zagreb; Persily, Nate
Publication Date: 2019
Creators: Franklin Fowler, Erika; Franz, Mike; King, Gary; Martin, Greg; Mukerjee, Zagreb; Persily, Nate

The Ad Library API provides programmatic access to the Facebook Ad Library, a collection of all political advertisements run on Facebook and Instagram since May 2018 in the US, and for other dates in different countries. The codebook describes the scope, structure, and fields of these data. The Ad Library offers detailed information about each advertisement, including:

  • Ad Creative: Visual and textual content of the ad.

  • Impressions: Number of times the ad was displayed.

  • Spend: Estimated amount spent on the ad.

  • Demographics: Age, gender, and location breakdown of the audience reached.

Given that the Ad Library archives all ads related to political content, social issues, and elections since May 2018, the number of observations runs into the millions. The Ad Library’s data is structured to include various attributes for each advertisement:

  • Ad ID: Unique identifier for each ad.

  • Page ID and Name: Information about the page running the ad.

  • Ad Creative: Content and format of the ad.

  • Impressions and Spend: Metrics indicating the ad’s reach and budget.

  • Demographic Distribution: Breakdown of the audience by age, gender, and location.

Multi-aspect Reviews

Creators: Julian McAuley; Jure Leskovec; Dan Jurafsky
Publication Date: 2013
Creators: Julian McAuley; Jure Leskovec; Dan Jurafsky
These datasets include reviews with multiple rated dimensions.It is particularly valuable for research in sentiment analysis, recommender systems, and user modeling, as it allows for a nuanced understanding of user opinions beyond overall ratings.​The most comprehensive of these are beer review datasets from Ratebeer and Beeradvocate, which include sensory aspects such as taste, look, feel, and smell. The data set is about 1 GB large.
Ratebeer:

  • Number of users: 40,213
  • Number of items: 110,419
  • Number of ratings/reviews: 2,855,232
  • Timespan: April, 2000 – November, 2011

BeerAdvocate:

  • Number of users: 33,387
  • Number of items: 66,051
  • Number of ratings/reviews: 1,586,259
  • Timespan: January, 1998 – November, 2011

The datasets are structured in a JSON format, with each entry representing a single review that includes:

  • Product Information: Details about the beer being reviewed.

  • User Information: Anonymized identifiers of the reviewers.

  • Review Content: Textual feedback provided by the user.

  • Ratings: Numerical scores for overall satisfaction and specific aspects (appearance, aroma, palate, taste).

Goodreads-books

Creators: Zając, Zygmunt
Publication Date: 2019
Creators: Zając, Zygmunt

The primary reason for creating this dataset is the requirement of a good clean dataset of books. It contains important features such as book titles, authors, average ratings, ISBN identifiers, language codes, number of pages, ratings count, text reviews count, publication dates, and publishers. A distinctive aspect of this dataset is its ability to support a wide range of book-related analyses, such as trends in book popularity, author influence, and reader preferences. The data set is 1.56 MB large and was scraped via the Goodreads API. It encompasses over 10,000 observations, each representing a unique book entry with multiple attributes. The structure of the dataset is straightforward, consisting of a single CSV file with the following key columns:

  • bookID: A unique identification number for each book.
  • title: The official title of the book.
  • authors: Names of the authors, with multiple authors separated by a delimiter.
  • average_rating: The average user rating for the book.
  • isbn & isbn13: The 10-digit and 13-digit International Standard Book Numbers, respectively.
  • language_code: The primary language in which the book is published (e.g., ‘eng’ for English).
  • num_pages: The total number of pages in the book.
  • ratings_count: The total number of ratings the book has received from users.
  • text_reviews_count: The total number of text reviews written by users.
  • publication_date: The original publication date of the book.
  • publisher: The name of the publishing house.

COVID-19 Twitter Chatter Dataset

Creators: Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo
Publication Date: 2024
Creators: Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The dataset is 14.2 GB large.

Social capital I: measurement and associations with economic mobility

Creators: Chetty, Raj; Jackson, Mathew O.; Kuchler, Theresa; Stroebel, Johannes; Hiller, Abigail; Oppenheemer, Sarah
Publication Date: 2022
Creators: Chetty, Raj; Jackson, Mathew O.; Kuchler, Theresa; Stroebel, Johannes; Hiller, Abigail; Oppenheemer, Sarah
Social capital – the strength of our relationships and communities – has been shown to play an important role in outcomes ranging from income to health. This dataset provides a detailed analysis of social capital across various U.S. communities, focusing on its impact on economic mobility. Using privacy-protected data on 21 billion friendships from Facebook, we measure three types of social capital in each neighborhood, high school, and college in the United States:

  • Cohesiveness: the degree to which social networks are fragmented into cliques
  • Economic connectedness: the degree to which low-income and high-income people are friends with each other
  • Civic engagement: rates of volunteering and participation in community organizations

The dataset is approximately 8 MB in size and structured into different geographical levels, including ZIP codes, high schools, and colleges across the United States. Each entry details the three key measures of social capital—economic connectedness, cohesiveness, and civic engagement—allowing for targeted analysis at various community levels.

 

Facebook Privacy-Protected Full URLs Data Set

Creators: Messing, Solomon; DeGregorio, Christina; Hillenbrand, Bennett; King, Gary; Mahanti, Saurav; Mukerjee, Zagreb; Nayak, Chaya; Persily, Nate; State, Bogdan; Wilkins, Arjun
Publication Date: 2020
Creators: Messing, Solomon; DeGregorio, Christina; Hillenbrand, Bennett; King, Gary; Mahanti, Saurav; Mukerjee, Zagreb; Nayak, Chaya; Persily, Nate; State, Bogdan; Wilkins, Arjun

This is a codebook for data on the demographics of people who viewed, shared, and otherwise interacted with web pages (URLs) shared on Facebook, between January 1, 2017 and October 31, 2022. The data has about 68 million URLs, over 3.1 trillion rows, and over 71 trillion cell values. It results from a collaboration between Facebook and Social Science One (at IQSS at Harvard), originally prepared for Social Science One grantees and describes the “full” URLs dataset, including its scope, structure, and fields. This is version 10 of the codebook and data (released 4/13/2023), first described by Gary King and Nathaniel Persily at https://socialscience.one/blog/update-social-science-one. The dataset’s structure is organized to facilitate detailed analysis. Each entry corresponds to a unique URL and includes aggregated user interaction metrics. These metrics are further broken down by various demographic dimensions, such as age, gender, and country. For users in the United States, additional categorizations include political page affinity, offering insights into how different political leanings may influence content engagement.

Video Game Sales

Creators: Smith, Gregory
Publication Date: 2016
Creators: Smith, Gregory

This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com. The dataset has a size of 1,36 MB and includes games released up to the year 2016, offering a historical perspective on video game sales over several decades. It allows for in-depth analysis of sales trends across different regions, platforms, and genres, making it a valuable resource for market analysis and strategic planning within the video game industry. Each entry in the dataset includes the following attributes:

  • Rank: Overall sales ranking of the game.
  • Name: Title of the game.
  • Platform: The platform on which the game was released (e.g., PC, PS4).
  • Year: Year of the game’s release.
  • Genre: Genre classification of the game.
  • Publisher: Company that published the game.
  • NA_Sales: Sales figures in North America (in millions).
  • EU_Sales: Sales figures in Europe (in millions).
  • JP_Sales: Sales figures in Japan (in millions).
  • Other_Sales: Sales figures in the rest of the world (in millions).
  • Global_Sales: Total worldwide sales (in millions).

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.