Resources by Stefan

Third Eye Data: TV News Archive chyrons

Creators: TV News Archive
Publication Date: 2017
Creators: TV News Archive

The Third Eye: TV News Archive Chyrons dataset captures and analyzes the “lower third” text, known as chyrons, displayed during live TV news broadcasts. This dataset provides a unique look into the real-time editorial choices of major news networks, offering insights into how different media outlets frame news stories. Using Optical Character Recognition (OCR) technology, chyrons are extracted and archived continuously, making it possible to track how key topics are covered over time.

At its inception in September 2017, the dataset collected chyrons from four major news networks: BBC News, CNN, Fox News, and MSNBC. Within just two weeks of its launch, over four million chyrons had already been captured, highlighting the vast amount of real-time data available. The dataset has been continuously updated since, allowing for longitudinal studies of media framing and news presentation trends. It’s size is approximately 12.5 kB in TSV format.

The dataset is structured into several key components. Each chyron entry includes:

  • The exact chyron text, showing the wording used by the network.
  • Timestamps, allowing analysis of how frequently specific topics appear.
  • Channel identifiers, enabling comparisons between different networks.
  • Duration data, indicating how long a chyron remained on screen, which can suggest emphasis or prioritization of certain stories.

By leveraging this dataset, researchers, journalists, and media analysts can examine bias in news presentation, media influence on public perception, and breaking news coverage trends. It serves as a powerful tool for studying news framing, editorial strategies, and the evolution of televised news narratives across competing networks.

CO2 emissions and ancillary data for 343 cities from diverse sources

Creators: Nangini, Cathy; Peregon Anna; Ciais, Philippe; Weddige, Ulf; Vogel, Felix; Wang, Jun; Bréon, François-Marie; Bachra, Simeran; Wang, Yilong; Gurney, Kevin; Yamagata, Yoshiki; Appleby, Kyra; Telahoun, Sara; Canadell, Josep G; Grübler, Arnulf; Dhakal, Shobhakar; Creutzig, Felix
Publication Date: 2019
Creators: Nangini, Cathy; Peregon Anna; Ciais, Philippe; Weddige, Ulf; Vogel, Felix; Wang, Jun; Bréon, François-Marie; Bachra, Simeran; Wang, Yilong; Gurney, Kevin; Yamagata, Yoshiki; Appleby, Kyra; Telahoun, Sara; Canadell, Josep G; Grübler, Arnulf; Dhakal, Shobhakar; Creutzig, Felix

This dataset collects anthropogenic carbon dioxide emissions data, supplemented with various socio-economic and environmental factors, across 343 cities worldwide. A dataset of dimensions 343 × 179 consisting of CO2 emissions from CDP (187 cities, few in developing countries), the Bonn Center for Local Climate contains action and reporting data (73 cities, mainly in developing countries), and data collected by Peking University (83 cities in China). Further, a set of socio-economic variables – called ancillary data – were collected from other datasets (e.g. socio-economic and traffic indices) or calculated (climate indices, urban area expansion), then combined with the emission data. The remaining attributes are descriptive (e.g. city name, country, etc.) or related to quality assurance/control checks. The file size is 1,8 MB and the majority (88%) of the cities reported emissions between 2010 and 2015. Structurally the dataset contains

  • City Identification: Each entry includes city name, country, and other descriptive attributes.
  • CO₂ Emissions Data: Reported emissions with quality assurance/control checks.
  • Ancillary Variables: Socio-economic data, traffic indices, climate indices, urban area expansion metrics, and more.

Please open using Tab as separator and ” as text delimiter.

MUStARD: Multimodal Sarcasm Detection Dataset

Creators: Castro, Santiago; Hazarika, Devamanyu; Pérez-Rosas, Verónica; Zimmermann, Roger; Mihalcea, Rada; Poria, Soujanya
Publication Date: 2019
Creators: Castro, Santiago; Hazarika, Devamanyu; Pérez-Rosas, Verónica; Zimmermann, Roger; Mihalcea, Rada; Poria, Soujanya

We release the MUStARD dataset which is a multimodal video corpus for research in automated sarcasm discovery. The dataset is compiled from popular TV shows including Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each entry in the dataset combines textual transcripts, audio signals, and visual cues, enabling comprehensive analysis of sarcasm as it manifests across different channels. Beyond isolated utterances, the dataset includes preceding conversational context, providing insights into how prior dialogue influences the interpretation of sarcasm. The dataset was compiled and released in 2019 and is approximately 11,9 kB in size.

Key numbers:

  • Total Utterances: 690

  • Sarcastic Utterances: 345

  • Non-Sarcastic Utterances: 345

The dataset is organized in JSON format with the following fields:

  • utterance: The text of the target utterance to classify.

  • speaker: Speaker of the target utterance.

  • context: List of utterances (in chronological order) preceding the target utterance.

  • context_speakers: Respective speakers of the context utterances.

  • sarcasm: Binary label indicating sarcasm (1 for sarcastic, 0 for non-sarcastic).

State of the State

Creators: Fivethirtyeight
Publication Date: 2019
Creators: Fivethirtyeight

We conducted a text analysis of all 50 governors’ 2019 state of the state speeches to see what issues were talked about the most and whether there were differences between what Democratic and Republican governors were focusing on.

index.csv contains a listing of each of the 50 speeches, one for each state as well as the name and party of the state’s governor and a link to an official source for the speech.

words.csv contains every one-word phrase that was mentioned in at least 10 speeches and every two- or three-word phrase that was mentioned in at least five speeches after a list of stop-words was removed and the word “healthcare” was replaced with “health care” so that they were not counted as distinct phrases. It also contains the results of a chi^2 test that shows the statistical significance of and associated p-value of phrases. Overall, the dataset is 134,5 kB in size

Steam Video Game and Bundle Data

Creators: Kang, Wang-Cheng; McAuley, Julian; Pathak, Apurva; Gupta, Kshitiz
Publication Date: 2018
Creators: Kang, Wang-Cheng; McAuley, Julian; Pathak, Apurva; Gupta, Kshitiz
This datasets collect user interactions and metadata from the Steam platform, aimed at facilitating research in recommendation systems and user behavior analysis. They encompass a vast number of user reviews, capturing detailed feedback and engagement levels. Also, it provides insights into game bundles, detailing which games are frequently purchased together, aiding in the analysis of bundling strategies and their effectiveness. The The dataset encompasses user reviews and interactions from October 2010 to January 2018 and is approximately 1.4 GB in size and includes:
  • Reviews: 7,793,069
  • Users: 2,567,538
  • Items: 15,474
  • Bundles: 615

Structurally, the dataset comprises several key components:

  • User Reviews: Each review entry includes the user ID, game ID, review text, and associated metadata such as timestamps and ratings.

  • Game Metadata: Information about each game, including game ID, title, genre, developer, and pricing details.

  • Bundle Details: Descriptions of game bundles, specifying the bundle ID, included games, and pricing information.

 

Social Recommendation Data

Creators: Cai, Chenwei; He, Ruining; McAuley, Julian; Zhao, Tong; King, Irwin
Publication Date: 2017
Creators: Cai, Chenwei; He, Ruining; McAuley, Julian; Zhao, Tong; King, Irwin

These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews). Those specific user ratings allow for detailed analysis of user preferences. By capturing the social (or trust) relationships between users, this dataset enables the study of how social connections influence user behavior and recommendations. The dataset is approximately 660 MB in size and includes:

Number of Observations:

  • LibraryThing:

    • Users: 73,882
    • Items: 337,561
    • Ratings: 979,053
    • Social Relations: 120,536
  • Epinions:

    • Users: 116,260
    • Items: 41,269
    • Ratings/Feedback: 181,394
    • Social Relations: 181,304

The dataset is structured into:

  • User Information: Anonymized user identifiers.

  • Item Information: Identifiers for items such as books or products.

  • Ratings/Feedback: User-provided ratings or feedback scores for items.

  • Social Relations: Mappings of social or trust relationships between users.


Twitch Livestreaming Interactions

Creators: Rappaz, Jérémie; McAuley, Julian; Aberer, Karl
Publication Date: 2021
Creators: Rappaz, Jérémie; McAuley, Julian; Aberer, Karl

This is a dataset of users consuming streaming content on Twitch. We retrieved all streamers, and all users connected in their respective chats, every 10 minutes during 43 days. The dataset is unique because it captures real-time interactions between users and streamers at a high temporal resolution, allowing for detailed analysis of how audiences engage with live content. Below are the key features that make this dataset particularly valuable. The dataset is 6,47 GB in size and covers a 43-day period in July 2019, with data collected every 10 minutes, resulting in 6,148 time steps.

Overall, it includes:

  • Users: 100k
  • Streamers (items): 162.6k
  • Interactions: 3M
  • Time steps: 6148

Structurally, the dataset encompasses the following information:

  • User ID: Anonymized identifier for each user.

  • Stream ID: Identifier for each streaming session.

  • Streamer Username: Name of the channel or streamer.

  • Time Start: The initial time step (in 10-minute intervals) when the user was observed in the chat.

  • Time Stop: The final time step (in 10-minute intervals) when the user was observed in the chat.

One million comic books panel

Creators: Iyyer, Mohit; Manjunatha, Varun; Guha, Anupam; Vyas, Yogarshi; Boyd-Graber, Jordan; Daumé III, Hal; Davis, Larry
Publication Date: 2016
Creators: Iyyer, Mohit; Manjunatha, Varun; Guha, Anupam; Vyas, Yogarshi; Boyd-Graber, Jordan; Daumé III, Hal; Davis, Larry
Visual narrative is often a combination of explicit information and judicious omissions, relying on the viewer to supply missing details. In comics, most movements in time and space are hidden in the “gutters” between panels. To follow the story, readers logically connect panels together by inferring unseen actions through a process called “closure”. While computers can now describe what is explicitly depicted in natural images, in this paper we examine whether they can understand the closure-driven narratives conveyed by stylized artwork and dialogue in comic book panels. We construct a dataset, COMICS, that consists of over 1.2 million panels (120 GB) paired with automatic textbox transcriptions. An in-depth analysis of COMICS demonstrates that neither text nor image alone can tell a comic book story, so a computer must understand both modalities to keep up with the plot. We introduce three cloze-style tasks that ask models to predict narrative and character-centric aspects of a panel given n preceding panels as context. Various deep neural architectures underperform human baselines on these tasks, suggesting that COMICS contains fundamental challenges for both vision and language. Overall, the dataset is organized into three components:
  • Panel Images: Each panel is stored as an image file, capturing the visual content of the comic scenes.

  • Textbox Transcriptions: Textual content from each panel is extracted using OCR, allowing for analysis of dialogues, narratives, and other textual elements.

  • Metadata: Additional information such as panel dimensions, position within the page, and associated comic book identifiers is included to facilitate detailed analyses.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.