Showing 145-152 of 262 results

Characterizing Online Discussion Using Coarse Discourse Sequences

Creators: Zhang, Amy; Culbertson, Brian; Paritosh, Praveen
Publication Date: 2017
Creators: Zhang, Amy; Culbertson, Brian; Paritosh, Praveen

In this work, we present a novel method for classifying comments in online discussions into a set of coarse discourse acts towards the goal of better understanding discussions at scale. To facilitate this study, we devise a categorization of coarse discourse acts designed to encompass general online discussion and allow for easy annotation by crowd workers. We collect and release a corpus of over 9,000 threads comprising over 100,000 comments manually annotated via paid crowdsourcing with discourse acts and randomly sampled from the site Reddit. Using our corpus, we demonstrate how the analysis of discourse acts can characterize different types of discussions, including discourse sequences such as Q&A pairs and chains of disagreement, as well as different communities. Finally, we conduct experiments to predict discourse acts using our corpus, finding that structured prediction models such as conditional random fields can achieve an F1 score of 75%. We also demonstrate how the broadening of discourse acts from simply question and answer to a richer set of categories can improve the recall performance of Q&A extraction.

Central African Republic: Displacement Data - Baseline Assessment

Creators: International Organization for Migration (IOM)
Publication Date: 2024
Creators: International Organization for Migration (IOM)

This Data is about IDP, returnees from CAR (previous IDP) and returnees from other countries repartition by origin and period of displacement and between 2013 and the date of assessment. It offers data on the distribution of IDPs and returnees, including their origins, periods of displacement, and reasons for displacement. This information is crucial for understanding the dynamics of displacement within CAR and for planning humanitarian interventions. As a sub-component of mobility tracking, the baseline assessment collects data on the presence of displaced populations in defined geographic areas. This foundational data supports the identification of needs and the coordination of assistance efforts. Evaluation has been run in 6 prefectures (admin1), 16 sub-prefectures (admin2) and 367 localities. The number of observations varies across different rounds of data collection. For instance, Round 6 of the assessment identified a total displaced population of 1,074,983 individuals, comprising 580,692 IDPs and 375,684 returnees. The dataset is 16,7 kB and organized into several key components:

  • IDP Data: Information on internally displaced persons, including their locations, demographics, and displacement periods.

  • Returnee Data: Details about returnees, both former IDPs and those returning from other countries, including their areas of origin and return timelines.

  • Reasons for Displacement: Categorization of displacement causes, such as armed conflicts, inter-community tensions, or preventive measures. Notably, 67% of internal displacements are linked to armed conflicts, 26% to inter-community tensions, and 6% are preventive.

  • Geographical Distribution: Data on the distribution of displaced populations across various regions and localities within CAR.

Crowdsourced air traffic data from The OpenSky Network 2020

Creators: Olive, Xavier; Strohmeier, Martin; Lübbe, Jannis
Publication Date: 2022
Creators: Olive, Xavier; Strohmeier, Martin; Lübbe, Jannis

The data in this dataset is derived and cleaned from the full OpenSky dataset to illustrate the development of air traffic during the COVID-19 pandemic. It spans all flights seen by the network’s more than 2500 members since 1 January 2019. More data will be periodically included in the dataset until the end of the COVID-19 pandemic. Leveraging a network of over 2,500 members, the dataset aggregates ADS-B signals received by volunteers worldwide, ensuring a rich and diverse data source. The dataset includes records of 41,900,660 flights, capturing data from 160,737 unique aircrafts. Flight operations involving 13,934 airports across 127 countries are documented. In total, the dataset has a size of 7,0 GB. Each month is represented by a separate CSV file, containing flight data for that specific period. ​ Each file includes the following columns:

  • callsign: Identifier used for air traffic control communications.
  • number: Commercial flight number, if available.
  • icao24: Unique 24-bit address assigned to the aircraft’s transponder.
  • registration: Aircraft’s registration number.
  • typecode: Aircraft model code.
  • origin: ICAO code of the departure airport.
  • destination: ICAO code of the arrival airport.
  • firstseen: Timestamp of the first detection during the flight.
  • lastseen: Timestamp of the last detection during the flight.
  • day: Date of the flight.

ConceptNet

Creators: ConceptNet
Publication Date: 2021
Creators: ConceptNet

ConceptNet aims to give computers access to common-sense knowledge, the kind of information that ordinary people know but usually leave unstated.ConceptNet is a semantic network that represents things that computers should know about the world, especially for the purpose of understanding text written by people. Its “concepts” are represented using words and phrases of many different natural language — unlike similar projects, it’s not limited to a single language such as English. It expresses over 13 million links between these concepts, and makes the whole data set available under a Creative Commons license. ConceptNet is structured as a graph, where nodes represent concepts (words or phrases), and edges represent the relationships between these concepts. Each edge is labeled with a relation type, such as “IsA,” “PartOf,” or “RelatedTo,” indicating the nature of the relationship. The dataset is organized into sub-datasets based on language and relation types, allowing users to work with specific subsets relevant to their applications.

O Say Can You See: Early Washington, D.C., Law and Family Project

Creators: O Say Can You See Project
Publication Date: 2024
Creators: O Say Can You See Project

This site documents the challenge to slavery and the quest for freedom in early Washington, D.C., by collecting, digitizing, making accessible, and analyzing freedom suits filed between 1800 and 1862, as well as tracing the multigenerational family networks they reveal. The project encompasses hundreds of freedom suits from the Circuit Court for the District of Columbia, Maryland state courts, and the U.S. Supreme Court, providing invaluable insights into the legal battles fought by enslaved individuals seeking freedom. By exploring the web of litigants, jurists, attorneys, and community members present in case files, the project shows deep relationship mapping of early Washington, D.C., illustrating how each person is connected to others in the city and beyond. The dataset has a size of 3,0 kB and is organized into several interconnected components:

  • People: A database of individuals involved in the cases, including litigants, jurists, attorneys, and community members, with detailed profiles and social connections.

  • Families: Kinship and family networks of multigenerational Black, white, and mixed families, created using information derived from court records and genealogical research.

  • Cases: A collection of hundreds of freedom cases from various courts, providing detailed accounts of each legal battle.

  • Stories: Interactive analyses of the court cases, families, attorneys, and judges, focusing on historical or legal questions raised by these cases.

Super Bowl Ads

Creators: Superbowl Ads; FiveThirtyEight
Publication Date: 2021
Creators: Superbowl Ads; FiveThirtyEight

This dataset contains a list of ads from the 10 brands that had the most advertisements in Super Bowls from 2000 to 2020, according to data from superbowl-ads.com, with matching videos found on YouTube. Each advertisement is evaluated across seven defining characteristics: humor, early product display, patriotism, celebrity presence, danger elements, inclusion of animals, and use of sexual content. This granular assessment allows for in-depth analysis of advertising strategies. Furthermore, links to corresponding YouTube videos are included, facilitating immediate access to the commercials for further qualitative analysis.There are 233 advertisements documented in the dataset, spanning from 2000 to 2020 with a total size of 38,6 kB. Structurally, the dataset is organized as a CSV file with the following columns:

  • year: Year the advertisement aired.

  • brand: Brand of the advertiser, standardized to account for variations and sub-brands.

  • superbowl_ads_dot_com_url: Link to the advertisement’s entry on superbowl-ads.com.

  • youtube_url: Link to the corresponding YouTube video of the advertisement.

  • funny: Indicates if the ad was intended to be humorous (TRUE/FALSE).

  • show_product_quickly: Indicates if the product was shown within the first 10 seconds (TRUE/FALSE).

  • patriotic: Indicates if the ad had patriotic elements (TRUE/FALSE).

  • celebrity: Indicates if a celebrity appeared in the ad (TRUE/FALSE).

  • danger: Indicates if the ad involved elements of danger (TRUE/FALSE).

  • animals: Indicates if animals were featured in the ad (TRUE/FALSE).

  • use_sex: Indicates if sexual content was used to promote the product (TRUE/FALSE).

Stack Exchange Data

Creators: Stack Exchange Inc.
Publication Date: 2014
Creators: Stack Exchange Inc.

This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. The dataset covers detailed records of questions, answers, comments, user profiles, and other related metadata from numerous Stack Exchange communities. This breadth allows for in-depth analysis of community interactions, content evolution, and knowledge dissemination patterns. The dataset has a size of 92,3 GB and captures content from the inception of each Stack Exchange site up to the date of the specific data dump. For example, the September 2023 release includes data up to that month. Structurally, the database is organized into individual archives for each Stack Exchange community. Each archive contains several XML files representing different data tables:

  • Posts.xml: Contains both questions and answers, with fields detailing post ID, creation date, score, body content, and related metadata.

  • Users.xml: Includes user information such as user ID, reputation, creation date, and profile details.

  • Comments.xml: Encompasses comments made on posts, including comment ID, post ID, user ID, and content.

  • Votes.xml: Records voting data on posts, detailing vote type, user ID, and timestamps.

U.S.-Mexico Border Surveillance Data

Creators: Electronic Frontier Foundaton (EFF)
Publication Date: 2024
Creators: Electronic Frontier Foundaton (EFF)

This dataset includes the locations of Customs & Border Patrol surveillance towers, proposed tower locations, and automated license plate readers. There is an accompanying blog post and map. The dataset includes precise locations of Customs & Border Protection (CBP) surveillance towers, proposed tower sites, automated license plate readers, aerostats (tethered surveillance balloons), and facial recognition systems at land ports of entry. This extensive mapping offers valuable insights into the deployment and reach of surveillance infrastructure along the border. By making this data publicly available, the database facilitates research into the implications of surveillance practices on civil liberties and border communities. The dataset was last updated on November 21, 2024. It reflects the state of surveillance infrastructure up to that date and has a total size of 187,3 kB. Structurally, the dataset is organized into several categories, each detailing a specific type of surveillance technology:

  • Surveillance Towers: Locations and specifications of existing and proposed CBP surveillance towers.

  • Automated License Plate Readers (ALPRs): Positions of ALPR systems used to monitor vehicle movements across the border.

  • Aerostats: Details on tethered surveillance balloons employed for aerial monitoring.

  • Facial Recognition Systems: Information on the deployment of facial recognition technology at land ports of entry.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.