Showing 137-144 of 262 results

Raw Bay Area Craigslist Rental Housing Posts

Creators: Pennigton, Kate
Publication Date: 2018
Creators: Pennigton, Kate

Like many cities, San Francisco doesn’t track rents. ​The Bay Area Craigslist Rental Housing Posts dataset comprises rental housing listings from the San Francisco Bay Area, spanning from 2000 to 2018. Each entry includes various attributes such as posting date, neighborhood, price, number of bedrooms and bathrooms, square footage, and geographic coordinates, facilitating in-depth analysis of housing trends. There are 200,796 individual rental listings documented in the dataset with a total size of 16,3 kB. The dataset is organized into three main components:

  1. Raw Data (2000-2012): This subset includes 167,090 entries with fields for posting date, title, and neighborhood.

  2. Raw Data (2013-2018): Comprising 58,551 entries, this subset offers more detailed information, including post ID, date, neighborhood, price, square footage, number of bedrooms, address, latitude, longitude, description, title, and details.

  3. Cleaned Data (2000-2018): This consolidated and processed dataset contains 200,796 entries with variables such as post ID, date, year, neighborhood, city, county, price, number of bedrooms, number of bathrooms, square footage, room type indicator, address, latitude, longitude, title, description, and details.

Creators: Habernal, Ivan

This dataset contains transcribed court hearings sourced from official hearings of the European Court of Human Rights (https://www.echr.coe.int/webcasts-of-hearings). The hearings are 154 selected webcasts (videos) from 2012-2022 in their original language (no interpretation). With manual annotation for language labels and automatic processing of the extracted audio with pyannote and whisper-large-v2, the resulting dataset contains 4000 speaker turns and 88920 individual lines. The dataset has a size of 1,9 MB and contains two subsets, the transcripts and the metadata with linked documents. The transcripts are additionally available as .txt or .xml.
Languages

The largest amounts in the transcripts are: English, French

A smaller portion also contains the following languages:

Russian, Spanish, Croatian, Italian, Portuguese, Turkish, Polish, Lithuanian, German, Ukrainian, Hungarian, Dutch, Albanian, Romanian, Serbian

The collected metadata is: English
Dataset Structure
Data Instances

Each instance in transcripts represents an entire segment of a transcript, similar to a conversation turn in a dialog.

{ ‘id’: 0, ‘webcast_id’: ‘1021112_29112017’, ‘segment_id’: 0, ‘speaker_name’: ‘UNK’, ‘speaker_role’: ‘Announcer’, ‘data’: { ‘begin’: [12.479999542236328], ‘end’: [13.359999656677246], ‘language’: [‘fr’], ‘text’: [‘La Cour!’] } }

Each instance in documents represents a information on a document in hudoc associated with a hearing and the metadata associated with a hearing. The actual document is linked and can also be found in hudocwith the case_id. Note: hearing_type states the type of the hearing, type states the type of the document. If the hearing is a “Grand Chamber hearing”, the “CHAMBER” document refers to a different hearing.

{

‘id’: 16,

‘webcast_id’: ‘1232311_02102012’,

‘hearing_title’: ‘Michaud v. France (nos. 12323/11)’,

‘hearing_date’: ‘2012-10-02 00:00:00’,

‘hearing_type’: ‘Chamber hearing’,

‘application_number’: [‘12323/11’],

‘case_id’: ‘001-115377’,

‘case_name’: ‘CASE OF MICHAUD v. FRANCE’,

‘case_url’: ‘https://hudoc.echr.coe.int/eng?i=001-115377’,

‘ecli’: ‘ECLI:CE:ECHR:2012:1206JUD001232311’,

‘type’: ‘CHAMBER’,

‘document_date’: ‘2012-12-06 00:00:00’,

‘importance’: 1,

‘articles’: [‘8’, ‘8-1’, ‘8-2′, ’34’, ’35’],

‘respondent_government’: [‘FRA’],

‘issue’: ‘Decision of the National Bar Council of 12 July 2007 “adopting regulations on internal procedures for implementing the obligation to combat money laundering and terrorist financing, and an internal supervisory mechanism to guarantee compliance with those procedures” ; Article 21-1 of the Law of 31 December 1971 ; Law no. 2004-130 of 11 February 2004 ; Monetary and Financial Code’,

‘strasbourg_caselaw’: ‘André and Other v. France, no 18603/03, 24 July 2008;Bosphorus Hava Yollari Turizm ve Ticaret Anonim Sirketi v. Ireland [GC], no 45036/98, ECHR 2005-VI;[…]’,

‘external_sources’: ‘Directive 91/308/EEC, 10 June 1991;Article 6 of the Treaty on European Union;Charter of Fundamental Rights of the European Union;Articles 169, 170, 173, 175, 177, 184 and 189 of the Treaty establishing the European Community;Recommendations 12 and 16 of the financial action task force (“FATF”) on money laundering;Council of Europe Convention on Laundering, Search, Seizure and Confiscation of the Proceeds from Crime and on the Financing of Terrorism (16 May 2005)’,

‘conclusion’: ‘Remainder inadmissible;No violation of Article 8 – Right to respect for private and family life (Article 8-1 – Respect for correspondence;Respect for private life)’,

‘separate_opinion’: True

}

SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments

Creators: Lange, Kai-Robin; Jentsch, Carsten
Publication Date: 2023
Creators: Lange, Kai-Robin; Jentsch, Carsten

A dataset of German parliament debates covering 74 years of plenary protocols across all 16 state parliaments of Germany as well as the German Bundestag. The debates are separated into individual speeches which are enriched with meta data identifying the speaker as a member of the parliament (mp). In total, the dataset covers more than 15,000,000 speeches with a total size of 10 GB.

When using this data set, please cite the original paper “Lange, K.-R., Jentsch, C. (2023). SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments. Proceedings of the 3rd Workshop on Computational Linguistics for Political Text Analysis@KONVENS 2023.“.

The meta data is separated into two different types: time-specific meta-data that contains only information for a legislative period but can change over time (e.g. the party or constituency of an mp) and meta-data that is considered fixed, such as the birth date or the name of a speaker. The former information are stored aong with the speeches as it is considered temporal information of that point in time, but are additionally stored in the file all_mps_mapping.csv if there is the need to double-check something. The rest of the meta-data are stored in the file all_mps_meta.csv. The meta-data from this file can be matched with a speech by comparing the speaker ID-variable “MPID”. The speeches of each parliament are saved in a csv format. Along with the speeches, they contain the following meta-data:

  • Period: int. The period in which the speech took place
  • Session: int. The session in which the speech took place
  • Chair: boolean. The information if the speaker was the chair of the plenary session
  • Interjection: boolean. The information if the speech is a comment or an interjection from the crowd
  • Party: list (e.g. [“cdu”] or [“cdu”, “fdp”] when having more than one speaker during an interjection). List of the party of the speaker or the parties whom the comment/interjection references
  • Consituency: string. The consituency of the speaker in the current legislative period
  • MPID: int. The ID of the speaker, which can be used to get more meta-data from the file all_mps_meta.csv

The file all_mps_meta.csv contains the following meta information:

  • MPID: int. The ID of the speaker, which can be used to match the mp with his/her speeches.
  • WikipediaLink: The Link to the mps Wikipedia page
  • WikiDataLink: The Link to the mps WikiData page
  • Name: string. The full name of the mp.
  • Last Name: string. The last name of the mp, found on WikiData. If no last name is given on WikiData, the full name was heuristically cut at the last space to get the information neccessary for splitting the speeches.
  • Born: string, format: YYYY-MM-DD. Birth date of the mp. If an exact birth date is found on WikiData, this exact date is used. Otherwise, a day in the year of birth given on Wikipedia is used.
  • SexOrGender: string. Information on the sex or gender of the mp. Disclaimer: This infomation was taken from WikiData, which does not seem to differentiate between sex or gender.
  • Occupation: list. Occupation(s) of the mp.
  • Religion: string. Religious believes of the mp.
  • AbgeordnetenwatchID: int. ID of the mp on the website Abgeordnetenwatch

IAB-SMART-Mobility

Creators: Research Data Centre of the German Federal Employment Agency (BA) at the Institute for Employment Research (IAB)
Publication Date: 2023
Creators: Research Data Centre of the German Federal Employment Agency (BA) at the Institute for Employment Research (IAB)

The IAB-SMART Mobility data module provides smartphone-generated mobility indicators as additional variables for the Panel Labor Market and Social Security (PASS). Participants in IAB-SMART are PASS respondents from wave 11 (2017) with an Android smartphone who were invited to install the IAB-SMART app.  Provided the user’s consent, the IAB-SMART app collected the location of the smartphones at half-hour intervals from January to August 2018. Mobility indicators such as the number of visited places or the share of location measurements at home were derived from this location information. The mobility indicators capture the entire observation period as well as weekdays and weekends separately. With IAB-SMART Mobility, research questions related to mobility behaviour concerning employment and unemployment can be analysed. The indicators from IAB-SMART Mobility complement the survey data from PASS and can also be linked with the administrative data from IAB (PASS-ADIAB). More Information can be found at the website of the FDZ-BA/IAB: DOI: 10.5164/IAB.IAB-SMART.de.en.v1

Detailed information about the data module, as well as a guide on how to link IAB-SMART Mobility with PASS data, can be found in the following publication: Zimmermann, Florian, Filser, Andreas, Haas, Georg-Christoph; Bähr, Sebastian (2023). The IAB-SMART-Mobility Module: An Innovative Research Dataset with Mobility Indicators Based on Raw Geodata. Jahrbücher für Nationalökonomie und Statistik. https://doi.org/10.1515/jbnst-2023-0051

Fact-Checking Facebook Politics Pages

Creators: Silverman Craig; Strapagiel, Lauren; Shaban, Hamza; Hall, Ellie; Singer-Vine, Jeremy
Publication Date: 2016
Creators: Silverman Craig; Strapagiel, Lauren; Shaban, Hamza; Hall, Ellie; Singer-Vine, Jeremy

This repository contains the data and analysis for the BuzzFeed News article, “Hyperpartisan Facebook Pages Are Publishing False And Misleading Information At An Alarming Rate,” published October 20, 2016. The dataset specifically examines content from hyperpartisan Facebook pages, providing insights into the spread of false and misleading information within polarized political communities. It includes Facebook engagement figures—such as shares, reactions, and comments—offering a perspective on how users interact with content of varying accuracy. The dataset has a total size of 364,8 kB and contains over 1,000 posts from hyperpartisan political Facebook pages. Data was collected up to October 11, 2016, capturing a snapshot of political content leading up to the 2016 U.S. presidential election. Structurally, the dataset is organized into a spreadsheet with columns representing:

  • Post Content: Text of the Facebook post.

  • Fact-Check Rating: Assessment of the post’s accuracy.

  • Engagement Metrics: Counts of shares, reactions, and comments.

2012-2016 Facebook Posts

Creators: Martinchek, Patrick
Publication Date: 2016
Creators: Martinchek, Patrick

This dataset comprises Facebook posts from the 15 mainstream media sources during the years 2012 to 2016. It includes posts from the top mainstream media outlets, offering insights into their social media strategies and audience engagement during a significant period in digital media evolution.The dataset is structured to include various fields such as post content, timestamps, and engagement metrics like likes, shares, and comments. Each record represents a single Facebook post, allowing for detailed analysis of individual entries. It has a size of 861,17 MB.

Huge Collection of Reddit Votes

Creators: Leake, Joseph
Publication Date: 2020
Creators: Leake, Joseph

The dataset covers data of over 44 million upvotes and downvotes cast by Reddit users between 2007 and 2020. This is a tab-delimited list of votes cast by reddit users who have opted-in to make their voting history public. Each row contains the submission id for the thread being voted on, the subreddit the submission was located in, the epoch timestamp associated with the vote, the voter’s username, and whether it was an upvote or a downvote. The votes included are from users who have chosen to make their voting history public, ensuring compliance with privacy preferences. There’s a separate file containing information about the submissions that were voted on. The dataset contains over 44 million voting records and has a size of 21,9 kB. Structurally, the dataset is organized into two main components:

  1. Votes Data: A tab-delimited file where each row represents a vote with the following fields:

    • submission_id: Identifier of the Reddit submission that received the vote.

    • subreddit: Name of the subreddit where the submission was posted.

    • created_time: Epoch timestamp indicating when the vote was cast.

    • username: Reddit username of the voter.

    • vote: Type of vote, either ‘upvote’ or ‘downvote’.

  2. Submissions Data: A separate file containing information about the submissions that received votes, including details such as submission titles, authors, and timestamps.

Tracking Mastodon user numbers over time

Creators: Willison, Simon
Publication Date: 2022
Creators: Willison, Simon

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter. I’ve set up a new git scraper to track the number of registered user accounts on known Mastodon instances over time. The dataset collects data from numerous Mastodon instances, providing a holistic view of user distribution across the network. This approach captures the decentralized nature of Mastodon, offering insights into individual server growth and overall network expansion. By recording user numbers at regular intervals, the dataset enables the analysis of growth patterns over time, identifying trends and significant adoption milestones. The dataset includes user counts from approximately 1,830 Mastodon instances, with data points collected approximately every 20 minutes. This frequency allows for detailed temporal analysis of user growth. Data collection began on November 20, 2022, and has continued since then, capturing the rapid growth of Mastodon following significant events such as changes in other social media platforms.

The dataset is structured with each record representing a snapshot of user numbers across various Mastodon instances at a specific timestamp. Key fields include:

  • Instance Name: The domain name of the Mastodon instance.

  • User Count: The number of registered users on the instance at the time of data collection.

  • Timestamp: The date and time when the data was collected.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.