Resources by Stefan

National Zoning and Land Use Database

Creators: Desmond, Matthew; Mleczko: Matthew
Publication Date: 2023
Creators: Desmond, Matthew; Mleczko: Matthew

We constructed a National Zoning and Land Use Database using natural language processing techniques on publicly available administrative data. We created the National Zoning and Land Use Database (NZLUD) to provide national zoning and land use data for the 2019-2022 time period. We supply our source code to enable timely access to publicly available zoning information. Users can rerun our code at regular intervals to create panel zoning data moving forward. Users can also expand to additional municipalities or additional zoning and land use measures not currently captured by our process. The intent is for this code to further automate the process of building national zoning and land use information in an open source way. Overall, the dataset has a size of 82,2 kB and is structured into:

  • Municipal-Level Data: The file contains zoning and land use information for individual municipalities. This includes various measures extracted using NLP techniques from municipal codes, focusing on aspects such as zoning regulations and land use policies.

  • MSA-Level Data: This file aggregates the municipal-level data to the MSA level, providing a broader view of zoning and land use patterns across metropolitan areas.

Project Foundations of Legal Data Science I

Creators: Fobbe, Seán
Publication Date: 2019
Creators: Fobbe, Seán

the first two of a new series of open and high-quality international legal data sets: comprehensive, fully reproducible, human- and machine-readable open access collections covering one hundred years of case law of the primary judicial organs of the United Nations and the League of Nations: the Corpus of Decisions: International Court of Justice (CD-ICJ) and the Corpus of Decisions: Permanent Court of International Justice(CD-PCIJ).

New York Prison Employee Discipline Data

Creators: The Mashall Project
Publication Date: 2023
Creators: The Mashall Project

The Marshall Project examined 12 years of employee discipline data and hundreds of prisoner lawsuits. The Marshall Project requested this data via New York’s Freedom of Information Law in the summer of 2020. We asked for: “An electronic copy of the database maintained by the Bureau of Labor Relations to document disciplinary cases and their resolutions involving alleged misconduct by DOCCS employees from January 1, 2010 to present.” We later asked for a second batch of data covering the time period up until the spring of 2022. By doing so, we prove an in-depth look into the internal disciplinary processes of the New York prison system, offering insights into the nature and frequency of infractions committed by prison staff, the outcomes of disciplinary proceedings, and patterns that may exist within the system. Such detailed information is typically challenging to access, making this dataset particularly valuable for research and analysis in criminal justice and institutional accountability. The dataset has a size of 1,2 MB.

Netflix Prize Data Set

Creators: Netflix
Publication Date: 2009
Creators: Netflix

This dataset was constructed to support participants in the Netflix Prize. See [Web Link] for details about the prize.

There are over 480,000 customers in the dataset, each identified by a unique integer id.

The title and release year for each movie is also provided. There are over 17,000 movies in the dataset, each identified by a unique integer id.

The dataset contains over 100 million ratings and has a size of 7,7 kB. The ratings were collected between October 1998 and December 2005 and reflect the distribution of all ratings received during this period. Each rating has a customer id, a movie id, the date of the rating, and the value of the rating.

As part of the original Netflix Prize a set of ratings was identified whose rating values were not provided in the original dataset. The object of the Prize was to accurately predict the ratings from this ‘qualifying’ set. These missing ratings are now available in the grand_prize.tar.gz dataset file.

Soccer Power Index (SPI) Ratings

Creators: FiveThirtyEight
Publication Date: 2022
Creators: FiveThirtyEight

This file contains links to the data behind our Club Soccer Predictions and Global Club Soccer Rankings. These data analyse soccer team performances worldwide,  providing valuable insights into team strengths and match outcomes.

The SPI database covers data up to the year 2022 with a total size of 52,6 kB and includes the following metrics:

  • SPI Rating: An overall measure of a team’s strength, combining offensive and defensive capabilities.

  • Offensive and Defensive Ratings: Separate evaluations of a team’s attacking and defensive proficiencies.

  • Match Probabilities: Predicted probabilities for home win, away win, and draw outcomes, offering insights into expected match results.

  • Projected Scores: Anticipated goal counts for both home and away teams, aiding in match analysis and forecasting.

 

2020 U.S. Election Emails

Creators: Mathur, Arunesh; Wang, Angelina; Schemmer, Carsten; Hami, Maia; Stewart, Brandom M; Narayanan; Arvind
Publication Date: 2023
Creators: Mathur, Arunesh; Wang, Angelina; Schemmer, Carsten; Hami, Maia; Stewart, Brandom M; Narayanan; Arvind

This is a preliminary release of the code and data associated with the research paper “Manipulative tactics are the norm in political emails: Evidence from 300K emails from the 2020 U.S. election cycle”.

The corpus contains emails from over 3,000 political campaigns and organizations in the 2020 election cycle in the U.S. The corpus aims to be comprehensive and includes coverage of emails from the candidates in prominent federal and state races as well as political organizations such as Political Action Committees (PACs) and political parties active in the 2020 cycle. We automated the process of signing up to receive emails from the websites of the political campaigns and organizations. For each entity’s website, if the bot discovered an email sign-up form, it filled it in with the information of a fictional recipient The entire dataset contains 317,366 emails.

Comparative Constitution Project Data

Creators: Comparative Constitutional Project
Publication Date: 2022
Creators: Comparative Constitutional Project

The dataset includes information on 799 constitutional systems and 2,999 amendments across various countries since 1789. The primary objective of the CCP is to record the characteristics of national constitutions written since 1789. The CCP aims to fill this informational gap by providing systematic data to comparative legal scholars for analysis long before they provide advice to constitution drafters. It is our hope that the analysis of, and insights from, these data will promote peace, justice, and human development through the constitution making process. The dataset has s a size of approximately 435 kB and is organized into several key components:

  • Chronology of Constitutional Events: This component documents each constitutional event (e.g., adoption, amendment, suspension) for recognized independent states since 1789.

  • Constitutional Texts: The CCP has collected the texts of nearly every constitutional system as well as most amendments, providing a repository for textual analysis.

  • Characteristics of National Constitutions: This dataset includes approximately 650 variables coded for each constitution, detailing various aspects such as governmental structure, rights, and amendment processes.

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Creators: Henderson, Peter; Krass, Mark S.; Zheng, Lucia; Guha Neel; Manning, Christopher D.; Jurafsky, Dan; Ho, Daniel E.
Publication Date: 2022
Creators: Henderson, Peter; Krass, Mark S.; Zheng, Lucia; Guha Neel; Manning, Christopher D.; Jurafsky, Dan; Ho, Daniel E.

We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives. The data encompasses a vast number of observations, meticulously collected from 35 distinct sources. These sources include court opinions, contracts, administrative rules, legislative records, and more, reflecting various norms and legal standards for data filtering. The dataset has a size of 256GB. The temporal coverage of the dataset varies across its subsets, as each source spans different time ranges. For instance, U.S. court opinions from CourtListener are synchronized as of December 31, 2022, while the Federal Register includes draft rulemaking documents filed by agencies over an extended period.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.