Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Creators:
Henderson, Peter; Krass, Mark S.; Zheng, Lucia; Guha Neel; Manning, Christopher D.; Jurafsky, Dan; Ho, Daniel E.
Publication Date:
2022
Data Category:
Dataset Description:
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives. The data encompasses a vast number of observations, meticulously collected from 35 distinct sources. These sources include court opinions, contracts, administrative rules, legislative records, and more, reflecting various norms and legal standards for data filtering. The dataset has a size of 256GB. The temporal coverage of the dataset varies across its subsets, as each source spans different time ranges. For instance, U.S. court opinions from CourtListener are synchronized as of December 31, 2022, while the Federal Register includes draft rulemaking documents filed by agencies over an extended period.
Variables:
Details:

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.