Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

Variables:
Name Description
text the document text
created_timestamp If the original source provided a timestamp when the document was created we provide this as well. Note, these may be inaccurate. For example CourtListener case opinions provide the timestamp of when it was uploaded to CourtListener not when the opinion was published.
downloaded_timestamp When the document was scraped
url the source url
Publication Date:
2022
Creators:
Henderson, Peter; Krass, Mark S.; Zheng, Lucia; Guha Neel; Manning, Christopher D.; Jurafsky, Dan; Ho, Daniel E.
Publisher:
ArXiv
License:
Creative Commons Attribution Share Alike 4.0 International

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.