Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
Variables:
Name | Description |
---|---|
text | the document text |
created_timestamp | If the original source provided a timestamp when the document was created we provide this as well. Note, these may be inaccurate. For example CourtListener case opinions provide the timestamp of when it was uploaded to CourtListener not when the opinion was published. |
downloaded_timestamp | When the document was scraped |
url | the source url |
Publication Date:
2022
Creators:
Henderson, Peter; Krass, Mark S.; Zheng, Lucia; Guha Neel; Manning, Christopher D.; Jurafsky, Dan; Ho, Daniel E.
Publisher:
ArXiv
License:
Creative Commons Attribution Share Alike 4.0 International