2025 ACL ACL 2025

Building a Long Text Privacy Policy Corpus with Multi-Class Labels

Abstract

AbstractLegal text poses distinctive challenges for natural language processing. The legal import of a term may depend on omissions, cross-references, or silence, Further, legal text is often susceptible to multiple valid, conflicting interpretations; as the saying goes: a good lawyer’s answer to any question is “it depends.”This work introduces a new, hand-coded dataset for the interpretation of privacy policies. It includes privacy policies from 149 firms, including materials incorporated by reference. The policies are annotated across 64 dimension that reflect the applicable legal rules and contested terms from EU and US privacy regulation and litigation. Our annotation methodology is designed to capture the capture core challenges peculiar to legal language, including indeterminacy, interdependence between clauses, meaningful silence, and the implications of legal defaults. We present a set of baseline results for the dataset using current large language models.

🧭 Keyword Pioneer — privacy policy corpus
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Speech & Audio
🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing