We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.IR

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Information Retrieval

Title: Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Abstract: Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.
Subjects: Information Retrieval (cs.IR); Cryptography and Security (cs.CR)
Journal reference: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021
DOI: 10.18653/v1/2021.acl-long.532
Cite as: arXiv:2004.11131 [cs.IR]
  (or arXiv:2004.11131v2 [cs.IR] for this version)

Submission history

From: Mukund Srinath [view email]
[v1] Thu, 23 Apr 2020 13:21:00 GMT (925kb,D)
[v2] Sat, 30 Mar 2024 12:21:59 GMT (5477kb,D)

Link back to: arXiv, form interface, contact.