Well, that is exactly why they call the “I read and accept” the biggest lie on the web.
Research has shown that if an average American would read all the privacy policies of websites she visits, she would spend 244 hours a year just doing that [1]. Even more, to do so, you need an educational level of a junior college student [2], which also gives you an idea of the average complexity of those documents.
Of course, if you are on the other side and have a business, then creating a privacy policy is not an easy feat neither. Legal costs for doing so can reach U$ 3000 [3], and you might still be unsure if your real data practices (which can change frequently) correspond to what you actually wrote (something that changes much less often).
One of the goals of the SMOOTH project is to help you consolidate those two worlds. For that, the module called SMOOTEXT handles the analysis of the textual components (privacy policies, cookie policies, eventually other legal agreements) and extracts from there the legal elements relevant to the GDPR framework.
In order to do those, we will use techniques from Natural Language Processing, the subfield of science and engineering which studies how to teach computers to understand and generate human language.
However, before doing so we need what is called annotated data. This is examples of the kind of things we want to extract, done by a process we trust (mostly humans). Those examples are then given to our algorithms that try to find and generalize the relevant patterns in the text in order to be able to recognize them whenever new privacy policies arrive. In SMOOTEXT we will address challenges like how to extend and re-use existing such datasets [4], how to generalize those annotations done in one language to any other of the languages spoken in the EU and how to incorporate user feedback in order to improve our algorithms over time.
For better or for worse, natural language is the de facto communication channel for sharing privacy policies. To avoid the energy and time-consuming effort mentioned in the beginning, we plan to take advantage of progress in machine learning to teach computers to do that analysis. This will allow SME’s to focus their time and energy on taking the right decisions for their business.
Written by: Matthias Gallé
[1] McDonald, Aleecia M., and Lorrie Faith Cranor. “The cost of reading privacy policies.” ISJLP 4 (2008): 543.
[2] Massey, Aaron K., et al. “Automated text mining for requirements analysis of policy documents.” Requirements Engineering Conference (RE), 2013 21st IEEE International. IEEE, 2013.
[3] https://www.freeprivacypolicy.com/blog/privacy-policy-cost/
[4] Gallé, Matthias, Athena Christofi, and Hady Elsahar. “The Case for a GDPR-specific Annotated Dataset of Privacy Policies.” (2019).