Towards Simplified Privacy Policies
Free email add-in service Unroll.me faced public backlash late April 2017 after it emerged as part of a New York Times profile of Uber C.E.O. Travis Kalanick that it sold Lyft rider receipts to the ride-hailing startup so that it could keep tabs on its competitor. Users asked questions as to why the company had betrayed their trust. After all, the service had access to their email inboxes to rid them of unwanted newsletters and subscriptions.
In the wake of the revelation, C.E.O. Jojo Hedaya said “it was heartbreaking to see that some of our users were upset to learn about how we monetize our free service.” Perri Chase, the third co-founder of Unroll.me who left the company post its acquisition by Slice Intelligence defended the data collection via a Medium post, adding anyone who were outraged by Unroll.me’s monetisation practice as “living under a rock.”
“I encourage you to go read the Terms of Service of every app you opt in to in order to see what rights they have over your data. This is not new. Is it good? Is it bad? Is that the point? You optin for an awesome free product that clearly states the following and you are offended and surprised? Really?,” Chase further added.
The most surprising aspect is not that this incident happened, but that most people don’t bother reading privacy policies and terms of service agreements, instead often consenting and signing up for a service without really knowing what they are signing up for. This is so because most privacy policies are often long, complicated and mired in obtuse legalese, as if deliberately designed with an intention to confuse users.
At a time when people are more online than ever before in a variety of ways ranging from computers to smartphones to tablets to smart internet of things (IoTs) devices, this lack of understanding of privacy policies poses a significant privacy concern. It not only prevents users from making informed decisions, but also leads to serious mismatches in privacy expectations.
Towards Simplified Privacy Policies
To resolve the information asymmetry, various solutions have been proposed. First came the Platform for Privacy Preferences (P3P) in 2002. Implemented as a protocol, it allows websites to define a list of policies (in P3P format) that states the different kinds of data they expect to collect from its visitors. Users, through their web browsers, can likewise set up the data they are willing to share with websites. Thus when a user visits a site that has P3P enabled, the policies at both ends are matched. If they do not, the user is prompted if he is willing to proceed and in the process risk giving up more personal information.
However the lack of proper enforcement meant P3P was never widely adopted. There also exist crowdsourcing solutions like Terms of Service; Didn’t Read (ToS;DR) and TLDRLegal that aim to break down privacy notices in simple, easy-to-understand formats. While they can be useful, there is always a risk that these initiatives can never be exhaustive and up-to-date. But recent studies have shown that crowdsourcing, if deployed carefully, can be a viable means to annotating website privacy policies.
Alternatively, researchers have also studied the implications of length on the overall effectiveness of privacy policies, unsurprisingly finding that short-form privacy notices can inform users better, but also can lead to less awareness when they turn out to be too short and exclude relevant information in the process.
A “Nutrition Label” for Privacy
In 2009, researchers from Carnegie Mellon University, came up a novel idea for simplifying privacy policies on the web — a “nutrition label” for privacy. The paper’s basic premise is to standardize privacy policies in a nutrition label-like simplified grid format, presenting a company’s data collection practices with simple yes/no answers.
The two dimensional privacy nutrition label grid explicitly states the kind of information a company collects, and whether it is “opt in” (information is shared only when the user opts-in) or “opt out” (information is shared by default unless user opts-out), in addition to highlighting the information that may be shared with other companies. It also makes use of colored symbols as a way to observe the overall color intensity. An example of such a label is shown in Figure 1.
The study recruited 24 participants, 16 students and 8 non-students, and made use of two natural language privacy policies for mythical companies ABC Group and Bell General and two label formatted policies for companies Acme and Button Co. with subtle differences in data collection.
The machine learning classifiers are rule based, which means they look for specific regular expression patterns in the policy text to run the analysis. It also relies on a database housing 100 training privacy policies that’s part of the extension package to train an algorithmic ML classifier, which employs a naive Bayes classifier to search for specific bigram patterns within a document.
The authors also note that the entire text need not be analyzed. Rather they assume that if a policy quotes a law, information sharing will occur even if stated otherwise, and if policies don’t clearly explain how certain information may be collected or used, default rules might be applied to “fill the gaps.” The study also limits itself to classifying the text based on six different relevant categories — data collection, encryption, ad tracking, limited data retention, profiling through linkage analysis, and disclosure of personal information to advertisers.
After running the extension through 300 test classifications, a combination of rule-based and ML classifier was found to be best for ad tracking and profiling, a rule classifier for encryption and ML classifier for collection, limited retention, and ad disclosure categories. The result was a total of 27 misclassifications (11 for rule, 16 for ML classifier) with false negatives (18) weighing over false positives (9), and 28% of policies receiving an A, 50% a B, and 22% a C grade.
Most importantly, the study also highlights the inherent difficulties and ambiguities associated with natural language privacy policies by showing that “the ambiguity of text in privacy policies, as measured by semantic diversity, has statistical significance for whether a classification decision is more likely to succeed or fail.”
In addition to putting usability front and center, the system also follows a Privee-like approach to annotating a corpus of 115 privacy policies in an attempt to describe and extract nine specific categories of information like data retention, data security, first party collection/use, third party sharing, user choice among others.
The resulting annotation data is then used to generate natural language processing models to identify sequences of text that are more likely pertain to the same data practice (e.g. data retention) across different policies.
As the study mentions, it “allows for information to be found in the same place every time. It removes wiggle room and complicated terminology by using four standard symbols that can be compared easily. It allows for quick high-level visual feedback by looking at the overall intensity of the page, can be printed, can fit in a browser window, and has a glossary of useful terms attached.”
As evidenced in the Unroll.me case, companies can take the view that users should read their privacy policies before signing up, and that failing to do so is a sign of lack of privacy concern on part of the user. While that may or may not be the case, it is the responsibility of a website (or service) owner to ensure that they do all they can to make their data practices readable and transparent, thereby reducing the time it takes to read them, or else face regulatory action for failing to comply.
A couple of promising solutions in this front range from having a privacy nutrition label, to grading them (Privee) or visualizing them in the form a map spanning multiple color-coded privacy practice categories (Usable Privacy Privacy Project). No matter what the approach is, gaining a deeper understanding of the exact kind of information that matters the most to users will help drive the conversation towards better privacy policies and make them effective.