Towards Simplified Privacy Policies

Ravie Lakshmanan
13 min readMay 8, 2017

--

Dropbox registration page — Did you actually read the terms?

Introduction

Free email add-in service Unroll.me faced public backlash late April 2017 after it emerged as part of a New York Times profile of Uber C.E.O. Travis Kalanick that it sold Lyft rider receipts to the ride-hailing startup so that it could keep tabs on its competitor. Users asked questions as to why the company had betrayed their trust. After all, the service had access to their email inboxes to rid them of unwanted newsletters and subscriptions.

In the wake of the revelation, C.E.O. Jojo Hedaya said “it was heartbreaking to see that some of our users were upset to learn about how we monetize our free service.” Perri Chase, the third co-founder of Unroll.me who left the company post its acquisition by Slice Intelligence defended the data collection via a Medium post, adding anyone who were outraged by Unroll.me’s monetisation practice as “living under a rock.”

“I encourage you to go read the Terms of Service of every app you opt in to in order to see what rights they have over your data. This is not new. Is it good? Is it bad? Is that the point? You optin for an awesome free product that clearly states the following and you are offended and surprised? Really?,” Chase further added.

The most surprising aspect is not that this incident happened, but that most people don’t bother reading privacy policies and terms of service agreements, instead often consenting and signing up for a service without really knowing what they are signing up for. This is so because most privacy policies are often long, complicated and mired in obtuse legalese, as if deliberately designed with an intention to confuse users.

In fact a study undertaken by Aleecia M. McDonald and Lorrie Faith Cranor in 2008 pegged that it could take anywhere from 181 to 304 hours for a person to read all the privacy policies of the web services he uses in a year (and an estimated $5,038 dollars per year in lost productivity). And according to a 2014 Pew Research study, half of the Americans don’t even know what a privacy policy is. “Some 52% of internet users believe — incorrectly — that this statement is true, and that privacy policies actually ensure the confidentiality of their personal information,” said the study.

At a time when people are more online than ever before in a variety of ways ranging from computers to smartphones to tablets to smart internet of things (IoTs) devices, this lack of understanding of privacy policies poses a significant privacy concern. It not only prevents users from making informed decisions, but also leads to serious mismatches in privacy expectations.

What is a Privacy Policy?

A privacy policy, in simple terms, is a statement or a legal document that puts a company’s data collection and usage practice in black and white. It declares its stance on how it “collects, stores, and releases personal information it collects”. In other words, it tells the user the exact pieces of information it collects from her, and whether it is kept confidential, or shared with other third-parties.

The exact nature of the privacy policies differ from country to country and the laws the companies have to abide by. In the European Union, this is broadly enforced by the Data Protection Directive (to be replaced by General Data Protection Regulation effective May 25, 2018), which regulates the processing of personal information in the union. Australia has the Privacy Act 1988 that mandates government and private-sector companies to ensure an “open and transparent management of personal information including having a privacy policy.” India similarly requires any company that “collects, receives, possess, stores, deals or handles” sensitive personal data to provide a privacy policy.”

In the United States however, online privacy is not regulated. The Federal Trade Commission instead encourages companies to disclose information about their data collection practices, based on the idea that doing so will allow users to make enlightened decisions about which companies to trust with their information. The FTC also can brand a practice as deceptive if a company fails to obtain explicit opt-in consent or uses the personal information in a manner that’s not consistent with the privacy policy under which the information was collected.

But the proliferation of long and hard-to-understand privacy policies means most users don’t even take the trouble of reading them in the first place. How many times have we clicked “I agree to the terms and privacy policy” without batting an eye while signing up for a new service? Assuming a privacy policy to be transparent, how is the average user to know if the service behaves exactly as it says it does? The practice, therefore, is not just ineffective, but also counterproductive to the very problem it is meant to solve.

Towards Simplified Privacy Policies

To resolve the information asymmetry, various solutions have been proposed. First came the Platform for Privacy Preferences (P3P) in 2002. Implemented as a protocol, it allows websites to define a list of policies (in P3P format) that states the different kinds of data they expect to collect from its visitors. Users, through their web browsers, can likewise set up the data they are willing to share with websites. Thus when a user visits a site that has P3P enabled, the policies at both ends are matched. If they do not, the user is prompted if he is willing to proceed and in the process risk giving up more personal information.

However the lack of proper enforcement meant P3P was never widely adopted. There also exist crowdsourcing solutions like Terms of Service; Didn’t Read (ToS;DR) and TLDRLegal that aim to break down privacy notices in simple, easy-to-understand formats. While they can be useful, there is always a risk that these initiatives can never be exhaustive and up-to-date. But recent studies have shown that crowdsourcing, if deployed carefully, can be a viable means to annotating website privacy policies.

By building an annotation tool that asked participants specific questions about data collection practices (for e.g. “Does the policy state that the website might collect contact information about its users?”) and allowed them to select relevant passages in the policy that helped them arrive the answers, the study found that a 80% crowdworker (or inter-annotator) agreement threshold for each question can produce “meaningful privacy policy annotations.”

Alternatively, researchers have also studied the implications of length on the overall effectiveness of privacy policies, unsurprisingly finding that short-form privacy notices can inform users better, but also can lead to less awareness when they turn out to be too short and exclude relevant information in the process.

Studies have also been undertaken to perform automated analysis of Android mobile apps to check if they should have a privacy policy and in addition, perform a static analysis of the apps by extracting the permissions and constructing an invocation graph from third-party ad libraries during data sharing. It also compares the analysis results with its privacy policy to identify potential inconsistencies. This way, the system not only makes data collection practices more transparent but also serves as a useful screening tool to check for mismatches in data collection.

A “Nutrition Label” for Privacy

In 2009, researchers from Carnegie Mellon University, came up a novel idea for simplifying privacy policies on the web — a “nutrition label” for privacy. The paper’s basic premise is to standardize privacy policies in a nutrition label-like simplified grid format, presenting a company’s data collection practices with simple yes/no answers.

The two dimensional privacy nutrition label grid explicitly states the kind of information a company collects, and whether it is “opt in” (information is shared only when the user opts-in) or “opt out” (information is shared by default unless user opts-out), in addition to highlighting the information that may be shared with other companies. It also makes use of colored symbols as a way to observe the overall color intensity. An example of such a label is shown in Figure 1.

The study recruited 24 participants, 16 students and 8 non-students, and made use of two natural language privacy policies for mythical companies ABC Group and Bell General and two label formatted policies for companies Acme and Button Co. with subtle differences in data collection.

Figure 1: A privacy nutrition label for Bell Group

The participants were then given four different tasks that were aimed at finding required information, their understanding of the policies and overall ease of use. The results found that people were able to answer more questions correctly with the label formatted privacy policy. It also found the label format to be much faster than natural language policies to answer questions and compare the policies. In terms of satisfaction, the study found that users were more comfortable viewing the grid if they were given the natural language policy first, and likewise, were likely to read the policies more in label format if they saw the natural language policy first.

The idea of a label-based approach to simplifying privacy policy is not new. In a report by former FTC commissioner Sheila F. Anthony published in July 2001, she said “consumers are not much better off today with incomprehensible privacy policies than they were five or six years ago when there were no privacy policies,” adding “a standardized format for privacy policies, much like the food label required by the Nutritional Labeling and Education Act or the EnergyGuides required by The Energy Policy and Conservation Act of 1975 (“EPCA”), would allow consumers to quickly assess whether a particular site’s privacy policy satisfies their privacy goals.”

Privee

Citing the difficulties with navigating privacy policies, authors Sebastian Zimmeck and Steven M. Bellovin, proposed a system for automatically analyzing website privacy policies in 2014. Called Privee, the system combines machine learning classification techniques with crowdsourced privacy policy repositories like ToS;DR to return policy analysis results to the user.

At a high level, Privee works as a Chrome browser extension. When a user requests a privacy policy analysis for a website by clicking the extension icon from the toolbar, it first checks if the results are available in a crowdsourcing repository. If so, they are returned and shown to the user. In the event no results are available, Privee scrapes the text of the privacy policy from the website, runs through machine learning classifiers, and then returns the results in the form of grades A, B or C (good to worse, in that order). The flow is depicted in Figure 2.

The machine learning classifiers are rule based, which means they look for specific regular expression patterns in the policy text to run the analysis. It also relies on a database housing 100 training privacy policies that’s part of the extension package to train an algorithmic ML classifier, which employs a naive Bayes classifier to search for specific bigram patterns within a document.

Figure 2: Privee program flow

The authors also note that the entire text need not be analyzed. Rather they assume that if a policy quotes a law, information sharing will occur even if stated otherwise, and if policies don’t clearly explain how certain information may be collected or used, default rules might be applied to “fill the gaps.” The study also limits itself to classifying the text based on six different relevant categories — data collection, encryption, ad tracking, limited data retention, profiling through linkage analysis, and disclosure of personal information to advertisers.

After running the extension through 300 test classifications, a combination of rule-based and ML classifier was found to be best for ad tracking and profiling, a rule classifier for encryption and ML classifier for collection, limited retention, and ad disclosure categories. The result was a total of 27 misclassifications (11 for rule, 16 for ML classifier) with false negatives (18) weighing over false positives (9), and 28% of policies receiving an A, 50% a B, and 22% a C grade.

Most importantly, the study also highlights the inherent difficulties and ambiguities associated with natural language privacy policies by showing that “the ambiguity of text in privacy policies, as measured by semantic diversity, has statistical significance for whether a classification decision is more likely to succeed or fail.”

Usable Privacy Policy Project

The Usable Privacy Policy Project, brings together a pool of researchers from Carnegie Mellon University, the Center on Law and Information Policy at Fordham University and the Center for Internet Society at Stanford University. Started in late 2013, it leverages crowdsourced privacy repositories in conjunction with natural language processing to extract key features from websites’ privacy policies to gain a semi-automated understanding of the policy.

In addition to putting usability front and center, the system also follows a Privee-like approach to annotating a corpus of 115 privacy policies in an attempt to describe and extract nine specific categories of information like data retention, data security, first party collection/use, third party sharing, user choice among others.

The resulting annotation data is then used to generate natural language processing models to identify sequences of text that are more likely pertain to the same data practice (e.g. data retention) across different policies.

Figure 3: The Usable Privacy Policy Project
Figure 4: Privacy Practices of Amazon.com

“We estimate an HMM-like model on our corpus, exploiting similarity across privacy policies to the extent it is evident in the data. In our formulation, each hidden state corresponds to an issue or topic, characterized by a distribution over words and bigrams appearing in privacy policy sections addressing that issue. The transition distribution captures tendencies of privacy policy authors to organize these sections in similar orders, though with some variation,” said the authors involved in the study.

The NLP results are subsequently used to improve annotation interfaces and generate simplified privacy policy models that break down the policy into different practices and shared with the user. The whole approach to semi-automated extraction of privacy policy can be visualized in Figure 3, and a sample data practice breakdown for Amazon.com in Figure 4.

Comparison

Taking the three implementations together, it’s easy to see where each of them succeed. The privacy nutrition label approach stands out for its simplicity, but also runs a risk of being too simplistic, thus leaving out necessary information for users to make proper privacy decisions. However it would seem having a label would be much preferable than not reading the privacy policy at all.

As the study mentions, it “allows for information to be found in the same place every time. It removes wiggle room and complicated terminology by using four standard symbols that can be compared easily. It allows for quick high-level visual feedback by looking at the overall intensity of the page, can be printed, can fit in a browser window, and has a glossary of useful terms attached.”

Privee and Usable Privacy Policy Project, on the other hand, extends a crowdsourced service like ToS;DR with automated machine learning and NLP capabilities to parse a website’s privacy policy. While grades and color coding definitely make it easier for an average user to understand, the accuracy of the results heavily depend on the accuracy of the classification (and the parameters chosen), which in turn relies on a crowdsourced repository which may or may not be up to date (or at worst, even inaccurate) and the privacy policy itself which are often ambiguous and confusing.

All of this brings us back to the very question of how long a privacy policy should be. An average privacy policy stands at an approximate 2,500 words, which means there is an increased possibility that having a shorter privacy policy can bring about greater user awareness. Can a privacy policy be therefore codified to follow a document template, so that it remains consistent across the board? If so, what should be its word limit and will there be any need for automated analysis of policies at all?

As users begin to interact with their services via different channels, several questions still remain. How can a privacy policy really assure a user that the company is actually doing only what it says and is not collecting personal information that’s not disclosed in its privacy policy? With users transitioning from desktops to mobile apps and IoT devices, should they be periodically analyzed for potential privacy policy mismatches?

Conclusion

Privacy policies are an ubiquitous feature of almost all websites and online services today, but the difficult nature of the language and sheer length of most policies tend to most often than not discourage users from reading them. This results in users relying on pre-formed expectations to interact with websites, in turn causing mismatches and exposing themselves to unintended privacy risks even if the practices are disclosed in the privacy policy.

If it is any consolation, consolidation of businesses means multiple websites may end up sharing a single privacy policy even as new companies mushroom throughout the world. While it is always possible that the resulting privacy policy when companies merge can be longer than usual, studies have shown that this actually helps reduce the cost of reading privacy policies.

As evidenced in the Unroll.me case, companies can take the view that users should read their privacy policies before signing up, and that failing to do so is a sign of lack of privacy concern on part of the user. While that may or may not be the case, it is the responsibility of a website (or service) owner to ensure that they do all they can to make their data practices readable and transparent, thereby reducing the time it takes to read them, or else face regulatory action for failing to comply.

A couple of promising solutions in this front range from having a privacy nutrition label, to grading them (Privee) or visualizing them in the form a map spanning multiple color-coded privacy practice categories (Usable Privacy Privacy Project). No matter what the approach is, gaining a deeper understanding of the exact kind of information that matters the most to users will help drive the conversation towards better privacy policies and make them effective.

--

--

Ravie Lakshmanan
Ravie Lakshmanan

Written by Ravie Lakshmanan

Computational journalist and cybersecurity reporter

No responses yet