October 23, 2014

Deep Data Privacy RiskCheap storage technology isn’t just changing what information we store; it’s also changing how we store familiar kinds of information. Both developments have implications for privacy.

The price of data storage, computer processing, and computer networking has plummeted. One consequence is Big Data: even unfunded start-up companies can afford to store and process categories of information, like locations of mobile devices, that translate into previously unmanageable heaps of digital data. But cheap data resources also make it possible to store and process familiar kinds of information in new ways.

Whether by accident or by design, these approaches sometimes remember how users interact with and change information over time, leading at once to new insights and peculiar, often occluded privacy risks. With much regret for glossing over great technical richness, but with much hope that a new abstraction might bring needed privacy and other non-technical expertise into the fold, I call this latter consequence of cheap data processing power “Deep Data.”

What is so different about Deep Data that a new, woefully abstract buzzword is needed to talk about it?

When companies move on a new category of Big Data, the need for a new privacy assessment is usually clear. Are we comfortable sharing the locations of our cell phones? Are we alright giving our bookseller a list of everything we’ve browsed and bought? Will we give the power company detailed information on when, how, and how much electricity we use? These aren’t easy questions, but the need to ask is often clear. Companies, for their part, are incentivized to announce new features with all the fanfare they can muster, shining their own light on such changes.

Conversely, Deep Data changes are often technically subtle and practically invisible. It may not be clear at all that a new privacy trade-off is on offer.

When technically motivated, Deep Data shifts rarely occasion mass-market announcements. (The technical team may or may not write a blog post.) The operational motivation is often performance, reliability, or contending with an expanding user base. The user experience goal is as often invisibility. Users shouldn’t notice anything, or, rather, they should stop noticing performance or availability problems.

Conversely, when the motivation is new data analysis capability, consumers can count themselves lucky when a privacy policy is updated, and luckier still when some anorak understands and popularizes understanding of the change. The more likely outcome when no user-visible change in functionality accompanies the shift is a tweak to a policy page in the backwaters of the website. The change occurs in stealth.

Deep Data techniques are progress, and not of their nature threatening to privacy. But to the extent they happen to take forms, technically and socially, that slip through cracks in the abstract vocabulary we use to discuss privacy and data science across disciplines, they pose peculiar privacy risk. “Deep Data” is worth the coinage as a tool to raise awareness.

Just enough databases

All of this generalizing calls for a more concrete view of the issue. There is technical detail here worth grasping at a high level, but we can use an intuitive example to stay grounded: items added to an online shopping cart. It is very common for websites to store this information on their servers, so that users can pick up where they left off from a past shopping session. A broad, unwritten consensus deigns this practice acceptable.

A “traditional” approach to shopping carts handles information much as one might with a Microsoft Excel spreadsheet. The company sets up a large server computer as its shopping cart database. That database contains a few very large data tables, analogous to worksheets in a spreadsheet. One table has a row for each user account, with columns for user name, email address, and account number. Another table has a row for each item added to a shopping cart, with columns identifying who added the item, the item’s name or stock number, quantity, and price.

When a user adds an item to their shopping cart, a message is sent to the database to insert a new row in its shopping cart table. If the user later adds one more of that item, the database finds the previously created row and changes its quantity column value from one to two. If the user removes both items from the cart (or completes the purchase), the database finds and deletes the shopping cart table rows for that account.

This approach affords several benefits. The database can always find out exactly what’s in a user’s shopping cart by looking up the rows in the shopping cart table for that user. Because the table is updated with each change to the cart, the database doesn’t waste resources storing information on items that have been removed or purchased. Since shopping cart data is stored in one place, on the sole shopping cart database server, software powering the site always knows where to get definitive information on a shopping cart. This is the way website databases have largely been done for two decades.

Cheap data resources lessen the benefits of this approach and overcome many of its flaws. While one centralized shopping cart database makes clear where programs should request information, that centralization creates a single point of failure. If the shopping cart database server goes down (or its backup fails), the shopping cart breaks site-wide. Moreover, as more users sign up, the database server will have to be repeatedly shut down for upgrade or replacement. Eventually the upgrade-or-replace track will dead-end; much higher performance is possible with a network of less powerful, interconnected computers, each holding a part of all the shopping cart data to be stored. Computers can be added or removed from such a network to match need, at low cost. The benefit of deleting and updating information in the table as it’s changed, once vital to keeping storage costs low, is now irrelevant. Storage space is that much cheaper. A decentralized, rather than centralized, approach now makes sense more often.

An analogy

The trade-offs inherent in centralized versus decentralized databases are much like the trade-offs between centralized and decentralized record keeping systems.

Consider the files that states keep on corporations and other businesses formed under their laws. Each of the United States, large and small, has a Secretary of State or other official whose office keeps and certifies legal records on such businesses. In almost every case, there is one central office in the state capital, and all requests for information and filings are sent there. It is more important, legally speaking, for information coming out this office to be reliable and correct than instantly available. For this reason, the offices review filings as they are submitted, rejecting those that are flawed without cluttering the public record. Despite the work load, these offices work passably well, given that the number of businesses registered in a state is manageable and requests and filings don’t have to be made that often.

Real estate records are handled differently, especially in large states. Often each county or other subdivision has a local records office that handles documents related to land in its vicinity. New documents, such as those recording title after the sale of a home, as well as requests for information, go to the local records office. Local offices may not be as well organized, as large, or as efficiently run as the big central business records office in the capital, but the difference is offset by the convenience of keeping records close to the property. In addition, the role of such offices is often more clerical, adding documents to the record (“recording” and indexing them) as they come in, with important conclusions about correctness and meaning to be made by lawyers only later, when needed. If the records office in a small town has to close for a repair, the filings for a new skyscraper in the major city 300 miles away won’t be held up.

Decentralization has its discontents. If the state decides to foreclose property state-wide for a new border-spanning freeway, dealing with each relevant records office will be expensive and time-consuming. Because records are scattered about various offices, those records may not always be consistent, say about properties near the borders between counties. What’s more, putting off the process of review and analysis postpones many disputes that might have been more easily resolved early on. Resolution mechanisms for these conflicts, such as local courts of law, can handle these issues, but each brings its own practices and procedures into the mix. This adds up to far more complexity in depth. If it is clear that additional complexity and inefficiency will affect only relatively rare cases, the benefits of decentralization may outweigh the costs.

Deep Data storage

With a bit of the basics and an analogy under our belts, we’re ready to take on a Deep Data trend with serious privacy implications.

Consider the software developer who wants to move shopping cart data from a centralized database to a decentralized network of server computers. Each of those computers should be able to respond to a request for any user’s shopping cart information. Changes to shopping carts should likewise be accepted by any computer in the network, and end up in the right place. All of the computers should be able to take in a massive amount of requests and changes at a time, and respond very quickly to each request. If any computer in the network fails, the system as a whole should continue to work, and no data should be lost. Though the individual server that receives new information may know more than the others for a time, eventually that new knowledge has to be available via request to any of the other servers.

This kind of system has many of the problems of our generalized real estate records office, and can benefit by aspects of its approach.

One aspect of that approach is to put new information “on the record” with relatively little review, deferring analysis to a later time. The practical effect of this approach crops up in cases where, say, a user increases the quantity of an item already in their cart. In a traditional, centralized system, this kind of change is stored by first finding the row indicating that the item is in the user’s cart, then changing that record where it is stored (“mutating” it, in the jargon) to reflect the new quantity. This is much akin to the business records office, which pulls the file for a business when a new filing is received, checks it for correctness, and only then adds the document to the file.

A more lightweight approach might skip find-and-check, opting instead to log the fact that the user added an additional item immediately. If and when the software requires a current view of the shopping cart for a user, all of the log entries for that customer can be recalled, and then applied, one after the other, to reveal the current state of the user’s cart. Much as one might balance a checkbook by reading back carbon copies of checks and deposit slips, the current state of the cart can be derived from all items the user has added or removed to date.

This approach is being actively adopted, both in distributed data stores (Datomic, Cassandra), and in the way data is stored on individual server computers (LevelDB, RocksDB, recent SQLite). Though the area is still developing, and developing very quickly, at least some of these approaches have come to be termed “immutable” data approaches, so named because data is updated only by appending new data to a log, rather than by changing (“mutating”) data records in place.

Immutable log-based database system vary in meaningful ways. Some run recurring housekeeping routines, called “compaction”, to remove old data, much as a banks keep transaction records justifying balance figures only so long. To return to our example, if a user adds a textbook to their shopping cart, but later removes it, many systems will determine after some time that the corresponding log entries cancel each other out, and delete them to save space. Other systems remove records older than a certain age, or old versions of data records for which a more recent revision is available. Various clever approaches (such as vector clocks) are used to keep changes coming into different servers across a network in the correct order.

Still other systems, like some real estate record offices, rarely or never dispose of old records, but place them in deep storage that’s plentiful, safe, but slow to access. Some of these systems, including tools used to track changes to software projects over time (such as Git) or to track public ledgers of transactions (such as the Bitcoin blockchain), use cryptographic “hashing”, a kind of data fingerprinting, to ensure ordering among successive data records. These systems provide the ability to rewind to a point in history as a core feature. Like a chain of title to real property, each subsequent data entry must accurately refer to or otherwise incorporate the last relevant data, forming a chain of validity stretching back to the start of the system.

System designers sometimes start out with robust compaction, but shift to a more archival system when they realize the value to be gleaned from analysis of how and when information changes. Electronic commerce portals, for instance, can derive powerful insights from information about what products a customer adds but later removes from their shopping cart, whether standing alone or in correlation to the products they eventually do purchase. Social networks may learn from postings or messages saved as drafts, but never shared or sent. Whether for technical or business value reasons, handling of information that is compatible with Deep Data storage converges upon data storage strategies that make these kinds of insights possible.

Privacy risk

Privacy and anonymity are cherished values in the subculture of open source software developers. That community also makes ample use of GitHub, a website that provides free, public sharing of software projects managed using Git, a software tool that tracks changes to computer files precisely over time. Git stores every change made to a project’s files in perpetuity, and each change (a “commit”) is tagged with the name of its author and the time and date it was made. Thousands of otherwise privacy-conscious developers make such information publicly available online every day, as they have done for decades using other websites and tools.

When and how often a programmer worked on a specific or any software project can be extraordinary revealing. A consulting client billed for work done five days in a week may find, upon source code delivery, that changes were made on only two. A former employer making a claim on intellectual property might show that changes to a passion project were made during work hours in the period of employment. A suspicious lover told their unfaithful partner was hard at work finishing a project can note a suspicious absence of commits at the relevant time. What the Fourth Amendment denies a police investigator who wants to install a tracking program might be had for free from the suspect’s public commit history.

Not all users of GitHub run the same risk; the cost-benefit analysis is very personal. The point is rather that changes to how information we are already used to sharing is stored are prone to slip past our privacy reflexes. This is apparently true even when we’re fully aware of the data systems being used to store data about us and even, uniquely to Git, when the subject of stored data is the one choosing and using the data storage tool.

Privacy risk for users of other services adopting immutable data stores will, of course, vary. In the event of a security breach or a data leak, the additional privacy danger grows with the amount and sensitivity of information about past changes to data stored, but not yet compacted. If that information contains time and date data—say, for when a user adds and removes search alerts for information on a disease—the effect may be much more serious than GitHub commit times. Seemingly innocuous information that may become sensitive in specific circumstances—records showing a social media user’s change from the name their stalker knows to a new one—may pose severe latent privacy risk. Whether substantial or non-existent, risks of this kind should be assessed and factored into the overall risk profile of any proposal to deploy Deep Data techniques. Those who aren’t expert in the particulars of the specific technology on offer should know to ask the right questions of those who are.

“Deep Data” can serve as a trigger for this kind of more nuanced privacy analysis, and one that non-experts can wield and apply. The fundamental question is this: Separate and apart from the new kinds of information we will store about users, do the new ways we plan to store that information also create privacy risk? Might the privacy impact of a data breach be more serious? Might users feel violated by changes to the way their information is stored and processed, even if they’re otherwise accustomed to sharing that kind of information in the abstract?

Policy tools

Immutable data approaches to Deep Data also call out for more effective communication among policymakers and technologists about the technical costs and benefits of policy tools applicable to privacy. Recent controversial developments in European privacy policy, notably the so-called “right to be forgotten”, plot potential collision courses with the trajectory of Deep Data technologies. When SOPA and PIPA threatened to inadvertently hobble important DNS security improvements, the right experts spoke in the right way to impress lawmakers with the technical costs of the proposed approach. The technology community is primed to bristle at copyright laws; thanks to the right to be forgotten, it should now be aware that privacy regulation is no less fertile ground for system-breaking legal changes.

Policy makers aren’t database experts, but the track record of legislation shows lawmakers are capable of nuanced information approaches that sound in the theme of Deep Data storage problems. Data privacy laws on the books in the United States are primarily sector-specific, rather than general and cross-cutting, but unifying principles and concepts unite and describe much sector-specific regulation. Among those foundational principles are rights to be made aware of data that is stored and used, in part to ensure that people aren’t harmed by decisions based on outdated or inaccurate information about them. The principle is backed up in various laws for various public and private sectors by differing legal rights to affect data that is stored and processed.

In the case of credit history information, for example, subjects of data collection are entitled to various notices when credit information is used, and especially when it is used against their interests. The notices must make clear who holds the information that was used, and also that individuals have rights to receive, review, and correct information about them.

In contrast, regulations about education records provide various procedures for disputing information kept on file. When appeals of record keeping decisions run out, students have rights to insert dissenting statements into their files, to be reproduced with the rest whenever the file is requested. Records on file are inviolate, in part because educational institutions rely on them for effective administration, but the interest in fairness and correctness is served by adding information that will be assessed whenever the file is reviewed.

Together with rights to deletion and archiving of data deemed stale or irrelevant, such as procedures for expunging of criminal convictions and omission of long-past, arguably irrelevant bankruptcy records, the scope of corrective privacy law tools runs the metaphorical gamut from append-only supplementation through compaction to in-place mutation of data records where they’re stored. Those involved in policy formulation are not entirely unfamiliar with data storage problems and the balancing acts that make one or another approach most appropriate. They just aren’t aware of how those problems are being addressed to make the services their constituents rely on possible.

Technical choices have long flowed in the other direction, from applicable regulation to compliant system. Compliance with certain privacy regulation regimes, such as HIPAA and GLBA, has long guided the adoption of certain technologies that are either inherently compliant or broadly believed to be compliant. Other kinds of regulation, such as that affecting financial institutions, has long motivated the use of data storage approaches that facilitate auditability, like Write-Once, Read-Many (WORM) storage media.

But there is little evidence, especially in light of the recent developments, that knowledge flows the other way. What happens when a system becomes subject both to auditability requirements (save everything) and the right to be forgotten (but delete things when you’re asked)?

It isn’t helpful or useful to decry Congress for failing to meet the geeks on their terms; there are a thousand forms of expertise, none of which should monopolize legislative attention. The goal is rather to provide broad concepts, general but workable, that ensure policy wonks spot issues and call in experts before it’s too late to act on them without a SOPA/PIPA-scale blowup. “Big Data” is, at best, shorthand for a cartoon version of data science. It is also a powerful tool for awareness, wieldable by and intuitive to utter non-experts. It’s time we realized that Big Data, as an abstraction, is starting to leak in unfortunate ways. Deep Data will eventually have the same problems, but for now, it’s progress.

Your thoughts and feedback are always welcome by e-mail.