Friday, August 14, 2009

e-Privacy: A Framework for Data-Publishing against Realistic Adversaries (Part I)

We have a new paper with exciting results on privacy-preserving data publishing at VLDB 2009. What is privacy-preserving data publishing? Let us start with a motivating example. Consider the following set of medical records published by Gotham City Hospital:
ZIP CODEAGEDISEASE
130** less than 30 Viral Infection
130** less than 30 Heart Disease
1485* at least 40 Cancer
1485* at least 40 Heart Disease
130** around 35 Cancer
130** around 35 Cancer

Each record in this table corresponds to a unique patient in the hospital, and each patient has three attributes: her zip code, her age, and her disease. Each patient considers her disease to be sensitive; the other attributes are not sensitive, but might be used to link a record to a person.
The non-sensitive attributes have been coarsened to ensure that no patient can be uniquely identified. For example, the zip code of the first patient has been changed from 13021 to 130**, and the values in the age attributes have been changed to ranges. The hospital should ensure that an adversary cannot link any patient to her disease.

Suppose Rachel is an individual in the population. Given access to only this table, the adversary Alice, may not be able the deduce Rachel's disease. But if Alice knows that Rachel is one of the individuals whose medical record is published in the table, and that Rachel is 35 year old and lives in zip code 13068, Alice can infer that Rachel has cancer.

So now let's be a bit more formal and describe the basic scenario of privacy-preserving data-publishing abstractly: You have a table T with sensitive information about individuals. You want to publish a sanitized version T' that (a) offers good utility and (b) preserves the privacy of the individuals in the table.

Now, we have neither defined utility nor privacy, and there might not be a "best" definition of these concepts. In the literature you find a variety of definitions that differ in what is considered sensitive information, in what privacy means and against what types of adversaries privacy needs to be protected.

In prior work on this topic, privacy was either protected against very weak adversaries or against extremely powerful (basically omniscient) adversaries. For example, consider the weak adversary of t-closeness. This adversary knows the distribution of the diseases in T before you have released any information about T; for instance, the adversary in t-closeness believes that Rachel's chance of having cancer is 50%. Another weak adversary is captured in l-diversity. Here, the adversary believes that for Rachel all diseases are equally likely, and the adversary knows some facts about the world, such as "men are unlikely to have breast cancer." On the other extreme, differential privacy considers a very powerful adversary who is assumed to know all patients in T except Rachel. Differential privacy provides so much protection that no generalization of T can be released, and so much privacy limits the utility of the released table.

Is there a middle ground of adversaries that we can work with that are neither omniscient nor weaklings? I will tell you more about this in my next blog posting.

No comments:

Post a Comment