The following is a guest post by Xiaokui Xiao, Assistant Professor in the Division of Information Systems at Nanyang Technological University. Xiaokui was a postdoc in the Cornell Big Red Data Group from 2008 to 2009.
Numerous organizations, like the U.S. Census Bureau or hospitals, maintain large collections of personal information (such as census data or medical records). These data collections are of significant research value, and there is much benefit in making them publicly available. Since the data is sensitive in nature, proper measures must be taken to ensure that its publication does not endanger the privacy of the individuals that contributed the data. A canonical solution to this problem is to modify the data before releasing it to the public such that the modification prevents inference of private information while retaining the statistical characteristics of the data. In our work, we are limiting the disclosure that happens through data publishing by using ε-differential privacy. Informally, ε-differential privacy requires that the published data should be generated using a randomized algorithm G, such that the output of G is not sensitive to any particular tuple in the input, i.e., the output of G should rely mainly on the general properties of the data. This ensures that, by observing the data modified by G, an adversary (who would like to find out information about individuals in the table) is not able to infer much information about any individual tuple, and hence, privacy is preserved.
The simplest method to enforce ε-differential privacy, as proposed by Dwork et al., is to first derive the frequency distribution of the tuples in the input data, and then publish a noisy version of the distribution. For example, given the medical records in Table 1, Dwork et al.'s method first maps the records to the frequency matrix in Table 2, where each entry in the first (second) column stores the number of diabetes (non-diabetes) patients in Table 1 that belong to a specific age group. After that, Dwork et al.'s method adds independent noise with a Θ(1) variance to each entry in Table 2, and then publishes the noisy frequency matrix.
Dwork et al.'s method provides reasonable accuracy for queries about individual entries in the frequency matrix, as it injects only a small amount of noise (with a constant variance) into each entry. Nevertheless, there exist numerous other queries for which Dwork et al.’s method fails to provide useful results. For example, for a count query answered by taking the sum of a constant fraction of the entries in the noisy frequency matrix, the approximate query result has a Θ(m) noise variance, where m denotes the total number of entries in the matrix. Note that m is typically an enormous number, as practical datasets often contain multiple attributes with sizable domains. Hence, a Θ(m) noise variance can render the approximate result meaningless, especially when the actual result of the query is small.
In our ICDE 2010 paper, we remedy the deficiency of Dwork’s method with Privelet (privacy preserving wavelet), a data publishing technique that not only ensures ε-differential privacy, but also provides accurate results for all range-count queries, i.e., count queries where the predicate on each attribute is a range. Privelet guarantees that any range-count query can be answered with a noise whose variance is polylogarithmic in m. This significantly improves over the O(m) noise variance bound provided by Dwork et al.’s method.
In a nutshell, the effectiveness of Privelet results from a novel application of wavelet transforms, a type of linear transformations that has been widely adopted for image processing and approximate query processing. As with Dwork et al.’s method, Privelet preserves privacy by modifying the frequency matrix M of the input data. Instead of injecting noise directly into M, however, Privelet first applies a wavelet transform on M, converting M to another matrix C. Privelet then adds a polylogarithmic noise to each entry in C, and maps C back to a noisy frequency matrix M∗. The matrix M∗ thus obtained has an interesting property: The result of any range-count query on M ∗ can be expressed as a weighted sum of a polylogarithmic number of entries in C. Furthermore, each of these entries contributes at most polylogarithmic noise variance to the weighted sum. Therefore, the variance of the noise in the query result is bounded by a polylogarithm of m.
You can read more about this work in our paper that will appear at ICDE 2010. See you in Long Beach for the talk!