Monday, September 28, 2009

Differential Privacy via Wavelet Transforms

The following is a guest post by Xiaokui Xiao, Assistant Professor in the Division of Information Systems at Nanyang Technological University. Xiaokui was a postdoc in the Cornell Big Red Data Group from 2008 to 2009.

Numerous organizations, like the U.S. Census Bureau or hospitals, maintain large collections of personal information (such as census data or medical records). These data collections are of significant research value, and there is much benefit in making them publicly available. Since the data is sensitive in nature, proper measures must be taken to ensure that its publication does not endanger the privacy of the individuals that contributed the data. A canonical solution to this problem is to modify the data before releasing it to the public such that the modification prevents inference of private information while retaining the statistical characteristics of the data. In our work, we are limiting the disclosure that happens through data publishing by using ε-differential privacy. Informally, ε-differential privacy requires that the published data should be generated using a randomized algorithm G, such that the output of G is not sensitive to any particular tuple in the input, i.e., the output of G should rely mainly on the general properties of the data. This ensures that, by observing the data modified by G, an adversary (who would like to find out information about individuals in the table) is not able to infer much information about any individual tuple, and hence, privacy is preserved.

The simplest method to enforce ε-differential privacy, as proposed by Dwork et al., is to first derive the frequency distribution of the tuples in the input data, and then publish a noisy version of the distribution. For example, given the medical records in Table 1, Dwork et al.'s method first maps the records to the frequency matrix in Table 2, where each entry in the first (second) column stores the number of diabetes (non-diabetes) patients in Table 1 that belong to a specific age group. After that, Dwork et al.'s method adds independent noise with a Θ(1) variance to each entry in Table 2, and then publishes the noisy frequency matrix.

Dwork et al.'s method provides reasonable accuracy for queries about individual entries in the frequency matrix, as it injects only a small amount of noise (with a constant variance) into each entry. Nevertheless, there exist numerous other queries for which Dwork et al.’s method fails to provide useful results. For example, for a count query answered by taking the sum of a constant fraction of the entries in the noisy frequency matrix, the approximate query result has a Θ(m) noise variance, where m denotes the total number of entries in the matrix. Note that m is typically an enormous number, as practical datasets often contain multiple attributes with sizable domains. Hence, a Θ(m) noise variance can render the approximate result meaningless, especially when the actual result of the query is small.

In our ICDE 2010 paper, we remedy the deficiency of Dwork’s method with Privelet (privacy preserving wavelet), a data publishing technique that not only ensures ε-differential privacy, but also provides accurate results for all range-count queries, i.e., count queries where the predicate on each attribute is a range. Privelet guarantees that any range-count query can be answered with a noise whose variance is polylogarithmic in m. This significantly improves over the O(m) noise variance bound provided by Dwork et al.’s method.

In a nutshell, the effectiveness of Privelet results from a novel application of wavelet transforms, a type of linear transformations that has been widely adopted for image processing and approximate query processing. As with Dwork et al.’s method, Privelet preserves privacy by modifying the frequency matrix M of the input data. Instead of injecting noise directly into M, however, Privelet first applies a wavelet transform on M, converting M to another matrix C. Privelet then adds a polylogarithmic noise to each entry in C, and maps C back to a noisy frequency matrix M∗. The matrix M∗ thus obtained has an interesting property: The result of any range-count query on M ∗ can be expressed as a weighted sum of a polylogarithmic number of entries in C. Furthermore, each of these entries contributes at most polylogarithmic noise variance to the weighted sum. Therefore, the variance of the noise in the query result is bounded by a polylogarithm of m.

You can read more about this work in our paper that will appear at ICDE 2010. See you in Long Beach for the talk!

Tuesday, September 15, 2009

Intensional Associations in Dataspaces

One problem that many users have in managing their data is how to obtain connected items while searching. For example, picture yourself searching for information on an interesting classroom project you developed some years ago. You may type a few keywords in a search tool that will lead you to one or two documents lost on the vast amount of information in your hard drive about that project. Unfortunately, not all documents you are interested in, such as graphs, emails, and results of interesting experiments, may contain the keywords you chose to type on the search box.

The problem in this example is that even though you could find some information related to your project, you cannot connect from this information to other important items in the same context. Together with colleagues from Saarland University and ETH Zurich, I have explored an idea to solve this problem in a paper recently accepted for publication at ICDE 2010. The full version of our paper can be found here (link to draft).

In order to define connections among items in a dataspace, we propose association trails. An association trail is a declarative definition of how items in the dataspace are connected by virtual association edges to other items. A set of association trails defines a logical graph of associations over the dataspace. For example, you may connect documents in your personal dataspace by associating items touched around the same time, documents with similar content, different versions of documents you authored or received, or items that reside in similar folder hierarchies in your email server and in your filesystem.

Coming back to our classroom project search, association trails create connections from your one or two search results to a rich set of related emails, documents, and experiment results. Automatically obtaining all of this context information from search results is called in our paper a neighborhood query. While neighborhood queries are very useful to help you find information in your data, they are also very expensive to process over the logical graph of connections created by association trails. In order to address this problem, our paper investigates a new indexing technique, called the grouping-compressed index (GCI). In a nutshell, GCI creates a compressed representation of the logical graph declared by association trails. We can use this compressed representation to answer neighborhood queries without ever having to expand it to the whole graph. As a consequence, GCI can achieve over an order of magnitude better indexing or querying times when compared to various alternatives.

Association trails have been integrated into the iMeMex Dataspace Management System and the code is released under an open-source license. If you are interested in dataspaces, you can also find out about other work I have done in iMeMex by taking a look at my PhD thesis.

I am looking forward to an interesting conference at Long Beach next year! Hope to see you there!

The SIGMOD 2010 Deadline -- Nov. 5

Remember, remember the Fifth of November,
And that SIGMOD matters a lot,
I know of no reason
Why SIGMOD submission
Should ever be forgot.

(And should you forget, it's V for VLDB for you!)

Saturday, September 12, 2009

Talk Announcement -- Daniel Deutch, Tel Aviv University

Date: Monday Sept. 21, 2009 noon to 1pm
Location: 5130 Upson Hall, Cornell

Querying Past and Future in Web Applications

Daniel Deutch

Abstract: Many businesses offer their services to customers via Web-based application interfaces. Reasoning about execution flows of such applications is extremely valuable for companies: it can be used to optimize business processes, employ targeted advertisements, reduce operational costs, and ultimately increase competitiveness. Such reasoning often operates in an environment inducing partial information and uncertainty of various flavors. First, the execution traces recorded for a Web application often contain only partial information on the activities that were performed at run-time, due to confidentiality, lack of storage space, etc. Second, even in the presence of fully detailed traces of the past executions, prediction of the behavior of future executions may still operate under terms of uncertainty. This is because executions often depend on unknown external parameters, such as users behavior, interaction with other applications, servers response time, etc.

In this talk I will consider (1) models for capturing Web applications and their executions. These models are expressive enough to capture common scenarios, while restrictive enough to allow for efficient query evaluation; (2) query evaluation algorithms over applications/execution traces under these models, and (3) practical implementations for recommending navigation flows within a web applications.

Friday, September 11, 2009

PIP:A Database System for Great and Small Expectations

The field of probabilistic databases attempts to ask the question: can we still reason about data that we aren't certain about? Related to a wide variety of fields including statistics, probability theory, and fuzzy logic, probabilistic databases treat data as following a probabilistic model rather than being known with certainty. When the data is queried, the database not only produces a query result, but also computes (or more commonly, estimates) statistical properties (i.e., Confidence, or Histogram) of that result.

For example, consider a risk-management application that uses statistical models to evaluate the long term effects of corporate decisions and policies. This application may use a DBMS to store predictions and statistical measures (e.g., error bounds) of those predictions. However, arbitrary queries made on the predictions do not translate naturally into queries on the corresponding statistical measures. A user who requires error bounds on the sum of a join over several tables of predictions must first obtain a formula for computing those bounds, assuming a closed form formula even exists.

A wide variety of probabilistic database systems have arisen supporting probabilistic data that follows finite, discrete distributions. Some even approximate continuous distributions by translating them into discrete distributions (i.e., by integrating over bins or sampling).

Unfortunately, these kinds of approximations are hard to instantiate prior to runtime; To maintain generality, these systems must provision (e.g., via bin size, or sample count) for a wide variety of unknown arbitrary queries and statistical measurements of unknown precision when creating probabilistic data tables. Worse still, since precision can be data-dependent, it may be impossible to accurately provision a query until after it has completed. These systems must generate an overabundance of samples or unnecessarily small bin sizes, lest they be unable to achieve sufficient precision.

For example, if a query contains a selection predicate, samples violating the predicate are dropped and do not contribute to the expectation. The more selective the predicate, the more samples are needed to maintain consistent accuracy. Our sample application may be queried to combine a model predicting customer profits with a model for predicting dissatisfied customers, perhaps as a result of a corporate decision to use a cheaper, but slower shipping company. If the query asks for profit loss due to dissatisfied customers, the query need only consider profit from customers under those conditions where the customer is dissatisfied (ie, the underlying model may include a correlation between ordering patterns and dependence on fast shipping).

We address these concerns in our paper PIP:A Database System for Great and Small Expectations (link to draft version), to be presented at ICDE 2010. PIP is a purely symbolic probabilistic database that supports continuous distributions. Using a C-Tables representation and techniques similar to those used by the MayBMS Probabilistic Database Management System, PIP maintains a symbolic representation of all uncertainty in the database. Queries are evaluated directly on this symbolic representation; a direct mapping exists between all relational algebra operators and their c-tables counterparts. Even better, the direct mapping is nearly trivial.

Because the query result is expressed symbolically, its statistical properties can be estimated very efficiently or even computed precisely (in certain cases). By exploiting knowledge about the underlying distributions (where available) and the query itself, PIP acts as a framework for applying a variety of statistical tools to improve optimization efficiency. PIP is extensible, allowing developers to encode new probability distributions into modules that are linked into PIP itself. These modules encode any or all of a wide range of pieces of information about the distribution. For example, if the CDF of a distribution is known, when computing the expectation of a variable following this distribution, PIP can use the CDF to improve sampling efficiency or even compute an exact value for the expectation.

PIP has been implemented as a plugin for Postgres. Some minor modifications to Postgres itself make it possible to employ static data queries on probabilistic data unchanged. However, the core functionality can be imported into any Postgres database. A beta version of PIP will be released officially soon. Check back on this blog for details.

Cornell papers accepted at ICDE 2010

The group has just had the following papers and a tutorial accepted at ICDE 2010:
  • Intensional Associations in Dataspaces, by Marcos Vaz Salles, Jens Dittrich, Lukas Blunschi (short paper)
  • Differential Privacy via Wavelet Transforms, by Xiaokui Xiao, Guozhang Wang, Johannes Gehrke (long paper)
  • Approximate Confidence Computation in Probabilistic Databases, by Dan Olteanu, Jiewen Huang, Christoph Koch (long paper)
  • PIP: A Database System for Great and Small Expectations, by Oliver Kennedy, Christoph Koch (long paper)
  • Privacy in Data Publishing, Johannes Gehrke, Daniel Kifer, Ashwin Machanavajjhala (three-hour tutorial)
More on this work coming soon on this blog!

Monday, September 7, 2009

New MMO: Monopoly City Streets

From their website: "A live worldwide game of MONOPOLY using Google Maps as the game board." Are there any interesting scaling issues with running this *live*? Maybe we can help them with our scripting language SGL ;-)?

Friday, September 4, 2009

e-Privacy: A Framework for Data-Publishing against Realistic Adversaries (Part II)

Recall the setting from my previous posting. We want to publish a table while maintaining the privacy of the individuals in the table. So far, there only existed extreme adversaries: Either very weak or super strong adversaries - nothing in between.
If we only protect privacy against weak adversaries then a smarter adversary will be able to breach the privacy of individuals in practice. For example, an adversary who knows "it is flu season and there are likely to be many elderly patients with flu symptoms in the hospital." can be smart enough to breach the privacy of individuals.
On the other hand if we protect against extremely strong adversaries then our published table is close to being useless.
That is why we would like to protect against realistic adversaries in the middle ground. So how do "realistic" adversaries look like?

In our VLDB paper, we introduce e-privacy which protects against adversaries who have collected some statistics about the population and use them to build their belief about individuals. An adversary can have the following knowledge:
  • Knowledge about the general population. The adversaries obtain this knowledge from other datasets, for example, through articles that they have read that contain survey data or by looking at other data sets.
  • Knowledge about specific individuals in the table. The adversaries obtain this knowledge as in prior work, for example, by knowing some facts about their friends or neighbors.
As we just outlined, the adversary starts with some beliefs about the world. Upon seeing the released table the adversary may revise her belief - for example, the adversary may know the rate of cancer in the general population and assumed this was the same for all patients but upon seeing the data from a hospital, might change her mind. By how much an adversary changes her belief depends on what we call the stubbornness of the adversary. Infinitely stubborn adversaries believe in facts, and thus they do not change their belief at all. Adversaries with finite stubbornness need more or less data to convince them that their prior beliefs were incorrect.

In the paper we explain how to take a table and generalize it in order to protect against these adversaries. This was impossible if you wanted to protect against the super strong adversaries. But now, we can find out which tables we can publish defending against realistic adversaries.

We believe that e-privacy with its defense against realistic adversaries is very useful in practice, and thus we hope that you will want to experiment with it and try it out on your own data. We are eager to learn about your experience!

For more details on the framework and an experimental evaluation of information content of data published with e-privacy please refer to our paper. If you want to know more about our projects on data privacy you can find more information here.

Wednesday, September 2, 2009

A Confluence of Column Stores and Search Engines: Opportunities and Challenges

At the USETIM (Using Search Engine Technology for Information Management) Workshop at VLDB 2009, we presented a paper that explores the technical similarities and differences between column stores and search engines. The motivation for the paper is that we believe that the two systems have a lot of similarities, and that development in both fields point towards a confluence. We therefore try to evaluate the synergies obtainable from implementing one system to support both workloads, and we find that there are both opportunities and challenges associated with developing such a hybrid system.

With a high-level perspective, we can claim that both column stores and search engines are column-oriented. While column stores obviously store column-oriented representations of tables, an inverted index used in search engines can be interpreted as a column-oriented representation of a term-document matrix where the entries describe the occurrence of terms in documents. Despite this similarity, it is clear that the queries typically supported in the two types of systems are quite different. While search engines focus on returning the top-k ranked documents for a keyword query, column stores are mainly used in decision support systems where queries with aggregations are dominant. However, typical search engine queries might become more similar to traditional DSS queries through the introduction of faceted search. In addition, new approaches suggested for storing the entries in inverted lists/posting lists make the storage layout in search engines more column-oriented. The fact that that column stores have been used as back-ends to search engines with good performance also indicates similarities between the fields.

The similarities might suggest that one system can support both workloads efficiently. Our qualitative evaluation indicates that there are clear opportunities associated with constructing a hybrid system, such as efficient storage structures for facets in faceted search engines based on a columnar layout, and technology transfer of loading techniques and indexing structures. But, there are also challenges that need to be overcome for a hybrid system to be efficient. Examples of challenges include the fact that search engine workloads require low latency, while throughput is the major optimization criterion for column stores. A successful hybrid system might also be able to support other related workloads, but supporting more workloads typically involves supporting more query languages. When supporting several query languages, it becomes a challenge to find a physical algebra for the system that enables supporting all query languages efficiently.

If you are interested in hearing some more details about this work, or in discussing the potential in developing a hybrid system for both search and decision support workloads, comment in this blog or read our paper at the USETIM workshop at VLDB 2009!