Thursday, February 10, 2011
CIDR 2011
Monday, October 18, 2010
Why and Where Provenance
This paper looks at problem of determining the provenance of a query answer, i.e. what data in the database "contributes to" the resulting answer. One of the insights of the paper is that the concept of provenance profoundly depends on what one means by "contributes to." Two notions of provenance are introduced, where provenance and why provenance, and shown to have very different behavior.
The distinction between why and where provenance is best seen with an example: Suppose ("Joe", 1234) is an answer to this query.
SELECT name, telephone
FROM employee, dept
WHERE employee.dno = dept.dno AND dept.name = "Computer Science"
The where provenance of 1234 is simply the corresponding phone number in Joe's record in the employee relation. The why provenance includes not only Joe's record in employee, but also the Computer Science record in dept because without that record, Joe's record would not be included in the result.
For why provenance, the paper gives precise characterization based on query *syntax.* Informally, a tuple in the database is part of the why provenance if it is used in some minimal derivation of the answer tuple (the qualifications "some" and "minimal" are important). This notion of provenance has nice properties---for instance, invariance to query rewriting.
For where provenance, the intuition guiding the above approach appears to break down. Examples are shown where two queries are equivalent yet exhibit different where provenance, and they suggest that a syntactic characterization may fail to fully capture where provenance.
Despite the challenges with where provenance, it appears as though subsequent work has developed approaches for where provenance. How were these challenges addressed?
In addition, the why provenance characterization is for SPJU queries only. Extending to include negation and aggregation seems important but quite challenging: the provenance of a tuple may include the entire database! Such an answer, while technically correct, may not be useful to the user. Is there a reasonable notion of weighted provenance, where some input tuples have more influence on the query answer than others?
In addition to where and why provenance, what other kinds of provenance might be useful?
Monday, October 4, 2010
Sunday, September 26, 2010
Is the Cloud Ready for Scientific Computing?
Last Thursday, in the DB breakfast at Cornell, we asked ourselves the question whether the cloud was ready for HPC. We discussed a paper from this year's VLDB conference by Schad, Dittrich, and Quiané-Ruiz reporting unexpected high variance in Amazon EC2's performance. The paper describes the results from instance types in different availability zones through a benchmark measuring instance startup, cpu, memory speed, disk I/O, network bandwidth, and S3 access times. The main lessons after analyzing the results of one month of data is that instances allocated to different physical system types and availabilities zones can have large variability in performance for CPU, disk I/O and network performance. In fact, similar observations have been made by other studies and benchmarks (such as this and this). Given these results, how tightly will cloud providers ever be able to specify and guarantee performance-based SLAs?
As a result, members of the HPC community feel that the cloud may not be ready for their scientific applications, which tend to be network and memory bound (for example, take a look at this nice paper for some results). However, recently, Amazon released a new instance type: the cluster computer instances, and results from a benchmark run on 800 such instances was reported to rank within the Top500 list of supercomputers. Will this start a new era for the HPC community to run their applications in the cloud?
The cloud democratizes access to resources; even researchers who do not have access to a supercomputer will be able to afford to rent hundreds of high-performance instances in the cloud and scale their simulations to unprecedented dimensions. I think that this means that also for HPC applications, "performance per dollar" and not only on "performance" as it was traditionally measured will be the an important metric in the future. If you look at the sorting benchmark homepage, you will see different two categories of benchmarks: Daytona, where the sort code needs to be general purpose, and Indy, where the goal is only to sort according to the benchmark specifications. In addition, there exists a benchmark that measures the amount of energy required to sort. Will we see similar developments in the measurement of supercomputing systems?
Monday, March 15, 2010
ICDE 2010 Trip Report
I would like to briefly summarize some of the interesting things I have seen at ICDE. This is clearly a biased view of everything that was there at the conference, so please do take it with a grain of salt!
There were three keynotes plus a banquet presentation. Pekka Kostamaa from Teradata told us about how data warehousing is becoming more complex. In particular, the programming model is less clearly only SQL, as many vendors now support MapReduce interfaces for complex analysis over the data. In addition, star schemas and nightly loads are a thing of the past. They see modern installations exhibiting very complex schemas, which reflect better and more comprehensive data integration of many parts of the business, and a move towards on-line loading and querying, e.g. to enable on-the-spot marketing. Donald Kossmann delivered a keynote on cloud architecture and his experience with his startup 28msec. He pointed out that the current classic web architecture of database servers and application servers with strict, coarse-grained data partitioning of data among database servers does not fully utilize cloud resources. He advocated more of a RAID-like architecture in which application and database server are combined into a single system that spreads data at finer granularity over a set of cloud compute nodes. Jeff Naughton’s keynote was a reflection on the peer review process in the database community. His concerned arguments are that low acceptance rates and a narrow view of the reviewing service are stifling creativity in the community. He presented some challenging suggestions for change, leading to discussion and food for thought. During the banquet, we had an extra presentation, in which Gio Wiederhold argued for the need to instill in the professional practice of software design considerations about cost, expected value, and the economics of software.
There were also many interesting paper presentations:
Hive – A Petabyte Scale Data Warehouse Using Hadoop: The authors present how to build a SQL engine over a Hadoop runtime. I asked some of the authors one-on-one what extra features they would love Hadoop to have in their experience. They pointed out that a MapReduceMerge model would ease things significantly. In addition, they would like more flexibility on when to take checkpoints, not at the end of every MapReduce task as is now the case. Moreover, they would also like to have a feature to pipe map-reduce jobs, i.e., send the output of a reduce step directly to the next mappers.
Usher – Improving Data Quality with Dynamic Forms: This work received the best student paper award. It presented a system to improve data quality by making data entry forms dynamic. The idea is to change the data entry form according to probabilistic model over the questions in the form. So the system may adapt the order the questions are asked, enable real-time feedback about entered values (e.g., via most-likely completions), and re-ask questions that are likely to have been entered incorrectly. One interesting aspect is that the authors in fact deployed their system for the transcription of paper-based patient intake forms in an HIV/AIDS clinic in Tanzania, showing that database research can have direct positive impact in problems faced by developing countries.
Optimizing ETL Workflows for Fault-Tolerance: The paper talks about which strategies to choose for fault-tolerance of complex ETL dataflow graphs. There are three basic alternatives for each job: restart from scratch, checkpointing, and process pairs. The authors design an optimizer that chooses different strategies while balancing the objectives of performance, fault-tolerance, and freshness.
FPGA Acceleration for the Frequent Item Problem: This is a paper exploring a problem we recently heard about at Cornell’s database lunch series. The authors explore different hardware designs starting from the Space-Saving algorithm. They show that a naïve translation of the algorithm into hardware does not obtain significant gains. By exploring pipelining, they show a design that is able to process about three times as many items per second as the best known CPU result.
The Similarity Join Database Operator: This paper shows how to integrate (1-D) similarity joins into a relational DBMS as database operators. Examples here are distance joins or kNN joins. One very interesting aspect of this work is that they present a set of algebraic rewrite rules for similarity joins. The authors are currently working on generalizing their techniques to the multi-dimensional case.
There were a few papers related to recent topics we covered on our classic DB reading group. Related to the Skyline operator paper, we had three presentations in a session dedicated to skyline processing. Another topic we recently read about that warranted a whole session was Top-K processing. This session included the paper that received the best paper award, TASM: Top-k Approximate Subtree Matching. Related to the progress estimation in SQL paper, we had one presentation about progress estimation in MapReduce with an implementation in Hadoop called Parallax.
There was of course a lot of interesting work coming from Cornell as well. Oliver presented PIP, a probabilistic database system for continuous distributions, and Xiaokui (now at Singapore) presented a paper on differential privacy via wavelet transforms. Christoph co-authored a paper on approximate confidence computation in probabilistic databases. Along with co-authors, Johannes gave a tutorial on privacy in data publishing and I presented work on modeling intensional associations in dataspaces.
Please feel free to add to this trip report if you would like to comment on your experience at the conference.
Thursday, October 8, 2009
Spatial indexing demos
Example:
(1) Go to http://www.cs.umd.edu/~hjs/
(2) Select "Online Demos"/"VASCO Spatial Index Demo"
(3) Scroll down; select "Lines"/"PM1 Quadtree demo"
(4) Draw some lines with the mouse
(5) Select "Operations"/"Move vertex"
(6) Move vertices around with the mouse and see how the quadtree changes.
Monday, September 28, 2009
Differential Privacy via Wavelet Transforms
Numerous organizations, like the U.S. Census Bureau or hospitals, maintain large collections of personal information (such as census data or medical records). These data collections are of significant research value, and there is much benefit in making them publicly available. Since the data is sensitive in nature, proper measures must be taken to ensure that its publication does not endanger the privacy of the individuals that contributed the data. A canonical solution to this problem is to modify the data before releasing it to the public such that the modification prevents inference of private information while retaining the statistical characteristics of the data. In our work, we are limiting the disclosure that happens through data publishing by using ε-differential privacy. Informally, ε-differential privacy requires that the published data should be generated using a randomized algorithm G, such that the output of G is not sensitive to any particular tuple in the input, i.e., the output of G should rely mainly on the general properties of the data. This ensures that, by observing the data modified by G, an adversary (who would like to find out information about individuals in the table) is not able to infer much information about any individual tuple, and hence, privacy is preserved.

The simplest method to enforce ε-differential privacy, as proposed by Dwork et al., is to first derive the frequency distribution of the tuples in the input data, and then publish a noisy version of the distribution. For example, given the medical records in Table 1, Dwork et al.'s method first maps the records to the frequency matrix in Table 2, where each entry in the first (second) column stores the number of diabetes (non-diabetes) patients in Table 1 that belong to a specific age group. After that, Dwork et al.'s method adds independent noise with a Θ(1) variance to each entry in Table 2, and then publishes the noisy frequency matrix.
Dwork et al.'s method provides reasonable accuracy for queries about individual entries in the frequency matrix, as it injects only a small amount of noise (with a constant variance) into each entry. Nevertheless, there exist numerous other queries for which Dwork et al.’s method fails to provide useful results. For example, for a count query answered by taking the sum of a constant fraction of the entries in the noisy frequency matrix, the approximate query result has a Θ(m) noise variance, where m denotes the total number of entries in the matrix. Note that m is typically an enormous number, as practical datasets often contain multiple attributes with sizable domains. Hence, a Θ(m) noise variance can render the approximate result meaningless, especially when the actual result of the query is small.
In our ICDE 2010 paper, we remedy the deficiency of Dwork’s method with Privelet (privacy preserving wavelet), a data publishing technique that not only ensures ε-differential privacy, but also provides accurate results for all range-count queries, i.e., count queries where the predicate on each attribute is a range. Privelet guarantees that any range-count query can be answered with a noise whose variance is polylogarithmic in m. This significantly improves over the O(m) noise variance bound provided by Dwork et al.’s method.
In a nutshell, the effectiveness of Privelet results from a novel application of wavelet transforms, a type of linear transformations that has been widely adopted for image processing and approximate query processing. As with Dwork et al.’s method, Privelet preserves privacy by modifying the frequency matrix M of the input data. Instead of injecting noise directly into M, however, Privelet first applies a wavelet transform on M, converting M to another matrix C. Privelet then adds a polylogarithmic noise to each entry in C, and maps C back to a noisy frequency matrix M∗. The matrix M∗ thus obtained has an interesting property: The result of any range-count query on M ∗ can be expressed as a weighted sum of a polylogarithmic number of entries in C. Furthermore, each of these entries contributes at most polylogarithmic noise variance to the weighted sum. Therefore, the variance of the noise in the query result is bounded by a polylogarithm of m.
You can read more about this work in our paper that will appear at ICDE 2010. See you in Long Beach for the talk!
