Monday, October 18, 2010

Why and Where Provenance

At DB Breakfast on Thursday October 7th, we continued our exploration of data provenance by reading the highly-cited paper:

Peter Buneman, Sanjeev Khanna, Wang Chiew Tan. Why and Where: A Characterization of Data Provenance. In ICDT, 2001.

This paper looks at problem of determining the provenance of a query answer, i.e. what data in the database "contributes to" the resulting answer. One of the insights of the paper is that the concept of provenance profoundly depends on what one means by "contributes to." Two notions of provenance are introduced, where provenance and why provenance, and shown to have very different behavior.

The distinction between why and where provenance is best seen with an example: Suppose ("Joe", 1234) is an answer to this query.

SELECT name, telephone
FROM employee, dept
WHERE employee.dno = dept.dno AND = "Computer Science"

The where provenance of 1234 is simply the corresponding phone number in Joe's record in the employee relation. The why provenance includes not only Joe's record in employee, but also the Computer Science record in dept because without that record, Joe's record would not be included in the result.

For why provenance, the paper gives precise characterization based on query *syntax.* Informally, a tuple in the database is part of the why provenance if it is used in some minimal derivation of the answer tuple (the qualifications "some" and "minimal" are important). This notion of provenance has nice properties---for instance, invariance to query rewriting.

For where provenance, the intuition guiding the above approach appears to break down. Examples are shown where two queries are equivalent yet exhibit different where provenance, and they suggest that a syntactic characterization may fail to fully capture where provenance.

Despite the challenges with where provenance, it appears as though subsequent work has developed approaches for where provenance. How were these challenges addressed?

In addition, the why provenance characterization is for SPJU queries only. Extending to include negation and aggregation seems important but quite challenging: the provenance of a tuple may include the entire database! Such an answer, while technically correct, may not be useful to the user. Is there a reasonable notion of weighted provenance, where some input tuples have more influence on the query answer than others?

In addition to where and why provenance, what other kinds of provenance might be useful?

Sunday, September 26, 2010

Is the Cloud Ready for Scientific Computing?

Last Thursday, in the DB breakfast at Cornell, we asked ourselves the question whether the cloud was ready for HPC. We discussed a paper from this year's VLDB conference by Schad, Dittrich, and Quiané-Ruiz reporting unexpected high variance in Amazon EC2's performance. The paper describes the results from instance types in different availability zones through a benchmark measuring instance startup, cpu, memory speed, disk I/O, network bandwidth, and S3 access times. The main lessons after analyzing the results of one month of data is that instances allocated to different physical system types and availabilities zones can have large variability in performance for CPU, disk I/O and network performance. In fact, similar observations have been made by other studies and benchmarks (such as this and this). Given these results, how tightly will cloud providers ever be able to specify and guarantee performance-based SLAs?

As a result, members of the HPC community feel that the cloud may not be ready for their scientific applications, which tend to be network and memory bound (for example, take a look at this nice paper for some results). However, recently, Amazon released a new instance type: the cluster computer instances, and results from a benchmark run on 800 such instances was reported to rank within the Top500 list of supercomputers. Will this start a new era for the HPC community to run their applications in the cloud?

The cloud democratizes access to resources; even researchers who do not have access to a supercomputer will be able to afford to rent hundreds of high-performance instances in the cloud and scale their simulations to unprecedented dimensions. I think that this means that also for HPC applications, "performance per dollar" and not only on "performance" as it was traditionally measured will be the an important metric in the future. If you look at the sorting benchmark homepage, you will see different two categories of benchmarks: Daytona, where the sort code needs to be general purpose, and Indy, where the goal is only to sort according to the benchmark specifications. In addition, there exists a benchmark that measures the amount of energy required to sort. Will we see similar developments in the measurement of supercomputing systems?

Monday, March 15, 2010

ICDE 2010 Trip Report

I would like to briefly summarize some of the interesting things I have seen at ICDE. This is clearly a biased view of everything that was there at the conference, so please do take it with a grain of salt!

There were three keynotes plus a banquet presentation. Pekka Kostamaa from Teradata told us about how data warehousing is becoming more complex. In particular, the programming model is less clearly only SQL, as many vendors now support MapReduce interfaces for complex analysis over the data. In addition, star schemas and nightly loads are a thing of the past. They see modern installations exhibiting very complex schemas, which reflect better and more comprehensive data integration of many parts of the business, and a move towards on-line loading and querying, e.g. to enable on-the-spot marketing. Donald Kossmann delivered a keynote on cloud architecture and his experience with his startup 28msec. He pointed out that the current classic web architecture of database servers and application servers with strict, coarse-grained data partitioning of data among database servers does not fully utilize cloud resources. He advocated more of a RAID-like architecture in which application and database server are combined into a single system that spreads data at finer granularity over a set of cloud compute nodes. Jeff Naughton’s keynote was a reflection on the peer review process in the database community. His concerned arguments are that low acceptance rates and a narrow view of the reviewing service are stifling creativity in the community. He presented some challenging suggestions for change, leading to discussion and food for thought. During the banquet, we had an extra presentation, in which Gio Wiederhold argued for the need to instill in the professional practice of software design considerations about cost, expected value, and the economics of software.

There were also many interesting paper presentations:

  • Hive – A Petabyte Scale Data Warehouse Using Hadoop: The authors present how to build a SQL engine over a Hadoop runtime. I asked some of the authors one-on-one what extra features they would love Hadoop to have in their experience. They pointed out that a MapReduceMerge model would ease things significantly. In addition, they would like more flexibility on when to take checkpoints, not at the end of every MapReduce task as is now the case. Moreover, they would also like to have a feature to pipe map-reduce jobs, i.e., send the output of a reduce step directly to the next mappers.

  • Usher – Improving Data Quality with Dynamic Forms: This work received the best student paper award. It presented a system to improve data quality by making data entry forms dynamic. The idea is to change the data entry form according to probabilistic model over the questions in the form. So the system may adapt the order the questions are asked, enable real-time feedback about entered values (e.g., via most-likely completions), and re-ask questions that are likely to have been entered incorrectly. One interesting aspect is that the authors in fact deployed their system for the transcription of paper-based patient intake forms in an HIV/AIDS clinic in Tanzania, showing that database research can have direct positive impact in problems faced by developing countries.

  • Optimizing ETL Workflows for Fault-Tolerance: The paper talks about which strategies to choose for fault-tolerance of complex ETL dataflow graphs. There are three basic alternatives for each job: restart from scratch, checkpointing, and process pairs. The authors design an optimizer that chooses different strategies while balancing the objectives of performance, fault-tolerance, and freshness.

  • FPGA Acceleration for the Frequent Item Problem: This is a paper exploring a problem we recently heard about at Cornell’s database lunch series. The authors explore different hardware designs starting from the Space-Saving algorithm. They show that a naïve translation of the algorithm into hardware does not obtain significant gains. By exploring pipelining, they show a design that is able to process about three times as many items per second as the best known CPU result.

  • The Similarity Join Database Operator: This paper shows how to integrate (1-D) similarity joins into a relational DBMS as database operators. Examples here are distance joins or kNN joins. One very interesting aspect of this work is that they present a set of algebraic rewrite rules for similarity joins. The authors are currently working on generalizing their techniques to the multi-dimensional case.

  • There were a few papers related to recent topics we covered on our classic DB reading group. Related to the Skyline operator paper, we had three presentations in a session dedicated to skyline processing. Another topic we recently read about that warranted a whole session was Top-K processing. This session included the paper that received the best paper award, TASM: Top-k Approximate Subtree Matching. Related to the progress estimation in SQL paper, we had one presentation about progress estimation in MapReduce with an implementation in Hadoop called Parallax.

  • There was of course a lot of interesting work coming from Cornell as well. Oliver presented PIP, a probabilistic database system for continuous distributions, and Xiaokui (now at Singapore) presented a paper on differential privacy via wavelet transforms. Christoph co-authored a paper on approximate confidence computation in probabilistic databases. Along with co-authors, Johannes gave a tutorial on privacy in data publishing and I presented work on modeling intensional associations in dataspaces.

Please feel free to add to this trip report if you would like to comment on your experience at the conference.