Monday, October 18, 2010

Why and Where Provenance

At DB Breakfast on Thursday October 7th, we continued our exploration of data provenance by reading the highly-cited paper:

Peter Buneman, Sanjeev Khanna, Wang Chiew Tan. Why and Where: A Characterization of Data Provenance. In ICDT, 2001.

This paper looks at problem of determining the provenance of a query answer, i.e. what data in the database "contributes to" the resulting answer. One of the insights of the paper is that the concept of provenance profoundly depends on what one means by "contributes to." Two notions of provenance are introduced, where provenance and why provenance, and shown to have very different behavior.

The distinction between why and where provenance is best seen with an example: Suppose ("Joe", 1234) is an answer to this query.

SELECT name, telephone
FROM employee, dept
WHERE employee.dno = dept.dno AND = "Computer Science"

The where provenance of 1234 is simply the corresponding phone number in Joe's record in the employee relation. The why provenance includes not only Joe's record in employee, but also the Computer Science record in dept because without that record, Joe's record would not be included in the result.

For why provenance, the paper gives precise characterization based on query *syntax.* Informally, a tuple in the database is part of the why provenance if it is used in some minimal derivation of the answer tuple (the qualifications "some" and "minimal" are important). This notion of provenance has nice properties---for instance, invariance to query rewriting.

For where provenance, the intuition guiding the above approach appears to break down. Examples are shown where two queries are equivalent yet exhibit different where provenance, and they suggest that a syntactic characterization may fail to fully capture where provenance.

Despite the challenges with where provenance, it appears as though subsequent work has developed approaches for where provenance. How were these challenges addressed?

In addition, the why provenance characterization is for SPJU queries only. Extending to include negation and aggregation seems important but quite challenging: the provenance of a tuple may include the entire database! Such an answer, while technically correct, may not be useful to the user. Is there a reasonable notion of weighted provenance, where some input tuples have more influence on the query answer than others?

In addition to where and why provenance, what other kinds of provenance might be useful?