Thursday, February 10, 2011

CIDR 2011

This year's CIDR was exciting. As might be expected, there was a clear focus on cloud technologies in the program, and cloud middleware and infrastructure systems had a strong offering in particular. Changes to the memory heirarchy effected by Flash and Phase Change Memory (Flash's heir apparent) were also a subject of intense discussion.

Two specific instances of cloud middleware took a rather unusual (and perhaps even a little Matrixy) approach to the architecture of the underlying cloud. MIT's Crowdsourced Databases, and Stanford's proposal for using humans to answer queries both attempt to build a crowdsource operator (an invocation of a service like Amazon's Mechanical Turk) into a traditional relation query optimizer. Aside from the obvious interface challenges, this operator introduces the potential for inaccuracies (c.f., My Database Hates Me) and an actual financial cost into the query optimizer's cost model.

An aspect of cloud computation addressed by many papers was the idea of transactions in the cloud. SAP's Transactional Intent, Microsoft's Deuteronomy, Google's Megastore, and several other presentations throughout the conference noted the difficulties of programming distributed datastores without transactional support and presented suggestions for creating what amounts to transactional infrastructures for cloud programming.

On a related note, a paradigm for distributed programming that appeared throughout many of these papers (and also Saarland's OctopusDB) was that of a log-structured database engine. Rather than the traditional approach of storing the primary copy of a datum sorted, to take advantage of sequential scans the primary datum is simply maintained in a log (in part, taking advantage of the support for fast random access in flash). Furthermore, by ensuring that the elements are sequenced in a canonical order, the log provides an effective synchronization abstraction.

Several presentatons such as MIT's Relational Cloud and Duke's Starfish made efforts towards a more generic cloud infrastructure, reducing the effort required to deploy, maintain, and tune a large scale data-processing system.

Microsoft had a strong hardware-layer offering this year, presenting several papers on Flash/PC memory-based algorithms. They were joined in architectures for Flash memory by a paper out of ITUC/INRIA.

Another idea was present, subtly appearing in a large number of papers: interactive semistructured queries. Instantiations of this idea ranged from interactive question-suggestion interfaces like MPI's IQ and Duke's Citizen Journalism, to typeahead suggestions for queries, forms, etc... like Tsinghua's DBEase, to LAWA's temporal queries over the way-back-machine, to spreadsheet-style relational database engines like MIT's schema-independent DBUI. These projects each attempt to provide an environment for non-technical users to construct queries. In each case, this ends up taking the form of an interactive session, where users refine a query by interactively querying the database schema. DBEase in particular has a pretty snazzy set of demos (http://dbease.cs.tsinghua.edu.cn) that I encourage you to check out.

Yet another hot topic this CIDR was data provenance. A slew of data provenance gathering systems for debugging and data validation were presented by Yahoo, Stanford, UPenn, and others. Of particular note, the UPenn paper makes note of an interesting challenge in data provenance: privacy. Exporting the provenance information of a tuple leaks information about the data that went into the tuple. How can we measure, and more importantly limit the exposure of sensitive information, without eliminating the usefulness of the provenance information.

An entirely new branch of research to me is computational activism. Berkeley's Data in the First Mile, and Duke's Computational Journalism both espouse the need for building good task specific UIs (and the corresponding computational backends) for use in (respectively) third-world countries, and journalism (i.e., fact checking, pattern/outlier discovery, and claim monitoring).

Several other interesting papers branched off into entirely unique directions. Berkeley's CALM quantifies the situations where synchronization primitives are required in a distributed program and provides programming language support for distributed programs along the lines of Evita Raced. A vision paper out of EPFL called for hybrid relational+hdfs database storage architectures, where the curation of flat data files is done on a pay-as-you-go basis: As data is extracted from the data files for use in queries, the resulting tables are stored and indexed for future use. A project out of Microsoft is attempting to unify database access control mechanisms with privacy control mechanisms. Saarland University's OctopusDB is a database engine that attempts to be one-size-fits-all by making a distinction between the conceptual act of storing data and the physical representation of that data on a storage medium.

Finally (and most importantly ;) ), Yanif Ahmad presented DBToaster... The one database compiler to rule them all.

Monday, October 18, 2010

Why and Where Provenance

At DB Breakfast on Thursday October 7th, we continued our exploration of data provenance by reading the highly-cited paper:

Peter Buneman, Sanjeev Khanna, Wang Chiew Tan. Why and Where: A Characterization of Data Provenance. In ICDT, 2001.

This paper looks at problem of determining the provenance of a query answer, i.e. what data in the database "contributes to" the resulting answer. One of the insights of the paper is that the concept of provenance profoundly depends on what one means by "contributes to." Two notions of provenance are introduced, where provenance and why provenance, and shown to have very different behavior.

The distinction between why and where provenance is best seen with an example: Suppose ("Joe", 1234) is an answer to this query.

SELECT name, telephone
FROM employee, dept
WHERE employee.dno = dept.dno AND dept.name = "Computer Science"

The where provenance of 1234 is simply the corresponding phone number in Joe's record in the employee relation. The why provenance includes not only Joe's record in employee, but also the Computer Science record in dept because without that record, Joe's record would not be included in the result.

For why provenance, the paper gives precise characterization based on query *syntax.* Informally, a tuple in the database is part of the why provenance if it is used in some minimal derivation of the answer tuple (the qualifications "some" and "minimal" are important). This notion of provenance has nice properties---for instance, invariance to query rewriting.

For where provenance, the intuition guiding the above approach appears to break down. Examples are shown where two queries are equivalent yet exhibit different where provenance, and they suggest that a syntactic characterization may fail to fully capture where provenance.

Despite the challenges with where provenance, it appears as though subsequent work has developed approaches for where provenance. How were these challenges addressed?

In addition, the why provenance characterization is for SPJU queries only. Extending to include negation and aggregation seems important but quite challenging: the provenance of a tuple may include the entire database! Such an answer, while technically correct, may not be useful to the user. Is there a reasonable notion of weighted provenance, where some input tuples have more influence on the query answer than others?

In addition to where and why provenance, what other kinds of provenance might be useful?

Sunday, September 26, 2010

Is the Cloud Ready for Scientific Computing?

Last Thursday, in the DB breakfast at Cornell, we asked ourselves the question whether the cloud was ready for HPC. We discussed a paper from this year's VLDB conference by Schad, Dittrich, and Quiané-Ruiz reporting unexpected high variance in Amazon EC2's performance. The paper describes the results from instance types in different availability zones through a benchmark measuring instance startup, cpu, memory speed, disk I/O, network bandwidth, and S3 access times. The main lessons after analyzing the results of one month of data is that instances allocated to different physical system types and availabilities zones can have large variability in performance for CPU, disk I/O and network performance. In fact, similar observations have been made by other studies and benchmarks (such as this and this). Given these results, how tightly will cloud providers ever be able to specify and guarantee performance-based SLAs?

As a result, members of the HPC community feel that the cloud may not be ready for their scientific applications, which tend to be network and memory bound (for example, take a look at this nice paper for some results). However, recently, Amazon released a new instance type: the cluster computer instances, and results from a benchmark run on 800 such instances was reported to rank within the Top500 list of supercomputers. Will this start a new era for the HPC community to run their applications in the cloud?

The cloud democratizes access to resources; even researchers who do not have access to a supercomputer will be able to afford to rent hundreds of high-performance instances in the cloud and scale their simulations to unprecedented dimensions. I think that this means that also for HPC applications, "performance per dollar" and not only on "performance" as it was traditionally measured will be the an important metric in the future. If you look at the sorting benchmark homepage, you will see different two categories of benchmarks: Daytona, where the sort code needs to be general purpose, and Indy, where the goal is only to sort according to the benchmark specifications. In addition, there exists a benchmark that measures the amount of energy required to sort. Will we see similar developments in the measurement of supercomputing systems?

Monday, March 15, 2010

ICDE 2010 Trip Report


I would like to briefly summarize some of the interesting things I have seen at ICDE. This is clearly a biased view of everything that was there at the conference, so please do take it with a grain of salt!


There were three keynotes plus a banquet presentation. Pekka Kostamaa from Teradata told us about how data warehousing is becoming more complex. In particular, the programming model is less clearly only SQL, as many vendors now support MapReduce interfaces for complex analysis over the data. In addition, star schemas and nightly loads are a thing of the past. They see modern installations exhibiting very complex schemas, which reflect better and more comprehensive data integration of many parts of the business, and a move towards on-line loading and querying, e.g. to enable on-the-spot marketing. Donald Kossmann delivered a keynote on cloud architecture and his experience with his startup 28msec. He pointed out that the current classic web architecture of database servers and application servers with strict, coarse-grained data partitioning of data among database servers does not fully utilize cloud resources. He advocated more of a RAID-like architecture in which application and database server are combined into a single system that spreads data at finer granularity over a set of cloud compute nodes. Jeff Naughton’s keynote was a reflection on the peer review process in the database community. His concerned arguments are that low acceptance rates and a narrow view of the reviewing service are stifling creativity in the community. He presented some challenging suggestions for change, leading to discussion and food for thought. During the banquet, we had an extra presentation, in which Gio Wiederhold argued for the need to instill in the professional practice of software design considerations about cost, expected value, and the economics of software.


There were also many interesting paper presentations:


  • Hive – A Petabyte Scale Data Warehouse Using Hadoop: The authors present how to build a SQL engine over a Hadoop runtime. I asked some of the authors one-on-one what extra features they would love Hadoop to have in their experience. They pointed out that a MapReduceMerge model would ease things significantly. In addition, they would like more flexibility on when to take checkpoints, not at the end of every MapReduce task as is now the case. Moreover, they would also like to have a feature to pipe map-reduce jobs, i.e., send the output of a reduce step directly to the next mappers.


  • Usher – Improving Data Quality with Dynamic Forms: This work received the best student paper award. It presented a system to improve data quality by making data entry forms dynamic. The idea is to change the data entry form according to probabilistic model over the questions in the form. So the system may adapt the order the questions are asked, enable real-time feedback about entered values (e.g., via most-likely completions), and re-ask questions that are likely to have been entered incorrectly. One interesting aspect is that the authors in fact deployed their system for the transcription of paper-based patient intake forms in an HIV/AIDS clinic in Tanzania, showing that database research can have direct positive impact in problems faced by developing countries.


  • Optimizing ETL Workflows for Fault-Tolerance: The paper talks about which strategies to choose for fault-tolerance of complex ETL dataflow graphs. There are three basic alternatives for each job: restart from scratch, checkpointing, and process pairs. The authors design an optimizer that chooses different strategies while balancing the objectives of performance, fault-tolerance, and freshness.


  • FPGA Acceleration for the Frequent Item Problem: This is a paper exploring a problem we recently heard about at Cornell’s database lunch series. The authors explore different hardware designs starting from the Space-Saving algorithm. They show that a naïve translation of the algorithm into hardware does not obtain significant gains. By exploring pipelining, they show a design that is able to process about three times as many items per second as the best known CPU result.


  • The Similarity Join Database Operator: This paper shows how to integrate (1-D) similarity joins into a relational DBMS as database operators. Examples here are distance joins or kNN joins. One very interesting aspect of this work is that they present a set of algebraic rewrite rules for similarity joins. The authors are currently working on generalizing their techniques to the multi-dimensional case.


  • There were a few papers related to recent topics we covered on our classic DB reading group. Related to the Skyline operator paper, we had three presentations in a session dedicated to skyline processing. Another topic we recently read about that warranted a whole session was Top-K processing. This session included the paper that received the best paper award, TASM: Top-k Approximate Subtree Matching. Related to the progress estimation in SQL paper, we had one presentation about progress estimation in MapReduce with an implementation in Hadoop called Parallax.


  • There was of course a lot of interesting work coming from Cornell as well. Oliver presented PIP, a probabilistic database system for continuous distributions, and Xiaokui (now at Singapore) presented a paper on differential privacy via wavelet transforms. Christoph co-authored a paper on approximate confidence computation in probabilistic databases. Along with co-authors, Johannes gave a tutorial on privacy in data publishing and I presented work on modeling intensional associations in dataspaces.



Please feel free to add to this trip report if you would like to comment on your experience at the conference.

Thursday, October 8, 2009

Spatial indexing demos

On his homepage, Hanan Samet provides applets for visually learning how various spatial data structures work. They are great way to quickly get an intuition of how the various algorithms work. The "move" functionality is a particularly nice and interactive way of seeing how changes to the data affect the data structures.

Example:

(1) Go to http://www.cs.umd.edu/~hjs/

(2) Select "Online Demos"/"VASCO Spatial Index Demo"

(3) Scroll down; select "Lines"/"PM1 Quadtree demo"

(4) Draw some lines with the mouse

(5) Select "Operations"/"Move vertex"

(6) Move vertices around with the mouse and see how the quadtree changes.

Monday, September 28, 2009

Differential Privacy via Wavelet Transforms

The following is a guest post by Xiaokui Xiao, Assistant Professor in the Division of Information Systems at Nanyang Technological University. Xiaokui was a postdoc in the Cornell Big Red Data Group from 2008 to 2009.

Numerous organizations, like the U.S. Census Bureau or hospitals, maintain large collections of personal information (such as census data or medical records). These data collections are of significant research value, and there is much benefit in making them publicly available. Since the data is sensitive in nature, proper measures must be taken to ensure that its publication does not endanger the privacy of the individuals that contributed the data. A canonical solution to this problem is to modify the data before releasing it to the public such that the modification prevents inference of private information while retaining the statistical characteristics of the data. In our work, we are limiting the disclosure that happens through data publishing by using ε-differential privacy. Informally, ε-differential privacy requires that the published data should be generated using a randomized algorithm G, such that the output of G is not sensitive to any particular tuple in the input, i.e., the output of G should rely mainly on the general properties of the data. This ensures that, by observing the data modified by G, an adversary (who would like to find out information about individuals in the table) is not able to infer much information about any individual tuple, and hence, privacy is preserved.


The simplest method to enforce ε-differential privacy, as proposed by Dwork et al., is to first derive the frequency distribution of the tuples in the input data, and then publish a noisy version of the distribution. For example, given the medical records in Table 1, Dwork et al.'s method first maps the records to the frequency matrix in Table 2, where each entry in the first (second) column stores the number of diabetes (non-diabetes) patients in Table 1 that belong to a specific age group. After that, Dwork et al.'s method adds independent noise with a Θ(1) variance to each entry in Table 2, and then publishes the noisy frequency matrix.

Dwork et al.'s method provides reasonable accuracy for queries about individual entries in the frequency matrix, as it injects only a small amount of noise (with a constant variance) into each entry. Nevertheless, there exist numerous other queries for which Dwork et al.’s method fails to provide useful results. For example, for a count query answered by taking the sum of a constant fraction of the entries in the noisy frequency matrix, the approximate query result has a Θ(m) noise variance, where m denotes the total number of entries in the matrix. Note that m is typically an enormous number, as practical datasets often contain multiple attributes with sizable domains. Hence, a Θ(m) noise variance can render the approximate result meaningless, especially when the actual result of the query is small.

In our ICDE 2010 paper, we remedy the deficiency of Dwork’s method with Privelet (privacy preserving wavelet), a data publishing technique that not only ensures ε-differential privacy, but also provides accurate results for all range-count queries, i.e., count queries where the predicate on each attribute is a range. Privelet guarantees that any range-count query can be answered with a noise whose variance is polylogarithmic in m. This significantly improves over the O(m) noise variance bound provided by Dwork et al.’s method.

In a nutshell, the effectiveness of Privelet results from a novel application of wavelet transforms, a type of linear transformations that has been widely adopted for image processing and approximate query processing. As with Dwork et al.’s method, Privelet preserves privacy by modifying the frequency matrix M of the input data. Instead of injecting noise directly into M, however, Privelet first applies a wavelet transform on M, converting M to another matrix C. Privelet then adds a polylogarithmic noise to each entry in C, and maps C back to a noisy frequency matrix M∗. The matrix M∗ thus obtained has an interesting property: The result of any range-count query on M ∗ can be expressed as a weighted sum of a polylogarithmic number of entries in C. Furthermore, each of these entries contributes at most polylogarithmic noise variance to the weighted sum. Therefore, the variance of the noise in the query result is bounded by a polylogarithm of m.

You can read more about this work in our paper that will appear at ICDE 2010. See you in Long Beach for the talk!