The Big Red Data Blog: March 2010

I would like to briefly summarize some of the interesting things I have seen at ICDE. This is clearly a biased view of everything that was there at the conference, so please do take it with a grain of salt!

There were three keynotes plus a banquet presentation. Pekka Kostamaa from Teradata told us about how data warehousing is becoming more complex. In particular, the programming model is less clearly only SQL, as many vendors now support MapReduce interfaces for complex analysis over the data. In addition, star schemas and nightly loads are a thing of the past. They see modern installations exhibiting very complex schemas, which reflect better and more comprehensive data integration of many parts of the business, and a move towards on-line loading and querying, e.g. to enable on-the-spot marketing. Donald Kossmann delivered a keynote on cloud architecture and his experience with his startup 28msec. He pointed out that the current classic web architecture of database servers and application servers with strict, coarse-grained data partitioning of data among database servers does not fully utilize cloud resources. He advocated more of a RAID-like architecture in which application and database server are combined into a single system that spreads data at finer granularity over a set of cloud compute nodes. Jeff Naughton’s keynote was a reflection on the peer review process in the database community. His concerned arguments are that low acceptance rates and a narrow view of the reviewing service are stifling creativity in the community. He presented some challenging suggestions for change, leading to discussion and food for thought. During the banquet, we had an extra presentation, in which Gio Wiederhold argued for the need to instill in the professional practice of software design considerations about cost, expected value, and the economics of software.

There were also many interesting paper presentations:

Hive – A Petabyte Scale Data Warehouse Using Hadoop: The authors present how to build a SQL engine over a Hadoop runtime. I asked some of the authors one-on-one what extra features they would love Hadoop to have in their experience. They pointed out that a MapReduceMerge model would ease things significantly. In addition, they would like more flexibility on when to take checkpoints, not at the end of every MapReduce task as is now the case. Moreover, they would also like to have a feature to pipe map-reduce jobs, i.e., send the output of a reduce step directly to the next mappers.

Usher – Improving Data Quality with Dynamic Forms: This work received the best student paper award. It presented a system to improve data quality by making data entry forms dynamic. The idea is to change the data entry form according to probabilistic model over the questions in the form. So the system may adapt the order the questions are asked, enable real-time feedback about entered values (e.g., via most-likely completions), and re-ask questions that are likely to have been entered incorrectly. One interesting aspect is that the authors in fact deployed their system for the transcription of paper-based patient intake forms in an HIV/AIDS clinic in Tanzania, showing that database research can have direct positive impact in problems faced by developing countries.

Optimizing ETL Workflows for Fault-Tolerance: The paper talks about which strategies to choose for fault-tolerance of complex ETL dataflow graphs. There are three basic alternatives for each job: restart from scratch, checkpointing, and process pairs. The authors design an optimizer that chooses different strategies while balancing the objectives of performance, fault-tolerance, and freshness.

FPGA Acceleration for the Frequent Item Problem: This is a paper exploring a problem we recently heard about at Cornell’s database lunch series. The authors explore different hardware designs starting from the Space-Saving algorithm. They show that a naïve translation of the algorithm into hardware does not obtain significant gains. By exploring pipelining, they show a design that is able to process about three times as many items per second as the best known CPU result.

The Similarity Join Database Operator: This paper shows how to integrate (1-D) similarity joins into a relational DBMS as database operators. Examples here are distance joins or kNN joins. One very interesting aspect of this work is that they present a set of algebraic rewrite rules for similarity joins. The authors are currently working on generalizing their techniques to the multi-dimensional case.

There were a few papers related to recent topics we covered on our classic DB reading group. Related to the Skyline operator paper, we had three presentations in a session dedicated to skyline processing. Another topic we recently read about that warranted a whole session was Top-K processing. This session included the paper that received the best paper award, TASM: Top-k Approximate Subtree Matching. Related to the progress estimation in SQL paper, we had one presentation about progress estimation in MapReduce with an implementation in Hadoop called Parallax.

There was of course a lot of interesting work coming from Cornell as well. Oliver presented PIP, a probabilistic database system for continuous distributions, and Xiaokui (now at Singapore) presented a paper on differential privacy via wavelet transforms. Christoph co-authored a paper on approximate confidence computation in probabilistic databases. Along with co-authors, Johannes gave a tutorial on privacy in data publishing and I presented work on modeling intensional associations in dataspaces.

Please feel free to add to this trip report if you would like to comment on your experience at the conference.

The Big Red Data Blog

Monday, March 15, 2010

ICDE 2010 Trip Report

Cornell University Database Group

Links

Search This Blog

Labels

Blog Archive