The Big Red Data Blog: February 2011

This year's CIDR was exciting. As might be expected, there was a clear focus on cloud technologies in the program, and cloud middleware and infrastructure systems had a strong offering in particular. Changes to the memory heirarchy effected by Flash and Phase Change Memory (Flash's heir apparent) were also a subject of intense discussion.

Two specific instances of cloud middleware took a rather unusual (and perhaps even a little Matrixy) approach to the architecture of the underlying cloud. MIT's Crowdsourced Databases, and Stanford's proposal for using humans to answer queries both attempt to build a crowdsource operator (an invocation of a service like Amazon's Mechanical Turk) into a traditional relation query optimizer. Aside from the obvious interface challenges, this operator introduces the potential for inaccuracies (c.f., My Database Hates Me) and an actual financial cost into the query optimizer's cost model.

An aspect of cloud computation addressed by many papers was the idea of transactions in the cloud. SAP's Transactional Intent, Microsoft's Deuteronomy, Google's Megastore, and several other presentations throughout the conference noted the difficulties of programming distributed datastores without transactional support and presented suggestions for creating what amounts to transactional infrastructures for cloud programming.

On a related note, a paradigm for distributed programming that appeared throughout many of these papers (and also Saarland's OctopusDB) was that of a log-structured database engine. Rather than the traditional approach of storing the primary copy of a datum sorted, to take advantage of sequential scans the primary datum is simply maintained in a log (in part, taking advantage of the support for fast random access in flash). Furthermore, by ensuring that the elements are sequenced in a canonical order, the log provides an effective synchronization abstraction.

Several presentatons such as MIT's Relational Cloud and Duke's Starfish made efforts towards a more generic cloud infrastructure, reducing the effort required to deploy, maintain, and tune a large scale data-processing system.

Microsoft had a strong hardware-layer offering this year, presenting several papers on Flash/PC memory-based algorithms. They were joined in architectures for Flash memory by a paper out of ITUC/INRIA.

Another idea was present, subtly appearing in a large number of papers: interactive semistructured queries. Instantiations of this idea ranged from interactive question-suggestion interfaces like MPI's IQ and Duke's Citizen Journalism, to typeahead suggestions for queries, forms, etc... like Tsinghua's DBEase, to LAWA's temporal queries over the way-back-machine, to spreadsheet-style relational database engines like MIT's schema-independent DBUI. These projects each attempt to provide an environment for non-technical users to construct queries. In each case, this ends up taking the form of an interactive session, where users refine a query by interactively querying the database schema. DBEase in particular has a pretty snazzy set of demos (http://dbease.cs.tsinghua.edu.cn) that I encourage you to check out.

Yet another hot topic this CIDR was data provenance. A slew of data provenance gathering systems for debugging and data validation were presented by Yahoo, Stanford, UPenn, and others. Of particular note, the UPenn paper makes note of an interesting challenge in data provenance: privacy. Exporting the provenance information of a tuple leaks information about the data that went into the tuple. How can we measure, and more importantly limit the exposure of sensitive information, without eliminating the usefulness of the provenance information.

An entirely new branch of research to me is computational activism. Berkeley's Data in the First Mile, and Duke's Computational Journalism both espouse the need for building good task specific UIs (and the corresponding computational backends) for use in (respectively) third-world countries, and journalism (i.e., fact checking, pattern/outlier discovery, and claim monitoring).

Several other interesting papers branched off into entirely unique directions. Berkeley's CALM quantifies the situations where synchronization primitives are required in a distributed program and provides programming language support for distributed programs along the lines of Evita Raced. A vision paper out of EPFL called for hybrid relational+hdfs database storage architectures, where the curation of flat data files is done on a pay-as-you-go basis: As data is extracted from the data files for use in queries, the resulting tables are stored and indexed for future use. A project out of Microsoft is attempting to unify database access control mechanisms with privacy control mechanisms. Saarland University's OctopusDB is a database engine that attempts to be one-size-fits-all by making a distinction between the conceptual act of storing data and the physical representation of that data on a storage medium.

Finally (and most importantly ;) ), Yanif Ahmad presented DBToaster... The one database compiler to rule them all.

The Big Red Data Blog

Thursday, February 10, 2011

CIDR 2011

Cornell University Database Group

Links

Search This Blog

Labels

Blog Archive