The Big Red Data Blog: 2009

Thursday, October 8, 2009

Spatial indexing demos

On his homepage, Hanan Samet provides applets for visually learning how various spatial data structures work. They are great way to quickly get an intuition of how the various algorithms work. The "move" functionality is a particularly nice and interactive way of seeing how changes to the data affect the data structures.

Example:

(1) Go to http://www.cs.umd.edu/~hjs/

(2) Select "Online Demos"/"VASCO Spatial Index Demo"

(3) Scroll down; select "Lines"/"PM1 Quadtree demo"

(4) Draw some lines with the mouse

(5) Select "Operations"/"Move vertex"

(6) Move vertices around with the mouse and see how the quadtree changes.

Monday, September 28, 2009

Differential Privacy via Wavelet Transforms

The following is a guest post by Xiaokui Xiao, Assistant Professor in the Division of Information Systems at Nanyang Technological University. Xiaokui was a postdoc in the Cornell Big Red Data Group from 2008 to 2009.

Numerous organizations, like the U.S. Census Bureau or hospitals, maintain large collections of personal information (such as census data or medical records). These data collections are of significant research value, and there is much benefit in making them publicly available. Since the data is sensitive in nature, proper measures must be taken to ensure that its publication does not endanger the privacy of the individuals that contributed the data. A canonical solution to this problem is to modify the data before releasing it to the public such that the modification prevents inference of private information while retaining the statistical characteristics of the data. In our work, we are limiting the disclosure that happens through data publishing by using ε-differential privacy. Informally, ε-differential privacy requires that the published data should be generated using a randomized algorithm G, such that the output of G is not sensitive to any particular tuple in the input, i.e., the output of G should rely mainly on the general properties of the data. This ensures that, by observing the data modified by G, an adversary (who would like to find out information about individuals in the table) is not able to infer much information about any individual tuple, and hence, privacy is preserved.

The simplest method to enforce ε-differential privacy, as proposed by Dwork et al., is to first derive the frequency distribution of the tuples in the input data, and then publish a noisy version of the distribution. For example, given the medical records in Table 1, Dwork et al.'s method first maps the records to the frequency matrix in Table 2, where each entry in the first (second) column stores the number of diabetes (non-diabetes) patients in Table 1 that belong to a specific age group. After that, Dwork et al.'s method adds independent noise with a Θ(1) variance to each entry in Table 2, and then publishes the noisy frequency matrix.

Dwork et al.'s method provides reasonable accuracy for queries about individual entries in the frequency matrix, as it injects only a small amount of noise (with a constant variance) into each entry. Nevertheless, there exist numerous other queries for which Dwork et al.’s method fails to provide useful results. For example, for a count query answered by taking the sum of a constant fraction of the entries in the noisy frequency matrix, the approximate query result has a Θ(m) noise variance, where m denotes the total number of entries in the matrix. Note that m is typically an enormous number, as practical datasets often contain multiple attributes with sizable domains. Hence, a Θ(m) noise variance can render the approximate result meaningless, especially when the actual result of the query is small.

In our ICDE 2010 paper, we remedy the deficiency of Dwork’s method with Privelet (privacy preserving wavelet), a data publishing technique that not only ensures ε-differential privacy, but also provides accurate results for all range-count queries, i.e., count queries where the predicate on each attribute is a range. Privelet guarantees that any range-count query can be answered with a noise whose variance is polylogarithmic in m. This significantly improves over the O(m) noise variance bound provided by Dwork et al.’s method.

In a nutshell, the effectiveness of Privelet results from a novel application of wavelet transforms, a type of linear transformations that has been widely adopted for image processing and approximate query processing. As with Dwork et al.’s method, Privelet preserves privacy by modifying the frequency matrix M of the input data. Instead of injecting noise directly into M, however, Privelet first applies a wavelet transform on M, converting M to another matrix C. Privelet then adds a polylogarithmic noise to each entry in C, and maps C back to a noisy frequency matrix M∗. The matrix M∗ thus obtained has an interesting property: The result of any range-count query on M ∗ can be expressed as a weighted sum of a polylogarithmic number of entries in C. Furthermore, each of these entries contributes at most polylogarithmic noise variance to the weighted sum. Therefore, the variance of the noise in the query result is bounded by a polylogarithm of m.

You can read more about this work in our paper that will appear at ICDE 2010. See you in Long Beach for the talk!

Tuesday, September 15, 2009

Intensional Associations in Dataspaces

One problem that many users have in managing their data is how to obtain connected items while searching. For example, picture yourself searching for information on an interesting classroom project you developed some years ago. You may type a few keywords in a search tool that will lead you to one or two documents lost on the vast amount of information in your hard drive about that project. Unfortunately, not all documents you are interested in, such as graphs, emails, and results of interesting experiments, may contain the keywords you chose to type on the search box.

The problem in this example is that even though you could find some information related to your project, you cannot connect from this information to other important items in the same context. Together with colleagues from Saarland University and ETH Zurich, I have explored an idea to solve this problem in a paper recently accepted for publication at ICDE 2010. The full version of our paper can be found here (link to draft).

In order to define connections among items in a dataspace, we propose association trails. An association trail is a declarative definition of how items in the dataspace are connected by virtual association edges to other items. A set of association trails defines a logical graph of associations over the dataspace. For example, you may connect documents in your personal dataspace by associating items touched around the same time, documents with similar content, different versions of documents you authored or received, or items that reside in similar folder hierarchies in your email server and in your filesystem.

Coming back to our classroom project search, association trails create connections from your one or two search results to a rich set of related emails, documents, and experiment results. Automatically obtaining all of this context information from search results is called in our paper a neighborhood query. While neighborhood queries are very useful to help you find information in your data, they are also very expensive to process over the logical graph of connections created by association trails. In order to address this problem, our paper investigates a new indexing technique, called the grouping-compressed index (GCI). In a nutshell, GCI creates a compressed representation of the logical graph declared by association trails. We can use this compressed representation to answer neighborhood queries without ever having to expand it to the whole graph. As a consequence, GCI can achieve over an order of magnitude better indexing or querying times when compared to various alternatives.

Association trails have been integrated into the iMeMex Dataspace Management System and the code is released under an open-source license. If you are interested in dataspaces, you can also find out about other work I have done in iMeMex by taking a look at my PhD thesis.

I am looking forward to an interesting conference at Long Beach next year! Hope to see you there!

The SIGMOD 2010 Deadline -- Nov. 5

Remember, remember the Fifth of November,
And that SIGMOD matters a lot,
I know of no reason
Why SIGMOD submission
Should ever be forgot.

(And should you forget, it's V for VLDB for you!)

Saturday, September 12, 2009

Talk Announcement -- Daniel Deutch, Tel Aviv University

Date: Monday Sept. 21, 2009 noon to 1pm
Location: 5130 Upson Hall, Cornell

Querying Past and Future in Web Applications

Daniel Deutch
http://www.cs.tau.ac.il/~danielde/

Abstract: Many businesses offer their services to customers via Web-based application interfaces. Reasoning about execution flows of such applications is extremely valuable for companies: it can be used to optimize business processes, employ targeted advertisements, reduce operational costs, and ultimately increase competitiveness. Such reasoning often operates in an environment inducing partial information and uncertainty of various flavors. First, the execution traces recorded for a Web application often contain only partial information on the activities that were performed at run-time, due to confidentiality, lack of storage space, etc. Second, even in the presence of fully detailed traces of the past executions, prediction of the behavior of future executions may still operate under terms of uncertainty. This is because executions often depend on unknown external parameters, such as users behavior, interaction with other applications, servers response time, etc.

In this talk I will consider (1) models for capturing Web applications and their executions. These models are expressive enough to capture common scenarios, while restrictive enough to allow for efficient query evaluation; (2) query evaluation algorithms over applications/execution traces under these models, and (3) practical implementations for recommending navigation flows within a web applications.

Friday, September 11, 2009

PIP:A Database System for Great and Small Expectations

The field of probabilistic databases attempts to ask the question: can we still reason about data that we aren't certain about? Related to a wide variety of fields including statistics, probability theory, and fuzzy logic, probabilistic databases treat data as following a probabilistic model rather than being known with certainty. When the data is queried, the database not only produces a query result, but also computes (or more commonly, estimates) statistical properties (i.e., Confidence, or Histogram) of that result.

For example, consider a risk-management application that uses statistical models to evaluate the long term effects of corporate decisions and policies. This application may use a DBMS to store predictions and statistical measures (e.g., error bounds) of those predictions. However, arbitrary queries made on the predictions do not translate naturally into queries on the corresponding statistical measures. A user who requires error bounds on the sum of a join over several tables of predictions must first obtain a formula for computing those bounds, assuming a closed form formula even exists.

A wide variety of probabilistic database systems have arisen supporting probabilistic data that follows finite, discrete distributions. Some even approximate continuous distributions by translating them into discrete distributions (i.e., by integrating over bins or sampling).

Unfortunately, these kinds of approximations are hard to instantiate prior to runtime; To maintain generality, these systems must provision (e.g., via bin size, or sample count) for a wide variety of unknown arbitrary queries and statistical measurements of unknown precision when creating probabilistic data tables. Worse still, since precision can be data-dependent, it may be impossible to accurately provision a query until after it has completed. These systems must generate an overabundance of samples or unnecessarily small bin sizes, lest they be unable to achieve sufficient precision.

For example, if a query contains a selection predicate, samples violating the predicate are dropped and do not contribute to the expectation. The more selective the predicate, the more samples are needed to maintain consistent accuracy. Our sample application may be queried to combine a model predicting customer profits with a model for predicting dissatisfied customers, perhaps as a result of a corporate decision to use a cheaper, but slower shipping company. If the query asks for profit loss due to dissatisfied customers, the query need only consider profit from customers under those conditions where the customer is dissatisfied (ie, the underlying model may include a correlation between ordering patterns and dependence on fast shipping).

We address these concerns in our paper PIP:A Database System for Great and Small Expectations (link to draft version), to be presented at ICDE 2010. PIP is a purely symbolic probabilistic database that supports continuous distributions. Using a C-Tables representation and techniques similar to those used by the MayBMS Probabilistic Database Management System, PIP maintains a symbolic representation of all uncertainty in the database. Queries are evaluated directly on this symbolic representation; a direct mapping exists between all relational algebra operators and their c-tables counterparts. Even better, the direct mapping is nearly trivial.

Because the query result is expressed symbolically, its statistical properties can be estimated very efficiently or even computed precisely (in certain cases). By exploiting knowledge about the underlying distributions (where available) and the query itself, PIP acts as a framework for applying a variety of statistical tools to improve optimization efficiency. PIP is extensible, allowing developers to encode new probability distributions into modules that are linked into PIP itself. These modules encode any or all of a wide range of pieces of information about the distribution. For example, if the CDF of a distribution is known, when computing the expectation of a variable following this distribution, PIP can use the CDF to improve sampling efficiency or even compute an exact value for the expectation.

PIP has been implemented as a plugin for Postgres. Some minor modifications to Postgres itself make it possible to employ static data queries on probabilistic data unchanged. However, the core functionality can be imported into any Postgres database. A beta version of PIP will be released officially soon. Check back on this blog for details.

Cornell papers accepted at ICDE 2010

The group has just had the following papers and a tutorial accepted at ICDE 2010:

Intensional Associations in Dataspaces, by Marcos Vaz Salles, Jens Dittrich, Lukas Blunschi (short paper)

Differential Privacy via Wavelet Transforms, by Xiaokui Xiao, Guozhang Wang, Johannes Gehrke (long paper)

Approximate Confidence Computation in Probabilistic Databases, by Dan Olteanu, Jiewen Huang, Christoph Koch (long paper)

PIP: A Database System for Great and Small Expectations, by Oliver Kennedy, Christoph Koch (long paper)

Privacy in Data Publishing, Johannes Gehrke, Daniel Kifer, Ashwin Machanavajjhala (three-hour tutorial)

More on this work coming soon on this blog!

Monday, September 7, 2009

New MMO: Monopoly City Streets

From their website: "A live worldwide game of MONOPOLY using Google Maps as the game board." Are there any interesting scaling issues with running this *live*? Maybe we can help them with our scripting language SGL ;-)?

Friday, September 4, 2009

e-Privacy: A Framework for Data-Publishing against Realistic Adversaries (Part II)

Recall the setting from my previous posting. We want to publish a table while maintaining the privacy of the individuals in the table. So far, there only existed extreme adversaries: Either very weak or super strong adversaries - nothing in between.
If we only protect privacy against weak adversaries then a smarter adversary will be able to breach the privacy of individuals in practice. For example, an adversary who knows "it is flu season and there are likely to be many elderly patients with flu symptoms in the hospital." can be smart enough to breach the privacy of individuals.
On the other hand if we protect against extremely strong adversaries then our published table is close to being useless.
That is why we would like to protect against realistic adversaries in the middle ground. So how do "realistic" adversaries look like?

In our VLDB paper, we introduce e-privacy which protects against adversaries who have collected some statistics about the population and use them to build their belief about individuals. An adversary can have the following knowledge:

Knowledge about the general population. The adversaries obtain this knowledge from other datasets, for example, through articles that they have read that contain survey data or by looking at other data sets.
Knowledge about specific individuals in the table. The adversaries obtain this knowledge as in prior work, for example, by knowing some facts about their friends or neighbors.

As we just outlined, the adversary starts with some beliefs about the world. Upon seeing the released table the adversary may revise her belief - for example, the adversary may know the rate of cancer in the general population and assumed this was the same for all patients but upon seeing the data from a hospital, might change her mind. By how much an adversary changes her belief depends on what we call the stubbornness of the adversary. Infinitely stubborn adversaries believe in facts, and thus they do not change their belief at all. Adversaries with finite stubbornness need more or less data to convince them that their prior beliefs were incorrect.

In the paper we explain how to take a table and generalize it in order to protect against these adversaries. This was impossible if you wanted to protect against the super strong adversaries. But now, we can find out which tables we can publish defending against realistic adversaries.

We believe that e-privacy with its defense against realistic adversaries is very useful in practice, and thus we hope that you will want to experiment with it and try it out on your own data. We are eager to learn about your experience!

For more details on the framework and an experimental evaluation of information content of data published with e-privacy please refer to our paper. If you want to know more about our projects on data privacy you can find more information here.

Wednesday, September 2, 2009

A Confluence of Column Stores and Search Engines: Opportunities and Challenges

At the USETIM (Using Search Engine Technology for Information Management) Workshop at VLDB 2009, we presented a paper that explores the technical similarities and differences between column stores and search engines. The motivation for the paper is that we believe that the two systems have a lot of similarities, and that development in both fields point towards a confluence. We therefore try to evaluate the synergies obtainable from implementing one system to support both workloads, and we find that there are both opportunities and challenges associated with developing such a hybrid system.

With a high-level perspective, we can claim that both column stores and search engines are column-oriented. While column stores obviously store column-oriented representations of tables, an inverted index used in search engines can be interpreted as a column-oriented representation of a term-document matrix where the entries describe the occurrence of terms in documents. Despite this similarity, it is clear that the queries typically supported in the two types of systems are quite different. While search engines focus on returning the top-k ranked documents for a keyword query, column stores are mainly used in decision support systems where queries with aggregations are dominant. However, typical search engine queries might become more similar to traditional DSS queries through the introduction of faceted search. In addition, new approaches suggested for storing the entries in inverted lists/posting lists make the storage layout in search engines more column-oriented. The fact that that column stores have been used as back-ends to search engines with good performance also indicates similarities between the fields.

The similarities might suggest that one system can support both workloads efficiently. Our qualitative evaluation indicates that there are clear opportunities associated with constructing a hybrid system, such as efficient storage structures for facets in faceted search engines based on a columnar layout, and technology transfer of loading techniques and indexing structures. But, there are also challenges that need to be overcome for a hybrid system to be efficient. Examples of challenges include the fact that search engine workloads require low latency, while throughput is the major optimization criterion for column stores. A successful hybrid system might also be able to support other related workloads, but supporting more workloads typically involves supporting more query languages. When supporting several query languages, it becomes a challenge to find a physical algebra for the system that enables supporting all query languages efficiently.

If you are interested in hearing some more details about this work, or in discussing the potential in developing a hybrid system for both search and decision support workloads, comment in this blog or read our paper at the USETIM workshop at VLDB 2009!

Monday, August 31, 2009

Course Announcement: CS6320 Database Management Systems (for graduate students)

CS6320 covers the foundations, design, and construction of large-scale data-centric computing systems, with a special focus on turning declarative specifications and queries into scalable systems. I will present the foundations of declarative languages and the implications of these foundations on the difficulty of solving problems specified using such languages (e.g., database queries; or constraint satisfaction problems in AI; or model checking problems in computer aided verification). I will also present the central concepts and algorithms used for building scalable systems for processing declarative languages. This includes parallel processing (e.g., in a map/reduce-style framework) and automata-based data stream processing techniques. Moreover, I will show how to engineer declarative languages for a purpose, achieving a good trade-off between expressive power and processing cost, and how to understand and implement optimizing compilers that turn declarative specifications into efficient code that executes them.

The course will be of interest to both systems and theory students. A big part of the grade is based on a research project, where you can choose between a theoretical and a systems project.

THE FIRST LECTURE WILL TAKE PLACE ON TUESDAY, SEPT. 1, 2009.

M.Eng. Project Kick-off Meeting Thursday Sept. 3 at 5pm

This is of interest to Cornell M.Eng. students interested in doing a CS7999 project with the database group. We will have a kick-off meeting for CS7999 projects on Thursday at 5-6pm in 5130 Upson Hall. We will present the available projects and required skills, and will answer your questions. We strongly recommend to attend this meeting; it may not be possible to start a CS7999 project with us later this term. If you plan to attend, please RSVP to shawna@cs.cornell.edu.

Wednesday, August 19, 2009

Give your picture the Ithaca touch

http://www.lutralutra.co.uk/squirrelizer/

Monday, August 17, 2009

Computer Science as Creativity

When asked to name some creative activities, most people will immediately say "painting," or "drawing," or "writing." These are common creative pursuits, but creative pursuits are certainly not limited to these or similar endeavors. They are examples of low floor/high ceiling activities, low floor meaning that they can be begun and practiced started at a very young age with that very basic assortment of tools, and high ceiling meaning that there is a lot of space to improve in---many new innovations and techniques can be gained along with even more useful tools.

Other creative pursuits, such as gardening, baking, or mixing drinks need mastery of some prerequisite skills before beginning. We certainly would not expect a five year old to be able to plant a bed of tulips or know how to mix a gin martini.

Mathematics, entrepreneurship, and athletics fall even further from the general idea of what creativity is. Yet all these fields involve creative thinking, original ways of looking at and solving problems. Bending and breaking the “rules” allows for advancement. Sometimes this is very acceptable, as in non-Euclidean geometry, where lines and planes need not be straight, and sometimes completely unacceptable, as witnessed in the commotion over steroid abuse in the Olympics or Major League Baseball.

That being said, why don't most of us consider computer science a creative field? Computers are staples in workplaces, schools, and homes; many of us can’t imagine a day without Google or EBay; and an entire culture has carried over from the Internet to the outside world. The rise of computing and Internet technologies has made it easier to store, retrieve, analyze, and share important information, transforming almost every field of study. And yet we hear so many students---the very ones born into the “Digital Age”---complaining about how difficult or boring computer science is.

This is probably for the same reasons we don't think of math or particle physics as very creative. The learning curve is relatively steeper, it is more difficult to see personal progress being made, and there exist errors and wrong answers.

Nowadays, computer skills courses taught to young students involve mostly typing and learning to navigate the Internet or create slideshow presentations. Exposure to actual computer programming is minimal, appearing as TI-BASIC side projects in algebra and geometry classes. Programming classes for Java or C++ are often not available until the high school or college level.

Furthermore, how does the student know that he is becoming “better” at computer science? Given one programming language, he could be tested over his knowledge of the commands particular to that language. However, this only proves his knowledge of some definitions, and not necessarily the reasoning.

Learning to program can also be frustrating, if at every turn, one has to look out for missing parentheses or semicolons. The “aesthetics” of code---syntax, spaces, and indentations---are helpful, but they certainly aren’t what computer science is “all about.” Misplacing one character can throw off an entire program, and while perhaps the ideas are correct, the work becomes a lot of tedious nit-combing for those tiny mistakes.

The thinking, the logic, and the potential of computer science are what should be emphasized in these introductory classes. The theory and the application are at once alien and familiar, but not enough is being done to meld them in the learning mind. For example, we hear that “math is everywhere,” and whenever we go to the supermarket to buy food, or do our taxes, or cook from a recipe, we are using what we learned in math. We can see that computer science is all around, but using awkward analogies to explain the concepts does nothing to help explain why we need to know a dozen ways to sort objects or manipulate strings.

Yes, computer science is a creative domain. Not only can one apply creativity to it, it can also be applied to other fields creatively. That is the brilliance of such a field, but the stepping stones must be polished before the adjective “boring” can disappear. In an age where words like “blog,” “Web 2.0” and “search engine” are commonplace, we ought to update our pedagogy to reflect the importance that computer science has affected our lives.

Friday, August 14, 2009

e-Privacy: A Framework for Data-Publishing against Realistic Adversaries (Part I)

We have a new paper with exciting results on privacy-preserving data publishing at VLDB 2009. What is privacy-preserving data publishing? Let us start with a motivating example. Consider the following set of medical records published by Gotham City Hospital:

ZIP CODE	AGE	DISEASE
130**	less than 30	Viral Infection
130**	less than 30	Heart Disease
1485*	at least 40	Cancer
1485*	at least 40	Heart Disease
130**	around 35	Cancer
130**	around 35	Cancer

Each record in this table corresponds to a unique patient in the hospital, and each patient has three attributes: her zip code, her age, and her disease. Each patient considers her disease to be sensitive; the other attributes are not sensitive, but might be used to link a record to a person.
The non-sensitive attributes have been coarsened to ensure that no patient can be uniquely identified. For example, the zip code of the first patient has been changed from 13021 to 130**, and the values in the age attributes have been changed to ranges. The hospital should ensure that an adversary cannot link any patient to her disease.

Suppose Rachel is an individual in the population. Given access to only this table, the adversary Alice, may not be able the deduce Rachel's disease. But if Alice knows that Rachel is one of the individuals whose medical record is published in the table, and that Rachel is 35 year old and lives in zip code 13068, Alice can infer that Rachel has cancer.

So now let's be a bit more formal and describe the basic scenario of privacy-preserving data-publishing abstractly: You have a table T with sensitive information about individuals. You want to publish a sanitized version T' that (a) offers good utility and (b) preserves the privacy of the individuals in the table.

Now, we have neither defined utility nor privacy, and there might not be a "best" definition of these concepts. In the literature you find a variety of definitions that differ in what is considered sensitive information, in what privacy means and against what types of adversaries privacy needs to be protected.

In prior work on this topic, privacy was either protected against very weak adversaries or against extremely powerful (basically omniscient) adversaries. For example, consider the weak adversary of t-closeness. This adversary knows the distribution of the diseases in T before you have released any information about T; for instance, the adversary in t-closeness believes that Rachel's chance of having cancer is 50%. Another weak adversary is captured in l-diversity. Here, the adversary believes that for Rachel all diseases are equally likely, and the adversary knows some facts about the world, such as "men are unlikely to have breast cancer." On the other extreme, differential privacy considers a very powerful adversary who is assumed to know all patients in T except Rachel. Differential privacy provides so much protection that no generalization of T can be released, and so much privacy limits the utility of the released table.

Is there a middle ground of adversaries that we can work with that are neither omniscient nor weaklings? I will tell you more about this in my next blog posting.

Thursday, August 13, 2009

A SQL Compiler for High-Performance Delta Processing in Main-Memory Databases

DBToaster is a novel SQL compiler that generates database engines for high-performance main-memory processing of streaming data. In a nutshell, DBToaster aggressively compiles aggregate queries to incremental (or delta-) form, enabling stream data to be processed highly efficiently, one tuple at a time, through the entire query, in contrast to today's operator-centric query plan interpreters. We will demonstrate the DBToaster compiler at VLDB 2009 in the "Core DB Technology & System issues" Demo Session on Tuesday, August 25th, and Wednesday, August 26th.

The DBToaster compiler produces database engines in native code to perform incremental view maintenance of continuous queries posed on update streams. Update streams cannot be addressed efficiently by today's systems, and one clear motivating application is that of algorithmic trading on orderbook data, where buy and sell orders on an exchange's orderbooks are updated arbitrarily. Algorithmic trading applications require high-performance processing, have bounded size state in practice, and often involve queries with a processing scope that cannot be addressed by constructs such as windows or punctuations.

The novelty behind DBToaster's query compilation to view maintenance code is twofold, first in its use of a recursive compilation technique to aggressively simplify queries to delta forms, and second, its maintenance of multiple map data structures throughout recursive compilation to support delta processing. To define recursive compilation, we contrast to traditional view maintenance algorithms. Current view maintenance techniques use query plans to compute result deltas from changes to a single base relation. To emphasize, the incremental form of a user-specified query is itself a query, but critically, a simpler query that exploits new query inputs. Typically, the new inputs allow the elimination of joins and simplifications of aggregate computations.

Now, today's view maintenance algorithms stop at this point. DBToaster on the other hand ploughs straight ahead, recursively applying compilation to transform delta forms that are themselves queries, to simpler and simpler queries, by considering combinations of base relation deltas. Our recursive compilation bottoms out at queries that can be represented as very simple procedural code statements. Furthermore, DBToaster maintains each delta form encountered as a map datastructure, essentially a group-by aggregate index derived from applying aggregate distributivity properties together with join-graph decompositions. These maps are incrementally maintained by the procedural code generated by recursively compiling the maps' delta forms. DBToaster internally uses a map algebra to represent and reason about queries and map datastructures, and performs recursive compilation through a set of transformation rules defined in our map algebra.

We invite you to attend our demo at VLDB to see DBToaster in action. We will demonstrate DBToaster and its recursive compilation as applied to executing algorithmic trading strategies on orderbook data, and a data warehouse loading scenario, emulating simultaneous processing of a data integration query and an OLAP query, while incrementally loading the warehouse from an OLTP database. Our demonstration includes a visualization of recursive compilation and the map datastructures as applied to these demo queries, as well as the ability to trace and step through the generated map maintenance code. DBToaster generates extremely high performance native code database engines, and we will demonstrate the performance of these query processors compared to both popular open source database systems, as well as commercial-grade tools. You'll even have the opportunity to try out your own queries on our datasets! Look out for the toaster...

Wednesday, August 12, 2009

Creativity and Rules

A balance must be struck, between teaching creativity with very structured methods and teaching it with very open methods. In a previous post, we discussed how teaching creativity in too structured a way is really not teaching creativity at all. With too many set rules, the mind is stifled and not allowed to expand.

Having too few rules is also bad, however. Assignments that say “draw whatever you like,” or “write about whatever you want” are good, if the student is already brimming with imaginative ideas. For those who do not consider themselves very imaginative, who cannot think of ‘good ideas,’ or simply feel at a loss, they may decide to copy off a neighbor’s ideas. This is counterproductive. Yes, the assignment has been done, yes, it may be good work, but no, nothing has been gained towards becoming more creative or thoughtful. We desire a bounty of fresh ideas, which can certainly arise through emulation, but through flat-out imitation we learn nothing.

We can stand in front of a Pollock painting and declare, “My kid could do that”---after all, all Pollock did was throw buckets of paint on his canvasses, right? We can think that poetic acclaim will be easily achieved if we throw together an incoherent set of words and pass it off as a comment on postmodernism.

Without a base structure of learning and development, we have disorder, in the sense that there are ideas and tools floating in a sphere around us, but that we don’t use for various reasons—perhaps we feel that they are unnecessary, we think that they are too difficult to be utilized, or we are completely unaware of their existence.

One particular quote from H. Jackson Brown Jr’s Life’s Little Instruction book has oft been repeated: “Learn the rules so you know how to break them properly.” The basic rules are there to help, and not to hinder. We learn to use pencils and markers the “proper way” when we are just learning how to write or color, and from there it is a small jump to breaking them, taking them apart, and then writing or coloring in novel ways.

For every creative field, there are an infinite number of tools at one’s disposal. Probably the most challenging part of being creative is simply breaking into that field, and learning to use the most basic of rules and tools to achieve the desired outcome. The creative ideas come with the toolbox, and we mold them to fit with what we are able to achieve, and what tools we achieve with.

That nature of that basic toolbox determines what we in general think of as “creative.” Although we have previously learned that any field can serve as a place of creative learning and thinking, we often do not think of certain fields as such. Mathematics, computer science, and physics all need creativity, yet the toolbox needed to properly break into the field can be enough to turn many people away, unfortunately dismissing them as “too hard” or “too boring.”

Next: Computer Science as Creativity

Saturday, August 8, 2009

Micro-Review of KIDO'Z

I have been looking for websites where children can express their creativity through video games, and I recently came across KIDO'Z. Now the primary focus of KIDO'Z is not gaming, but to be "the Kid's Web Environment" as it says on their website. In the spirit of this being a micro-blog, here is a micro-review.

KIDO'Z opens up an Adobe Air Application that allows children to do three activities:

Browse the Web. There is a list of approved websites (that the parent can add to and block at her liberty) with the usual suspects: Thomas, Mickey Mouse, Dora, Tiger and Pooh, etc. The initial experience of selecting the website resembles the iTunes store and is thus very different from browsing the web. In addition, even going to "approved" websites does not prevent the display of material that is at the least not useful for children. For example, the website of Dora the Explorer shows at the bottom some sponsored links (including how to save on house insurance and how to find a local moving company). These sponsored links appear in the KIDO'Z browser as well, however clicking on the links just reloads the webpage. I think this gives a distorted view of concept of a link and the functionality of the Web.
Watch a limited set of YouTube videos. The most popular "channels" include classic cartoons, the "Animal Channel" to give you an idea of what is approved.
Some games, none of them seem to tickle the creative spark in a child.

I tried to play with the "Parental Control Center" but I first ran against a wall: The password that I had created when first signing up had more than 10 characters, but the password box on the "Parents Login" site only let me input 10 characters. So I re-started the application and went to the parents' part of the website from within the app. The parent has full control over the content and can block and add content at will. Below is a video that gives you an idea about the type of control a parent has.

There do not seem to be any "social features" in the application; no networks of parents that recommend each other content; only various lists of approved content with comments and ratings.

I can see that there is a market for this kind of application; it gives children exposure to content on the Web while keeping the parent in control. And I like the idea of the parent setting virtual boundaries but then letting the child explore the space by herself. And as a portal for content that is suitable for children the site is not bad. But personally I am not sold; I did not like it that the interface of the application is different from a standard browser, and I think that if you have the time it is better to sit next to your child when she surfs the web, point her to some content that is worthwhile to explore, and discuss and answer her questions yourself.

Friday, August 7, 2009

Cornell best hotspot of brainpower in the nation, says website

This website ranks US cities by the percentage of resident master's and doctorate degrees. The top four cities are right adjacent to Cornell!

Saturday, August 1, 2009

Google Faculty Summit

This week I was at the Google Faculty Summit; I put up some notes here. (I don't add them to this blog because they are a little opinionated.)

Thursday, July 30, 2009

Cooperative Update Exchange in the Youtopia System

In the Youtopia project, we are building a system to allow communities to share structured data. There is an increasing amount of such data being shared on the Web, but there are few tools to make this sharing and management easy for everyone. What is needed is the equivalent of Wiki software for structured data, i.e. a collaborative database management system.

Designing a collaborative DBMS poses unique and exciting technical challenges because all functionality must be fully decentralized. In a recent paper to appear in VLDB 2009, we tackle one such challenge, namely, the problem of update exchange.

Update exchange is a process where changes to data in one table are automatically propagated to other tables as directed by a set of mappings (or tuple-generating dependencies). For example, suppose table A contains information about restaurants in New York State and table B contains data about restaurants in Ithaca, as well as restaurant reviews. If a new restaurant is opened in Ithaca and
entered into table A, a mapping can automatically propagate that information to table B, notifying the owner of table B of the new restaurant and giving them a chance to supply a review.

In our paper, we introduce a model for update exchange which is fundamentally cooperative. It is a variant of the classical chase that includes human intervention at crucial points. Thus, the users have a great deal of control over how the mappings fire. An additional advantage of our model is that the mappings may form arbitrary cycles - we are able to drop the acyclicity restrictions often found in the literature without risking infinite sequences of new tuple insertions.

Our paper presents our update exchange model and explains how to design a practical system that incorporates it. The key observation is that human intervention is slow and unpredictable, and the system must never delay new incoming queries and updates just because it is awaiting human input on an older update. This means that in general, multiple chases will be going on in the system simultaneously. We study the potential for interference among such chases. We provide a serializability framework and concrete algorithms that make it possible to avoid all such interference, should it be unacceptable in a given deployment scenario.

We are currently putting our work into practice and building a prototype of the Youtopia system. Stay tuned for updates, and we hope to see you at the talk in Lyon!

Saturday, July 25, 2009

What is Creativity?

The root of the words "create" and "creativity" come from the Latin creatus and creare, meaning "to make or produce," or literally, "to grow." - Jane Piirto

Creativity does not simply apply to art or writing, as early schooling tends to lead us to believe. Creativity is applicable to business, to physics, politics, pretty much every imaginable aspect of living. It is just a matter of making something new, something novel. (Of course, ideas don't grow in vacuum chambers, so interpretation of creativity does depend on zeitgeist.)

"To say that Thomas Edison invented electricity or that Albert Einstein discovered relativity is a convenient simplification.... But Edison's or Einstein's discoveries would be inconceivable without the prior knowledge, without the intellectual and social network that stimulated their thinking, and without the social mechanisms that recognized and spread their innovations. To say that the theory of relativity was created by Einstein is like saying that it is the spark that is responsible for the fire. The spark is necessary, but without air and tinder there would be no flame."
- Mihaly Csikszentmihalyi

Csikszentmihalyi defines "Big C" and "little c" as the two types of creativity, differing in magnitude. "Big C" Creativity is what leads to changing domains, changing entire ways of life, by someone who is (or becomes) well known by others in the relevant field. Einstein's theory of relativity, Beethoven's Moonlight Sonata, and Twain's literary works fall under Big C. "little c" creativity, on the other hand, concerns the individual's day-to-day life, and activities such as finding the fastest way to a destination using side streets, or figuring out how to keep rabbits out of the vegetable garden.

Everyone is born with a propensity towards creativity. Many debates have occurred over the years on whether the "amount" of creativity can be measured, or if it is correlated with other factors, such as intelligence, giftedness, talent, socioeconomic standing, and so on. Perhaps there are some facets of creativity that are only innate, or some factors that facilitate creativity (verbal acuity, an eye for detail, a particular physique) that are innate, but there are other aspects that can be actively developed.

Being creative is necessary for a fulfilling life, and it is something that should be developed and cultivated throughout one's lifetime. Unfortunately, creativity can be suppressed by the process of being educated or growing up, in favor of practicality or pragmatism. The good news is that at any time, creativity can be picked up again, re-nurtured, and re-embraced, and that it does not have to exist entirely separate from practicality.

Next time: Teaching Creativity with Too Much Freedom, and Being a Creative Individual

Further Reading

Creativity: Flow and the Psychology of Discovery and Invention (Mihaly Csikszentmihalyi)

Understanding Creativity (Jane Piirto)

Thursday, July 23, 2009

Invited talk at CIAA'09 on applications of automata in XML processing

Last week I gave an invited talk at the 14th International Conference on Implementation and Application of Automata (CIAA'09) in Sydney, Australia. I talked about applications of automata in XML processing. More specifically, I addressed three problem areas:

Using automata for XML validation, where I addressed DTDs and XML Schema (more precisely, restraint competition grammars), their automata theoretic analogs, and efficient streaming validation.
Using automata in XML publish-subscribe and complex-event processing systems, and
Using automata for evaluating highly expressive node selecting queries with a constant number of sequential passes over the data.

I also talked a little about the industrial importance of complex-event processing systems.

Overall, the conference was a very nice opportunity to meet some old friends. Unfortunately, it was not a equally great opportunity to make new ones among the Koala population (i.e., form a coalition). All the local Koala's seemed to live in zoos and were said to die from stress if one touches them. (So that's illegal.) Generally, the wildlife is well protected there. I saw a sign at the beach which put picking up seashells or barnacles under a fine of up to $22,000.

(Note: the picture is from Wikipedia. I saw no Koalas, just barnacles.)

SIGMOD 2009 New Researcher Symposium

The goal of this year's SIGMOD New Researchers Symposium, which I co-chaired with Nesime Tatbul from ETH Zurich, was to give graduate students and junior researchers advice on various challenges involved in human factors, such as successfully making the move from graduate student to being part of, and productive in, a team of researchers in an industry lab or to becoming junior faculty and having to build up a research group. The two panels, consisting of well-known members of the data management community, also covered issues such as attracting and mentoring graduate students or junior researchers and providing good leadership of a research group. The slides used by the panelists are now available here. (Note: some panelists did not use slides.)

Wednesday, July 22, 2009

Teaching Creativity with Little Freedom

I remember going through elementary school, and coming to greatly dislike certain phrases that my teachers seemed to enjoy repeating, whether in praise—"Wow, that's so creative!"—or admonishment—"Try to be more creative."

At that young age, I had already cynical/scornful tendencies of such banal forms of encouragement. I was looking around at my peers' work, and oftentimes couldn't, for the life of me, see how it represented anything that I hadn't seen done before. To quibble, the term used was 'creative', not 'original', but I'm sure the teacher had seen it all as well.

So what did it mean to be creative? And furthermore, how was one supposed to become creative?

We had plenty of time for art, "creative" writing, and other such activities, which appeared on the schedule once or twice a week. It seemed, however, that these activities were either too structured, or not structured enough. I recall from second grade, a project where we had to draw leaves using provided stencils, cut them out, and color them. As the teacher demonstrated the process, she colored her leaves yellow-green. We all took out our boxes of crayons. I saw my left and right neighbors taking different shades of green, but I decided I didn't want green leaves. Out of my box of twenty-four, I picked a deep red and a bright orange, mixing them on paper to create a beautiful, fiery-looking leaf.

While I was admiring the blending of the waxy hues, my right-side neighbor noticed, and elbowed me, saying "You're supposed to make the leaves green."

I looked around, and quailed, seeing that everybody else in the room was using green. It could be that I hadn't been paying attention, but I had assumed the leaves could be any color we liked. The teacher came around to tell me that the leaves should be green. Why did they have to be green? Because they all had to look alike.

My seven-year-old self was completely stumped. I couldn't turn the leaf over and color that side; it would be facing the wrong way. I didn't have any more paper to work with; we were given only one sheet, which I had already cut into pieces. I decided to try and cover the orange crayon with dark green. Suffice to say, it turned out badly.

At the end of the week, all of the leaves were hung up above the blackboard, including my orange-and-green mess. Only now do I wonder what would have happened to me if I had simply said "So what?" to my neighbor and let it be.

Next: What is Creativity?

Tuesday, July 21, 2009

An Evaluation of Checkpoint Recovery for Massively Multiplayer Online Games

One problem that current Massively Multiplayer Online Games (MMOs) face is how to provide persistence for their virtual worlds. We have studied this problem in a recent paper accepted for publication at VLDB 2009.

MMOs use standard DBMS on the back-end to provide transactional guarantees for state updates. This decision is appropriate for updates that require full ACID guarantees, such as in-game financial transactions and item exchanges. The bulk of state updates in MMOs does not need, however, full ACID properties. For example, character movement comprises a large amount of the updates applied to a virtual world. As it turns out, game logic such as collision detection prevents character movement updates from generating any conflicts. We call these updates local updates.

The amount of local updates a game must process may exceed hundreds of thousands or millions per second. For performance, these updates are applied in main memory only, and game developers hand-code persistence logic for durability.

In our paper, we have experimentally evaluated the performance of main-memory checkpoint recovery techniques for MMOs. Our study shows that these techniques are a viable alternative to provide durability for local updates. Notwithstanding, not all techniques we have studied are equally suited for MMOs. MMOs have stringent latency requirements, ruling out methods that introduce long pauses in the game. Here is a summary of our recommendations to game developers:

Methods that perform copy on update of dirty objects only have clear latency advantages over methods based on eager copies of the game state. They avoid latency peaks by spreading their overhead over a number of game ticks.
When update rates are so dramatically large and skewed that the entire game state gets updated in a single tick of the game, little can be done to reduce the latency impact of the checkpoint algorithms. In this extreme situation, an algorithm based on an eager copy of the entire game state introduces the minimum pause in the game.
Methods based on a double-backup organization either match or outperform log-based alternatives in terms of recovery time.
The best method for a wide range of parameters is copy on update combined with a double backup. This method outperforms alternatives by up to a factor five in latency without any degradation in recovery time.

Our evaluation is based on a detailed simulation model of the checkpoint methods, available for download here. This simulation model has been validated against a real implementation of a relevant subset of the techniques. We plan to make our implementation also available for download. Keep tuned for updates in the near future!

Hope to see you at our talk in Lyon!

Sunday, July 19, 2009

New Social Gaming Company?

Techcrunch has an article about a new social gaming company called Tiny Speck. I think social gaming will be huge: We will all be constantly connected, we will carry around various intellectual prostethics, and we will have time to spend as the checkout line at the grocery store. It will also be interesting to see how games adapt once our attention span for a game is only five minutes? To sample what's out there check out the various games at Zynga (social games) or Kongregate (online flash games, many for the five-minute attention span).

Friday, July 17, 2009

Teaching a Foreign Language In a Virtual World

The New York Times has an article yesterday about an interesting type of game: A Virtual World where children will learn languages. Unfortunately, Wiz World Online is all in Chinese so I could not play it, but this is definitively a development to watch, and I would love to hear from people who have played it to understand how the virtual world works. I think that games have great potential to be used as a teaching tool where we can use AI and machine learning to personalize the game and learning experience without the personal attention of a teacher.

PS: When I just made a small modification to this entry, Google showed me an add for "Learn Chinese online with Beijing's best teachers."

Demo of CAT (the Cornell Anonymization Toolkit) at SIGMOD 2009

At the recent SIGMOD conference, we presented our new tool for limiting disclosure in data publishing: CAT, the Cornell Anonymization Toolkit. While there has been a lot of interesting work on different algorithms for privacy-preserving data publishing, there are few tools that combine them with a usable GUI that visualize what these algorithms achieve. CAT has an easy-to-use interface; it guides the user through the process of preparing a dataset forpublication while limiting disclosure through the identification of records that have high risk under various attacker models.

CAT currently implements l-diversity and t-closeness.

Below are a few screenshots of CAT; the code is available here. We welcome feedback from users about features that you would like to have and what aspects of the tool you think work well and which need improvement.

Blackboard buys TerriblyClever

Blackboard, the company that (among other applications) sells the course content management system that Cornell uses, has purchased MobilEdu, a company run by undergraduates from Stanford that developers university-centric services (such as GPS-supported maps, a directory, bills, courses, etc.) for the iPhone. Congratulations to the founders! I can only hope that this will have impact on Blackboard beyond new apps on the iPhone; the interface of the version of the Blackboard course management system that I have used has no Web 2.0 features and is very clunky. In addition, management of assignments and submission of solutions does not work well.

MayBMS demo at SIGMOD

MayBMS demo was presented at the 2009 ACM SIGMOD conference in Providence, RI. This was the first MayBMS demo after its official release on Sourceforge earlier this year. As part of the demonstration we showed two web applications built on top of the MayBMS server. The first one was a human resource management application that uses what-if analysis to aid managerial decisions, and the second one showed how random graphs and social networks can be modeled and queried as probabilistic databases. See the conference poster here.

Saturday, July 11, 2009

Bulding an OS for netbooks with two developers

Techcrunch reported on July 8 that besides Google, French company Jolicloud is building an operating system for netbooks. They seem to be building on Linux, but what impressed me most is that their team consists of just two developers. I found this encouraging for us at a university where we are always limited by just working with a few selected graduate students. It is the idea and the people that matter, not the size of the team!

Friday, July 10, 2009

Ashwin's thesis recognized

Cornell DB group alumnus Ashwin Machanavajjhala has won the 2009 ACM SIGMOD Doctoral Dissertation Award honorable mention. Congratulations Ashwin!

Tuesday, June 30, 2009

Nitin Gupta, with co-author Neha Singh of IIT Bombay, wins first place in the ACM Student Research Competition Grand Finals for their work on XML processing.

Friday, May 29, 2009

VLDB acceptance notifications out

The Cornell DB group got three technical papers and a demo accepted for presentation at this year's VLDB conference in Lyon, France!

Saturday, February 28, 2009

Michaela Goetz wins a Microsoft Research Graduate Women's Scholarship. Congratulations Mila!