CS6320 covers the foundations, design, and construction of large-scale data-centric computing systems, with a special focus on turning declarative specifications and queries into scalable systems. I will present the foundations of declarative languages and the implications of these foundations on the difficulty of solving problems specified using such languages (e.g., database queries; or constraint satisfaction problems in AI; or model checking problems in computer aided verification). I will also present the central concepts and algorithms used for building scalable systems for processing declarative languages. This includes parallel processing (e.g., in a map/reduce-style framework) and automata-based data stream processing techniques. Moreover, I will show how to engineer declarative languages for a purpose, achieving a good trade-off between expressive power and processing cost, and how to understand and implement optimizing compilers that turn declarative specifications into efficient code that executes them.
The course will be of interest to both systems and theory students. A big part of the grade is based on a research project, where you can choose between a theoretical and a systems project.
THE FIRST LECTURE WILL TAKE PLACE ON TUESDAY, SEPT. 1, 2009.
Monday, August 31, 2009
M.Eng. Project Kick-off Meeting Thursday Sept. 3 at 5pm
This is of interest to Cornell M.Eng. students interested in doing a CS7999 project with the database group. We will have a kick-off meeting for CS7999 projects on Thursday at 5-6pm in 5130 Upson Hall. We will present the available projects and required skills, and will answer your questions. We strongly recommend to attend this meeting; it may not be possible to start a CS7999 project with us later this term. If you plan to attend, please RSVP to shawna@cs.cornell.edu.
Wednesday, August 19, 2009
Monday, August 17, 2009
Computer Science as Creativity
When asked to name some creative activities, most people will immediately say "painting," or "drawing," or "writing." These are common creative pursuits, but creative pursuits are certainly not limited to these or similar endeavors. They are examples of low floor/high ceiling activities, low floor meaning that they can be begun and practiced started at a very young age with that very basic assortment of tools, and high ceiling meaning that there is a lot of space to improve in---many new innovations and techniques can be gained along with even more useful tools.
Other creative pursuits, such as gardening, baking, or mixing drinks need mastery of some prerequisite skills before beginning. We certainly would not expect a five year old to be able to plant a bed of tulips or know how to mix a gin martini.
Mathematics, entrepreneurship, and athletics fall even further from the general idea of what creativity is. Yet all these fields involve creative thinking, original ways of looking at and solving problems. Bending and breaking the “rules” allows for advancement. Sometimes this is very acceptable, as in non-Euclidean geometry, where lines and planes need not be straight, and sometimes completely unacceptable, as witnessed in the commotion over steroid abuse in the Olympics or Major League Baseball.
That being said, why don't most of us consider computer science a creative field? Computers are staples in workplaces, schools, and homes; many of us can’t imagine a day without Google or EBay; and an entire culture has carried over from the Internet to the outside world. The rise of computing and Internet technologies has made it easier to store, retrieve, analyze, and share important information, transforming almost every field of study. And yet we hear so many students---the very ones born into the “Digital Age”---complaining about how difficult or boring computer science is.
This is probably for the same reasons we don't think of math or particle physics as very creative. The learning curve is relatively steeper, it is more difficult to see personal progress being made, and there exist errors and wrong answers.
Nowadays, computer skills courses taught to young students involve mostly typing and learning to navigate the Internet or create slideshow presentations. Exposure to actual computer programming is minimal, appearing as TI-BASIC side projects in algebra and geometry classes. Programming classes for Java or C++ are often not available until the high school or college level.
Furthermore, how does the student know that he is becoming “better” at computer science? Given one programming language, he could be tested over his knowledge of the commands particular to that language. However, this only proves his knowledge of some definitions, and not necessarily the reasoning.
Learning to program can also be frustrating, if at every turn, one has to look out for missing parentheses or semicolons. The “aesthetics” of code---syntax, spaces, and indentations---are helpful, but they certainly aren’t what computer science is “all about.” Misplacing one character can throw off an entire program, and while perhaps the ideas are correct, the work becomes a lot of tedious nit-combing for those tiny mistakes.
The thinking, the logic, and the potential of computer science are what should be emphasized in these introductory classes. The theory and the application are at once alien and familiar, but not enough is being done to meld them in the learning mind. For example, we hear that “math is everywhere,” and whenever we go to the supermarket to buy food, or do our taxes, or cook from a recipe, we are using what we learned in math. We can see that computer science is all around, but using awkward analogies to explain the concepts does nothing to help explain why we need to know a dozen ways to sort objects or manipulate strings.
Yes, computer science is a creative domain. Not only can one apply creativity to it, it can also be applied to other fields creatively. That is the brilliance of such a field, but the stepping stones must be polished before the adjective “boring” can disappear. In an age where words like “blog,” “Web 2.0” and “search engine” are commonplace, we ought to update our pedagogy to reflect the importance that computer science has affected our lives.
Other creative pursuits, such as gardening, baking, or mixing drinks need mastery of some prerequisite skills before beginning. We certainly would not expect a five year old to be able to plant a bed of tulips or know how to mix a gin martini.
Mathematics, entrepreneurship, and athletics fall even further from the general idea of what creativity is. Yet all these fields involve creative thinking, original ways of looking at and solving problems. Bending and breaking the “rules” allows for advancement. Sometimes this is very acceptable, as in non-Euclidean geometry, where lines and planes need not be straight, and sometimes completely unacceptable, as witnessed in the commotion over steroid abuse in the Olympics or Major League Baseball.
That being said, why don't most of us consider computer science a creative field? Computers are staples in workplaces, schools, and homes; many of us can’t imagine a day without Google or EBay; and an entire culture has carried over from the Internet to the outside world. The rise of computing and Internet technologies has made it easier to store, retrieve, analyze, and share important information, transforming almost every field of study. And yet we hear so many students---the very ones born into the “Digital Age”---complaining about how difficult or boring computer science is.
This is probably for the same reasons we don't think of math or particle physics as very creative. The learning curve is relatively steeper, it is more difficult to see personal progress being made, and there exist errors and wrong answers.
Nowadays, computer skills courses taught to young students involve mostly typing and learning to navigate the Internet or create slideshow presentations. Exposure to actual computer programming is minimal, appearing as TI-BASIC side projects in algebra and geometry classes. Programming classes for Java or C++ are often not available until the high school or college level.
Furthermore, how does the student know that he is becoming “better” at computer science? Given one programming language, he could be tested over his knowledge of the commands particular to that language. However, this only proves his knowledge of some definitions, and not necessarily the reasoning.
Learning to program can also be frustrating, if at every turn, one has to look out for missing parentheses or semicolons. The “aesthetics” of code---syntax, spaces, and indentations---are helpful, but they certainly aren’t what computer science is “all about.” Misplacing one character can throw off an entire program, and while perhaps the ideas are correct, the work becomes a lot of tedious nit-combing for those tiny mistakes.
The thinking, the logic, and the potential of computer science are what should be emphasized in these introductory classes. The theory and the application are at once alien and familiar, but not enough is being done to meld them in the learning mind. For example, we hear that “math is everywhere,” and whenever we go to the supermarket to buy food, or do our taxes, or cook from a recipe, we are using what we learned in math. We can see that computer science is all around, but using awkward analogies to explain the concepts does nothing to help explain why we need to know a dozen ways to sort objects or manipulate strings.
Yes, computer science is a creative domain. Not only can one apply creativity to it, it can also be applied to other fields creatively. That is the brilliance of such a field, but the stepping stones must be polished before the adjective “boring” can disappear. In an age where words like “blog,” “Web 2.0” and “search engine” are commonplace, we ought to update our pedagogy to reflect the importance that computer science has affected our lives.
Friday, August 14, 2009
e-Privacy: A Framework for Data-Publishing against Realistic Adversaries (Part I)
We have a new paper with exciting results on privacy-preserving data publishing at VLDB 2009. What is privacy-preserving data publishing? Let us start with a motivating example. Consider the following set of medical records published by Gotham City Hospital:
Each record in this table corresponds to a unique patient in the hospital, and each patient has three attributes: her zip code, her age, and her disease. Each patient considers her disease to be sensitive; the other attributes are not sensitive, but might be used to link a record to a person.
The non-sensitive attributes have been coarsened to ensure that no patient can be uniquely identified. For example, the zip code of the first patient has been changed from 13021 to 130**, and the values in the age attributes have been changed to ranges. The hospital should ensure that an adversary cannot link any patient to her disease.
Suppose Rachel is an individual in the population. Given access to only this table, the adversary Alice, may not be able the deduce Rachel's disease. But if Alice knows that Rachel is one of the individuals whose medical record is published in the table, and that Rachel is 35 year old and lives in zip code 13068, Alice can infer that Rachel has cancer.
So now let's be a bit more formal and describe the basic scenario of privacy-preserving data-publishing abstractly: You have a table T with sensitive information about individuals. You want to publish a sanitized version T' that (a) offers good utility and (b) preserves the privacy of the individuals in the table.
Now, we have neither defined utility nor privacy, and there might not be a "best" definition of these concepts. In the literature you find a variety of definitions that differ in what is considered sensitive information, in what privacy means and against what types of adversaries privacy needs to be protected.
In prior work on this topic, privacy was either protected against very weak adversaries or against extremely powerful (basically omniscient) adversaries. For example, consider the weak adversary of t-closeness. This adversary knows the distribution of the diseases in T before you have released any information about T; for instance, the adversary in t-closeness believes that Rachel's chance of having cancer is 50%. Another weak adversary is captured in l-diversity. Here, the adversary believes that for Rachel all diseases are equally likely, and the adversary knows some facts about the world, such as "men are unlikely to have breast cancer." On the other extreme, differential privacy considers a very powerful adversary who is assumed to know all patients in T except Rachel. Differential privacy provides so much protection that no generalization of T can be released, and so much privacy limits the utility of the released table.
Is there a middle ground of adversaries that we can work with that are neither omniscient nor weaklings? I will tell you more about this in my next blog posting.
ZIP CODE | AGE | DISEASE |
130** | less than 30 | Viral Infection |
130** | less than 30 | Heart Disease |
1485* | at least 40 | Cancer |
1485* | at least 40 | Heart Disease |
130** | around 35 | Cancer |
130** | around 35 | Cancer |
Each record in this table corresponds to a unique patient in the hospital, and each patient has three attributes: her zip code, her age, and her disease. Each patient considers her disease to be sensitive; the other attributes are not sensitive, but might be used to link a record to a person.
The non-sensitive attributes have been coarsened to ensure that no patient can be uniquely identified. For example, the zip code of the first patient has been changed from 13021 to 130**, and the values in the age attributes have been changed to ranges. The hospital should ensure that an adversary cannot link any patient to her disease.
Suppose Rachel is an individual in the population. Given access to only this table, the adversary Alice, may not be able the deduce Rachel's disease. But if Alice knows that Rachel is one of the individuals whose medical record is published in the table, and that Rachel is 35 year old and lives in zip code 13068, Alice can infer that Rachel has cancer.
So now let's be a bit more formal and describe the basic scenario of privacy-preserving data-publishing abstractly: You have a table T with sensitive information about individuals. You want to publish a sanitized version T' that (a) offers good utility and (b) preserves the privacy of the individuals in the table.
Now, we have neither defined utility nor privacy, and there might not be a "best" definition of these concepts. In the literature you find a variety of definitions that differ in what is considered sensitive information, in what privacy means and against what types of adversaries privacy needs to be protected.
In prior work on this topic, privacy was either protected against very weak adversaries or against extremely powerful (basically omniscient) adversaries. For example, consider the weak adversary of t-closeness. This adversary knows the distribution of the diseases in T before you have released any information about T; for instance, the adversary in t-closeness believes that Rachel's chance of having cancer is 50%. Another weak adversary is captured in l-diversity. Here, the adversary believes that for Rachel all diseases are equally likely, and the adversary knows some facts about the world, such as "men are unlikely to have breast cancer." On the other extreme, differential privacy considers a very powerful adversary who is assumed to know all patients in T except Rachel. Differential privacy provides so much protection that no generalization of T can be released, and so much privacy limits the utility of the released table.
Is there a middle ground of adversaries that we can work with that are neither omniscient nor weaklings? I will tell you more about this in my next blog posting.
Thursday, August 13, 2009
A SQL Compiler for High-Performance Delta Processing in Main-Memory Databases
DBToaster is a novel SQL compiler that generates database engines for high-performance main-memory processing of streaming data. In a nutshell, DBToaster aggressively compiles aggregate queries to incremental (or delta-) form, enabling stream data to be processed highly efficiently, one tuple at a time, through the entire query, in contrast to today's operator-centric query plan interpreters. We will demonstrate the DBToaster compiler at VLDB 2009 in the "Core DB Technology & System issues" Demo Session on Tuesday, August 25th, and Wednesday, August 26th.
The DBToaster compiler produces database engines in native code to perform incremental view maintenance of continuous queries posed on update streams. Update streams cannot be addressed efficiently by today's systems, and one clear motivating application is that of algorithmic trading on orderbook data, where buy and sell orders on an exchange's orderbooks are updated arbitrarily. Algorithmic trading applications require high-performance processing, have bounded size state in practice, and often involve queries with a processing scope that cannot be addressed by constructs such as windows or punctuations.
The novelty behind DBToaster's query compilation to view maintenance code is twofold, first in its use of a recursive compilation technique to aggressively simplify queries to delta forms, and second, its maintenance of multiple map data structures throughout recursive compilation to support delta processing. To define recursive compilation, we contrast to traditional view maintenance algorithms. Current view maintenance techniques use query plans to compute result deltas from changes to a single base relation. To emphasize, the incremental form of a user-specified query is itself a query, but critically, a simpler query that exploits new query inputs. Typically, the new inputs allow the elimination of joins and simplifications of aggregate computations.
Now, today's view maintenance algorithms stop at this point. DBToaster on the other hand ploughs straight ahead, recursively applying compilation to transform delta forms that are themselves queries, to simpler and simpler queries, by considering combinations of base relation deltas. Our recursive compilation bottoms out at queries that can be represented as very simple procedural code statements. Furthermore, DBToaster maintains each delta form encountered as a map datastructure, essentially a group-by aggregate index derived from applying aggregate distributivity properties together with join-graph decompositions. These maps are incrementally maintained by the procedural code generated by recursively compiling the maps' delta forms. DBToaster internally uses a map algebra to represent and reason about queries and map datastructures, and performs recursive compilation through a set of transformation rules defined in our map algebra.
We invite you to attend our demo at VLDB to see DBToaster in action. We will demonstrate DBToaster and its recursive compilation as applied to executing algorithmic trading strategies on orderbook data, and a data warehouse loading scenario, emulating simultaneous processing of a data integration query and an OLAP query, while incrementally loading the warehouse from an OLTP database. Our demonstration includes a visualization of recursive compilation and the map datastructures as applied to these demo queries, as well as the ability to trace and step through the generated map maintenance code. DBToaster generates extremely high performance native code database engines, and we will demonstrate the performance of these query processors compared to both popular open source database systems, as well as commercial-grade tools. You'll even have the opportunity to try out your own queries on our datasets! Look out for the toaster...
Wednesday, August 12, 2009
Creativity and Rules
A balance must be struck, between teaching creativity with very structured methods and teaching it with very open methods. In a previous post, we discussed how teaching creativity in too structured a way is really not teaching creativity at all. With too many set rules, the mind is stifled and not allowed to expand.
Having too few rules is also bad, however. Assignments that say “draw whatever you like,” or “write about whatever you want” are good, if the student is already brimming with imaginative ideas. For those who do not consider themselves very imaginative, who cannot think of ‘good ideas,’ or simply feel at a loss, they may decide to copy off a neighbor’s ideas. This is counterproductive. Yes, the assignment has been done, yes, it may be good work, but no, nothing has been gained towards becoming more creative or thoughtful. We desire a bounty of fresh ideas, which can certainly arise through emulation, but through flat-out imitation we learn nothing.
We can stand in front of a Pollock painting and declare, “My kid could do that”---after all, all Pollock did was throw buckets of paint on his canvasses, right? We can think that poetic acclaim will be easily achieved if we throw together an incoherent set of words and pass it off as a comment on postmodernism.
Without a base structure of learning and development, we have disorder, in the sense that there are ideas and tools floating in a sphere around us, but that we don’t use for various reasons—perhaps we feel that they are unnecessary, we think that they are too difficult to be utilized, or we are completely unaware of their existence.
One particular quote from H. Jackson Brown Jr’s Life’s Little Instruction book has oft been repeated: “Learn the rules so you know how to break them properly.” The basic rules are there to help, and not to hinder. We learn to use pencils and markers the “proper way” when we are just learning how to write or color, and from there it is a small jump to breaking them, taking them apart, and then writing or coloring in novel ways.
For every creative field, there are an infinite number of tools at one’s disposal. Probably the most challenging part of being creative is simply breaking into that field, and learning to use the most basic of rules and tools to achieve the desired outcome. The creative ideas come with the toolbox, and we mold them to fit with what we are able to achieve, and what tools we achieve with.
That nature of that basic toolbox determines what we in general think of as “creative.” Although we have previously learned that any field can serve as a place of creative learning and thinking, we often do not think of certain fields as such. Mathematics, computer science, and physics all need creativity, yet the toolbox needed to properly break into the field can be enough to turn many people away, unfortunately dismissing them as “too hard” or “too boring.”
Next: Computer Science as Creativity
Having too few rules is also bad, however. Assignments that say “draw whatever you like,” or “write about whatever you want” are good, if the student is already brimming with imaginative ideas. For those who do not consider themselves very imaginative, who cannot think of ‘good ideas,’ or simply feel at a loss, they may decide to copy off a neighbor’s ideas. This is counterproductive. Yes, the assignment has been done, yes, it may be good work, but no, nothing has been gained towards becoming more creative or thoughtful. We desire a bounty of fresh ideas, which can certainly arise through emulation, but through flat-out imitation we learn nothing.
We can stand in front of a Pollock painting and declare, “My kid could do that”---after all, all Pollock did was throw buckets of paint on his canvasses, right? We can think that poetic acclaim will be easily achieved if we throw together an incoherent set of words and pass it off as a comment on postmodernism.
Without a base structure of learning and development, we have disorder, in the sense that there are ideas and tools floating in a sphere around us, but that we don’t use for various reasons—perhaps we feel that they are unnecessary, we think that they are too difficult to be utilized, or we are completely unaware of their existence.
One particular quote from H. Jackson Brown Jr’s Life’s Little Instruction book has oft been repeated: “Learn the rules so you know how to break them properly.” The basic rules are there to help, and not to hinder. We learn to use pencils and markers the “proper way” when we are just learning how to write or color, and from there it is a small jump to breaking them, taking them apart, and then writing or coloring in novel ways.
For every creative field, there are an infinite number of tools at one’s disposal. Probably the most challenging part of being creative is simply breaking into that field, and learning to use the most basic of rules and tools to achieve the desired outcome. The creative ideas come with the toolbox, and we mold them to fit with what we are able to achieve, and what tools we achieve with.
That nature of that basic toolbox determines what we in general think of as “creative.” Although we have previously learned that any field can serve as a place of creative learning and thinking, we often do not think of certain fields as such. Mathematics, computer science, and physics all need creativity, yet the toolbox needed to properly break into the field can be enough to turn many people away, unfortunately dismissing them as “too hard” or “too boring.”
Next: Computer Science as Creativity
Saturday, August 8, 2009
Micro-Review of KIDO'Z
I have been looking for websites where children can express their creativity through video games, and I recently came across KIDO'Z. Now the primary focus of KIDO'Z is not gaming, but to be "the Kid's Web Environment" as it says on their website. In the spirit of this being a micro-blog, here is a micro-review.
KIDO'Z opens up an Adobe Air Application that allows children to do three activities:
There do not seem to be any "social features" in the application; no networks of parents that recommend each other content; only various lists of approved content with comments and ratings.
I can see that there is a market for this kind of application; it gives children exposure to content on the Web while keeping the parent in control. And I like the idea of the parent setting virtual boundaries but then letting the child explore the space by herself. And as a portal for content that is suitable for children the site is not bad. But personally I am not sold; I did not like it that the interface of the application is different from a standard browser, and I think that if you have the time it is better to sit next to your child when she surfs the web, point her to some content that is worthwhile to explore, and discuss and answer her questions yourself.
KIDO'Z opens up an Adobe Air Application that allows children to do three activities:
- Browse the Web. There is a list of approved websites (that the parent can add to and block at her liberty) with the usual suspects: Thomas, Mickey Mouse, Dora, Tiger and Pooh, etc. The initial experience of selecting the website resembles the iTunes store and is thus very different from browsing the web. In addition, even going to "approved" websites does not prevent the display of material that is at the least not useful for children. For example, the website of Dora the Explorer shows at the bottom some sponsored links (including how to save on house insurance and how to find a local moving company). These sponsored links appear in the KIDO'Z browser as well, however clicking on the links just reloads the webpage. I think this gives a distorted view of concept of a link and the functionality of the Web.
- Watch a limited set of YouTube videos. The most popular "channels" include classic cartoons, the "Animal Channel" to give you an idea of what is approved.
- Some games, none of them seem to tickle the creative spark in a child.
There do not seem to be any "social features" in the application; no networks of parents that recommend each other content; only various lists of approved content with comments and ratings.
I can see that there is a market for this kind of application; it gives children exposure to content on the Web while keeping the parent in control. And I like the idea of the parent setting virtual boundaries but then letting the child explore the space by herself. And as a portal for content that is suitable for children the site is not bad. But personally I am not sold; I did not like it that the interface of the application is different from a standard browser, and I think that if you have the time it is better to sit next to your child when she surfs the web, point her to some content that is worthwhile to explore, and discuss and answer her questions yourself.
Labels:
creativity,
education,
games,
startups
Friday, August 7, 2009
Cornell best hotspot of brainpower in the nation, says website
This website ranks US cities by the percentage of resident master's and doctorate degrees. The top four cities are right adjacent to Cornell!
Saturday, August 1, 2009
Google Faculty Summit
This week I was at the Google Faculty Summit; I put up some notes here. (I don't add them to this blog because they are a little opinionated.)
Subscribe to:
Posts (Atom)