The field of probabilistic databases attempts to ask the question: can we still reason about data that we aren't certain about? Related to a wide variety of fields including statistics, probability theory, and fuzzy logic, probabilistic databases treat data as following a probabilistic model rather than being known with certainty. When the data is queried, the database not only produces a query result, but also computes (or more commonly, estimates) statistical properties (i.e., Confidence, or Histogram) of that result.
For example, consider a risk-management application that uses statistical models to evaluate the long term effects of corporate decisions and policies. This application may use a DBMS to store predictions and statistical measures (e.g., error bounds) of those predictions. However, arbitrary queries made on the predictions do not translate naturally into queries on the corresponding statistical measures. A user who requires error bounds on the sum of a join over several tables of predictions must first obtain a formula for computing those bounds, assuming a closed form formula even exists.
A wide variety of probabilistic database systems have arisen supporting probabilistic data that follows finite, discrete distributions. Some even approximate continuous distributions by translating them into discrete distributions (i.e., by integrating over bins or sampling).
Unfortunately, these kinds of approximations are hard to instantiate prior to runtime; To maintain generality, these systems must provision (e.g., via bin size, or sample count) for a wide variety of unknown arbitrary queries and statistical measurements of unknown precision when creating probabilistic data tables. Worse still, since precision can be data-dependent, it may be impossible to accurately provision a query until after it has completed. These systems must generate an overabundance of samples or unnecessarily small bin sizes, lest they be unable to achieve sufficient precision.
For example, if a query contains a selection predicate, samples violating the predicate are dropped and do not contribute to the expectation. The more selective the predicate, the more samples are needed to maintain consistent accuracy. Our sample application may be queried to combine a model predicting customer profits with a model for predicting dissatisfied customers, perhaps as a result of a corporate decision to use a cheaper, but slower shipping company. If the query asks for profit loss due to dissatisfied customers, the query need only consider profit from customers under those conditions where the customer is dissatisfied (ie, the underlying model may include a correlation between ordering patterns and dependence on fast shipping).
We address these concerns in our paper PIP:A Database System for Great and Small Expectations (link to draft version), to be presented at ICDE 2010. PIP is a purely symbolic probabilistic database that supports continuous distributions. Using a C-Tables representation and techniques similar to those used by the MayBMS Probabilistic Database Management System, PIP maintains a symbolic representation of all uncertainty in the database. Queries are evaluated directly on this symbolic representation; a direct mapping exists between all relational algebra operators and their c-tables counterparts. Even better, the direct mapping is nearly trivial.
Because the query result is expressed symbolically, its statistical properties can be estimated very efficiently or even computed precisely (in certain cases). By exploiting knowledge about the underlying distributions (where available) and the query itself, PIP acts as a framework for applying a variety of statistical tools to improve optimization efficiency. PIP is extensible, allowing developers to encode new probability distributions into modules that are linked into PIP itself. These modules encode any or all of a wide range of pieces of information about the distribution. For example, if the CDF of a distribution is known, when computing the expectation of a variable following this distribution, PIP can use the CDF to improve sampling efficiency or even compute an exact value for the expectation.
PIP has been implemented as a plugin for Postgres. Some minor modifications to Postgres itself make it possible to employ static data queries on probabilistic data unchanged. However, the core functionality can be imported into any Postgres database. A beta version of PIP will be released officially soon. Check back on this blog for details.