Saturday, November 17, 2012

Big Data News over the last week

The last week was really busy with the SIGMOD 2013 Deadline, and I did not get to posting that much, so I will catch up over the weekend.

Dell acquires Gale Technologies, a company that automates and manages physical and virtual resources. They provide templated provisioning and management through their GaleForce platform, which provides automation, self-service, and resource scheduling; they manage everything from compute, network, storage, and cloud resources (they include amazon and rackspace), and edge devices. This space is hot since Cisco also recently announced that they would acquire Cloupia which has similar capabilities: A resource lifecycle manager, an operations automation center, a capacity manager, etc. Management of resources and automation of in-house, cloud, and hybrid deployments is one of those boring "enterprise" tasks that one hears little about in the press, but that is crucially important in the enterprise.

Interesting speculations about Steve Sinofsky, including amazon and Tableau as companies located in Seattle.

A nice sampling of Big Data and Enterprise Security. This reminds me of the first paper that I read in this space when I was still a graduate student and just starting in data mining: Wenke LeeSalvatore J. StolfoKui W. MokMining Audit Data to Build Intrusion Detection Models. KDD 1998: 66-72. It contained an early version of item constraints which they called axis attributes and also a heuristic algorithm for finding patterns with low support.

The amount of storage used in Microsoft's SkyDrive has doubled over the last six months. They also announced selective sync. This comes at the same time as Dropbox is running its space race for universities (Cornell is currently Number 4!) with the goal to sign up (and lock in) students. But Dropbox is also innovating fast, having just announced Dropbox Chooser, an easy way to integrate Dropbox into web apps with a small piece of JavaScript.

Monday, November 12, 2012

Recent Big Data News

ElasticSearch has raised $10M in funding. Elasticsearch is an open-source RESTful enterprise search engine built on Lucene which has quite a few nice features:
  • Distribution and high availability: Indexes are in shards, and each shard can be replicated.
  • Multi-tenancy, basically support for more than one index with different configuration
  • Their update API seems to have several levels of consistency.
  • They support all the features that are now standard such as facets, different weighting of various parts of documents, different mappers, etc.
  • And its API is very simple indeed.

Recollect is a new startup that enables archiving, downloading, and searching of your pictures and activities (tweets and check-ins for now). Their UI has an interesting timeline look.

Big Data News for the weekend of November 10, 2012

Google improved its Google Cloud SQL mySQL database offering. They now have six tiers with exponentially increasing sizing and pricing, starting with D1 (500MB RAM, 1GB storage, 850k I/Os per day at $1.46/day or 10 cents/hour) to G32 (16GB RAM, 10GB storage, 32M I/Os per day at $46.84 per day or $3.08 per hour).

The closest to this in the amazon cloud is Amazon Relational Database Service (RDS); the other services that amazon offers are Amazon DynamoDB (high-performance NoSQL) and SimpleDB (managed NoSQL for smaller datasets). The whole Google Cloud Platform is getting more and more pieces with Google App Engine, Google Compute Engine, Google Cloud Storage,Google BigQuery, and Google Cloud SQL.

A posting on GigaOM argues that an important role of IT departments in the future will be the integration of app eco-systems, and that this brings with integration requirements from the cloud: Identity and authentication and other security and compliance-related functionality, trouble-shooting app-ecosystems, and integration of new applications. I completely agree that one of the major benefits of the cloud that we can finally have integrated app-ecosystems beyond what clairvoyant app-developers have imagined, and our ongoing research efforts on the SAFE application development environment are steps into this direction.
GigaOM also has some interesting stats about the growth of Hadoop.

Thursday, November 8, 2012

Big Data News Updates

Pentaho obtained a weeks ago a Series C funding of $23 million. They have a nice 5-Minute Marketing Video available that explains their product. Their workbench has four aspects:
  • Data sources. They connect to a variety of data sources and have a data integration platform that can perform joins across data sources, column mappings, i.e., it seems that one can create SPJ queries across data sources.
  • Reports: Once you have created an integrated data source, you can simply create a report by dragging and dropping columns and adding filters. It seems that the expressive power is equal to a SELECT-FROM-WHERE qeuery over the integrated data source.
  • Analysis: Capabilities seem to be a subset of Excel with some visual OLAP functionality like in Excel PowerPivot.
  • Dashboard: This enables the creation of a panel of various linked reports and analyses, including mapping functionality.
It seems that their preferred interaction pattern is for a user to load the relevant subset of their data into main memory and then interact with it. They also seem to have the capability of doing the same analysis at a truly large scale over Hadoop. Their memory scale-out story is based on a distributed caching layer such as 
Infinispan/JBoss Data Grid or Memcached.

A competitor in this space is Jaspersoft, which I will discuss in a later post.

And for the crazy ones, an old Apple Ad.

Wednesday, October 31, 2012

Activities Around Data Services

In the last days, there has been quite quite some news in the data services space:
  • Scalebase announced that it received $10.5M of funding. Their product is what they call the "ScaleBase Data Traffic Manager." According to their website, they do the following:
    • They create a layer between any application and the database tier (currently this means MySQL, although they seem to be working on Oracle and Microsoft SQL Server)
    • They have two scale-out mechanisms:
      • They automatically partition the data.
    • Their system is based on a shared-nothing architecture
  • GigaOM speculates that interesting new data services will be in the stock at in the future, in particular high-end analytics and more support for enterprise applications. 
And Big Data is supposed to drive $232 in IT Spending until 2016 according to a Gartner Report.

Thursday, February 10, 2011

CIDR 2011

This year's CIDR was exciting. As might be expected, there was a clear focus on cloud technologies in the program, and cloud middleware and infrastructure systems had a strong offering in particular. Changes to the memory heirarchy effected by Flash and Phase Change Memory (Flash's heir apparent) were also a subject of intense discussion.

Two specific instances of cloud middleware took a rather unusual (and perhaps even a little Matrixy) approach to the architecture of the underlying cloud. MIT's Crowdsourced Databases, and Stanford's proposal for using humans to answer queries both attempt to build a crowdsource operator (an invocation of a service like Amazon's Mechanical Turk) into a traditional relation query optimizer. Aside from the obvious interface challenges, this operator introduces the potential for inaccuracies (c.f., My Database Hates Me) and an actual financial cost into the query optimizer's cost model.

An aspect of cloud computation addressed by many papers was the idea of transactions in the cloud. SAP's Transactional Intent, Microsoft's Deuteronomy, Google's Megastore, and several other presentations throughout the conference noted the difficulties of programming distributed datastores without transactional support and presented suggestions for creating what amounts to transactional infrastructures for cloud programming.

On a related note, a paradigm for distributed programming that appeared throughout many of these papers (and also Saarland's OctopusDB) was that of a log-structured database engine. Rather than the traditional approach of storing the primary copy of a datum sorted, to take advantage of sequential scans the primary datum is simply maintained in a log (in part, taking advantage of the support for fast random access in flash). Furthermore, by ensuring that the elements are sequenced in a canonical order, the log provides an effective synchronization abstraction.

Several presentatons such as MIT's Relational Cloud and Duke's Starfish made efforts towards a more generic cloud infrastructure, reducing the effort required to deploy, maintain, and tune a large scale data-processing system.

Microsoft had a strong hardware-layer offering this year, presenting several papers on Flash/PC memory-based algorithms. They were joined in architectures for Flash memory by a paper out of ITUC/INRIA.

Another idea was present, subtly appearing in a large number of papers: interactive semistructured queries. Instantiations of this idea ranged from interactive question-suggestion interfaces like MPI's IQ and Duke's Citizen Journalism, to typeahead suggestions for queries, forms, etc... like Tsinghua's DBEase, to LAWA's temporal queries over the way-back-machine, to spreadsheet-style relational database engines like MIT's schema-independent DBUI. These projects each attempt to provide an environment for non-technical users to construct queries. In each case, this ends up taking the form of an interactive session, where users refine a query by interactively querying the database schema. DBEase in particular has a pretty snazzy set of demos ( that I encourage you to check out.

Yet another hot topic this CIDR was data provenance. A slew of data provenance gathering systems for debugging and data validation were presented by Yahoo, Stanford, UPenn, and others. Of particular note, the UPenn paper makes note of an interesting challenge in data provenance: privacy. Exporting the provenance information of a tuple leaks information about the data that went into the tuple. How can we measure, and more importantly limit the exposure of sensitive information, without eliminating the usefulness of the provenance information.

An entirely new branch of research to me is computational activism. Berkeley's Data in the First Mile, and Duke's Computational Journalism both espouse the need for building good task specific UIs (and the corresponding computational backends) for use in (respectively) third-world countries, and journalism (i.e., fact checking, pattern/outlier discovery, and claim monitoring).

Several other interesting papers branched off into entirely unique directions. Berkeley's CALM quantifies the situations where synchronization primitives are required in a distributed program and provides programming language support for distributed programs along the lines of Evita Raced. A vision paper out of EPFL called for hybrid relational+hdfs database storage architectures, where the curation of flat data files is done on a pay-as-you-go basis: As data is extracted from the data files for use in queries, the resulting tables are stored and indexed for future use. A project out of Microsoft is attempting to unify database access control mechanisms with privacy control mechanisms. Saarland University's OctopusDB is a database engine that attempts to be one-size-fits-all by making a distinction between the conceptual act of storing data and the physical representation of that data on a storage medium.

Finally (and most importantly ;) ), Yanif Ahmad presented DBToaster... The one database compiler to rule them all.

Monday, October 18, 2010

Why and Where Provenance

At DB Breakfast on Thursday October 7th, we continued our exploration of data provenance by reading the highly-cited paper:

Peter Buneman, Sanjeev Khanna, Wang Chiew Tan. Why and Where: A Characterization of Data Provenance. In ICDT, 2001.

This paper looks at problem of determining the provenance of a query answer, i.e. what data in the database "contributes to" the resulting answer. One of the insights of the paper is that the concept of provenance profoundly depends on what one means by "contributes to." Two notions of provenance are introduced, where provenance and why provenance, and shown to have very different behavior.

The distinction between why and where provenance is best seen with an example: Suppose ("Joe", 1234) is an answer to this query.

SELECT name, telephone
FROM employee, dept
WHERE employee.dno = dept.dno AND = "Computer Science"

The where provenance of 1234 is simply the corresponding phone number in Joe's record in the employee relation. The why provenance includes not only Joe's record in employee, but also the Computer Science record in dept because without that record, Joe's record would not be included in the result.

For why provenance, the paper gives precise characterization based on query *syntax.* Informally, a tuple in the database is part of the why provenance if it is used in some minimal derivation of the answer tuple (the qualifications "some" and "minimal" are important). This notion of provenance has nice properties---for instance, invariance to query rewriting.

For where provenance, the intuition guiding the above approach appears to break down. Examples are shown where two queries are equivalent yet exhibit different where provenance, and they suggest that a syntactic characterization may fail to fully capture where provenance.

Despite the challenges with where provenance, it appears as though subsequent work has developed approaches for where provenance. How were these challenges addressed?

In addition, the why provenance characterization is for SPJU queries only. Extending to include negation and aggregation seems important but quite challenging: the provenance of a tuple may include the entire database! Such an answer, while technically correct, may not be useful to the user. Is there a reasonable notion of weighted provenance, where some input tuples have more influence on the query answer than others?

In addition to where and why provenance, what other kinds of provenance might be useful?