Saturday, November 17, 2012

Big Data News over the last week

The last week was really busy with the SIGMOD 2013 Deadline, and I did not get to posting that much, so I will catch up over the weekend.

Dell acquires Gale Technologies, a company that automates and manages physical and virtual resources. They provide templated provisioning and management through their GaleForce platform, which provides automation, self-service, and resource scheduling; they manage everything from compute, network, storage, and cloud resources (they include amazon and rackspace), and edge devices. This space is hot since Cisco also recently announced that they would acquire Cloupia which has similar capabilities: A resource lifecycle manager, an operations automation center, a capacity manager, etc. Management of resources and automation of in-house, cloud, and hybrid deployments is one of those boring "enterprise" tasks that one hears little about in the press, but that is crucially important in the enterprise.

Interesting speculations about Steve Sinofsky, including amazon and Tableau as companies located in Seattle.

A nice sampling of Big Data and Enterprise Security. This reminds me of the first paper that I read in this space when I was still a graduate student and just starting in data mining: Wenke LeeSalvatore J. StolfoKui W. MokMining Audit Data to Build Intrusion Detection Models. KDD 1998: 66-72. It contained an early version of item constraints which they called axis attributes and also a heuristic algorithm for finding patterns with low support.

The amount of storage used in Microsoft's SkyDrive has doubled over the last six months. They also announced selective sync. This comes at the same time as Dropbox is running its space race for universities (Cornell is currently Number 4!) with the goal to sign up (and lock in) students. But Dropbox is also innovating fast, having just announced Dropbox Chooser, an easy way to integrate Dropbox into web apps with a small piece of JavaScript.

Monday, November 12, 2012

Recent Big Data News

ElasticSearch has raised $10M in funding. Elasticsearch is an open-source RESTful enterprise search engine built on Lucene which has quite a few nice features:
  • Distribution and high availability: Indexes are in shards, and each shard can be replicated.
  • Multi-tenancy, basically support for more than one index with different configuration
  • Their update API seems to have several levels of consistency.
  • They support all the features that are now standard such as facets, different weighting of various parts of documents, different mappers, etc.
  • And its API is very simple indeed.

Recollect is a new startup that enables archiving, downloading, and searching of your pictures and activities (tweets and check-ins for now). Their UI has an interesting timeline look.

Big Data News for the weekend of November 10, 2012

Google improved its Google Cloud SQL mySQL database offering. They now have six tiers with exponentially increasing sizing and pricing, starting with D1 (500MB RAM, 1GB storage, 850k I/Os per day at $1.46/day or 10 cents/hour) to G32 (16GB RAM, 10GB storage, 32M I/Os per day at $46.84 per day or $3.08 per hour).

The closest to this in the amazon cloud is Amazon Relational Database Service (RDS); the other services that amazon offers are Amazon DynamoDB (high-performance NoSQL) and SimpleDB (managed NoSQL for smaller datasets). The whole Google Cloud Platform is getting more and more pieces with Google App Engine, Google Compute Engine, Google Cloud Storage,Google BigQuery, and Google Cloud SQL.

A posting on GigaOM argues that an important role of IT departments in the future will be the integration of app eco-systems, and that this brings with integration requirements from the cloud: Identity and authentication and other security and compliance-related functionality, trouble-shooting app-ecosystems, and integration of new applications. I completely agree that one of the major benefits of the cloud that we can finally have integrated app-ecosystems beyond what clairvoyant app-developers have imagined, and our ongoing research efforts on the SAFE application development environment are steps into this direction.
GigaOM also has some interesting stats about the growth of Hadoop.

Thursday, November 8, 2012

Big Data News Updates

Pentaho obtained a weeks ago a Series C funding of $23 million. They have a nice 5-Minute Marketing Video available that explains their product. Their workbench has four aspects:
  • Data sources. They connect to a variety of data sources and have a data integration platform that can perform joins across data sources, column mappings, i.e., it seems that one can create SPJ queries across data sources.
  • Reports: Once you have created an integrated data source, you can simply create a report by dragging and dropping columns and adding filters. It seems that the expressive power is equal to a SELECT-FROM-WHERE qeuery over the integrated data source.
  • Analysis: Capabilities seem to be a subset of Excel with some visual OLAP functionality like in Excel PowerPivot.
  • Dashboard: This enables the creation of a panel of various linked reports and analyses, including mapping functionality.
It seems that their preferred interaction pattern is for a user to load the relevant subset of their data into main memory and then interact with it. They also seem to have the capability of doing the same analysis at a truly large scale over Hadoop. Their memory scale-out story is based on a distributed caching layer such as 
Infinispan/JBoss Data Grid or Memcached.

A competitor in this space is Jaspersoft, which I will discuss in a later post.

And for the crazy ones, an old Apple Ad.