Sunday, September 26, 2010

Is the Cloud Ready for Scientific Computing?

Last Thursday, in the DB breakfast at Cornell, we asked ourselves the question whether the cloud was ready for HPC. We discussed a paper from this year's VLDB conference by Schad, Dittrich, and Quiané-Ruiz reporting unexpected high variance in Amazon EC2's performance. The paper describes the results from instance types in different availability zones through a benchmark measuring instance startup, cpu, memory speed, disk I/O, network bandwidth, and S3 access times. The main lessons after analyzing the results of one month of data is that instances allocated to different physical system types and availabilities zones can have large variability in performance for CPU, disk I/O and network performance. In fact, similar observations have been made by other studies and benchmarks (such as this and this). Given these results, how tightly will cloud providers ever be able to specify and guarantee performance-based SLAs?

As a result, members of the HPC community feel that the cloud may not be ready for their scientific applications, which tend to be network and memory bound (for example, take a look at this nice paper for some results). However, recently, Amazon released a new instance type: the cluster computer instances, and results from a benchmark run on 800 such instances was reported to rank within the Top500 list of supercomputers. Will this start a new era for the HPC community to run their applications in the cloud?

The cloud democratizes access to resources; even researchers who do not have access to a supercomputer will be able to afford to rent hundreds of high-performance instances in the cloud and scale their simulations to unprecedented dimensions. I think that this means that also for HPC applications, "performance per dollar" and not only on "performance" as it was traditionally measured will be the an important metric in the future. If you look at the sorting benchmark homepage, you will see different two categories of benchmarks: Daytona, where the sort code needs to be general purpose, and Indy, where the goal is only to sort according to the benchmark specifications. In addition, there exists a benchmark that measures the amount of energy required to sort. Will we see similar developments in the measurement of supercomputing systems?