Tuesday, July 21, 2009

An Evaluation of Checkpoint Recovery for Massively Multiplayer Online Games

One problem that current Massively Multiplayer Online Games (MMOs) face is how to provide persistence for their virtual worlds. We have studied this problem in a recent paper accepted for publication at VLDB 2009.

MMOs use standard DBMS on the back-end to provide transactional guarantees for state updates. This decision is appropriate for updates that require full ACID guarantees, such as in-game financial transactions and item exchanges. The bulk of state updates in MMOs does not need, however, full ACID properties. For example, character movement comprises a large amount of the updates applied to a virtual world. As it turns out, game logic such as collision detection prevents character movement updates from generating any conflicts. We call these updates local updates.

The amount of local updates a game must process may exceed hundreds of thousands or millions per second. For performance, these updates are applied in main memory only, and game developers hand-code persistence logic for durability.

In our paper, we have experimentally evaluated the performance of main-memory checkpoint recovery techniques for MMOs. Our study shows that these techniques are a viable alternative to provide durability for local updates. Notwithstanding, not all techniques we have studied are equally suited for MMOs. MMOs have stringent latency requirements, ruling out methods that introduce long pauses in the game. Here is a summary of our recommendations to game developers:

  1. Methods that perform copy on update of dirty objects only have clear latency advantages over methods based on eager copies of the game state. They avoid latency peaks by spreading their overhead over a number of game ticks.

  2. When update rates are so dramatically large and skewed that the entire game state gets updated in a single tick of the game, little can be done to reduce the latency impact of the checkpoint algorithms. In this extreme situation, an algorithm based on an eager copy of the entire game state introduces the minimum pause in the game.

  3. Methods based on a double-backup organization either match or outperform log-based alternatives in terms of recovery time.

  4. The best method for a wide range of parameters is copy on update combined with a double backup. This method outperforms alternatives by up to a factor five in latency without any degradation in recovery time.

Our evaluation is based on a detailed simulation model of the checkpoint methods, available for download here. This simulation model has been validated against a real implementation of a relevant subset of the techniques. We plan to make our implementation also available for download. Keep tuned for updates in the near future!

Hope to see you at our talk in Lyon!

No comments:

Post a Comment