Skip to content
Systems

Modeling Metastability

Author

Andreas Haeberlen (University of Pennsylvania and Roblox)

Venue

HotOS 2025

Abstract

Recently, there has been increasing concern about a new failure mode in data-center systems: when there is an external shock, such as a sudden load spike or some machine failures, systems will sometimes respond with reduced throughput – but, in contrast to a traditional overload situation, the throughput does not recover once the external shock disappears, and remains permanently degraded. This phenomenon has been called a metastable failure. In this paper, we sketch a simple model that could help to explain how and why metastability arises. We also show how our model can be used to predict the presence or absence of metastable states in a given system.