Fail through the Cracks: An Analysis of Cross-System Interaction Failures in Modern Cloud Systems

Abstract

Modern cloud systems are typically orchestrated by interacting, interdependent (sub-)systems, each specializing in important features and services (e.g., data processing, storage, resource management, etc.). In reality, many failures of cloud systems are manifested through interactions across system boundaries. Hence, cloud system reliability is constructed not only from the reliability of each individual system, but also from their interactions. Unfortunately, such cross-system interaction failures are under-investigated. In this paper, we present the first analysis on 120 crosssystem interaction (CSI) failures involving seven widely codeployed, commonly interacting systems. We focus on understanding the discrepancies between the interacting systems as the root causes of CSI failures—CSI failures are not caused by traditional software defects in an individual system and thus cannot be analyzed in one isolated system. Our study reveals more than a dozen informative findings which open up new research directions in combating CSI failures. We advocate for cross-system testing and verification and demonstrate that such efforts can effectively reveal potential discrepancies.