An OS-Hypervisor Infrastructure for Automated OS Crash Diagnosis and Recovery in a Virtualized Environment

  • Joefon Jann IBM Thomas J. Watson Research Center
  • R. Sarma Burugula IBM Thomas J. Watson Research Center
  • Ching-Farn E. Wu IBM Thomas J. Watson Research Center
  • Kaoutar El Maghraoui IBM Thomas J. Watson Research Center

Resumo


Recovering from OS crashes has traditionally been done using reboot or checkpoint-restart mechanisms. Such techniques either fail to preserve the state before the crash happens or require modifications to applications. To eliminate these problems, we present a novel OS-hyper visor infrastructure for automated OS crash diagnosis and recovery in virtual servers. Our approach uses a small hidden OS-repair-image that is dynamically created from the healthy running OS instance. Upon an OS crash, the hyper visor automatically loads this repair-image to perform diagnosis and repair. The offending process is then quarantined, and the fixed OS automatically resumes running without a reboot. Our experimental evaluations demonstrated that it takes less than 3 seconds to recover from an OS crash. This approach can significantly reduce the downtime and maintenance costs in data centers. This is the first design and implementation of an OS-hyper visor combo capable of automatically resurrecting a crashed commercial server-OS.
Palavras-chave: Computer crashes, Kernel, Maintenance engineering, Virtual machine monitors, Hardware, Registers, Data structures, Operating Systems, Reliability, Availability, Computer Crash, System Recovery
Publicado
24/10/2012
JANN, Joefon; BURUGULA, R. Sarma; WU, Ching-Farn E.; MAGHRAOUI, Kaoutar El. An OS-Hypervisor Infrastructure for Automated OS Crash Diagnosis and Recovery in a Virtualized Environment. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 24. , 2012, Nova Iorque/EUA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2012 . p. 195-202.