April 29, 2021 18:09
Something went wrong!
As first: Sorry. As you can maybe imagine, this should not happen! As part of further improving the stability of my systems I post this incident report. This is a more in-depth analysis of the respective incident. It should give you an overview of what went how wrong. Maybe you can learn from it or at least understand how such thing could happen in the first place.
2021.04.29 17:40 - 2021.04.29 18:20
Which services got affected?
All services hosted on the server “Luke” and “Zeus”.
17:40:11 Electrical fuse failed UPS takeover 17:56:00 UPS reports low battery voltage Server begins emergency shutdown procedure 18:??:?? Emergency shutdown hangs 18:??:?? UPS power fails prematurely, causes server to crash 18:18:00 Power restored Server boots 18:20:00 Services coming back online
What went wrong?
The UPS failed way too early, as it should have kept the server even under full load alive for at least an hour (it failed after 20 minutes)! Also the emergency shutdown hung up, as the service for saving the VMs states took too long to stop. This is caused by the (over time) increased server performance of the past years. Therefore too many resources needed to be suspended, which took too way longer than expected (~6 minutes) comparing to the time of writing the respective service units.
How to improve?
- Investigate UPS health, maybe scheduling further maintenance windows.
- Perform real load tests, the self-tests of the UPS are fine, but they do not reflect a real incident with longer periodes of power failure.
- Disabled respective service unit, until I have time to apply it to only a selected group of VMs.