The Incident Was Coming From Inside The House

Or: the day a cluster tried to heal itself to death.

The Silence

It started with Plex. Then Paperless. Then Pocket ID. Uptime Kuma started quietly lighting up like a Christmas tree. Services on one node had stopped answering, but the node itself was... fine? A ping to one of the Proxmox nodes crawled back after a few seconds. SSH eventually responded. ha-manager status said everything was marvellous.

Everything was not marvellous.

The First Fire: Storage, Doing Nothing At Terrifying Speed

Load average on the sick node: 102. CPU: 99.3% iowait, 0.7% system, the rest idle. Six hundred sleeping processes, several stuck on folio_wait_bit_commo - kernel-speak for "waiting for the page cache to finish the thing it will never finish." The NFS share backing every VM's disk had decided it was tired. The node had decided it was tired too, but politely - just enough to keep replying to corosync heartbeats while doing absolutely no useful work.

Honest bit: we never fully nailed down what kicked it into the stall. The symptoms were textbook NFS backpressure, but our telemetry at the time was not granular enough to finger the exact trigger. This will matter later.

This is the failure mode Proxmox HA was never designed to catch. HA reacts to node death: quorum loss, power cut, cable chewed. A node that is just slow stays in the cluster, votes in elections, looks healthy from the outside, and serves zero packets to anyone who actually wants to use it.

In other words: HA catches corpses, not zombies.

A SysRq-b through a jumpbox unstuck the node. Back up in ninety seconds. One fire out. Which would be the end of the story, except a second one had already started while we were not looking.

The Second Fire: Auto-Remediation, Speedrunning Self-Destruction

While the node was sulking, seven services on it started looking "down" to Uptime Kuma. Each one fired a webhook at the AI assistant running on another node in the cluster - the one whose job it is to investigate alerts and restart the relevant VM. Seven webhooks, seven AI agents, all spawned at once.

Each agent loaded context, started SSH'ing into unreachable hosts, waited for timeouts, retried. None of them were allowed to actually succeed - the hosts they needed to reach were all on the sick node, which was still busy sulking.

All seven agents also happened to live on an LXC whose root filesystem was an NFS-backed .raw file, mounted via a kernel loop device. Seven parallel AI agents hammering that loop device did exactly what seven parallel AI agents on a loop device over NFS do. One kernel worker (kworker/u48:0+loop10) sitting at 25% CPU in D-state is a particularly bleak flavour of sadness.

The assistant's own node went from "quiet" to "load average 32" while it bravely tried to rescue services on a node that had already rescued itself. If this were a war film, the assistant would be the one heroically calling in an airstrike on its own position.

Two independent failures stacked. The storage stall was the first one. The assistant's spiral was the second, feeding on the first. Each would have been survivable in isolation. Together they were the rough outline of a genuinely bad afternoon.

What Changed

A few things.

High Availability, but for real this time. All thirty-five running guests enrolled in HA. Migrations now use the 5GbE USB adapter (a wild sentence to write) at type=insecure with a polite 512 MiB/s cap so migrations do not blot out VM traffic. shutdown_policy=migrate evacuates a node on clean reboot. The Plex container's config lost a single line (dev0: /dev/dri/card0) so it can now run on any node instead of being chained to the one with the exact matching DRM device index.

Rate-limiting, for the assistant. The webhook receiver got two new knobs: a hard cap of two concurrent remediation agents (queue of three, then drop), and a mass-outage detector. Five alerts in sixty seconds means something bigger than one service is wrong, and further remediation agents would just pile on. In that case: one consolidated DM, ten minutes of suppression, go look at the infrastructure, not the symptoms.

Alerts that see zombies. A Grafana pipeline now watches for the slow-node failure class HA cannot:

pvestatd + node_exporter -> InfluxDB + Prometheus
                                     |
                             Grafana alert rule
                                     |
                   POST /notify-alert to openclaw
                                     |
                               WhatsApp DM

Nine rules live, covering: per-node iowait rate, PSI IO and memory pressure, CPU temperature, disk free, PBS datastore fill, per-guest IO pressure, and backup failures. The threshold for iowait is thirty percent sustained over three minutes. The original stall would have fired this alert long before the assistant started its improvised self-destruction routine, and we would have spotted it at a stage where "investigate" was a valid response.

The Lesson, Stuck on a Post-It

Two, really.

First: HA reacts to missing heartbeats, not missing bandwidth. If your monitoring cannot tell the difference between "the node is dead" and "the node is profoundly unwell," you only have half a monitoring system. Metric-based alerts on iowait, load, and pressure fill that gap. This is also why we still do not know exactly what broke that node: we were blind to the thing we most needed to see.

Second: an auto-remediation system is part of your infrastructure. If it can be triggered by the failure it is supposed to fix, and if the thing it runs on shares a fate with the thing it is trying to heal, you have built a Rube Goldberg machine that ends in your pager.

The assistant is still running. It is still helpful. It now just knows how to shut up when the whole room is on fire.