My Patching Dashboard Grew a Brain (And Now It Checks Its Own Homework)

A while back I wrote about teaching my servers to reboot themselves. That post ended with a little FastAPI thing I'd built on top of the Ansible playbooks — host inventory, a terminal overlay streaming output over SSE, some big buttons labelled UPDATE and REBOOT. I called it Herdmon. Mostly, it worked.

"Worked" has always done a lot of heavy lifting in that sentence.

The Silent Success Problem

Ansible's output has the emotional range of a passive-aggressive coworker. You kick off a dist-upgrade across a bunch of containers, watch the recap scroll past:

PLAY RECAP *****
host-a : ok=4 changed=0 unreachable=0 failed=0
host-b : ok=4 changed=0 unreachable=0 failed=0
host-c : ok=4 changed=0 unreachable=0 failed=0

…and shrug. Did it update? Did it skip? Did a dependency quietly pin an old kernel? Did the service come back up? You can't tell from a recap. You have to go look. And if you don't go look, you find out three days later when something breaks and you realise the upgrade you thought ran hasn't actually touched that container in two months.

I hit this exact failure mode the week I was demoing the dashboard to myself. Fired an update at every pending host. Watched the stream. Saw changed=0 on every line and thought "great, nothing to do." Then the pending count still claimed twenty-something hosts with updates.

The recap lied. The upgrade had happened — in a previous run that got SIGTERM'd mid-stream. The re-run had nothing to do because the apt work was already applied. The inventory count was stale because the agent hadn't phoned home since the upgrade. Everything was fine. It looked broken.

If the dashboard is going to claim a patch ran, it needs to also claim it actually worked.

Enter the AI Verifier

So I wired the dashboard to an AI agent I'd already built for other homelab plumbing. The agent has SSH access (read-only), can query the cluster manager, and can ask the monitoring stack whether a given service is alive. All of this was sitting there doing other jobs.

After every patch job, Herdmon fires the last 200 lines of ansible output at the agent along with a structured brief: "here's the playbook, here are the targets, here's the return code, go check them." The agent SSHes every target in parallel, runs systemctl --failed, checks the mount table, asks the monitoring stack if the service is still up. Then it returns a verdict.

Verdicts come in grades:

  • OK — everything's green.
  • WARN — minor unrelated issues (one container has a dead unit that has nothing to do with this patch).
  • DEGRADED — a patched host is in a bad-but-not-broken state (kernel update applied, reboot pending).
  • FAIL — something critical broke.
  • TIMEOUT — verification itself didn't finish in time. Crucially not the same as FAIL.
  • INCONCLUSIVE — checks ran but couldn't decide.

The UI renders it as a colour-coded panel below the terminal. The agent also DMs me the summary with concerns, so I know whether to go look.

Within an hour of turning it on, it caught a real issue I would have missed: an old container had a broken logrotate service, silently failing every run for weeks. The patch didn't cause it, but the AI noticed while checking if the patch had broken anything. One of those "wait, while you're in there…" wins.

The Closed Loop

The AI's opinion is useful. The AI taking action is suspicious. But for a narrow set of known-safe remediations, I wanted the loop to close.

Specifically: NFS mounts on containers after a reboot. Classic failure mode — the container comes back up before the NFS server is happy, the mount unit wedges, and the service starts up staring at an empty directory. The fix is always the same five commands: daemon-reload, reset-failed, mount -a, findmnt to verify, kick the service. I had a manual procedure; I wrote it as an idempotent playbook.

Now when the AI spots a missing NFS mount in its checks, it can emit a structured action in its response:

{"type": "remediate-nfs", "targets": ["host"], "reason": "..."}

The webhook receiver POSTs that back to Herdmon as a new job. A 30-minute cooldown per (action-type, target) keeps it from looping. A supported-action allowlist keeps it from doing anything exotic. The dashboard shows the auto-fired job as a link inside the verdict panel so I can see what got triggered, when, and why.

Watching it self-heal is deeply satisfying and slightly unsettling.

HA Got Opinions

The rolling-restart from the old post worked fine when nothing was HA-managed. Then I enabled HA on every guest. Suddenly my playbook was fighting the cluster manager: I'd migrate a container, HA would migrate it back, I'd shut it down, HA would start it somewhere else. Standoffs.

The fix was to stop doing any of that. The playbook now calls ha-manager crm-command node-maintenance enable <node> and lets HA drain according to its own rules. I just poll until the node is empty. After the reboot I call the disable counterpart and HA rebalances. The playbook does the apt work and the reboot in the middle; HA does all the guest placement. Serial one node at a time, same as before, minus about three hundred lines of "migrate this VM, migrate that container, maybe force-stop this one" code I no longer maintain.

Less code, more correctness. The usual trade.

Stop Messaging Me For Things I Asked For

The moment I started actually using the verify loop for real patches, I realised something: the monitoring stack doesn't know the patch is happening. So every time a container moved nodes, the monitor flapped, and my phone buzzed with a WhatsApp. During a full rolling-restart that's forty-plus WhatsApps in a few minutes.

So the playbook now flips the alert system into maintenance mode scoped to the node being drained, for the duration of the drain. Suppressed alerts get batched and delivered as a single digest when the window closes. I get one "hey, here's what fired while you were working" DM at the end, instead of a phone that vibrates itself off the desk.

The verdict panel respects the same maintenance scope: while the window is open, the AI's DM is suppressed but the verdict JSON still flows to the dashboard. I can see the outcome in the browser without waking up the entire notification pipeline.

The Mundane Wins

Things that weren't newsworthy but quietly make the thing nicer to use:

  • An UPDATE APP button in the top bar that pulses amber when a new commit is available on the tracking branch. Click it; the updater updates itself, restarts cleanly, front-end revalidates.
  • A CANCEL button in the terminal overlay. Hung jobs used to need a kill from another shell; now it's one click. Surprisingly load-bearing.
  • A live status strip above the stream: current task, per-host status chips, elapsed timer, filter-out-ok-noise toggle. Reading Ansible output used to be archaeology; now the job state is obvious at a glance.
  • Verdict deduplication: retrying the same broken state within ten minutes doesn't double-notify. Useful when you're iterating on a playbook fix.
  • Background cache pre-warm: the cluster overview endpoint used to take three seconds cold while SSH'ing every node. A background task keeps the cache fresh now; the UI is instant.

Most of that lives under the label of "quality of life" rather than "feature". Worth writing down anyway, because compounded together they made the dashboard go from "I grudgingly use this" to "I click on it for fun."

Lessons, Such As They Are

Automation that reports success isn't the same as automation that proves it. Ansible's recap is a snapshot of what ran, not a claim that the outcome is healthy. The moment you ask "did the thing actually work?" everything changes: now you need a second opinion, a model of what healthy looks like, and a feedback loop. The AI layer isn't about being clever — it's about cheaply producing the second opinion that you would otherwise have to hand-write for every service type.

The other lesson: closing the loop is terrifying until you box it in with an allowlist and a cooldown. Then it's just another service.

What's Next

I've been half-threatening to clean this up and put it on GitHub. It's probably going to happen. It's a single FastAPI app with vanilla JS (no build step, because I'm incredibly lazy about frontend tooling), it drives Ansible via subprocess, and the AI bits are cleanly pluggable — you could hook it to your own agent or skip that layer entirely. If that sounds like something you'd try, watch this space.

If you spot me pushing branches with names like feat/actually-a-readme over the next couple of weeks, that's what's happening.