homelab

I Taught My Servers to Reboot Themselves (Without Breaking Everything)

James

21 Mar 2026 — 7 min read

Rebooting a single server is easy. Rebooting three servers that are collectively running 32 containers, 5 virtual machines, a media stack, two reverse proxies, and a monitoring system -- without anyone noticing -- is a different kind of problem.

My Proxmox cluster has three nodes. Between them they run everything from Plex to Paperless to an AI agent that automatically restarts services when they crash. The kind of setup where "just reboot it" is technically correct and practically catastrophic. Every service goes down, every monitor fires, my phone lights up like a Christmas tree, and someone asks why the media server isn't working.

I'd been avoiding reboots. Not deliberately, just... conveniently. Updates would pile up. Kernel patches would sit there. One node was running a kernel two versions behind the others because I kept finding reasons not to restart it. "I'll do it this weekend" is the homelab equivalent of "I'll start going to the gym."

So I built a system that does it for me. Every Sunday at 7am. While I'm asleep. And it mostly works.

The Playbook

The core of it is an Ansible playbook. The idea is simple: process one node at a time, move everything off it, update it, reboot it, move everything back. Like renovating a house one room at a time while still living in it.

The "move everything off it" part is where it gets interesting. Proxmox supports two kinds of migration:

VM live migration -- the VM keeps running while its memory state transfers to another node. Zero downtime. The kind of thing that makes you feel like you're living in the future.
CT restart migration -- the container stops, moves to the new node, and starts there. A few seconds of downtime. Less futuristic, but all my containers use shared NFS storage, so there's no disk data to copy. It's more of a "pick up the box and put it on a different shelf" than an actual migration.

The playbook captures what's running on each node, migrates VMs live, restart-migrates all the containers, runs apt dist-upgrade, reboots, waits for the node to rejoin the cluster, then migrates everything back. Three nodes, serial execution, one at a time. Cluster quorum maintained throughout.

Oh, and it live-migrates the Ansible controller VM first, because the playbook is running on the very infrastructure it's rebooting. It's like changing the tyres on a car while driving it, except the car politely pulls over, swaps itself to a different chassis, and then continues the journey.

The Web UI

Running playbooks from the command line is fine. Running them from a web page with a big red button and a live terminal feed is better.

I built herdmon -- a FastAPI app that pulls host data from PatchMon (a self-hosted patch monitoring tool) and shows every host in my infrastructure with its pending update count, security patches, and reboot status. From there I can select hosts and run Ansible playbooks against them: system package updates via apt/dnf, LXC application updates via Proxmox community scripts, or full backup-then-update runs that snapshot each container to PBS before touching anything. The output streams in real time via Server-Sent Events -- ansible-playbook stdout piped line by line into a browser terminal overlay.

Adding cluster operations was mostly a matter of creating a second page. The Cluster Ops page shows four cards -- one per node plus the backup server. Each card shows how many containers and VMs are running, the kernel version, uptime, and PVE version. Below each card: an UPDATE button and a REBOOT button. At the bottom: a big amber ROLLING UPDATE & RESTART ALL button with a confirmation dialog, because some buttons should make you think twice.

The node status comes from SSHing into each box and scraping pct list and qm list. It's cached for 30 seconds so I'm not hammering the nodes every time someone refreshes. Watch the migration happen in real time through the terminal overlay. Very satisfying.

The First Run

The first run did not go well.

I had deployed the playbook, tested the syntax, checked the logic. Then I pressed the button on the web UI. And immediately noticed there were two ansible-playbook processes running. Turns out I'd pressed it twice. Or the UI had. Either way, two rolling restarts were now competing for control of the same cluster.

The old version of the playbook -- the one that shuts down containers instead of migrating them -- was still loaded in memory from a previous run. So while I thought I was running the fancy new migration-based version, one of the processes was enthusiastically shutting down everything the old-fashioned way.

The Ansible VM got caught in the crossfire. It was shut down mid-run, which killed the playbook that was running on it. One of the nodes ended up in a state where systemd had queued a reboot but hadn't executed it, which meant every attempt to start a container was rejected with "Transaction is destructive." The entire node was in a bureaucratic deadlock with its own init system.

The fix was to let it reboot. I tried cancelling the pending jobs, isolating targets, daemon-reloading. None of it worked. In the end I sent echo b > /proc/sysrq-trigger through SSH, which is the Linux equivalent of pulling the plug and hoping for the best. It came back clean. Everything auto-started via onboot. Crisis resolved through percussive maintenance.

The Plex Problem

My Plex server runs in a privileged container with Intel iGPU passthrough for hardware transcoding. The container config references /dev/dri/card1 and /dev/dri/renderD128 -- the GPU device nodes that need to exist on whichever physical node runs the container.

All three of my nodes have identical Intel CPUs with identical iGPUs. The device paths are the same on all of them. In theory, Plex should be able to run anywhere.

In practice, when the playbook migrated Plex to another node, that node insisted /dev/dri/renderD128 didn't exist. Even though it definitely did. I checked. It was right there. Created 18 minutes before the start attempt, according to the timestamps.

My best theory is a udev race condition -- the device node existed but wasn't fully initialised, or the kernel module hadn't finished setting up the render interface even though the file was present. The kind of bug that only appears when you automate things, because a human would have waited 30 seconds and tried again.

The playbook continued anyway, because ignore_errors: true is the seatbelt of infrastructure automation. Plex sat stopped for the duration of the reboot, then started fine when it was migrated back to its home node. Not ideal, but not catastrophic.

Uptime Kuma Has Opinions

I have 85 monitors in Uptime Kuma watching everything from Proxmox nodes to network switches. When 25 containers go down simultaneously because they're migrating to another node, that's 25 alerts firing in quick succession. My phone. The webhook. WhatsApp messages. The works.

So the playbook pauses Uptime Kuma monitors before migrating, and resumes them after everything's back. Smart, right?

Except the pause script creates a new Socket.IO connection for each monitor. When you're pausing 25 monitors in rapid succession, Uptime Kuma's rate limiter kicks in with "Too frequently, try again later." The last few monitors don't get paused. Those containers migrate. Those alerts fire. My phone lights up. The rate limiter was protecting me from my own automation.

The fix was embarrassingly simple: add a two-second delay between each pause call. Not elegant, but effective.

The Speed Problem

The first successful run took one hour and one minute. For context, the actual reboots accounted for about three minutes of that. The apt upgrades were maybe two minutes. The remaining 56 minutes were container migrations.

The problem was sequential execution. Each pct migrate --restart takes about 50 seconds: stop the container, update the cluster config, start it on the target node. With 25 containers on the busiest node, that's 25 times 50 seconds. Twice -- once to migrate off, once to migrate back. Twenty minutes each direction, for an operation that involves no data transfer whatsoever because the storage is shared NFS.

Ansible's async feature fixed this. Fire all 25 migrations simultaneously with poll: 0, then use async_status to wait for them all to finish. The containers aren't competing for disk bandwidth because there's nothing to copy. They're just stopping, updating a config file, and starting on a different node. No reason they can't all do that at once.

- name: Restart-migrate CTs
  ansible.builtin.shell: >
    pct migrate {{ item }} {{ migration_target }}
    --restart --timeout 120
  loop: "{{ ct_list }}"
  async: 300
  poll: 0
  register: ct_migrate_jobs
  ignore_errors: true

- name: Wait for CT migrations to complete
  ansible.builtin.async_status:
    jid: "{{ item.ansible_job_id }}"
  loop: "{{ ct_migrate_jobs.results }}"
  register: ct_migrate_results
  until: ct_migrate_results.finished
  retries: 30
  delay: 10
  ignore_errors: true

What It Looks Like Now

Every Sunday at 7am, a cron job on the Ansible VM kicks off the rolling restart. It processes each node in sequence. For each one:

Pause Uptime Kuma monitors (with polite two-second gaps)
Live-migrate all VMs to another node (zero downtime)
Parallel restart-migrate all containers (seconds of downtime each)
Run apt dist-upgrade
Reboot, wait for cluster quorum
Parallel migrate everything back
Resume monitors

The Ansible controller VM gets special treatment -- it's always migrated first, before anything else, and returned to its home node at the end. The playbook runs uninterrupted across all three node reboots because it's always running on a node that isn't currently being rebooted.

Total runtime should be around 20 minutes now, down from an hour. I say "should" because the parallel migration fix went in after the first run, and the next test is tomorrow morning at 7am. I'll be asleep. That's rather the point.

What I Learned

LXC containers can't live-migrate. Full stop. They share the host kernel, so you can't seamlessly transfer a running process between two different kernel instances. But pct migrate --restart on shared storage is close enough -- the downtime is measured in seconds, not minutes.

VMs are the opposite. Live migration just works. The memory state streams across while the VM keeps running. I consistently forget how impressive this is until I watch it happen.

The hardest part of automation isn't making things work. It's making things fail gracefully. Every migration step has ignore_errors: true. Every SSH call has a timeout. There's a force-stop fallback for containers that refuse to migrate. The playbook is designed to finish even when individual steps break, because at 7am on a Sunday, nobody's around to fix it.

Well -- nobody human. There's also an AI agent sitting on the cluster that receives Uptime Kuma webhooks and attempts basic remediation when services go down: restart the service first, reboot the container if that doesn't work, then SSH in and try to figure out what went wrong. It's the last line of defence for when the rolling restart's ignore_errors ignores something it probably shouldn't have. That's a whole other story, and probably the next post.

And testing infrastructure automation in production is inevitable, because there is no staging environment for "reboot the cluster." You can syntax-check the playbook, dry-run individual commands, and review the logic until your eyes glaze over. But the first real run will always surface something you didn't think of. Like systemd queueing a reboot that can't be cancelled. Or a GPU device that exists but doesn't. Or two copies of your playbook fighting each other.

The playbook just climbs over it and carries on. Mostly.

herdmon is on GitHub if you want to poke around the code. It's a FastAPI app with vanilla JS, no build step, and an amber phosphor aesthetic that makes you feel like you're running a command centre. Which, honestly, is half the reason I built it.

I Taught My Servers to Reboot Themselves (Without Breaking Everything)

James

The Playbook

The Web UI

The First Run

The Plex Problem

Uptime Kuma Has Opinions

The Speed Problem

What It Looks Like Now

What I Learned

Read more

The Need for Speculative Speed

The Brain in the Other Room

My Patching Dashboard Grew a Brain (And Now It Checks Its Own Homework)

The Incident Was Coming From Inside The House