96% (And It Only Took Seven Blog Posts to Get Here)
The last post ended with the model scoring 80% and confidently telling me that Plex -- the service running on my primary Proxmox node, mentioned hundreds of times in the training data -- does not exist. It was the best score yet and also the most embarrassing failure. Like acing an exam but writing your own name wrong.
This post is the one where it comes together.
96%. Forty-eight out of fifty. Same 9B model. Same Ollama deployment. Same consumer hardware. The difference, as it has been every single time, is the data.
Hunting the Poison
The Plex denial bugged me. The model had the right answer in its training data. It had been taught that Plex runs on the primary node. But it had also been taught, by 25 separate training rows generated by a different LLM, that Plex lives on fabricated VMs and containers that don't exist. The generation model -- the one that created the Q&A pairs -- had hallucinated plausible-sounding infrastructure details, and the training model absorbed them as fact.
I built a ground truth fact table: every service, every container ID, every IP, every node. Then I wrote a regex scanner to check all 9,778 training rows against it. Twenty-five rows contained invented service IDs -- fake VM numbers, wrong IPs, services assigned to the wrong Proxmox nodes.
For good measure, I also tried LLM-powered validation: sending every row to GPT-5.4 Mini to check against the fact table. Cost $4. It flagged 39% of the data as "errors." Most of those were false positives -- the fact table didn't include every port and sub-service, so the LLM flagged perfectly legitimate training data as wrong. Spent four dollars to learn that sometimes grep is smarter than GPT.
Teaching the Model What It Doesn't Have
Removing contaminated data fixes one problem. But V8 still failed 10 questions, and most of those failures weren't about wrong facts -- they were about missing knowledge. The model didn't know the weekly update cycle runs on Saturdays. It didn't know which services lack backup coverage. It couldn't name the containers that Ansible doesn't manage.
The negation questions were the most interesting challenge. "What's not monitored by Uptime Kuma?" requires the model to know the complete set of things that ARE monitored and then reason about the complement. Teaching "X is not Y" without making the model refuse everything is the fine-tuning equivalent of teaching a toddler the word "no" -- powerful and immediately dangerous.
The fix was paired complement training. For every negation question, I wrote three variants:
- Positive: "What does Uptime Kuma monitor?" -- lists all the monitored services
- Negative: "What's not monitored?" -- names the specific unmonitored ones
- Point query: "Is Infisical monitored?" -- direct yes/no with reasoning
Eighty-six hand-written pairs targeting the exact ten questions the model got wrong. Not generated by an LLM. Not bulk-produced. Each one checked against the actual documentation, with the actual IPs, the actual commands, the actual service names. Tedious work. The kind of work that moves the needle.
The Docker Odyssey (A Brief Interlude)
Flash Attention 2 makes training twice as fast, but compiling it has been a running saga across this project. Mac can't cross-compile CUDA kernels. Docker Desktop OOMs. Pre-built wheels have ABI mismatches.
The solution turned out to be my Windows PC. A dedicated GPU, 64GB of RAM, and a CPU that could actually compile 72 CUDA kernels without dying. Getting Docker to work over SSH on Windows required discovering that the Windows Credential Manager doesn't function in non-interactive sessions, then manually embedding base64-encoded Docker Hub credentials in the config file because nothing else worked.
Then docker commit silently dropped the container's startup command, so the next RunPod pod booted with no SSH daemon. Twenty minutes of "image pulling" followed by "SSH not ready" thirty times. Recommitted with --change CMD ["/start.sh"] and everything worked.
The image now ships with FA2 baked in. No more fifteen-minute compilation at pod startup. Finally.
96%
The eval ran on the pod for the first time -- Ollama actually started because I replaced the blind sleep 5 with an API readiness poll. Small things.
| Category | V8 | V8.1 |
|---|---|---|
| Factual (specific) | 86% | 88% |
| Factual (vague) | 90% | 100% |
| Troubleshooting | 72% | 85% |
| Cross-service | 77% | 87% |
| Negation/boundary | ~37% | 100% |
| Hallucination traps | 100% | 100% |
| Overall | 80% | 96% |
Factual vague: perfect. Every "where's the monitoring stuff?" and "how do downloads work?" answered correctly. Negation: perfect. The model now knows exactly what isn't backed up, what isn't monitored, what isn't behind a reverse proxy, and what Ansible doesn't manage. Hallucination resistance: still perfect. Nine versions in and the model has never once described Kubernetes, Docker Swarm, or Portainer.
Two questions left. A troubleshooting scenario where it missed a specific keyword, and a Ghost blog diagnostic that didn't mention the right config file. Forty-eight out of fifty.
The Scoreboard
| Version | Score | What Changed |
|---|---|---|
| V2 (4B) | 37% | Baseline |
| V3 (9B) | 46% | Better data, hard negatives |
| V5 | 64% | Read the Unsloth docs |
| V7 | 70% | Targeted data generation |
| V8 | 80% | Data surgery + gold pairs |
| V8.1 | 96% | Contamination cleanup + 86 repair pairs |
Total project cost across every version: about $100 in GPU time and $30 in API calls. The model runs locally on consumer hardware at 60 tokens per second. It answers questions about my infrastructure faster than I can open the documentation.
What's Next: Actually Using This Thing
Seven blog posts about training a model and I've never actually used it for anything beyond evaluation. That changes now.
The plan is to wire the model into the homelab itself. OpenClaw already handles automated alert remediation -- when Uptime Kuma detects a service is down, it SSHes in and tries to fix it. Giving it access to a model that knows every service, every dependency chain, and every diagnostic procedure means it could do more than just restart things. It could actually diagnose.
Beyond that, there's a retrieval-augmented setup where the model can pull live documentation at query time instead of relying purely on what it memorised during training. The current eval is 50 questions with fixed ground truth. Real-world homelab questions don't come with answer keys.
And there might be a chat UI somewhere in there. Something where I can ask "what broke last night?" and get an answer that includes the actual log entries, the actual service names, and the actual fix -- not a generic troubleshooting flowchart from a model that's never seen my infrastructure.
The model started at 37%. It now scores 96% on a 50-question eval that includes hallucination traps, vague questions, multi-step procedures, and "what's NOT in your homelab" boundary tests. It runs on my Mac. It cost less than a decent night out. And for the first time, I'm going to stop training it and start using it.