My AI Forgot What Plex Is (But Got 80% on Everything Else)
The last post ended with a cliffhanger: V8 was training, the dataset had been surgically rebuilt with 140 hand-written gold pairs, and the target was 76-82%. Three days, several explosions, and one strongly-worded email later, the model hit 80%. Ten points above V7. Best score yet.
Getting there was the interesting part.
The Cheap GPU Trap (Sunday)
RunPod H100s are $2.69/hr. Vast.ai's marketplace had B200s at $3-6/hr and H100s at $1.47/hr. Easy maths. I built a complete training script for their marketplace -- interactive GPU selection, EU-only filtering, safety checks, the works.
The reality was different.
Vast.ai is a GPU marketplace, not a cloud provider. You're renting hardware from individual hosts, and the experience reflects that. SSH failed on a B200 in the UK because Vast.ai overrides your Docker entrypoint, silently breaking key injection. An H200 in Czechia ran at half the expected speed. An H200 in France showed 0% GPU utilisation -- sixty-one seconds per step on hardware that should have been ten times faster.
The real kicker was hidden in the API response: gpu_frac: 0.125. Those cheap listings weren't full GPUs. They were one-eighth slices. A "$1.47/hr H100" is actually $11.76/hr for the whole card, except you can't get the whole card, and one-eighth of an H100 can't fit a 9B model in VRAM anyway.
Three failed attempts. Three instances I was charged for but couldn't use. I wrote Vast.ai a refund request citing the UK Consumer Rights Act and the EU Unfair Commercial Practices Directive -- the fractional allocation isn't disclosed at the point of sale, which is the kind of thing consumer protection law has opinions about.
The takeaway: Vast.ai's marketplace is not worth the money for ML training. RunPod gives you a dedicated full GPU, it works every time, and $2.69/hr with no surprises beats $1.47/hr for a mystery slice you can't SSH into. Sometimes the more expensive option is the cheaper one.
Three Explosions and a Lesson (Monday)
The V8 config added two changes that Codex and Unsloth's docs recommended: rsLoRA (rank-stabilised LoRA) and adamw_torch_fused (a faster optimiser for H100s). Both sounded great on paper.
First training run: loss dropped beautifully through warmup, hit 1.2 at step 100, then at step 130 -- exactly when the learning rate reached its maximum -- the loss jumped to 7.8. Then 11.5. Then 12.6. Gradient norm went from 0.1 to four million.
Killed the pod. Changed the optimiser back to adamw_8bit. Second run: identical explosion. Same step. Same epoch. Same four-million gradient norm.
The optimiser wasn't the problem. rsLoRA was. Here's the maths I should have done first:
Standard LoRA scaling: alpha / rank = 32 / 32 = 1.0
rsLoRA scaling: alpha / sqrt(rank) = 32 / sqrt(32) = 5.66
Every gradient update was 5.66 times larger than V7's. The model was fine during warmup when the learning rate was tiny, but the moment it hit full speed at 2e-4, the 5.66x multiplier blew everything up. Consistently. Reproducibly. Expensively.
Disabled rsLoRA. Added max_grad_norm=1.0 as a safety net. Third time's the charm.
The Flash Attention Compilation Quest
Flash Attention 2 makes training about 1.5-2x faster. V7 ran at 10.8s per step. Without FA2, V8 was crawling at 18-20s. Worth fixing.
The Docker image didn't include FA2 because compiling CUDA kernels on a Mac involves cross-compiling from ARM to x86 via Rosetta emulation. I tried:
- Pre-built wheels in Docker -- ABI mismatch. The wheel was compiled against a different PyTorch than Unsloth installs.
- Source compile in Docker -- OOM. Even with 48GB allocated to Docker Desktop,
nvccgot killed. - Unsloth's official Docker image -- FA2 not actually included. So much for "batteries included."
- Runtime pip install on the pod -- installed the broken pre-built wheel instead of compiling.
- Runtime pip install with
--no-binary --no-deps-- compiled from source on the pod's 200GB RAM. Took 15 minutes. Worked.
72 CUDA kernels, compiled one by one on an H100 that was being paid for at $2.69/hr while doing zero training. But when training finally started: 5.2 seconds per step. Twice as fast as V7. The 15 minutes paid for themselves in the first hour.
I added a progress counter to the training script so next time I can watch "Compiling: 48/72 kernels..." instead of wondering if it's frozen.
The Actual Training (Tuesday)
With FA2 working and the config debugged, V8 trained in 2.5 hours instead of V7's 6. Loss dropped smoothly from 1.85 to 0.58. Eval loss settled at 0.92. Gradient norms stayed between 0.10 and 0.17 the entire time. No explosions. No drama. Just a GPU doing its job.
Total cost: about $12 in RunPod time. The cheapest training run yet, and the fastest by a factor of two.
80%
The eval didn't go smoothly either. First attempt scored 0/50 because Ollama failed to start on the pod. Second attempt: also 0/50, because I'd added </think> as a stop token to suppress Qwen's thinking tags, and it was cutting off every response at the first token.
Third attempt: prefilled </think> in the chat template instead, dropped temperature from 0.7 to 0.3, and ran the eval locally on my Mac.
80%. Forty out of fifty.
| Category | V5 | V7 | V8 |
|---|---|---|---|
| Factual (specific) | 69% | 80% | 86% |
| Factual (vague) | -- | 53% | 90% |
| Troubleshooting | 39% | 61% | 72% |
| Architecture | 90% | 90% | 100% |
| Cross-service | 62% | 78% | 77% |
| Hallucination traps | 100% | 100% | 100% |
| Overall | 64% | 70% | 80% |
Factual vague -- the category that measures whether the model understands casual, imprecise questions like "where's the monitoring stuff?" -- jumped 37 points. The hand-written gold pairs did exactly what they were designed to do.
Architecture reasoning hit 100%. Every "why is X configured this way?" question answered correctly with the actual design rationale. Hallucination traps: still perfect. Ask about Kubernetes, Docker Swarm, or Portainer and the model correctly tells you they're not used.
The One It Got Wrong
Question one of the eval: "Which box runs Plex?"
Model's answer: "Plex is not used in this homelab."
Plex is very much used. It's running as an LXC container on the primary Proxmox node. It's in the training data hundreds of times. So why did the model deny it?
I built a ground truth fact table mapping every service to its correct container ID, IP, and node. Then I scanned all 9,778 training rows against it. Found 25 contaminated rows where the LLM that generated the training data had invented fake Plex details -- fabricated VM and container IDs that don't exist anywhere in the infrastructure. The generation model hallucinated plausible-sounding infrastructure facts, and the training model learned them as truth.
I also tried LLM-powered validation -- sending every row to GPT-5.4 Mini to check against the fact table. Cost $4, flagged 39% of the data as "errors." Most were false positives. The fact table didn't include every port and sub-service, so the LLM flagged legitimate data as wrong. More noise than signal.
The regex scanner with known-bad patterns caught the real contamination. Sometimes the simple tool is the right one.
The Scoreboard
| Version | Score | What Changed |
|---|---|---|
| V2 (4B) | 37% | Baseline |
| V3 (9B) | 46% | Better data, hard negatives |
| V5 | 64% | Read the Unsloth docs |
| V7 | 70% | Targeted data generation |
| V8 | 80% | Data surgery + gold pairs + FA2 |
Total project spend: roughly $80 in GPU time and $25 in API calls across every version. The model runs on consumer hardware. The whole thing costs less than a month of most SaaS tools.
What's Next
V8.1 is prepped: 25 contaminated rows removed, 86 hand-written repair pairs targeting the 10 specific failures, corrective pairs finally included in the merge. The dataset is cleaner than it's ever been.
I also looked into switching to OpenAI's GPT-OSS-20B as a base model. At Q8 quantisation it's 21GB -- technically fits in my Mac's 64GB unified memory, but inference drops to 2-3 tokens per second. The Qwen 9B pushes 60 tokens per second at the same quality level. For an interactive assistant, speed matters more than marginal accuracy gains.
The model started this project unable to tell me where Plex runs. Nine versions later, it can trace cascading failures across 15 services, correctly refuse questions about technology that doesn't exist, and diagnose IP conflicts from vague symptoms. It just needs to remember that Plex is, in fact, still running. We'll get there.