Teaching My AI to Say 'I Don't Know' (And Then Teaching It to Stop)
The last post ended with a model that scored 46% and the profound insight that throwing more parameters at a problem is not, in fact, a substitute for having good training data. Revolutionary stuff. Nobel committee, you know where to find me.
What followed was about 24 hours of increasingly recursive AI engineering: using one AI to consult on how to train another AI, while a third generated the training data. At no point did I stop to question whether this was a reasonable way to spend a Saturday.
The model went from 46% to 70%. It also forgot how to answer simple questions, learned to refuse things it definitely knew, and got tripped up by a Unicode apostrophe. Progress.
Step One: Read the Manual (V5: 46% to 64%)
The first 18-point improvement cost nothing except my dignity. Unsloth, the framework I've been using for fine-tuning, publishes best practices on their website. I had not read them. Specifically:
Their trainer.train() function has a known gradient accumulation bug. The fix is literally one line: unsloth_train(trainer). I'd been training with incorrect loss normalisation for every run.
I also wasn't masking system prompts and user questions during training. About 30% of the gradient signal was being wasted teaching the model to generate tokens it should never output. Setting those labels to -100 is the training equivalent of not studying the exam instructions.
These two changes -- plus actually using the recommended hyperparameters instead of the ones I'd made up -- took the model from 46% to 64%. Same data. Same model. Just competent configuration. The previous runs weren't bad because the approach was wrong. They were bad because I hadn't read the docs. There's a lesson in there somewhere but I'm choosing to ignore it.
Hallucination resistance hit 100%. Ask about Kubernetes, Portainer, or Docker Swarm and the model will politely explain that none of those exist here and redirect you to Proxmox. The 200 hard negatives from V3 finally had enough signal to actually stick. Which is good, because "confidently describing services that don't exist" is a personality trait, not a feature.
Step Two: Consult an AI About Training an AI (V7: 64% to 70%)
Troubleshooting was still at 39%. The model had all the facts but the diagnostic reasoning of a Magic 8-Ball. Ask "a service keeps bouncing up and down" and it would suggest checking the watchdog timer -- a perfectly plausible generic answer that happens to be completely wrong for this homelab, where the real culprit is almost always an IP conflict from duplicate DHCP leases on the UniFi gateway.
The model wasn't stupid. It was generically intelligent, which in a domain-specific context is arguably worse.
So I built four specialised data generators. The troubleshooting one was the most fun -- 64 scenarios inspired by the kinds of problems people actually hit in places like Reddit's r/homelab, r/Proxmox, and r/selfhosted. Real-world failure patterns, real frustration, real "help my Plex is buffering and I've already tried turning it off and on again" energy.
The key innovation (borrowed from a Codex consultation, because why not use AI to plan your AI training) was "discriminative diagnostic chains." Every training pair had to explain not just the right answer but why the obvious wrong answer is wrong. "You might think this is a watchdog restarting the service, but in this homelab the most common cause of flapping is IP conflicts. Check ip neigh on the host."
Teaching a model why not turns out to be more effective than teaching it why. Who knew. Probably the Codex model I asked, which had apparently read more ML papers than me. The bar was low.
Four generators. 1,013 pairs. Troubleshooting jumped from 39% to 61%. Cross-service went from 62% to 78%. The model could trace a NAS reboot through stale NFS mounts through empty Plex libraries through cascading Uptime Kuma alerts. Overall: 70%, smoke test 87%.
Cost: about $20 in API calls across GPT-5.4 Mini, DeepSeek V3.2, and Claude Haiku. Using three different teacher models because Codex told me that using a Qwen model to teach a Qwen model is like copying your own homework. Style diversity matters, apparently.
Step Three: The Part Where Teaching 'I Don't Know' Went Too Far
Procedures dropped from 58% to 48%. Factual vague from 65% to 53%. I taught the model two new bad habits in the process of fixing three old ones.
The 142 generated procedure pairs were in lovingly detailed runbook format. Prerequisites, numbered steps, failure branches, rollback instructions. Proper documentation that any sysadmin would be proud of. The eval, however, just checks whether you mentioned ansible-playbook and migrate in the same answer. The model was writing essays when the eval wanted bullet points.
The second regression was more painful. I'd added 251 negation pairs teaching boundary knowledge -- "I don't have the exact firewall rules," "the specific TLS cipher config isn't documented." Very responsible. Very accurate. The model took this as a general instruction to hedge everything.
Ask V5 "how do downloads work?" and it would rattle off Prowlarr, Sonarr, Radarr, and NZBGet with every IP address. Ask V7 the same question and it said "I don't have the full download pipeline details." It knew the answer. It had been trained to not trust itself.
I'd spent the day teaching the model to say "I don't know" for things it shouldn't know about. Mission accomplished. Unfortunately, it then started saying "I don't know" for things it absolutely did know about. Teaching restraint to a language model is like teaching a golden retriever not to fetch -- technically possible, but you're fighting the architecture.
Also, two hallucination trap "failures" turned out to be a Unicode bug. The model was answering "Kubernetes isn't used" with a fancy right quotation mark (U+2019) instead of a straight apostrophe (U+0027). The eval's string matching didn't handle curly quotes. Two points lost to typography. I fixed the eval, recalculated, and the hallucination score went back to 100%. The most expensive find-and-replace of my life.
Step Four: Data Surgery (V8: The Current Bet)
Codex diagnosed it as "answer-policy misalignment." Not a knowledge problem. A vibes problem. The model has all the facts but expresses them wrong -- over-explaining when brevity was needed, refusing when confidence was appropriate.
V8 is surgical rather than additive:
- Cut 114 generic runbook procedure pairs (the ones teaching essay-style answers)
- Cut 20 chunk-boundary abstention pairs ("this section doesn't cover...")
- Added 140 hand-written gold pairs targeting exact eval failures
- Reduced oversampling ratios (the 3x troubleshoot amplification was too aggressive)
- Enabled rsLoRA and aligned hyperparameters with Unsloth's latest guidance
The gold pairs are aggressively terse. "How do I do a rolling restart?" gets four sentences with ansible-playbook, rolling-restart.yml, migrate, and pvecm status. Not a runbook. Not a walkthrough. Just the answer you need at 2am when pve3 is being difficult.
I also built a Docker container for NVIDIA's B200 GPU because one appeared on RunPod and I got excited. It has 192GB of HBM3e and costs $5.98/hr. I deployed to one in California, watched the 14GB container image crawl across the Atlantic at what felt like ISDN speeds, and quietly added an EU-only datacenter filter to the training script. Latency: the great equaliser.
The Scoreboard
| Version | Score | What Changed |
|---|---|---|
| V2 (4B) | 37% | Baseline |
| V3 (9B) | 46% | Better data, hard negatives |
| V5 | 64% | Read the Unsloth docs (embarrassing) |
| V7 | 70% | Targeted data generation ($20 in API calls) |
| V8 | 76-82%? | Data surgery + gold pairs (pending) |
Total spend: roughly $40 in RunPod GPU time and $20 in API calls for data generation. The model runs on consumer hardware. The training pipeline runs through Gitea Actions. The whole project costs less than a decent takeaway.
What I've Actually Learned
The model went from 37% to 70% on the same 9B architecture. Same parameter count, same quantisation, same Ollama deployment. The difference is entirely in what it was taught and how. Better technique gave 18 points. Better data gave 6 more. Worse data took 10 back.
Every training pair doesn't just teach a fact -- it teaches a response pattern. Add too many diagnostic chains and the model starts diagnosing when you asked a simple factual question. Add too many refusals and it starts refusing things it knows. The dataset is a personality profile, and at 9 billion parameters, the model isn't big enough to smooth out your biases. It absorbs them faithfully and reflects them back at you.
V8 is training now. Whether it hits 80% or stalls at 72%, the path forward is clear: fewer generated pairs, more hand-written ones, and a healthy suspicion of any training data that teaches the model to say more when the right answer is less.
The model started this journey unable to tell me where Plex runs. Now it can trace a cascading failure across 15 services, correctly refuse questions about infrastructure that doesn't exist, and diagnose IP conflicts from vague symptoms. It just occasionally forgets how downloads work. We're getting there.