The 9B Model: More Parameters, Same Problems (Mostly)
This is the boring sequel. The 9B model trained on the first attempt. No SSH failures, no disk full errors, no cascading tracebacks. The custom Docker container did its job. I launched the pod, it pulled the image, SSH connected, training started, and 50 minutes later I had a model. Anticlimactic doesn't begin to cover it.
So instead of a post about what broke, this is a post about what the extra 5 billion parameters actually bought me. Spoiler: it's complicated.
The Quick Version
On the 15-question smoke test, the 9B scored 67% compared to the 4B's 47%. That's a genuine improvement. On the harder 50-question eval with hallucination traps and vague questions, it scored 41% versus the 4B's 37%. That's... less genuine.
More parameters made the model smarter in some ways and dumber in others. Which is not the narrative arc I was hoping for, but at least it's honest.
What Got Better
The smoke test improvement was real and meaningful:
| Category | 4B | 9B |
|---|---|---|
| Architecture | 100% | 100% |
| Procedures | 50% | 100% |
| Cross-Service | 71% | 75% |
| Factual Recall | 46% | 71% |
| Troubleshooting | 11% | 11% |
| Overall | 47% (7/15) | 67% (10/15) |
Procedures went from 50% to 100%. The 9B now correctly describes how to run Ansible updates and how to add DNS records via the API. The 4B would give you a vaguely correct command buried in wrong context -- the 9B gives you the right command in the right order. This alone makes it more useful at 2am.
Factual recall jumped from 46% to 71%. The 9B remembers backup retention policies (daily, weekly, monthly), knows which model OpenClaw uses, and correctly identifies the Factorio server's VLAN. The 4B just... didn't have room for these details. Five billion extra parameters turns out to be a lot of room for phone numbers.
The Full Eval Tells a Different Story
The 50-question eval is designed to be harder. It includes hallucination traps ("tell me about the Portainer dashboard"), vague questions ("something about VLANs and a game server?"), and multi-step procedures that require precise sequencing. This is where the 9B's improvement gets... modest.
| Category | 4B | 9B | Change |
|---|---|---|---|
| Architecture Reasoning | 3/5 (57%) | 5/5 (67%) | +10% |
| Factual (Specific) | 6/10 (53%) | 7/10 (65%) | +12% |
| Hallucination Traps | 0/5 (0%) | 1/5 (20%) | +20% |
| Negation/Boundary | 2/5 (33%) | 3/5 (46%) | +13% |
| Cross-Service | 2/5 (39%) | 3/5 (48%) | +9% |
| Troubleshooting | 3/8 (25%) | 3/8 (25%) | 0% |
| Factual (Vague) | 3/5 (53%) | 2/5 (33%) | -20% |
| Procedures | 2/7 (31%) | 0/7 (16%) | -15% |
| Overall | 21/50 (37%) | 24/50 (41%) | +3% |
Some of these results need explaining, because they're counterintuitive.
The Surprises
Architecture went perfect. 5/5. The 9B understands every dependency chain, every design decision, every "why is this on that VLAN" question. This is the category where more parameters clearly help -- the model has enough capacity to hold the full mental map of how 30+ containers relate to each other. The 4B could do this sometimes. The 9B does it every time.
Hallucination resistance improved but is still terrible. 1/5 instead of 0/5. Progress, technically. The model still happily describes services that don't exist if you ask with enough confidence. "Tell me about the Portainer dashboard" and it'll invent one. This isn't a parameter problem -- it's a training data problem. We need more hard negatives. Many more.
Procedures regressed. This was the shock. 2/7 on the 4B, 0/7 on the 9B. On the smoke test, procedures went to 100%. On the full eval, they went to 0%. The difference? The smoke test asks straightforward "how do I run Ansible" questions. The full eval asks things like "walk me through setting up a new LXC from scratch" -- multi-step processes with specific ordering. The 9B gets the individual steps right but muddles the sequence. It's like it knows all the ingredients but can't follow the recipe.
Vague questions got worse. The 4B scored 53%, the 9B dropped to 33%. Ask "what's that reverse proxy thing?" and the 4B would give you a reasonable answer about Caddy. The 9B gives you a more detailed answer that somehow manages to miss more keywords. It's the curse of verbosity -- the model knows more, says more, and in doing so wanders further from the specific answer the eval is looking for. Whether that's actually worse for a human user is debatable.
Troubleshooting is identical. 3/8 on both models. 25%. This is the number that tells me the most. Doubling the parameters didn't move the needle at all on diagnostic reasoning. The model still generates plausible-sounding troubleshooting steps that are wrong. If 5 billion extra parameters can't fix this, the training data needs fundamentally different troubleshooting examples -- not just more of them, but better ones with actual diagnostic chains and verification steps.
What This Actually Means
The 9B is the better model for daily use. 67% on the smoke test versus 47% means it gets the answer right two-thirds of the time on the questions you'd actually ask. It knows more facts, understands more relationships, and can walk you through basic procedures.
But the full eval reveals that more parameters alone won't get us past 50%. The ceiling isn't the model -- it's the data. Specifically:
- Troubleshooting data needs real diagnostic chains. Not "the service is down, restart it" but "check the logs first, look for this specific error, try this fix, verify with this command." The current training pairs describe symptoms and solutions but skip the reasoning in between.
- Hard negatives need massive expansion. 20 hard negatives in a 4,800-pair dataset is not enough. The model needs hundreds of "we don't use that" examples to reliably refuse fictional services.
- Procedures need step-by-step validation. The training data has procedures, but they weren't generated with enough emphasis on exact ordering and verification between steps.
The Numbers
| Thing | Number |
|---|---|
| Training time | ~50 min on H100 |
| Training steps | 286 |
| Final loss | ~1.0 |
| Model size (Q8_0 GGUF) | 9 GB |
| RunPod cost | ~$2.50 |
| Things that broke | 0 |
| Smoke test improvement | 47% → 67% |
| Full eval improvement | 37% → 41% |
| Categories that regressed | 2 (procedures, vague) |
| Troubleshooting improvement | None. Zero. Zilch. |
What's Next
The 9B is the production model for now. It runs on my M5 Pro with 18GB unified memory, and the Q8_0 quantisation means minimal quality loss. For architecture questions, factual lookups, and basic procedures, it's genuinely useful.
But the next improvement won't come from a bigger model. It'll come from better training data -- specifically, rewriting the troubleshooting pairs with proper diagnostic reasoning, flooding the dataset with hard negatives, and adding step-verified procedures.
The model knows what my infrastructure looks like. It can even tell you what depends on what. But ask it what to do when something breaks and it's still guessing.
At least it's guessing with more confidence now. I'm not sure if that's progress or just a more articulate way of being wrong.