The Model Works (And Why I Had to Build My Own Docker Image to Get There)
Last time I left you with a model training on an H100 in the Netherlands. The loss was dropping. Everything looked promising.
Then the GGUF export failed. Then the 9B model couldn't download because the disk was full. Then my terminal filled with cascading Python tracebacks that my shell tried to execute as commands. Each traceback spawned more tracebacks. It was like watching a software error reproduce asexually.
But the 4B model? It works. It actually works. Ask it what happens if a Proxmox node goes down and it traces the dependency chain. Ask it about Kubernetes and it correctly says "we don't use that here." Ask it how to check if Caddy is running and it gives you the exact command on the right container.
Here's how I got there, and why "just use the Unsloth Docker image" stopped being an option.
The Image Problem
The last post covered six failed attempts using off-the-shelf Docker images. Every single failure came from dependency mismatches -- the SSH auth was wrong, or pip silently upgraded PyTorch to an incompatible CUDA version, or transformers was too old for Qwen3.5, or the vision-language tokenizer tried to parse training text as images. You know, normal Tuesday stuff.
Each pod launch meant 10+ minutes waiting for the image to pull, then another 2-3 minutes installing and upgrading packages, then discovering something new was broken. At $2.69/hour, those debugging sessions add up. I was essentially paying a cloud GPU to watch me google error messages.
So I built my own container. If the environment won't cooperate, make one that has no choice.
The Custom Container
The idea was simple: bake everything in. Every dependency at the exact version that works. No pip installs at runtime. No surprises. The sort of thing you'd think everyone does but apparently requires six failed attempts to consider.
The Dockerfile starts from NVIDIA's CUDA 12.4 base image, installs PyTorch with the matching CUDA toolkit, then adds Unsloth, transformers 5.x, and every other training dependency. SSH is configured the standard RunPod way -- root user, public key injection at startup. The entry point script sets up SSH and keeps the container alive.
Building it was its own adventure. I'm on an M3 Mac, so the image needs to cross-compile for linux/amd64 via QEMU emulation. That takes about 20 minutes. The first push to Docker Hub hit a 502 Bad Gateway on the 3GB PyTorch layer. Because of course it did. Second push worked.
But after that? Every pod launch is the same. Image pulls in a few minutes (it's cached after the first run), SSH connects immediately, and training starts without installing a single package. The container went from a source of problems to something I don't think about anymore. Which is the highest compliment you can pay infrastructure.
Training the 4B
With the container sorted, the actual training was anticlimactic. Which is exactly how it should be. Nobody writes blog posts about things that went smoothly, but here we are.
Qwen3.5-4B, bf16 precision, LoRA rank 32 with dropout 0.05, cosine learning rate schedule. The H100 chewed through 4,806 training pairs in about 50 minutes. Loss dropped steadily from 1.57 to 1.13 over two epochs. No crashes, no driver errors, no tokenizer surprises. I almost didn't know what to do with myself.
The GGUF export took another 15 minutes -- Unsloth clones and builds llama.cpp from source every single time, which feels like watching someone build a hammer before hanging each picture. But it worked because cmake was pre-installed in the container. That was another lesson from the failed attempts: the off-the-shelf images didn't have cmake, so the export would fail after training completed successfully. You'd have a trained model sitting on a pod you're paying for, unable to export it. Like buying a car and realising nobody installed doors.
The Data Got Better Too
While the Docker image was building (20 minutes of QEMU emulation is a lot of time to fill), I cleaned up the training data. The original 4,908 pairs had 735 scoring below 2.0 out of 5 -- junk from early generation runs with no specific infrastructure details. Another 128 were near-duplicates asking the same question in slightly different fonts, essentially.
I filtered those out, then regenerated the weak spots. The troubleshooter and auditor personas were underrepresented in the original dataset, so I ran a targeted generation pass with improved prompts that emphasise answer variety -- "don't always open with the service name and container ID, sometimes lead with the command, sometimes with the context." Teaching an AI to not sound like an AI. Very meta.
| Metric | Before | After |
|---|---|---|
| Total pairs | 4,908 | 4,806 |
| Mean quality score | 3.16 | 3.60 |
| Below 2.0 | 735 | 2 |
| Troubleshooting questions | 182 | 260 |
| Long answers (400+ chars) | 71% | 84% |
Fewer pairs, but better ones. The mean score jumped from 3.16 to 3.60 just by removing the bottom and adding targeted replacements. Turns out quality control is more effective than "generate more stuff and hope for the best." Who knew.
How It Actually Performs
I ran two evaluation suites against the model. The first is a 15-question smoke test -- hand-picked questions covering the basics. The second is a proper 50-question eval with eight categories, including some deliberately nasty hallucination traps and vaguely-worded questions that a human would understand but a model has to work for.
The 15-Question Smoke Test
| Category | Score |
|---|---|
| Architecture | 100% |
| Cross-Service | 71% |
| Procedures | 50% |
| Factual Recall | 46% |
| Troubleshooting | 11% |
| Overall | 47% (7/15) |
The smoke test tells a clear story: the model understands the architecture but can't fix things. Ask it why two Caddy containers exist and it nails the explanation. Ask it what to do when SSH breaks and it'll send you down a completely fictional troubleshooting path with full confidence. It's like a junior engineer who read the wiki but hasn't been on-call yet.
The 50-Question Full Eval
The full eval is where things get more interesting -- and more honest.
| Category | Passed | Avg Score | Notes |
|---|---|---|---|
| Architecture Reasoning | 4/5 | 68% | Understands dependencies and design decisions |
| Factual (Specific) | 7/10 | 64% | Good on IPs and container IDs, misses CPU models |
| Factual (Vague) | 3/5 | 43% | Struggles when the question is informal |
| Cross-Service | 2/5 | 40% | Gets the big dependency chains, misses subtleties |
| Negation/Boundary | 1/5 | 25% | Doesn't know what it doesn't know |
| Hallucination Traps | 1/5 | 20% | Falls for trick questions about non-existent services |
| Troubleshooting | 1/8 | 12% | Confidently wrong diagnostic steps |
| Procedures | 0/7 | 11% | Can't describe multi-step processes accurately |
| Overall | 19/50 | 36% |
Some patterns jump out:
Architecture is the strength. 4 out of 5 passed. The model genuinely understands why things are configured the way they are -- VLAN separation, reverse proxy placement, backup strategies. This makes sense: architectural knowledge is about relationships, and LoRA fine-tuning is good at learning patterns and associations.
Specific factual questions work better than vague ones. Ask "What IP is Plex on?" and it nails it. Ask "How do downloads work? Like TV shows and movies?" and it still figures out you mean the Sonarr/Radarr/NZBGet stack -- but the vaguer the question, the more likely it misses details. A 4B model doesn't have much room for inference about what you probably meant.
Hallucination traps are brutal. 1 out of 5. The model will happily describe services that don't exist in the homelab if you ask confidently enough. The hard negatives in the training data help -- it correctly rejects Kubernetes -- but it hasn't learned to reject every fictional service. This is a data problem more than a model size problem.
Procedures and troubleshooting are the floor. 0/7 on procedures, 1/8 on troubleshooting. The model can't reliably walk you through multi-step processes or diagnose problems. It knows what things are but not what to do about them. This is where 4 billion parameters genuinely aren't enough -- procedural knowledge requires more capacity to store step sequences accurately.
What Broke Along the Way
Because something always breaks. Always. I'm starting to think this is a fundamental law of the universe, right up there with entropy and the fact that USB cables are always the wrong way round on the first try.
The GGUF export failed twice on the pod. The training script's built-in export worked fine (Q4 and Q8 both succeeded), but the pipeline script's separate export step failed with a cryptic conversion error. The LoRA adapter was already safely downloaded by that point, so no work was lost -- but it meant I had to spin up a second pod just to re-export the GGUF. Cost: about $0.50 and 10 minutes. The most expensive file format conversion of my life.
The 9B model couldn't download. After the 4B training and GGUF export, the 50GB container disk was full. The 9B model weights are ~18GB and there was nowhere to put them. A disk full error in 2026. We put AI on cloud GPUs but we still run out of disk space. The fix for next time: 100GB disk and cleanup of GGUF artifacts between model runs.
The 9B training script was broken. It was still using the old SFTTrainer with dataset_text_field, which doesn't work on transformers 5.x. Same issue I'd already fixed for the 4B script, but the 9B script hadn't been updated. It downloaded the 18GB model, crashed on the first training step, and the pod sat there burning money while I stared at the error message wondering why I didn't just use a spreadsheet.
What's Next: 9B
The 4B model is good enough for quick lookups and architecture questions. But for troubleshooting -- the thing I actually want a homelab assistant for at 2am when something is on fire -- it needs more capacity. More parameters, more room to remember why that one NFS mount needs noac and what happens when you forget it.
The 9B training script is rewritten and ready. Same approach: bf16 LoRA, rank 32, cosine schedule, but with a smaller batch size to fit the larger model. The pipeline now cleans up disk between models and allocates 100GB instead of 50GB. Lessons learned, mistakes committed to version control so they can haunt future me.
The plan is to train both models in a single pod session -- 4B first (~50 min), download its adapter and GGUF, clean up, then 9B (~90 min). Total cost should be around $6-7 for both models. Then I can compare them side by side on the same eval suite and see if the extra parameters actually help where it matters.
The 4B scores 36% on the full eval. 19 out of 50. It knows what everything is but can't tell you what to do when it breaks. The 9B needs to close that gap -- especially on procedures (0/7) and troubleshooting (1/8). If doubling the parameters doesn't help with "walk me through fixing X," then the problem is the training data, not the model.
Either way, we'll find out.
Spoiler: something will probably break during the 9B training too. I'll let you know what.