I Made My Homelab AI Actually Good (And Accidentally Rented an H100)

Two days ago I had a fine-tuned model that scored 87% on a 15-question test. It knew my IPs, my VMIDs, and most of my backup retention policy. Good enough, right?

No. Obviously not. Because I looked at the training data and realised half the answers didn't even contain IP addresses, a quarter were generic Linux advice that could've come from any Stack Overflow answer, and the "troubleshooting" questions were 3% of the dataset. Three percent. For a homelab assistant. The model was passing tests because the tests were easy, not because the data was good.

So I tore the whole pipeline apart and rebuilt it. New generation engine, new model provider, new training approach, and -- because apparently I have no self-control -- a cloud GPU pipeline that took six attempts to get working. I went looking for an A6000 on RunPod. None available. No A100s either. So naturally I rented an H100. Go big or go home, right? It's only $2.69 an hour. What could go wrong.

Quite a lot, as it turns out.

Tidying Up

First order of business: the repo was 49 files in the root directory. I reorganised it into proper folders, consolidated 11 near-identical generation scripts into one, and cleaned up a few things that shouldn't have been committed. Housekeeping. Moving on.

The Data Quality Problem

Here's what the original 3,004 training pairs actually looked like when I scored them properly:

ProblemHow Bad
No IP addresses in answer51%
Short answers (<150 chars)48%
Only "What IP is X?" questions68%
Troubleshooting pairs3%
Architecture reasoning1%
Hard negatives ("Kubernetes isn't used here")0

The mean quality score was 2.54 out of 5. Half the dataset was essentially flashcards -- "What IP is Sonarr?" followed by a single IP address. Technically correct, completely useless for teaching a model to actually help someone troubleshoot at 2am.

Five Personas, One Chunk at a Time

The old pipeline sent entire documents to the LLM and asked for 20-40 pairs. The model skimmed -- grabbed the obvious facts, ignored the configs buried in paragraph three, and generated the same "What IP is X?" pattern repeatedly.

The new pipeline chunks documents into 3,500-character segments at markdown heading boundaries, then runs each chunk through five different personas:

  • Operator (5 pairs): "How do I check the status of X? What command restarts it?"
  • Troubleshooter (4 pairs): "X isn't responding after reboot, what do I check?"
  • Architect (3 pairs): "Why is X on the DMZ VLAN instead of management?"
  • Beginner (3 pairs): "I'm new here -- what is Caddy used for in this setup?"
  • Auditor (2 pairs): "Which services are externally accessible? What's the single point of failure?"

Same chunk, five completely different question styles. The Operator asks how to restart Caddy, the Architect asks why there are two Caddy containers, and the Troubleshooter asks what to do when Caddy stops responding. All grounded in the same specific IPs, CT IDs, and config paths from that chunk.

54 docs became 146 chunks. 146 chunks times 5 personas times ~3.5 pairs each = 730 API calls. Total output: 2,175 pairs with a mean quality score of 4.06 out of 5.

The improvement over the old data:

MetricOld (Qwen3.5 Plus)New (5-persona chunked)
Mean score2.544.06
Score >= 4.012%66%
Below 2.023%0.3%
Avg answer length516 chars1,111 chars
Short answers48%0%
Duplicates465

Zero short answers. Five duplicates (all hard negatives asking about the same non-existent service). Every answer grounded in specific infrastructure details. The beginner persona filled a gap I hadn't even thought about -- the model couldn't previously explain what anything was, only recite its IP address.

Swapping to GPT-5.4 Nano

I'd been generating training data with Qwen3.5 Plus via OpenRouter. It worked, but I swapped to GPT-5.4 Nano -- OpenAI's newest small model. Similar cost, but it supports structured JSON output mode. Instead of hoping the model returns valid JSON and manually stripping markdown code blocks, you set response_format: {type: "json_object"} and get guaranteed valid JSON every time. Fewer parse errors, cleaner pipeline.

The RunPod Adventure

My Mac M3 Pro OOM'd trying to train (18GB unified memory isn't enough). The RTX 4080 works but ties up the Windows machine. So I looked at RunPod -- cloud GPUs by the minute.

What followed was an education in Docker image compatibility, CUDA driver versions, and the gap between "this should work" and "this actually works."

Attempt 1: H100 in Netherlands, unsloth/unsloth image. SSH rejected -- the image uses a non-root unsloth user with SSH_KEY env var, not RunPod's standard PUBLIC_KEY for root. Took three attempts to figure this out.

Attempt 2: Switched to runpod/pytorch image (SSH works out of the box). CUDA driver error -- the H100 in EU-NL-1 has driver 12.0.7, but pip install unsloth silently upgraded PyTorch to a version requiring CUDA 12.8+. The base image's PyTorch 2.4 was fine, but Unsloth's dependency resolver replaced it.

Attempt 3: Pinned Unsloth version. Wrong transformers version -- Qwen3.5 needs transformers >= 5.2.0 but the pinned Unsloth pulled 4.57.

Attempt 4: Let Unsloth install its deps but froze PyTorch. It upgraded PyTorch anyway through a transitive dependency.

Attempt 5: Back to the unsloth/unsloth image (everything pre-installed), SSH as unsloth user, source /tmp/unsloth_environment to activate conda. Upgraded transformers for Qwen3.5 support. Tokenizer crashed because Qwen3.5 is a vision-language model and the tokenizer tried to parse training text as base64-encoded images. Had to bypass the VL processor and call the underlying text tokenizer directly.

Attempt 6: Batch size 1 on an H100. 12 seconds per step. Same speed as the 4080. Because batch size 1 doesn't utilise the GPU's parallelism at all. Cranked it to 16 -- needed a DataCollatorForSeq2Seq to handle variable-length sequences.

It's currently training. On an H100. In the Netherlands. At $2.69/hour. With a batch size of 16 and about 584 steps to go.

The script that orchestrates all of this -- runpod/train_runpod.sh -- handles the full lifecycle: launch pod, wait for SSH, upload data, install deps, train, download GGUFs, terminate. It tries H100 first, falls back through A100, A6000, and L40S. Auto-terminates on exit or Ctrl+C so you don't wake up to a $50 bill.

What I Actually Learned

  1. Data quality > data quantity. 2,175 high-quality pairs (score 4.06) will almost certainly outperform 3,004 mediocre ones (score 2.54). The old dataset was mostly flashcards. The new one teaches reasoning.
  2. Chunk first, then generate. Sending whole documents to an LLM for Q&A generation is like asking someone to summarise a book -- they hit the highlights and skip the details. Chunking forces coverage of every section.
  3. Personas multiply diversity for free. Same chunk, different perspective, completely different questions. The cost scales linearly but the diversity scales combinatorially.
  4. Cloud GPU setup is harder than it looks. Docker images, CUDA drivers, SSH auth, Python environments, vision-language tokenizer quirks -- any one of these can silently break your training pipeline. Test locally first.
  5. Transformers 5.x broke everything. SFTTrainer's dataset_text_field no longer auto-tokenizes. The whole internet's worth of Unsloth tutorials are written for transformers 4.x. If you're using a recent Docker image, you need to tokenize in the dataset map step and use the standard Trainer class.

The Numbers

ThingNumber
Training pairs (new, 5-persona)2,175
Mean quality score4.06 / 5.0
PersonasOperator, Troubleshooter, Architect, Beginner, Auditor
Chunks per doc (avg)2.7
API calls for generation730
Generation modelGPT-5.4 Nano via OpenRouter
Generation cost~$12
Training modelQwen3.5-4B (V2: rank 32, DoRA, cosine schedule)
Training GPUNVIDIA H100 80GB (RunPod, EU-NL-1)
Training cost~$2.69/hr
RunPod attempts before it worked6
Times pip install unsloth broke PyTorch3
SSH auth methods tried4

The Stack (V2)

  • Generation: GPT-5.4 Nano via OpenRouter, chunked docs, 5 personas, JSON object mode
  • Scoring: Custom 0-5 heuristic scorer (IPs, CT/VM IDs, commands, question diversity)
  • Training: Qwen3.5-4B with Unsloth, LoRA rank 32, DoRA, dropout 0.05, cosine LR
  • Hardware: NVIDIA H100 80GB on RunPod (~$2.69/hr) or RTX 4080 16GB (local WSL2)
  • Deployment: GGUF export -> Ollama
  • Pipeline: generate.sh (interactive model picker) -> check_data.sh -> train_runpod.sh

The model is training as I write this. An H100 in the Netherlands is processing 4,908 Q&A pairs about my homelab in a house in England. Each step takes about 4 seconds. The loss is dropping.

In about 30 minutes, I'll have a GGUF file that knows my infrastructure better than the first version ever did. Not because the model is bigger -- it's the same 4B parameters. But because the questions are better, the answers are deeper, and five different perspectives looked at every piece of documentation I've written.

Whether it's actually smarter? Ask it what happens if a Proxmox node goes down. If it traces the full dependency chain -- monitoring blind, media stack offline, but Plex still serving and the NAS still up -- then the personas did their job.

If it just names the node and nothing else -- well, at least it remembered something.