I Trained an AI on My Homelab (And It Knows Where Everything Lives)

At the end of my last post, I had a working search engine for my homelab docs. Point qmd at the markdown files, let Claude search them semantically, job done. I even hinted at what might come next.

What actually came next was a Saturday night rabbit hole. I was on my phone talking to Claude about whether RAG was really the best approach for my use case, and somewhere around midnight the conversation pivoted from "search is fine" to "but what if the model just knew it all?" One thing led to another, and by Sunday morning I had a plan to fine-tune a language model on my entire homelab documentation. I then spent the rest of Sunday -- minus a gym session and a trip to the shops -- learning how to actually do it.

There's a particular kind of stubbornness that comes from having 45 markdown files documenting every IP address, firewall rule, and questionable life choice in your homelab -- and still having to grep through them when someone asks "what port is Sonarr on?" Search helps. But what if the model just knew the answer?

So I fine-tuned a language model on my homelab documentation. All of it. Every VMID, every NFS mount path, every cursed SMB credential caching issue I've troubleshot at 2am. The model now knows my infrastructure better than I do, which is both impressive and slightly concerning.

The Setup

The idea was simple: take my private documentation repo -- 45 markdown files covering everything from Proxmox cluster topology to why I named my robot vacuum after a dead hamster's predecessor -- and turn it into training data for a small language model.

The model: Qwen3.5-4B, a 4 billion parameter model that fits on my RTX 4080 with 16GB VRAM. It's technically a multimodal vision-language model, which means it could theoretically understand screenshots of my Grafana dashboards too. Small enough to train at home, big enough to actually be useful. Hopefully.

The tooling: Unsloth for LoRA fine-tuning, which is basically the "I have one GPU and a dream" toolkit. It handles all the optimisation magic that makes training a 4B model on consumer hardware possible without setting anything on fire.

Getting it to actually run, however, was its own special kind of adventure.

The Windows Saga (A Cautionary Tale in Three Acts)

Act I: The Silent Treatment

Here's something the tutorials don't tell you: Unsloth and Windows don't get along. At all.

First, PyTorch quietly installed itself as CPU-only. No error, no warning, just a polite torch 2.10.0+cpu hiding in the version string like a passive-aggressive flatmate. Fixed that by reinstalling with the CUDA index URL.

Then Unsloth's model loader hung indefinitely. No error. No output. No progress bar. Just the word "Loading model..." and a blinking cursor that might as well have been mocking me. I confirmed the model did actually load -- by running the same code in the background and checking the output file 10 minutes later. It was fine. It just didn't feel like telling me about it.

Act II: The Permission Denied

When I finally coaxed it to the training step -- 842 steps, progress bar at 0%, the future looking bright -- it crashed. torch.compile tried to call nvcc to compile some CUDA kernels, and Windows responded with PermissionError: [Errno 13] Permission denied: 'nvcc'. Not "nvcc not found." Permission denied. As if nvcc existed but was being deliberately kept from me. (It did not exist.)

Act III: WSL2 to the Rescue

The fix? WSL2. But even that had opinions.

Installing Debian went fine. Then I needed the CUDA toolkit, which meant adding NVIDIA's repository, which is signed with SHA1, which Debian 13 (Trixie) has decided is beneath its security standards as of February 2026. So I had to convince Debian to accept an "insecure" repository from NVIDIA. Not exactly the words you want in your apt config.

Then the PATH. Oh, the PATH. Windows paths get inherited into WSL, complete with C:\Program Files (x86) and its lovely parentheses that break every bash command. Every time I tried to run something through WSL from Windows, the PATH expanded into a 2,000-character monstrosity that crashed on the first space.

The solution was just to type the commands myself in the WSL terminal like a normal person. Sometimes the simplest approach is the last one you try.

It worked first time. Because of course it did.

The lesson: if you're fine-tuning on Windows with a GPU, skip straight to WSL. Don't be a hero.

Generating 1,772 Training Pairs (The Hard Way)

You can't just feed raw markdown into a model and expect it to learn anything useful. You need structured question-answer pairs. Lots of them.

I could have written them by hand. I could have also hand-washed 1,772 dishes, but I didn't do that either.

Instead, I used Claude Code to orchestrate an army of parallel agents -- each one reading my documentation files and generating Q&A pairs in waves. The first three rounds were brute force: agents split by topic, then by document depth, then by "extract literally everything including individual firewall rule indices." 977 pairs, each round more specific than the last.

The breakthrough was flipping the prompt. Instead of "generate more pairs," I pointed two agents at the 977 existing questions and said "tell me what we forgot." They went through the docs line by line, comparing every sentence against what was already covered. 236 genuine gaps -- specific environment variables, cron schedules, SSH key types, API content-type headers that broader questions had glossed over. Completely different from generating more of the same.

A final verification pass caught another 305 pairs hiding in the details. Individual UniFi device IPs, the exact Chromium version OpenClaw uses (it's 145), the three possible claim states for Octopus Energy coffee rewards. My documentation is apparently more detailed than I realised.

Total: 1,772 unique, deduplicated Q&A pairs from 45 documents. Only 13 duplicates across the entire dataset (0.7%). Zero hallucinated facts, because every answer was generated directly from my actual documentation. The hard part isn't the code -- it's making sure your training data is specific enough to be useful. Generic questions get generic answers. The magic is in the pedantic specificity.

Training Day

With 1,772 pairs ready, it was time to actually train the thing. The configuration:

  • LoRA rank 16 -- attaching lightweight adapters to the attention and FFN layers
  • 0.47% of parameters trained -- 21.2 million out of 4.56 billion. The rest stays frozen.
  • Batch size 1 with gradient accumulation of 4 -- effective batch size of 4, because 16GB VRAM doesn't leave much room for opinions
  • 2 epochs -- each example seen twice
  • 842 total steps -- ~78 minutes on the RTX 4080

Unsloth helpfully announced it would "smartly offload gradients to save VRAM" and then the progress bar started moving. Slowly. About 5 seconds per step, with loss values every 10 steps and eval loss every 50.

The loss curve was textbook:

EpochTrain LossEval Loss
0.12.041.37
0.51.061.09
1.01.000.99
1.30.730.92
1.50.690.88
1.80.700.87
2.00.63--

Final training loss: 0.916. Eval loss settled at 0.87 and never diverged from training loss -- no overfitting. The model learned the content without just memorizing exact phrasing.

For context: a loss around 0.6-1.0 is the sweet spot for this kind of fine-tuning. Low enough that it knows the material, high enough that it's not just regurgitating training examples verbatim. Below 0.3 and you're in "the model has memorized your exact sentences" territory, which sounds impressive until it can't handle a question phrased slightly differently.

The RTX 4080 sat at about 15.8GB VRAM during training -- essentially maxed out -- running at 50C with 61% GPU utilisation. Unsloth's gradient offloading was doing the heavy lifting to keep it all in memory.

Training produces a LoRA adapter -- a small delta file that sits on top of the base model. To actually run this in Ollama, you need to merge it back into the base model and convert to GGUF format. Unsloth handles this automatically, but it needs llama.cpp compiled locally for quantisation -- which means cmake, build tools, and another round of apt-get install in WSL. Two exports: Q4_K_M (~2.5GB, everyday use) and Q8_0 (~4.5GB, higher precision). Both loadable directly into Ollama.

The Moment of Truth

Training done. GGUF exported. Model loaded into Ollama. Time to ask it things.

Ask it for the specs of a Proxmox node and it comes back with the correct IP, which VMs and containers run on it, and the hardware model. It got the RAM slightly wrong -- 32GB instead of 62GB -- but for a 4B model trained in 78 minutes, knowing which services live where is genuinely impressive.

Ask it what VLAN a specific service is on and it nails it. Correct subnet, correct container ID. No hesitation.

The cascading dependency questions are where it really shines. Ask what happens if the NAS goes offline and it walks through every affected service -- media stops, indexers lose access, download clients can't hand off files -- while correctly identifying which services keep running because they don't depend on NFS. It understood the architecture, not just the facts.

Ask about alert handling and it describes the full pipeline: the webhook endpoint, the dedup logic that prevents restart loops, the escalation ladder, the notification chain. All correct.

Ask for backup retention numbers and it recites the exact policy. Keep last, daily, weekly, monthly, yearly -- all the right values.

It's not going to replace my documentation, and it's not going to replace the Qwen 3.5 Plus that runs OpenClaw -- tool use, multi-step reasoning, and deciding when not to restart a container at 3am needs a bigger brain. But as a fast, local knowledge base that runs on Ollama? Genuinely useful. Ask it a question, get an answer with the right IP address in it, move on with your life. No internet required, no API costs, and it can't sell my firewall rules to advertisers.

What's Next

The model is static -- it knows my homelab as of today. But homelabs evolve constantly. New containers, changed IPs, updated configs. The training data goes stale.

So the plan is to automate this. A weekly pipeline that:

  1. Pulls the latest docs from my private repo
  2. Regenerates training data using the OpenRouter script
  3. Retrains in WSL via Unsloth
  4. Exports GGUF and reloads Ollama

A cron job that builds me a fresh AI every week. Because manually retraining is for people who haven't automated their free coffee yet.

One thought for later: since Qwen3.5-4B is actually a multimodal model, there's potential to feed it screenshots -- Grafana dashboards, error messages, terminal output -- and have it understand them in context of my specific homelab. That's a project for another weekend. And probably another blog post.


The Numbers

ThingNumber
Documentation files processed45
Q&A pairs (after deduplication)1,772
Duplicate rate0.7%
Agent rounds5
Total agents spawned29
ModelQwen3.5-4B (multimodal)
Trainable parameters21.2M of 4.56B (0.47%)
Training loss (final)0.916
Training time78 minutes
GGUF exportsQ4_K_M (~2.5GB) + Q8_0 (~4.5GB)
Times Windows was unhelpfulLost count

The Stack

For anyone who wants to try this:

  • Model: Qwen3.5-4B via Unsloth (multimodal, fine-tuned on text Q&A)
  • Method: LoRA (rank 16, bf16 precision, 0.47% of parameters trained)
  • Hardware: RTX 4080 16GB, running through WSL2 (Debian Trixie)
  • Data: 3,004 JSONL pairs (1,772 Claude + 1,232 OpenRouter Qwen3.5 Plus -- zero cross-source duplicates)
  • Training: 2 epochs, batch size 4 (gradient accumulation), adamw_8bit optimiser
  • Export: GGUF for Ollama (Q4_K_M + Q8_0)
  • Training time: ~78 minutes with Unsloth on WSL2

Three training scripts across three platforms:

ScriptPlatformFramework
train_unsloth.pyWSL2/Linux (RTX 4080)Unsloth + CUDA
train_transformers.pyWindows (RTX 4080)transformers + PEFT
train_mlx.pyMac Mini M4MLX + Metal

There's also generate_qa_openrouter.py for regenerating training data via OpenRouter whenever the docs change. It runs 10 parallel workers, chunks requests to avoid JSON truncation, deduplicates against existing data, and writes incrementally so a crash at doc 30 doesn't lose docs 1-29.


I now have a language model that knows my homelab better than I do. It knows every IP address, every VLAN, every backup schedule, and every fix I've ever documented.

It cannot, however, tell me why Windows said Permission denied for a binary that didn't exist. Some mysteries are beyond even 4.56 billion parameters.

But ask it which container runs the robot vacuum's brain? It'll tell you the ID, the IP, and which node it lives on. Without even thinking about it.