The Brain in the Other Room

Or: how I stopped letting a laptop sleep on the embedding job.

The previous setup ran a self-hosted RAG stack on a laptop. Open WebUI in Docker, Chroma for vectors, Tika for extraction, Infinity for embeddings, and an LLM running on the same box via the local server. It worked. It also stopped working every time the lid closed.

This is the post about getting the whole thing off the laptop and onto something that doesn't have a sleep mode.

The Plan, Such As It Was

Move all the Docker bits into a single LXC on the Proxmox cluster. Always on. No laptop involved. The catch: the LXC has no GPU. Reasonable for the half-dozen sidecars — vector DB, extraction, observability — but a real problem for the two pieces that genuinely need a GPU: the LLM and the embedding model.

So those got pushed across the LAN to a Windows box that has a discrete GPU and otherwise spends most of its time on games. Ollama running there, listening on the local network, serving both jobs from the same daemon.

A clean split: the LXC orchestrates, the GPU box does the math.

The Embedding Tax

The first cut still kept embeddings on the CPU side, in Infinity, in the LXC. The thinking was that rerank and embed could share a container — they were already roommates.

They could not.

Cold-warming bge-m3 and bge-reranker-v2-m3 on four CPU cores took almost ten minutes. The Docker healthcheck — set to a generous-feeling six-minute window — would time out before the server bound to its port. Open WebUI, depending on service_healthy, would never start.

The fix was both architectural and trivial. Pull bge-m3 on Ollama. Point Open WebUI's embedding engine at it. Drop the embedding model from Infinity entirely; let Infinity do reranking only, on its tiny per-query workload where CPU latency is fine.

Bulk ingest went from "come back in twenty minutes" to interactive. The 13 GB embedding-cache shrank to 1.1 GB. The healthcheck got a fifteen-minute grace window and stopped lying about whether things were ready.

The Model That Wouldn't Load

Qwen3.6-35B-A3B had been on the wishlist — a 35B-parameter mixture-of-experts model with only 3B active per token, the kind of shape that punches above its weight on consumer hardware. Unsloth had a quantised GGUF up. Pulled it. Pointed Ollama at it. Got back, immediately:

error loading model architecture: unknown model architecture: 'qwen35moe'

Turned out Qwen3.6 was already in the official Ollama library. Switched to that. The direct qwen3.6:35b-a3b build loaded on the first try.

The third-party GGUF and the official one used the same underlying architecture name, but the metadata layout differed in a way the daemon tolerated for one and not the other. Lesson logged.

The 262,144-Token Trap

Got qwen3.6:27b loaded. First chat: 23 seconds for a one-word reply. Checked ollama ps:

context_length: 262144
13.7 GB / 44.2 GB in VRAM (31%)

A 17 GB model weighing 44 GB in memory, with under a third on the GPU. The math felt wrong until the context length explained it: a quarter-million tokens. The KV cache pre-allocates for that, even if a fraction of it is ever used — about 27 GB on its own. That's what was spilling to system RAM.

The fix is one environment variable: OLLAMA_CONTEXT_LENGTH=8192. For RAG, where retrieved chunks rarely break 5K tokens, you don't need a quarter-million. Cap it, and the model fits in VRAM with room to spare. Throughput goes from "patient" to "snappy."

Setting an environment variable on Windows over SSH is, separately, an essay's worth of suffering. The variable is set. The daemon needs to be restarted from a desktop session to inherit it. Headless won't do.

What Works

The reverse proxy fronts the chat UI on a wildcard cert. Vector collections survive container recreates. Reranking still runs on CPU because nobody has time to chase 100ms per query. Single-sign-on through the homelab IdP merges by email so the original admin account stays admin. The LXC reboots and the whole stack comes back unattended.

What's Next

Actually using it. The architecture survived, the docs survived, the wiki entry survived. What hasn't been done is pointing it at the thing it was built for: a real corpus of years of homelab notes that keep getting lost in. That's the next post.

If you've ever embedded a thousand-document collection on a laptop CPU and watched the fan develop a personality, you'll know why this one mattered.

The Brain in the Other Room

The Plan, Such As It Was

The Embedding Tax

The Model That Wouldn't Load

The 262,144-Token Trap

What Works

What's Next

Read more

Colour Me Impressed: Building an AI Colouring Book Factory in an Afternoon

The Charger, the Car, and the API That Wasn't There

Three Lies My LLM Stack Told Me

I Just Wanted It Louder (So Naturally I Lost All Sound First)