Three Lies My LLM Stack Told Me

It started with an innocent question: "is there anything we can improve?" Two days later I'd benchmarked eleven model-and-runtime combinations, deleted over 150 GB of dead weights, patched an inference server to load a model format that doesn't officially exist yet, and discovered that my speed numbers had been lying to me the entire time.

This is the story of tuning a local LLM stack on an M5 Pro MacBook, and the three separate times a number turned out to be fiction.

Lie number one: my own benchmark

My menu-bar app (LlamaGUI, a little SwiftUI thing that starts and stops MLX backends) has a built-in tok/s probe. It streams a fixed prompt and counts the SSE chunks coming back. One chunk, one token. Obvious, right?

Wrong. My main model runs on lightning-mlx with multi-token-prediction speculation, and when speculation accepts a batch of draft tokens, the server packs them into a single chunk. The probe was reporting 19 tok/s on a model that actually decodes at 80. The fix was to read the server's own usage.completion_tokens and divide by the decode window:

[code] ttft=0.30s tok/s=78.4 chunks=47 completion_tokens=216

47 chunks. 216 tokens. A 4x undercount, sitting in my stats display since the day I built it. The local LLM community apparently knows this one well - "never count chunks on a speculating server" - but I had to find it the hard way.

Lie number two: the config file

Youssofal publishes excellent MTPLX-optimised quants, and his new 6-bit "Balance" build of Qwen3.6-35B-A3B claims 126 tok/s with verified quality. One problem: its metadata says it was forged with MTPLX 0.3.8 - a version that exists only on the author's machine. Public 0.3.7 refuses to load it, and in the GitHub issue full of "same here" comments, the author says support lands in the next big release, now called V1.0. The 0.3.8 that built this model will apparently never ship; it grew into something bigger.

I have the MTPLX source as an editable install, so I went digging. Three layers down:

The bundle declares an MTP quantisation policy (prequantized-int4) that 0.3.7 has never heard of. Easy - add it to the accepted set.
The MoE sidecar stores its 256 experts as individual tensors, but mlx-lm wants them stacked into switch layers. The trunk loader has a sanitize step for this; the MTP injection path never ran it. So the patch stacks the weight/scales/biases triples itself.
And the kicker: the bundle's config says the experts are quantised at group size 64. They are not. The actual tensors are packed at group size 32. The config just lies. The patch now derives bits and group size from the tensor shapes and ignores what the metadata claims.

After all that, Balance runs at 39-51 tok/s on my machine. The promised 126? That was measured on an M5 Max with more than twice my memory bandwidth. Which brings me to.

Lie number three: someone else's defaults

lightning-mlx defaults to drafting 5 speculative tokens per step, a value tuned on - you guessed it - an M5 Max. On my M5 Pro, a quick flag sweep found that depth 3 wins by twenty percent:

baseline (d=5):  74-79 tok/s on code
draft tokens 3:  89-95 tok/s on code

Same model, same machine, one flag. The deeper drafts were just wasting verify time on tokens that got rejected anyway, because my chip can't feed the verifier fast enough to make them worthwhile.

The sweep also caught two traps. --mtp-optimistic doubles throughput on paper by skipping the speculation acceptance check - and promptly emitted malformed tool calls and degenerate output. And --stream-interval 50 looked 15% faster while silently dropping most of the streamed text. Both would have looked great on a dashboard.

The carnage

Along the way I benchmarked everything that looked promising on Hugging Face. Cohere's brand-new North Mini Code: fast, but it leaks its chain-of-thought into the output with no way to parse it out. Devstral Small 2: great benchmark scores, but dense 24B models crawl at 13-17 tok/s on this memory bandwidth. A distillation-calibrated DWQ quant: genuinely good, but within noise of what I already had. They're all gone now, along with two orphaned quants from a runtime I stopped using months ago. Disk went from 68% full to 60% free.

The final lineup is three models, every one benchmarked within the same 48 hours: the 35B MoE at ~90 tok/s for daily driving, the dense 27B as the careful second opinion, and a small Gemma for when I want a non-Qwen take.

What I actually learned

Every number in this story came with an asterisk I couldn't see until I measured it myself, on my hardware, with my workload. The benchmark was tuned for a different counting method, the config for a different loader, the defaults for a different chip, and the headline claims for a different memory bus.

Local inference is bandwidth-bound physics wrapped in software that mostly hasn't been calibrated for your machine. The single most useful thing in my stack right now isn't a model or a server - it's a fifty-line benchmark script with a warmup pass and a cache-busting nonce.

Next up: MTPLX V1.0 is supposed to land soon, which makes my compatibility patch disposable and that Balance model officially supported. And mlx-lm has a native MTP pull request brewing that could speed up the reference server for free. I'll believe both when I measure them.

Three Lies My LLM Stack Told Me

Lie number one: my own benchmark

Lie number two: the config file

Lie number three: someone else's defaults

The carnage

What I actually learned

Read more

Colour Me Impressed: Building an AI Colouring Book Factory in an Afternoon

The Charger, the Car, and the API That Wasn't There

I Just Wanted It Louder (So Naturally I Lost All Sound First)

The Need for Speculative Speed