The Need for Speculative Speed
Chasing faster local LLM inference on Apple Silicon — the wins, one spectacular near-miss, and the bug I almost walked away from.
There's a specific impatience that only shows up once you run language models locally. The cloud spoils you rotten, then you pull everything in-house and watch a model emit tokens at the speed of a tired typist. Something inside you whispers: this can be faster.
So when llama.cpp merged multi-token prediction — MTP, a flavour of speculative decoding — I wanted it on my Mac. Not eventually. Now.
The free lunch that's actually real
Normal decoding is a strict one-token-per-forward-pass tax. Speculative decoding cheats: the model drafts several tokens ahead, then verifies them in a single pass. Accepted drafts are free. Rejected ones cost nothing but pride.
The first win came easy. A project called MTPLX runs a model's own built-in MTP heads as an internal drafter — no second model needed, exact sampling, no quality trade. On Qwen3.6-27B it took decode speed from 10 tokens/sec to 25. Two and a half times faster, same output. If this were a film, the credits would roll here.
It was not a film.
The 35B that fought back
The bigger model — a 35-billion-parameter mixture-of-experts — was supposed to be next. Instead it became a saga.
The right runtime for it was a different engine, lightning-mlx. And lightning-mlx was fast — about 78 tokens/sec generating text. The catch hid in one word: generating. The moment I asked it to stream tokens, the way every chat client on earth does, it fell apart. Streamed responses arrived with a polite "hello", a "goodbye", and almost none of the actual content in between.
I tested every configuration. I disabled every clever feature. With plain greedy decoding and zero speculation, streaming still dropped the content on the floor. I concluded it was a broken subsystem, an upstream rewrite, not my problem to fix. I wrote it off. I moved the 35B back to the slow-but-reliable path and told myself that was the responsible call.
It nagged at me.
The bug I almost didn't fix
A day later I went back, stopped theorising, and did the boring thing: I added print statements and watched what actually happened. Three runs later, there it was.
The streaming code asked "are we in tool-calling mode?" — and got the answer yes, always, because the tool-call parser defaults to switched-on. That flipped on a set of heuristics designed to suppress junk during tool calls. One of them quietly dropped any content chunk shorter than 16 characters. Streaming, of course, emits content a few characters at a time.
So the engine was working perfectly and then a safety filter, meant for a completely different situation, was eating the entire response on the way out the door.
The fix was two lines.
tool_mode = bool(request.tools) # not "or a parser exists"That's it. The thing I'd written off as a rewrite was a wrong boolean. The 35B now streams cleanly at 95 tokens/sec — faster than anything else in the lineup, and a number I'd flatly told myself was impossible a day earlier.
Gemma 4 just wanted to watch the world burn
Then there was Gemma 4, the brand-new model I wanted as a quality option. Brand-new is the operative phrase — its architecture was about two weeks old, and the tooling showed it. One server crashed with a GPU threading error. Another mangled the output. A third could only benchmark it, never serve it.
The fourth worked — after I discovered the model package had its chat template surgically removed, restored it from the original, and learned that turning on the model's thinking mode made it pour its entire answer into the "reasoning" channel and never actually answer. Thinking off, template restored: a perfectly pleasant model.
Cleaning house
With all three slots on fast, modern MLX backends, the old llama.cpp setup had nothing left to do — it was slower on every model it could run. So it got retired: GGUF models deleted, the build deleted, the now-pointless menu card pulled from the app. That alone reclaimed about 80 GB of disk. The local stack is now one tidy thing instead of two overlapping ones.
What the whole detour taught me
A few things stuck:
Benchmark on real content, not a convenient prompt. One of these backends looked like a winner on a synthetic test and quietly lost by 30% on actual prose. The test prompt was doing it a favour.
"Broken subsystem" is often "wrong boolean" wearing a trench coat. I called something unfixable, then fixed it in two lines the moment I stopped guessing and started instrumenting. Print statements remain undefeated.
Bleeding-edge model architectures bleed. If a model is younger than the milk in your fridge, expect its tooling to be held together with optimism.
The scoreboard: every model faster than where it started, one of them dramatically so, a runtime bug fixed for good, and a stack that's simpler than when I began. Next time the impatience whispers, I'll at least remember to add the print statements first.
Sharing the rake
One last thing. That two-line streaming fix was never model-specific — it broke streaming for anyone using that engine, on any model. Keeping it as a private patch felt a bit like finding a rake in the dark, stepping on it, and then carefully setting it back down for the next person.
So it went upstream as a pull request — root cause, repro, the whole story. Best case, it gets merged and nobody else loses an afternoon to a stray boolean. Worst case, I still have my copy and lost nothing. That's the quiet deal with building on open source: when you trip over the rake, you also get to move it.