AH
Type to search...
12 min read
Running Local LLMs: An Easter Weekend Rabbit Hole
LLM InferenceApple SiliconBenchmarkingLocal AI

It started harmless enough. Easter weekend, a four-day break, and a thought that wouldn’t leave me alone: Can you actually run agentic coding workflows against a local LLM?

The pitch writes itself. If you’re GPU poor (no H100 cluster) or token poor (API costs add up), Apple Silicon gives you something interesting: you’re memory rich. My M3 Max has 64 GB of unified memory shared between CPU and GPU. That’s enough to load a 35-billion-parameter model at 4-bit quantization and still have headroom for a KV cache at 128,000 tokens of context. Theoretically, you could run AI inference for free. Minus hardware, electricity, and the hours you’ll never get back setting it all up.

I wanted to find out if theory matched reality.

Down the Rabbit Hole

Any reasonable person would have picked one model, one backend, tested it, and moved on with their weekend. I am not that person.

The model landscape alone was overwhelming. Google had just released Gemma 4 days earlier, a whole family of Mixture-of-Experts models (26B, E4B, E2B) plus a dense 31B variant. Qwen had shipped two compelling options: Qwen3.5-35B-A3B (a MoE model activating only 3.5B parameters per token) and Qwen3-Coder-Next (a dense coding specialist). Then there’s the question of dense versus MoE architecture, different quantization formats (4-bit, 8-bit, MXFP4, NVFP4, GGUF), and KV cache strategies like Google’s TurboQuant (paper) that compress the cache from FP16 down to 3.5 bits. TurboQuant had just been announced on March 24 (to be presented at ICLR 2026) and triggered a sell-off in memory stocks: the market assumed KV cache compression would reduce HBM demand, panic selling amplified the move (particularly out of China), and it coincided with Q2 hyperscaler budget resets that made demand look weaker than it was. The reaction was mostly sentiment. TurboQuant only compresses the KV cache (temporary inference memory), not model weights or training workloads, which are the actual drivers of memory demand. But for local inference on a laptop? The idea is real: quantize the KV cache to 3 bits with zero accuracy loss, cut memory 6×, and unlock longer context on constrained hardware. oMLX 0.3.4 already shipped fused Metal kernels for it.

But even if you pick a model, you still have to pick a backend. And there are many: Ollama, oMLX, mlx-vlm, mlx-lm, vllm-mlx, llama-server, LM Studio, Docker Model Runner. Each has different performance characteristics, different model format support, different endpoint compatibility (OpenAI-compatible API, thinking/reasoning tokens, function calling, tool use). Most people install LM Studio and call it a day.

My engineer brain wouldn’t let me. Show me the numbers.

The Benchmark

So I built a benchmark harness (with liberal help from AI, naturally). A ~2,000-line Python tool that manages the full server lifecycle for each configuration: start the inference server, wait for health, run a warmup request, execute three measured runs, record everything, tear down, move to the next config. It measures what actually matters for agentic workloads: wall-clock time-to-first-token (TTFT) and decode throughput at realistic context depths. Not at 23 tokens of input like most model cards report, but at 32k, 64k, and 128k tokens. The kind of context depth you hit after five to ten turns of a coding agent reading files, proposing edits, and running tests.

57 configurations. 8 backends. 7 models. 6 quantization formats. 4 context tiers.

There are existing tools for this space. llmfit tells you which models fit on your hardware. whatcani.run crowdsources short-prompt decode speeds across GPUs. Both are useful, but both answer “can I run this model?” at baseline context. Neither measures what happens at turn seven of a coding session, when your context window is 32,000 tokens deep and growing. That’s the gap this benchmark targets.

Each full run took multiple hours. I ran the suite multiple times over the long weekend as backends shipped updates, models dropped, and things broke and got fixed. 762 data rows and counting.

The benchmark harness is open source. Run it yourself and tell me what you find.

What the Numbers Say

Here’s what surprised me.

Short-prompt benchmarks lie

A model that advertises 65 tokens per second, measured at 23 tokens of input, can fail with an out-of-memory error at turn seven of a coding session. The effective throughput at the point of failure is zero.

The rankings at short prompts do not predict the rankings at agentic context depths. vllm-mlx scored the highest decode throughput in my evaluation: 105.9 t/s at baseline. But at 32,000 tokens of context, its TTFT exploded to 27.2 seconds. Meanwhile, mlx-lm stayed under 300 ms. The ranking inverts completely.

BackendTTFT at baselineTTFT at 32kWhat happened
vllm-mlx148 ms27,200 ms184× slower
mlx-lm178 ms286 ms1.6× slower
Ollama 0.1959 ms80 ms1.4× slower

If you only benchmark short prompts, you’ll pick the wrong backend.

Chart showing TTFT scaling with context depth across five backends. mlx-lm and Ollama stay under 1 second, while vllm-mlx, mlx-vlm, and oMLX explode to 30-240 seconds at 32k+ context.

MoE is the only game in town at 128k

On 64 GB unified memory, dense models of 30B+ parameters simply cannot fit at long context. The KV cache eats whatever headroom the model weights leave.

ModelArchitecture128k contextDecode (t/s)
Qwen3.5-35B-A3BMoE (3.5B active)Works42
Gemma 4 E2BMoE (2B active)Works64
Gemma 4 26BMoE (4B active)Works34
Qwen3-32BDense (32B)OOM-
Qwen3-Coder-NextDenseOOM-

MoE wins because fewer active parameters per token means less memory bandwidth demand during decode, and the smaller weight footprint leaves room for KV cache growth.

Prefix caching: the single biggest win

Backends with prefix caching (Ollama 0.19, mlx-lm) reuse the KV cache from previous requests when the conversation prefix matches. In an agentic loop, nearly the entire prefix is reusable after the first request.

57 seconds cold. 91 milliseconds warm. 626×.

That first request hurts. Every request after that is instant.

Chart comparing cold versus warm TTFT across backends. Warm TTFT ranges from 80ms to 470ms, cold TTFT from 29 to 299 seconds, up to 637× difference.

mlx-lm 0.31.2: the April 6 picture

With Gemma 4 support landing in mlx-lm 0.31.2 this week, the numbers look like this:

ModelDecode (t/s)Warm TTFT
Gemma 4 E2B (2B MoE)110.2187 ms
Qwen3.5-35B-A3B71.1210 ms
Gemma 4 26B (26B MoE)65.0251 ms
Gemma 4 E4B (4B MoE)60.1213 ms

These numbers would have been unthinkable six months ago on consumer hardware.

Horizontal bar chart of decode throughput at baseline across 13 backend configurations. vllm-mlx leads at 105.9 t/s, followed by mlx-lm at 93.6 t/s. Grouped bar chart showing decode throughput degradation at 32k, 64k, and 128k context across top backends.

For context, mlx-vlm (the only backend that supported Gemma 4 last week) served the same model at 29 t/s with 20-second TTFT. That’s 160× faster TTFT and 2.3× faster decode from a single backend update that landed days later.

The Ecosystem Moves at Lightspeed

The local LLM ecosystem ships releases daily. Not weekly. Daily.

Ollama went from 0.18 (llama.cpp under the hood) to 0.19 (native MLX backend) and throughput jumped 2.4× overnight, from 37 to 90 tokens per second. Same hardware, same model, same quantization. A framework-level architectural change that dwarfed every quantization optimization I tested (all of which were within 3% of each other).

It’s a whack-a-mole situation. mlx-vlm 0.4.4 fixed Gemma 4 chunked prefill but broke Qwen3.5 streaming. mlx-lm from main added Gemma 4 support and fixed a Qwen3.5 memory leak (540 KB/token down to 20 KB). oMLX 0.3.4 shipped fused Metal kernels for TurboQuant KV cache compression that cut overhead from 43% to 8%. llama.cpp b8670 added --clear-idle for +20% throughput but changed the --flash-attn flag syntax, breaking existing scripts.

Along the way I discovered half a dozen bugs across different stacks: HuggingFace’s xet transfer protocol crashes with ENOSPC on macOS because it writes temp files to a 893 MB RAM disk despite 112 GB of free disk space. LM Studio’s memory guardrail refuses to load a 38 GB model with 54 GB free. Ollama’s brew service keeps running the old version after an upgrade because of a symlink mismatch. Each run of the benchmark surfaced something new.

Every time I thought I was done, something changed. Multiple times. All over that long Easter weekend.

What I’d Actually Use

After all that, here’s where I landed. Six setups, each optimized for a different constraint:

GoalModelBackendFormat / QuantKV CacheMax ContextResult
Best all-aroundQwen3.5-35B-A3Bmlx-lm 0.31.2MLX 4-bitdefault (FP16)128k71 t/s, 210 ms warm TTFT
Best speed/param ratioGemma 4 26B-A4Bmlx-lm 0.31.2MLX 4-bitdefault (FP16)128k65 t/s, 251 ms warm TTFT
Fastest responseQwen3.5-35B-A3BOllama 0.19INT4prefix cache (built-in)64k80 ms warm TTFT, 90 t/s decode
Minimum KV cache overheadQwen3.5-35B-A3Bmlx-lm + TurboQuantMLX 4-bitTurboQuant 4-bit128k38 t/s
SSD offload / multitaskingQwen3.5-35B-A3BoMLX 0.3.4MLX 4-bitSSD-paged64k50 t/s
GGUF / pre-M5Qwen3-Coder-Nextllama-server b8670GGUF Q4_K_M--kv-q8 (8-bit KV)32k44 t/s

A few notes on the less obvious rows:

Gemma 4 26B is the sleeper pick. 4B active parameters (comparable to Qwen3.5’s 3.5B), nearly the same decode speed. I did not measure output quality, so this recommendation is purely on the numbers.

TurboQuant compresses the KV cache from FP16 to 4 bits during inference. Decode is slower (38 vs 71 t/s) because of the compression overhead, but on a memory-constrained machine this is the difference between fitting 128k context and crashing.

SSD offload is a different approach: oMLX pages model weights to disk instead of holding them in unified memory. You trade decode speed for the ability to run a 35B-parameter model alongside other applications without starving them of RAM.

Update: Qwen3.6-27B (April 22, 2026)

Qwen released the Qwen3.6-27B the day I thought I was done updating this post. It’s a dense 27B VLM with a hybrid attention architecture: 64 layers total, but only 16 use full attention — the rest use Gated DeltaNet linear attention with fixed-size state. This means the KV cache cost is roughly a quarter of a standard 27B model.

On paper, that’s exciting for long-context workloads. In practice, at 4-bit quantization via mlx-lm:

PromptDecodePrefillTTFT
short (23 tok)20.3 t/s93 t/s249 ms
code (751 tok)20.3 t/s3,012 t/s249 ms
context-32k17.3 t/s107k t/s304 ms
context-64k15.0 t/s145k t/s454 ms

The MXFP4 quantization was marginally faster at 21.6 t/s decode. TurboQuant and NVFP4 matched the 4-bit baseline. The 8-bit variant was bandwidth-limited at 12 t/s. Ollama’s NVFP4 format matched mlx-lm at 20 t/s — but its Q4_K_M GGUF was 35% slower at 13 t/s. The vllm-mlx backend had the highest raw decode (23 t/s) but terrible TTFT (3.7 seconds on the code prompt).

The 128k context test crashed mlx-lm (connection refused), despite the theoretical KV budget fitting easily. Ollama handled 128k at 7 t/s but with a 20-second TTFT. The DeltaNet architecture is genuinely memory-efficient — 32k and 64k context work with only 15 GB RSS total — but the decode speed at ~20 t/s is 3.5x slower than the MoE Qwen3.5-35B-A3B (71 t/s) and 3x slower than Coder-Next (61 t/s). Dense 27B is dense 27B: every token reads all 27 billion parameters.

One surprise: mlx-vlm (the VLM backend) returns HTTP 500 on the qwen3_5 architecture despite the model being a VLM. The mlx-lm text path works perfectly. If you want to run this model, use mlx-lm or Ollama.

The interactive results now include all 24 Qwen3.6-27B measurements.

Caveats

All of this was tested on one machine: Apple M3 Max, 64 GB unified memory, macOS 15.4, AC power, High Performance mode. Your mileage will vary, especially on different hardware. Nvidia GPUs are a completely different story with CUDA and tensor cores. The upcoming Apple M5 introduces Metal tensor shaders, which will significantly change the llama-server / GGUF numbers on Apple Silicon.

The scope here was speed: TTFT and decode throughput. I deliberately did not measure output quality (perplexity, HumanEval, needle-in-haystack, context fidelity). That’s a separate project, and if there’s interest, I might do it next.

Three measurement runs per configuration is not enough for statistical rigor. But it’s enough to see the signal.


So, back to the original question: Can you run agentic coding workflows against a local LLM on Apple Silicon?

Yes. With the right model (MoE), the right backend (mlx-lm with prefix caching), and enough unified memory, it works today. The gap to cloud models is in reasoning quality, not inference speed.

The benchmark harness is on GitHub. You can also browse the full interactive results, all 815 measurements, sortable and filterable. Run it on your hardware and tell me what you find.

Turns out an Easter weekend rabbit hole can go pretty deep.