llama.cpp on Apple Silicon: 29 GGUF Benchmarks and a 200 t/s Surprise

Since the Easter weekend benchmarking I kept going. Quietly, over the last few weeks, the benchmark suite grew from 57 configurations to over 80. The biggest addition: llama.cpp (llama-server) across all eight model groups, with Unsloth’s dynamic GGUF quantizations. 29 new test configurations, plus two speculative decoding tests that turned out to be the fastest thing I’ve measured so far.

This post covers what changed.

What was added

The original benchmark focused on MLX backends (mlx-lm, mlx-vlm, vllm-mlx, oMLX) and Ollama. The llama.cpp numbers were limited to one model (Qwen3-Coder-Next at Q4_K_M via llama-server b8670). That left a gap: GGUF is the most portable model format, and llama.cpp is the most widely used inference engine. I wanted to know how it stacks up against the MLX ecosystem on Apple Silicon.

Each GGUF was downloaded from Unsloth’s quantized collections, benchmarked, then deleted before downloading the next. Peak disk usage was about 22 GB. The whole run took roughly five hours unattended.

Three quant levels per model:

Q4_K_M (or UD-Q4_K_M for MoE): standard 4-bit, the baseline everyone uses
UD-Q4_K_XL: Unsloth Dynamic 4-bit, selective high-precision layers
UD-Q2_K_XL: Unsloth Dynamic 2-bit, the smallest usable quant

Plus UD-Q6_K_XL for the two smallest models (E2B, E4B, Qwen3.6-27B) where 6-bit still fits in memory.

The numbers

All measurements on M3 Max, 64 GB, AC power, llama-server b9020. Three runs per configuration, median reported. Code prompt (751 tokens input, 512 tokens output).

MoE models (3-4B active parameters)

Model	Quant	Decode (t/s)	TTFT (ms)	Memory
Gemma4-E2B (2B)	Q4_K_M	108.3	59	3.8 GB
Gemma4-E2B (2B)	UD-Q2_K_XL	114.1	55	3.2 GB
Gemma4-E4B (4B)	Q4_K_M	72.9	96	6.8 GB
Gemma4-26B-A4B (MoE)	Q4_K_M	63.9	123	22.0 GB
Gemma4-26B-A4B (MoE)	UD-Q2_K_XL	71.0	126	16.0 GB
Qwen3.5-35B-A3B (MoE)	Q4_K_M	58.3	—	—
Qwen3.6-35B-A3B (MoE)	Q4_K_M	58.4	—	—

Dense models (16-32B parameters)

Model	Quant	Decode (t/s)	TTFT (ms)	Memory
Qwen3-32B	Q4_K_M	17.4	—	—
Qwen3.6-27B	Q4_K_M	17.3	—	—
Gemma4-31B	Q4_K_M	16.5	—	—
Qwen3.6-27B	UD-Q6_K_XL	12.6	—	—

Dense models at 27-32B parameters land around 17 t/s via llama.cpp. That’s consistent with the mlx-lm numbers (18-20 t/s) and the Ollama GGUF results (13 t/s). The bandwidth wall is real: every token reads all parameters, and the M3 Max memory bandwidth is fixed.

The 2-bit surprise

UD-Q2_K_XL consistently matches or slightly beats Q4_K_M on decode speed. The Gemma4-26B MoE went from 63.9 t/s at Q4_K_M to 71.0 t/s at UD-Q2_K_XL, with 6 GB less memory. Smaller weights = less memory bandwidth per token = faster decode. Whether the quality holds at 2-bit quantization is a separate question I haven’t measured yet.

Speculative decoding: 206 tokens per second

This was the real surprise.

llama.cpp supports speculative decoding: run a small draft model to propose tokens, then verify them in parallel on the large model. If the draft model predicts well, you get the quality of the large model at close to the speed of the small one.

I paired Gemma4-E2B (Q4_K_M, 1.6 GB) as the draft model with Gemma4-26B-A4B as the main model. Both share the same Gemma 4 tokenizer (262,144 vocab), which is a requirement for llama.cpp’s speculative pipeline.

Config	Quant (main)	Decode (t/s)	TTFT (ms)	Memory
26B standalone	UD-Q4_K_M	63.9	123	22.0 GB
26B + E2B draft	UD-Q4_K_M	206.5	121	31.0 GB
26B + E2B draft	UD-Q2_K_XL	202.4	126	25.2 GB

3.2× speedup. From 64 t/s to 206 t/s. TTFT stayed the same. The cost is ~9 GB extra memory for the draft model.

Bar chart comparing Gemma4-26B standalone at 64 t/s to speculative decoding at 197-206 t/s, a 3.1x speedup.

For context, the fastest MLX result across the entire benchmark suite is vllm-mlx at 105.9 t/s (baseline prompt, Qwen3.5-35B-A3B). Speculative decoding on llama.cpp nearly doubles that, on a model with comparable parameter count.

The catch: speculative decoding works best when the draft model’s predictions match the main model. For factual or formulaic text, acceptance rates are high and you get close to the theoretical speedup. For creative or unpredictable text, the draft model guesses wrong more often and the speedup drops. For coding tasks (the workload I care about), the structured nature of code should work in speculative decoding’s favor.

llama.cpp vs. MLX backends

Comparing the same model across backends (code prompt, 4-bit quantization):

Model	llama.cpp (GGUF)	mlx-lm (MLX)	Ollama	Winner
Gemma4-E2B	108 t/s	50 t/s*	—	llama.cpp (2.2×)
Gemma4-26B-A4B	64 t/s	29 t/s	64 t/s	llama.cpp = Ollama
Qwen3.5-35B-A3B	58 t/s	95 t/s	97 t/s	mlx-lm / Ollama
Qwen3-32B	17 t/s	19 t/s	—	~tie
Qwen3.6-27B	17 t/s	20 t/s	13 t/s	mlx-lm

* Gemma4-E2B ran via mlx-vlm (the VLM backend), not mlx-lm, because the model is a VLM architecture. mlx-lm can’t load it due to the VLM weight layout.

The pattern: llama.cpp wins on Gemma 4 models. MLX wins on Qwen models. Ollama (which uses its own MLX backend since 0.19) matches mlx-lm on Qwen but matches llama.cpp on Gemma. This probably reflects architecture-specific optimizations in each backend’s attention kernels.

Grouped bar chart comparing llama.cpp, mlx-lm, mlx-vlm, and Ollama decode speeds across five models at 4-bit quantization.

Why I tried speculative decoding on llama.cpp instead of mlx-lm

I did try mlx-lm first. Three approaches, all blocked:

Google’s official MTP drafter (the gemma4_assistant model type): uses a centroid-based output head that hooks into the backbone’s hidden states. It’s designed to inject into the forward pass, not work as an independent draft model. mlx-lm 0.31.3 has no gemma4_assistant module and raises ValueError on load.
E2B as draft for 26B via mlx-lm: E2B is a VLM with num_kv_shared_layers=20. mlx-lm’s gemma4_text.py can’t load it because 20 layers × 7 KV weight tensors = 140 “extra parameters” that don’t match the expected layout. Also, 26B is MoE (30 layers) while E2B is dense (35 layers), making them architecturally incompatible for mlx-lm’s speculative pipeline.
E2B draft for E4B (both dense): same num_kv_shared_layers bug affects E4B. Blocked until mlx-lm ships a fix.

llama.cpp handles all of this because GGUF models are self-contained: the weight layout is in the file, and the speculative pipeline just needs matching tokenizers. No architecture compatibility required.

What changed since Easter

Besides the 29 llama.cpp tests, a few other things:

Backend versions tracked: every CSV now records the backend version (llama-server b9020, Ollama 0.23.1, mlx-lm 0.31.3, etc.). The report deduplicates by version, so you can compare b8920 vs b9020 results side-by-side.
Ollama model IDs fixed: gemma4:2b and gemma4:4b were renamed to gemma4:e2b and gemma4:e4b in the Ollama registry. Those tests are now enabled.
Total measurements: the interactive results page now covers 900+ data rows across all backends and models.

Updated recommendations

The previous post’s recommendations hold up, with one addition:

Goal	Setup	Result
Fastest decode	Gemma4-26B + E2B draft, llama.cpp spec	206 t/s
Best all-around	Qwen3.5-35B-A3B, mlx-lm 4-bit	95 t/s, 210 ms TTFT
Fastest response	Qwen3.5-35B-A3B, Ollama	80 ms warm TTFT
Smallest memory	Gemma4-E2B, llama.cpp UD-Q2_K_XL	114 t/s in 3.2 GB
GGUF portable	Gemma4-26B-A4B, llama-server Q4_K_M	64 t/s

Speculative decoding at 206 t/s is the new fastest configuration. But it needs 31 GB and two models loaded simultaneously, and the decode speed depends on how well the draft model predicts the main model’s output. For everyday use, Ollama or mlx-lm with a single MoE model is simpler and still fast.

The benchmark harness is on GitHub. The full interactive results are browsable. All 900+ measurements, filterable by model, backend, quant, and prompt type.