AH
Type to search...
8 min read
llama.cpp on Apple Silicon: 29 GGUF Benchmarks and a 200 t/s Surprise
LLM InferenceApple SiliconBenchmarkingllama.cppGGUF

Since the Easter weekend benchmarking I kept going. Quietly, over the last few weeks, the benchmark suite grew from 57 configurations to over 80. The biggest addition: llama.cpp (llama-server) across all eight model groups, with Unsloth’s dynamic GGUF quantizations. 29 new test configurations, plus two speculative decoding tests that turned out to be the fastest thing I’ve measured so far.

This post covers what changed.

What was added

The original benchmark focused on MLX backends (mlx-lm, mlx-vlm, vllm-mlx, oMLX) and Ollama. The llama.cpp numbers were limited to one model (Qwen3-Coder-Next at Q4_K_M via llama-server b8670). That left a gap: GGUF is the most portable model format, and llama.cpp is the most widely used inference engine. I wanted to know how it stacks up against the MLX ecosystem on Apple Silicon.

Each GGUF was downloaded from Unsloth’s quantized collections, benchmarked, then deleted before downloading the next. Peak disk usage was about 22 GB. The whole run took roughly five hours unattended.

Three quant levels per model:

  • Q4_K_M (or UD-Q4_K_M for MoE): standard 4-bit, the baseline everyone uses
  • UD-Q4_K_XL: Unsloth Dynamic 4-bit, selective high-precision layers
  • UD-Q2_K_XL: Unsloth Dynamic 2-bit, the smallest usable quant

Plus UD-Q6_K_XL for the two smallest models (E2B, E4B, Qwen3.6-27B) where 6-bit still fits in memory.

The numbers

All measurements on M3 Max, 64 GB, AC power, llama-server b9020. Three runs per configuration, median reported. Code prompt (751 tokens input, 512 tokens output).

MoE models (3-4B active parameters)

ModelQuantDecode (t/s)TTFT (ms)Memory
Gemma4-E2B (2B)Q4_K_M108.3593.8 GB
Gemma4-E2B (2B)UD-Q2_K_XL114.1553.2 GB
Gemma4-E4B (4B)Q4_K_M72.9966.8 GB
Gemma4-26B-A4B (MoE)Q4_K_M63.912322.0 GB
Gemma4-26B-A4B (MoE)UD-Q2_K_XL71.012616.0 GB
Qwen3.5-35B-A3B (MoE)Q4_K_M58.3
Qwen3.6-35B-A3B (MoE)Q4_K_M58.4

Dense models (16-32B parameters)

ModelQuantDecode (t/s)TTFT (ms)Memory
Qwen3-32BQ4_K_M17.4
Qwen3.6-27BQ4_K_M17.3
Gemma4-31BQ4_K_M16.5
Qwen3.6-27BUD-Q6_K_XL12.6

Dense models at 27-32B parameters land around 17 t/s via llama.cpp. That’s consistent with the mlx-lm numbers (18-20 t/s) and the Ollama GGUF results (13 t/s). The bandwidth wall is real: every token reads all parameters, and the M3 Max memory bandwidth is fixed.

The 2-bit surprise

UD-Q2_K_XL consistently matches or slightly beats Q4_K_M on decode speed. The Gemma4-26B MoE went from 63.9 t/s at Q4_K_M to 71.0 t/s at UD-Q2_K_XL, with 6 GB less memory. Smaller weights = less memory bandwidth per token = faster decode. Whether the quality holds at 2-bit quantization is a separate question I haven’t measured yet.

Speculative decoding: 206 tokens per second

This was the real surprise.

llama.cpp supports speculative decoding: run a small draft model to propose tokens, then verify them in parallel on the large model. If the draft model predicts well, you get the quality of the large model at close to the speed of the small one.

I paired Gemma4-E2B (Q4_K_M, 1.6 GB) as the draft model with Gemma4-26B-A4B as the main model. Both share the same Gemma 4 tokenizer (262,144 vocab), which is a requirement for llama.cpp’s speculative pipeline.

ConfigQuant (main)Decode (t/s)TTFT (ms)Memory
26B standaloneUD-Q4_K_M63.912322.0 GB
26B + E2B draftUD-Q4_K_M206.512131.0 GB
26B + E2B draftUD-Q2_K_XL202.412625.2 GB

3.2× speedup. From 64 t/s to 206 t/s. TTFT stayed the same. The cost is ~9 GB extra memory for the draft model.

Bar chart comparing Gemma4-26B standalone at 64 t/s to speculative decoding at 197-206 t/s, a 3.1x speedup.

For context, the fastest MLX result across the entire benchmark suite is vllm-mlx at 105.9 t/s (baseline prompt, Qwen3.5-35B-A3B). Speculative decoding on llama.cpp nearly doubles that, on a model with comparable parameter count.

The catch: speculative decoding works best when the draft model’s predictions match the main model. For factual or formulaic text, acceptance rates are high and you get close to the theoretical speedup. For creative or unpredictable text, the draft model guesses wrong more often and the speedup drops. For coding tasks (the workload I care about), the structured nature of code should work in speculative decoding’s favor.

llama.cpp vs. MLX backends

Comparing the same model across backends (code prompt, 4-bit quantization):

Modelllama.cpp (GGUF)mlx-lm (MLX)OllamaWinner
Gemma4-E2B108 t/s50 t/s*llama.cpp (2.2×)
Gemma4-26B-A4B64 t/s29 t/s64 t/sllama.cpp = Ollama
Qwen3.5-35B-A3B58 t/s95 t/s97 t/smlx-lm / Ollama
Qwen3-32B17 t/s19 t/s~tie
Qwen3.6-27B17 t/s20 t/s13 t/smlx-lm

* Gemma4-E2B ran via mlx-vlm (the VLM backend), not mlx-lm, because the model is a VLM architecture. mlx-lm can’t load it due to the VLM weight layout.

The pattern: llama.cpp wins on Gemma 4 models. MLX wins on Qwen models. Ollama (which uses its own MLX backend since 0.19) matches mlx-lm on Qwen but matches llama.cpp on Gemma. This probably reflects architecture-specific optimizations in each backend’s attention kernels.

Grouped bar chart comparing llama.cpp, mlx-lm, mlx-vlm, and Ollama decode speeds across five models at 4-bit quantization.

Why I tried speculative decoding on llama.cpp instead of mlx-lm

I did try mlx-lm first. Three approaches, all blocked:

  1. Google’s official MTP drafter (the gemma4_assistant model type): uses a centroid-based output head that hooks into the backbone’s hidden states. It’s designed to inject into the forward pass, not work as an independent draft model. mlx-lm 0.31.3 has no gemma4_assistant module and raises ValueError on load.

  2. E2B as draft for 26B via mlx-lm: E2B is a VLM with num_kv_shared_layers=20. mlx-lm’s gemma4_text.py can’t load it because 20 layers × 7 KV weight tensors = 140 “extra parameters” that don’t match the expected layout. Also, 26B is MoE (30 layers) while E2B is dense (35 layers), making them architecturally incompatible for mlx-lm’s speculative pipeline.

  3. E2B draft for E4B (both dense): same num_kv_shared_layers bug affects E4B. Blocked until mlx-lm ships a fix.

llama.cpp handles all of this because GGUF models are self-contained: the weight layout is in the file, and the speculative pipeline just needs matching tokenizers. No architecture compatibility required.

What changed since Easter

Besides the 29 llama.cpp tests, a few other things:

  • Backend versions tracked: every CSV now records the backend version (llama-server b9020, Ollama 0.23.1, mlx-lm 0.31.3, etc.). The report deduplicates by version, so you can compare b8920 vs b9020 results side-by-side.
  • Ollama model IDs fixed: gemma4:2b and gemma4:4b were renamed to gemma4:e2b and gemma4:e4b in the Ollama registry. Those tests are now enabled.
  • Total measurements: the interactive results page now covers 900+ data rows across all backends and models.

Updated recommendations

The previous post’s recommendations hold up, with one addition:

GoalSetupResult
Fastest decodeGemma4-26B + E2B draft, llama.cpp spec206 t/s
Best all-aroundQwen3.5-35B-A3B, mlx-lm 4-bit95 t/s, 210 ms TTFT
Fastest responseQwen3.5-35B-A3B, Ollama80 ms warm TTFT
Smallest memoryGemma4-E2B, llama.cpp UD-Q2_K_XL114 t/s in 3.2 GB
GGUF portableGemma4-26B-A4B, llama-server Q4_K_M64 t/s

Speculative decoding at 206 t/s is the new fastest configuration. But it needs 31 GB and two models loaded simultaneously, and the decode speed depends on how well the draft model predicts the main model’s output. For everyday use, Ollama or mlx-lm with a single MoE model is simpler and still fast.

The benchmark harness is on GitHub. The full interactive results are browsable. All 900+ measurements, filterable by model, backend, quant, and prompt type.