Since the Easter weekend benchmarking I kept going. Quietly, over the last few weeks, the benchmark suite grew from 57 configurations to over 80. The biggest addition: llama.cpp (llama-server) across all eight model groups, with Unsloth’s dynamic GGUF quantizations. 29 new test configurations, plus two speculative decoding tests that turned out to be the fastest thing I’ve measured so far.
This post covers what changed.
What was added
The original benchmark focused on MLX backends (mlx-lm, mlx-vlm, vllm-mlx, oMLX) and Ollama. The llama.cpp numbers were limited to one model (Qwen3-Coder-Next at Q4_K_M via llama-server b8670). That left a gap: GGUF is the most portable model format, and llama.cpp is the most widely used inference engine. I wanted to know how it stacks up against the MLX ecosystem on Apple Silicon.
Each GGUF was downloaded from Unsloth’s quantized collections, benchmarked, then deleted before downloading the next. Peak disk usage was about 22 GB. The whole run took roughly five hours unattended.
Three quant levels per model:
- Q4_K_M (or UD-Q4_K_M for MoE): standard 4-bit, the baseline everyone uses
- UD-Q4_K_XL: Unsloth Dynamic 4-bit, selective high-precision layers
- UD-Q2_K_XL: Unsloth Dynamic 2-bit, the smallest usable quant
Plus UD-Q6_K_XL for the two smallest models (E2B, E4B, Qwen3.6-27B) where 6-bit still fits in memory.
The numbers
All measurements on M3 Max, 64 GB, AC power, llama-server b9020. Three runs per configuration, median reported. Code prompt (751 tokens input, 512 tokens output).
MoE models (3-4B active parameters)
| Model | Quant | Decode (t/s) | TTFT (ms) | Memory |
|---|---|---|---|---|
| Gemma4-E2B (2B) | Q4_K_M | 108.3 | 59 | 3.8 GB |
| Gemma4-E2B (2B) | UD-Q2_K_XL | 114.1 | 55 | 3.2 GB |
| Gemma4-E4B (4B) | Q4_K_M | 72.9 | 96 | 6.8 GB |
| Gemma4-26B-A4B (MoE) | Q4_K_M | 63.9 | 123 | 22.0 GB |
| Gemma4-26B-A4B (MoE) | UD-Q2_K_XL | 71.0 | 126 | 16.0 GB |
| Qwen3.5-35B-A3B (MoE) | Q4_K_M | 58.3 | — | — |
| Qwen3.6-35B-A3B (MoE) | Q4_K_M | 58.4 | — | — |
Dense models (16-32B parameters)
| Model | Quant | Decode (t/s) | TTFT (ms) | Memory |
|---|---|---|---|---|
| Qwen3-32B | Q4_K_M | 17.4 | — | — |
| Qwen3.6-27B | Q4_K_M | 17.3 | — | — |
| Gemma4-31B | Q4_K_M | 16.5 | — | — |
| Qwen3.6-27B | UD-Q6_K_XL | 12.6 | — | — |
Dense models at 27-32B parameters land around 17 t/s via llama.cpp. That’s consistent with the mlx-lm numbers (18-20 t/s) and the Ollama GGUF results (13 t/s). The bandwidth wall is real: every token reads all parameters, and the M3 Max memory bandwidth is fixed.
The 2-bit surprise
UD-Q2_K_XL consistently matches or slightly beats Q4_K_M on decode speed. The Gemma4-26B MoE went from 63.9 t/s at Q4_K_M to 71.0 t/s at UD-Q2_K_XL, with 6 GB less memory. Smaller weights = less memory bandwidth per token = faster decode. Whether the quality holds at 2-bit quantization is a separate question I haven’t measured yet.
Speculative decoding: 206 tokens per second
This was the real surprise.
llama.cpp supports speculative decoding: run a small draft model to propose tokens, then verify them in parallel on the large model. If the draft model predicts well, you get the quality of the large model at close to the speed of the small one.
I paired Gemma4-E2B (Q4_K_M, 1.6 GB) as the draft model with Gemma4-26B-A4B as the main model. Both share the same Gemma 4 tokenizer (262,144 vocab), which is a requirement for llama.cpp’s speculative pipeline.
| Config | Quant (main) | Decode (t/s) | TTFT (ms) | Memory |
|---|---|---|---|---|
| 26B standalone | UD-Q4_K_M | 63.9 | 123 | 22.0 GB |
| 26B + E2B draft | UD-Q4_K_M | 206.5 | 121 | 31.0 GB |
| 26B + E2B draft | UD-Q2_K_XL | 202.4 | 126 | 25.2 GB |
3.2× speedup. From 64 t/s to 206 t/s. TTFT stayed the same. The cost is ~9 GB extra memory for the draft model.
For context, the fastest MLX result across the entire benchmark suite is vllm-mlx at 105.9 t/s (baseline prompt, Qwen3.5-35B-A3B). Speculative decoding on llama.cpp nearly doubles that, on a model with comparable parameter count.
The catch: speculative decoding works best when the draft model’s predictions match the main model. For factual or formulaic text, acceptance rates are high and you get close to the theoretical speedup. For creative or unpredictable text, the draft model guesses wrong more often and the speedup drops. For coding tasks (the workload I care about), the structured nature of code should work in speculative decoding’s favor.
llama.cpp vs. MLX backends
Comparing the same model across backends (code prompt, 4-bit quantization):
| Model | llama.cpp (GGUF) | mlx-lm (MLX) | Ollama | Winner |
|---|---|---|---|---|
| Gemma4-E2B | 108 t/s | 50 t/s* | — | llama.cpp (2.2×) |
| Gemma4-26B-A4B | 64 t/s | 29 t/s | 64 t/s | llama.cpp = Ollama |
| Qwen3.5-35B-A3B | 58 t/s | 95 t/s | 97 t/s | mlx-lm / Ollama |
| Qwen3-32B | 17 t/s | 19 t/s | — | ~tie |
| Qwen3.6-27B | 17 t/s | 20 t/s | 13 t/s | mlx-lm |
* Gemma4-E2B ran via mlx-vlm (the VLM backend), not mlx-lm, because the model is a VLM architecture. mlx-lm can’t load it due to the VLM weight layout.
The pattern: llama.cpp wins on Gemma 4 models. MLX wins on Qwen models. Ollama (which uses its own MLX backend since 0.19) matches mlx-lm on Qwen but matches llama.cpp on Gemma. This probably reflects architecture-specific optimizations in each backend’s attention kernels.
Why I tried speculative decoding on llama.cpp instead of mlx-lm
I did try mlx-lm first. Three approaches, all blocked:
-
Google’s official MTP drafter (the
gemma4_assistantmodel type): uses a centroid-based output head that hooks into the backbone’s hidden states. It’s designed to inject into the forward pass, not work as an independent draft model. mlx-lm 0.31.3 has nogemma4_assistantmodule and raises ValueError on load. -
E2B as draft for 26B via mlx-lm: E2B is a VLM with
num_kv_shared_layers=20. mlx-lm’sgemma4_text.pycan’t load it because 20 layers × 7 KV weight tensors = 140 “extra parameters” that don’t match the expected layout. Also, 26B is MoE (30 layers) while E2B is dense (35 layers), making them architecturally incompatible for mlx-lm’s speculative pipeline. -
E2B draft for E4B (both dense): same
num_kv_shared_layersbug affects E4B. Blocked until mlx-lm ships a fix.
llama.cpp handles all of this because GGUF models are self-contained: the weight layout is in the file, and the speculative pipeline just needs matching tokenizers. No architecture compatibility required.
What changed since Easter
Besides the 29 llama.cpp tests, a few other things:
- Backend versions tracked: every CSV now records the backend version (llama-server b9020, Ollama 0.23.1, mlx-lm 0.31.3, etc.). The report deduplicates by version, so you can compare b8920 vs b9020 results side-by-side.
- Ollama model IDs fixed:
gemma4:2bandgemma4:4bwere renamed togemma4:e2bandgemma4:e4bin the Ollama registry. Those tests are now enabled. - Total measurements: the interactive results page now covers 900+ data rows across all backends and models.
Updated recommendations
The previous post’s recommendations hold up, with one addition:
| Goal | Setup | Result |
|---|---|---|
| Fastest decode | Gemma4-26B + E2B draft, llama.cpp spec | 206 t/s |
| Best all-around | Qwen3.5-35B-A3B, mlx-lm 4-bit | 95 t/s, 210 ms TTFT |
| Fastest response | Qwen3.5-35B-A3B, Ollama | 80 ms warm TTFT |
| Smallest memory | Gemma4-E2B, llama.cpp UD-Q2_K_XL | 114 t/s in 3.2 GB |
| GGUF portable | Gemma4-26B-A4B, llama-server Q4_K_M | 64 t/s |
Speculative decoding at 206 t/s is the new fastest configuration. But it needs 31 GB and two models loaded simultaneously, and the decode speed depends on how well the draft model predicts the main model’s output. For everyday use, Ollama or mlx-lm with a single MoE model is simpler and still fast.
The benchmark harness is on GitHub. The full interactive results are browsable. All 900+ measurements, filterable by model, backend, quant, and prompt type.