llm-bench results

Apple M3 Max · 64 GB · 2026-05-13 09:13 · 360 configs (median of 3 runs) · 1328 total measurements

ID	Model	Prompt	Backend	Ver	Fmt	Quant	KV Cache	TTFT	Decode	Prefill	Tokens	Total	Peak RSS

TTFT = time to first token (warm, with prefix cache where available) · Cold = first-request TTFT (shown in parentheses when >1.5× warm) · Decode = generation tokens/s · Prefill = prompt eval tokens/s · Total = wall-clock time for full response · Peak RSS = process tree RAM during inference. All values are median of 3 runs except Peak RSS (max). Backends: mlx-lm 0.31.2–0.31.3, mlx-vlm 0.4.3/0.4.4, Ollama 0.19–0.21, oMLX 0.3.4, llama.cpp b5220–b8920, vllm-mlx 0.1–0.2.9, LM Studio, Docker Model Runner.