llm-bench results

Apple M3 Max · 64 GB  ·  2026-05-13 09:13  ·  360 configs (median of 3 runs)  ·  1328 total measurements

ID Model Prompt Backend Ver Fmt Quant KV Cache TTFT Decode Prefill Tokens Total Peak RSS

TTFT = time to first token (warm, with prefix cache where available)  ·  Cold = first-request TTFT (shown in parentheses when >1.5× warm)  ·  Decode = generation tokens/s  ·  Prefill = prompt eval tokens/s  ·  Total = wall-clock time for full response  ·  Peak RSS = process tree RAM during inference. All values are median of 3 runs except Peak RSS (max). Backends: mlx-lm 0.31.2–0.31.3, mlx-vlm 0.4.3/0.4.4, Ollama 0.19–0.21, oMLX 0.3.4, llama.cpp b5220–b8920, vllm-mlx 0.1–0.2.9, LM Studio, Docker Model Runner.