A systematic benchmark tool designed for the agentic workload profile. Evaluates 57 configurations across eight inference backends, seven models, six quantization formats, seven KV cache strategies, and four context tiers on Apple Silicon hardware.
Key features
- Full server lifecycle management per configuration (start, health check, warmup, measurement, teardown)
- Wall-clock TTFT and decode throughput measurement at realistic context depths (up to 128k tokens)
- Process-tree memory monitoring via recursive psutil sampling
- YAML-driven configuration with constraint validation
- OpenAI SSE and Ollama JSON-line streaming parsers
- 762+ row dataset released for reproducibility
Findings
- TTFT diverges by 100x across backends at 32k context
- Prefix caching achieves 626x TTFT reduction
- MoE architecture is required for 128k-token viability on 64 GB
- Framework-level optimization (2.4x) exceeds quantization improvements (<3%)
Interactive results
All 791 measurements — sortable by any column, filterable by model, backend, quantization, and context depth.