LLM Inference Benchmark Harness

A systematic benchmark tool designed for the agentic workload profile. Evaluates 57 configurations across eight inference backends, seven models, six quantization formats, seven KV cache strategies, and four context tiers on Apple Silicon hardware.

Key features

Full server lifecycle management per configuration (start, health check, warmup, measurement, teardown)
Wall-clock TTFT and decode throughput measurement at realistic context depths (up to 128k tokens)
Process-tree memory monitoring via recursive psutil sampling
YAML-driven configuration with constraint validation
OpenAI SSE and Ollama JSON-line streaming parsers
762+ row dataset released for reproducibility

Findings

TTFT diverges by 100x across backends at 32k context
Prefix caching achieves 626x TTFT reduction
MoE architecture is required for 128k-token viability on 64 GB
Framework-level optimization (2.4x) exceeds quantization improvements (<3%)

Interactive results

All 791 measurements — sortable by any column, filterable by model, backend, quantization, and context depth.

Open full screen ↗