* spec: add speed-bench support for benchmarking * speed-bench : add trailing newline to requirements.txt * speed-bench : bump datasets to 4.8.0 to fix ty check * server-bench : remove now-unused type: ignore after datasets bump
5.2 KiB
SPEED-Bench server benchmark
A lightweight SPEED-Bench client for benchmarking an already-running llama-server through its OpenAI-compatible API. It is primarily meant to evaluate speculative decoding (draft model, n-gram, MTP, EAGLE3, ...) by reporting per-category throughput, latency, and draft acceptance.
The dataset handling follows the aiperf SPEED-Bench tutorial, which also documents the dataset layout in more detail.
Install
pip install -r tools/server/bench/speed-bench/requirements.txt
Start a server
The client does not launch the server, so start llama-server yourself first. If you care about throughput numbers, set the client --concurrency to the server's slot count (--np):
llama-server \
-m target.gguf \
-c 8192 \
--port 8080 \
-ngl 99 -fa on \
--np 1 \
--jinja
For speculative decoding, start the server with the appropriate flags for your setup (e.g. a draft model with -md, or --spec-type ngram-mod). See the speculative decoding doc for details.
Run
python tools/server/bench/speed-bench/speed_bench.py \
--url localhost:8080 \
--bench qualitative \
--category coding \
--osl 1024 \
--concurrency 1
Options
| Option | Default | Description |
|---|---|---|
--url |
localhost:8080 |
Server URL. The scheme and /v1 are optional and a trailing slash is fine, so localhost:8080 and http://localhost:8080/v1/ both work. |
--model |
none | Optional model field sent in each request. |
--bench |
qualitative |
SPEED-Bench config, e.g. qualitative, throughput_1k. See available dataset variants. |
--category |
all |
Category filter within the bench; comma-separated list or all. For qualitative the categories are coding, humanities, math, multilingual, qa, rag, reasoning, roleplay, stem, summarization, writing. For the throughput_{ISL} splits they are high_entropy, low_entropy, mixed. |
--osl |
1024 |
Output sequence length, mapped to max_tokens. |
--extra-inputs |
{"temperature":0} |
Extra request fields as a JSON object. |
--concurrency |
1 |
Concurrent client requests; usually match --np. |
--limit |
none | Max samples per category (handy for smoke tests). |
--timeout |
600 |
Per-request timeout in seconds. |
--output |
none | Save raw per-request results and the summary to JSON. |
A few common ones:
--category allruns every category in the bench.--category coding,mathruns just those two.--bench throughput_8kruns a fixed-input-length throughput split.--limit 8keeps at most 8 samples per category, which is enough for a quick check.
The throughput_{ISL} splits use fixed input lengths (1k - 32k), so they are handy for long-context testing and for comparing different llama-server batching settings (e.g. sweeping -ub / --ubatch-size) on prompts of a known size. Make sure the server -c is large enough for the chosen split. When raising -ub, also raise -b to at least the same value, since the physical ubatch cannot exceed the logical batch.
When --output is given, the JSON file holds the run config, the selected_samples / completed_samples / failed_samples counts, the per-category summary rows, and the per-sample results.
Metrics
The summary prints one row per category plus an overall row:
samples- how many samples finished successfully.avg_prompt_t/s- prefill throughput from llama.cpp (timings.prompt_per_second), averaged over the category's samples.avg_pred_t/s- decode throughput from llama.cpp (timings.predicted_per_second), averaged over the category's samples.avg_latency- average end-to-end request latency seen by the client.accept_rate-accepted / draft_nover the category, orn/aif nothing was drafted (draft_n == 0).
Baseline vs speculative decoding
Save a run from each server with --output, then diff the two JSON files with speed_bench_compare.py.
First, start a plain llama-server (no speculative decoding) and save a baseline:
python tools/server/bench/speed-bench/speed_bench.py \
--url localhost:8080 \
--bench qualitative \
--category all \
--osl 1024 \
--concurrency 1 \
--output baseline.json
Then restart llama-server with speculative decoding enabled and save another run:
python tools/server/bench/speed-bench/speed_bench.py \
--url localhost:8080 \
--bench qualitative \
--category all \
--osl 1024 \
--concurrency 1 \
--output spec.json
Finally compare the two:
python tools/server/bench/speed-bench/speed_bench_compare.py \
--baseline baseline.json \
--speculative spec.json
The comparison table adds:
decode_speedup = spec_avg_pred_t/s / base_avg_pred_t/slatency_speedup = base_avg_latency / spec_avg_latency
Keep --bench, --category, --osl, and --limit the same across both runs, otherwise they won't be using the same prompts.