mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-09 07:16:44 +02:00

Files

T

Ruixiang Wang 689a9a470e server-bench : add speed-bench for speculative decoding benchmarking (#23869 )

* spec: add speed-bench support for benchmarking

* speed-bench : add trailing newline to requirements.txt

* speed-bench : bump datasets to 4.8.0 to fix ty check

* server-bench : remove now-unused type: ignore after datasets bump

2026-05-29 23:09:47 +02:00

5.2 KiB

Raw Blame History

SPEED-Bench server benchmark

A lightweight SPEED-Bench client for benchmarking an already-running llama-server through its OpenAI-compatible API. It is primarily meant to evaluate speculative decoding (draft model, n-gram, MTP, EAGLE3, ...) by reporting per-category throughput, latency, and draft acceptance.

The dataset handling follows the aiperf SPEED-Bench tutorial, which also documents the dataset layout in more detail.

Install

pip install -r tools/server/bench/speed-bench/requirements.txt

Start a server

The client does not launch the server, so start llama-server yourself first. If you care about throughput numbers, set the client --concurrency to the server's slot count (--np):

llama-server \
  -m target.gguf \
  -c 8192 \
  --port 8080 \
  -ngl 99 -fa on \
  --np 1 \
  --jinja

For speculative decoding, start the server with the appropriate flags for your setup (e.g. a draft model with -md, or --spec-type ngram-mod). See the speculative decoding doc for details.

Run

python tools/server/bench/speed-bench/speed_bench.py \
  --url localhost:8080 \
  --bench qualitative \
  --category coding \
  --osl 1024 \
  --concurrency 1

Options

Option	Default	Description
`--url`	`localhost:8080`	Server URL. The scheme and `/v1` are optional and a trailing slash is fine, so `localhost:8080` and `http://localhost:8080/v1/` both work.
`--model`	none	Optional `model` field sent in each request.
`--bench`	`qualitative`	SPEED-Bench config, e.g. `qualitative`, `throughput_1k`. See available dataset variants.
`--category`	`all`	Category filter within the bench; comma-separated list or `all`. For `qualitative` the categories are `coding`, `humanities`, `math`, `multilingual`, `qa`, `rag`, `reasoning`, `roleplay`, `stem`, `summarization`, `writing`. For the `throughput_{ISL}` splits they are `high_entropy`, `low_entropy`, `mixed`.
`--osl`	`1024`	Output sequence length, mapped to `max_tokens`.
`--extra-inputs`	`{"temperature":0}`	Extra request fields as a JSON object.
`--concurrency`	`1`	Concurrent client requests; usually match `--np`.
`--limit`	none	Max samples per category (handy for smoke tests).
`--timeout`	`600`	Per-request timeout in seconds.
`--output`	none	Save raw per-request results and the summary to JSON.

A few common ones:

--category all runs every category in the bench.
--category coding,math runs just those two.
--bench throughput_8k runs a fixed-input-length throughput split.
--limit 8 keeps at most 8 samples per category, which is enough for a quick check.

The throughput_{ISL} splits use fixed input lengths (1k - 32k), so they are handy for long-context testing and for comparing different llama-server batching settings (e.g. sweeping -ub / --ubatch-size) on prompts of a known size. Make sure the server -c is large enough for the chosen split. When raising -ub, also raise -b to at least the same value, since the physical ubatch cannot exceed the logical batch.

When --output is given, the JSON file holds the run config, the selected_samples / completed_samples / failed_samples counts, the per-category summary rows, and the per-sample results.

Metrics

The summary prints one row per category plus an overall row:

samples - how many samples finished successfully.
avg_prompt_t/s - prefill throughput from llama.cpp (timings.prompt_per_second), averaged over the category's samples.
avg_pred_t/s - decode throughput from llama.cpp (timings.predicted_per_second), averaged over the category's samples.
avg_latency - average end-to-end request latency seen by the client.
accept_rate - accepted / draft_n over the category, or n/a if nothing was drafted (draft_n == 0).

Baseline vs speculative decoding

Save a run from each server with --output, then diff the two JSON files with speed_bench_compare.py.

First, start a plain llama-server (no speculative decoding) and save a baseline:

python tools/server/bench/speed-bench/speed_bench.py \
  --url localhost:8080 \
  --bench qualitative \
  --category all \
  --osl 1024 \
  --concurrency 1 \
  --output baseline.json

Then restart llama-server with speculative decoding enabled and save another run:

python tools/server/bench/speed-bench/speed_bench.py \
  --url localhost:8080 \
  --bench qualitative \
  --category all \
  --osl 1024 \
  --concurrency 1 \
  --output spec.json

Finally compare the two:

python tools/server/bench/speed-bench/speed_bench_compare.py \
  --baseline baseline.json \
  --speculative spec.json

The comparison table adds:

decode_speedup = spec_avg_pred_t/s / base_avg_pred_t/s
latency_speedup = base_avg_latency / spec_avg_latency

Keep --bench, --category, --osl, and --limit the same across both runs, otherwise they won't be using the same prompts.

5.2 KiB Raw Blame History