Replace TARGETARCH build-arg with runtime arch detection via uname -m.
BuildKit's TARGETARCH injection was unreliable for the multi-arch cpu
build, causing the arm64 image variant to download and embed the x86_64
llama-swap binary — resulting in "exec format error" on arm64 hosts.
With QEMU user-space emulation, uname -m correctly returns aarch64
inside an arm64 container build, so the download always fetches the
right binary for the actual target architecture. Also adds --fail to
curl so HTTP 404s produce a build error instead of silently embedding an
HTML error page.
fixes#818
Co-authored-by: Claude <noreply@anthropic.com>
Shift the non-unified container builds about 8 hours after the llama.cpp's projects container publishing window. The llama.cpp containers take a few hours to build and publish and 8 hours is expected to be enough time to remain fresh.
Additionally, add an extra build at 18:00 in case the 12:00 one does not pick things up. The container builds on the llama-swap side are cheap (just injecting llama-swap binary) so it is fine to run them a bit more frequently.
Implement performance monitoring on OSX for Apple Silicon hardware.
The implementation uses a combination of mactop and ioreg. If mactop is
installed (`brew install mactop`) it is used in a headless cli mode to
stream usage metrics. mactop hooks into unpublished(?) C based APIs in
OSX. Rather than introduce a cgo dependency into llama-swap's build
chain only for darwin I opted to go the external process route.
ioreg, which comes bundled with OSX is used as the fallback. It does not
provide temperature and power usage data but is able to show accurate
GPU and memory utilization.
Updates #771, #814
Add Windows specific shutdown code paths so stopping of child processes
is more reliable:
- stopping llama-swap won't leave behind any child processes it created
- uses Job Objects in Windows so the whole llama-swap tree is closed by
the os
- add procCtx to baseRouter. It replaces shutdownCtx as a signal for
managing lifetime state.
- shutdownCtx is only used by the router to stop handling new requests
during shutdown
- improve debug logging to make it easier to trace source of issues
Fixes#804
Updates #807
Note: The original proxy/process_unix.go had a noop for setProcAttributes
so it also did not stop grandchildren processes. This patch adds that capability
and improves reliability.
--
Stop() no longer hangs on a shell wrapper that forks the real binary.
The upstream is built with exec.CommandContext + cmd.Cancel +
cmd.WaitDelay, so cmd.Wait() returns even when a forked grandchild
inherits the stdout/stderr pipes. killProcess sends the stop signal
directly (not by cancelling the context) so cmd.WaitDelay measures from
process exit and never silently caps the caller's graceful timeout.
The upstream is also started in its own process group (Setpgid) on Unix,
so the graceful SIGTERM — and the SIGKILL escalation after the timeout —
are delivered to the whole group via the negative PID. A forked
grandchild is reaped with its parent instead of leaking as an orphan.
The loading-spinner SSE goroutine can no longer panic when it outlives
the request. net/http recycles the response writer via Reset(nil) once
ServeHTTP returns; the orphaned goroutine then flushed against a
nil-backed writer and crashed with a SIGSEGV. A release() fence on
loadingWriter lets any in-flight write finish then short-circuits later
writes/flushes, and all three ServeHTTP select branches run a
finishLoading helper (cancelLoad, waitForCompletion, release) before the
writer is reclaimed.
- internal/process: exec.CommandContext + WaitDelay, Setpgid process
groups, group-wide SIGTERM/SIGKILL teardown
- internal/router: release() fence + finishLoading on loadingWriter
fixes#804
update the concurrency middleware to respond with a JSON payload instead
of plain text when the request limit is reached to be compatible with
openai api standard
---------
Co-authored-by: Ludwik <l.czarnota@samsung.com>
This is a huge backend change that essentially started with rewriting
the concurrency handling for processes and blew up to a refactor of the
entire application. In short these are the improvements:
**Better state and life cycle management:**
Life cycle management of processes has always been the trickiest part of
the code. Juggling mutex locks between multiple locations to reduce race
conditions was complex. Too complex for my feeble brain to build a
simple mental model around as llama-swap gained more features. All of
that has been refactored. Most of the locks are gone, replaced with a
single run() that owns all state changes. There is one place to start
from now to understand and extend routing logic.
The improved life cycle management makes it easier to implement more
complex swap optimization strategies in the future like #727.
**Collation of requests:**
llama-swap previously handled requests and swapping in the order they
came in. For example requests for models in this order ABCABC would
result in 5 swaps. Now those requests are handled in this order AABBCC.
The result is less time waiting for swap under a high churn request
queue. This fixes#588#612.
A possible future enhancement is to support a starvation parameter so
swap can be forced when models have been waiting too long.
**Shared base implementation for groups and swap matrix:**
During the refactor it became clear that much of the swapping logic was
shared between these two implementations. That is not surprising
considering the swap matrix was added many moons after groups. Now they
share a common base and their specific swap strategies are implemented
into the swapPlanner interface.
Requests for bespoke or specific swapping scenarios is a common theme in
the issues. Now users can implement whatever bespoke and weird swapping
strategy they want in their own fork. Just ask your agent of choice to
implement swapPlanner. I'll still remaining more conservative on what
actually lands in core llama-swap and will continue to evaluate PRs if
the changes is good for everyone or just one specific use case.
**AI / Agentic Disclosure:**
I paid very close attention to the low level swap concurrency design and
implementation. It's important to keep that essential part reliable,
boring and no surprises. Backwards compatibility was also maintained,
even the one way non-exclusive group model loading behaviour that people
have rightly pointed out be a weird design decision.
With the underlying swap core done the web server, api and UI sitting on
top were largely ported over with Claude Code and Opus 4.7 in multiple
phases. If you're curious I kept the changes in docs/newrouter-todo.md.
I did several passes to make sure things weren't left behind.
However, even frontier LLMs at the time of this PR still make small
decisions that don't make a lot of sense. They get shit wrong all the
time, just in small subtle way.
That said, there's likely to be some new bugs introduced with this
massive refactor. I'm fairly confident that there's no major
architectural flaws that would cause goal seeking agents to make dumb,
ugly code decisions.
For a little while the legacy llama-swap will be available under
cmd/legacy/llama-swap. The plan is to eventually delete that entry point
as well as the proxy package.
On a bit of a personal note, this PR is exciting and a bit sad for me. I
hand wrote much of the original code and this PR ultimately replaces
much of it. While the old code served as a good reference for the agent
to implement the new stuff it still a bit sad to eventually delete it
all.
The backend uses cache_tokens=-1 as a sentinel for endpoints that don't
report cache stats (embeddings, vLLM). The activity table correctly
renders these as "-", but the totals widget summed the sentinels
directly, so each such request subtracted 1 from the displayed total.
- clamp cache_tokens with Math.max(0, ...) when reducing
This improves the support for activity logging from the v1/responses and
v1/messages endpoints.
- add chat endpoint selection to Playground > Chat > Settings
- improve metrics extraction for streaming v1/messages and v1/responses
endpoints (tested with llama-server)
Fixes#742
- update README.md with new docker instructions
- update docs/configuration.md
- update .github/workflows to have pinned action versions
- gofmt events package
- fix small bugs in CI scripts
- reduce config options for internal/perf/monitor and config. A ring buffer is used to keep 1hr of entries at max 5s granularity. For long term stats use prometheus monitoring on /metrics
Fixes#744
Ignore LACT devices that report zero total VRAM.
Some virtual GPUs on headless VMs report `MemTotalMB == 0` through LACT,
which makes them appear in performance monitoring despite not providing
useful memory data. Skip those entries so only usable GPU devices are
reported.
This makes performance monitoring cleaner on headless VMs with virtual
GPUs that report zero VRAM.
Co-authored-by: David Soušek <david.sousek@intelogy.co.uk>
actions/delete-package-versions can't see OCI manifest lists. When the
cpu build pushes a multi-arch image, the registry gets a tagged index
plus one untagged per-platform manifest per arch. The cleanup step with
`delete-only-untagged-versions: true` then deletes the per-platform
children, leaving the index dangling — `docker pull
ghcr.io/mostlygeek/llama-swap:cpu` 404s on the referenced sha.
Swap to dataaxiom/ghcr-cleanup-action, which inspects tagged manifest
lists first and excludes their children from deletion. Single-arch
backends behave the same as before.
Fix#746
Encountered a similar problem as in
https://github.com/mostlygeek/llama-swap/issues/709 but in my case I
only needed the :cpu version.
So decided to add the github action to build arm64 combined with the
amd64 version on the same :cpu tag. Already tested it from this fork:
ghcr.io/rhtenhove/llama-swap:cpu and it works perfectly fine.
Adding GPU support is a whole other beast, needing quite a bit more work
and isn't something I can test.
## Problem
The `/running` endpoint in `listRunningProcessesHandler` reads
`process.state` directly without holding `stateMutex`. Meanwhile,
`swapState()` writes to `process.state` while holding the write lock.
This is a data race flagged by the Go race detector.
Also fixes a minor typo: "processes was in state" → "process was in
state".
## Fix
- `proxymanager.go`: Replace `process.state` with
`process.CurrentState()` which acquires `stateMutex.RLock()` before
reading.
- `process.go`: Fix typo in error message.
## Verification
- `gofmt -l` — clean
- `go test -run "TestProcessGroup_|TestProxyManager_" ./proxy/` — all
pass
- `go test ./proxy/config/... ./proxy/cache/...
./proxy/configwatcher/...` — all pass
Add system theme detection with automatic switching when OS theme
changes.
- Add ThemeMode type with "light", "dark", and "system" options
- Add system theme listener using matchMedia API
- Update theme toggle to cycle through System → Light → Dark
- Add combined sun/moon icon for system theme mode
- Migrate existing theme preferences to new format
Add a comprehensive performance monitoring system that collects CPU, memory, swap, load average, network IO, and GPU stats. Provides both a REST API for the UI and a Prometheus /metrics endpoint.
Backend changes:
- New internal/perf package with configurable interval-based stats collection
- GPU monitoring via LACT (Unix socket) and nvidia-smi fallback on Linux
- Ring buffer (internal/ring) for time-series stat storage
- Prometheus /metrics endpoint with all system and GPU metrics
- Moved LogMonitor to internal/logmon package
- New PerformanceConfig for hot-reloadable monitoring settings
- REST /api/performance endpoint replacing SSE streaming
UI changes:
- New Performance page with real-time charts for CPU, memory, GPU, and network
- Reusable PerformanceChart component
- LLAMA_SWAP_URL environment variable support
- Improved capture dialog display
Other:
- Example Grafana dashboard for Prometheus metrics
- monitor-test standalone binary
- Config schema and example updates
fixes#596
- inference handles to store an activity record for all inference endpoints
- add path, status code, and content type to Activities page
- toggle on/off columns no Activities page
- add configurable capture level for inference endpoints so large binary blobs are not stored in memory
- store captures in compressed binary format
- Fix the histogram calculation to use server provided generation
tokens/second.
- Move histogram to Activities page where it can exist with the rest of
the token metrics
Fixes#681
The fsnotify-based config watcher does not work reliably when the config
file is bind-mounted into a Docker container as an individual file, and
mishandles k8s ConfigMap projections (atomically swapped symlinks).
Replace it with a small os.Stat-polling watcher and add SIGHUP as an
explicit reload signal.
- new proxy/configwatcher package: 2s os.Stat poller, follows symlinks,
fires on mtime/size change and on missing -> present transitions
- SIGHUP triggers reload unconditionally (works without --watch-config)
via the same ConfigFileChangedEvent pipeline so the UI sees identical
state transitions
- watcher goroutine now exits cleanly on shutdown via a context
- drop github.com/fsnotify/fsnotify dependency
fixes#682
Install uv after the cpp tool binaries are copied and before the
llama-swap binary, enabling `uv run` usage for Python-based inference
backends like vLLM.
- add python3-pip to runtime apt installs
- add `pip install uv --break-system-packages` after cpp installs
fixes#628
Co-authored-by: Claude <noreply@anthropic.com>
- matrix.go change logic to consider any proxy.Process not in
StateStopped or StateShutdown
- process.StopImmediately, and Stop() which called it had a subtle bug
where it only handled state transitions from StateReady to
StateStopping. StateStarting -> StateStopping was ignored completely.
fix: #670