wylab/llama-swap

Fork 1

mirror of https://github.com/mostlygeek/llama-swap.git synced 2026-06-09 06:46:34 +02:00

T

Benson Wong 02e015fa49

Linux CI / run-tests (push) Failing after 14m56s

Details

Windows CI / run-tests (push) Has been cancelled

Details

Introduce new routing backend (#790 )

This is a huge backend change that essentially started with rewriting
the concurrency handling for processes and blew up to a refactor of the
entire application. In short these are the improvements:

**Better state and life cycle management:** 

Life cycle management of processes has always been the trickiest part of
the code. Juggling mutex locks between multiple locations to reduce race
conditions was complex. Too complex for my feeble brain to build a
simple mental model around as llama-swap gained more features. All of
that has been refactored. Most of the locks are gone, replaced with a
single run() that owns all state changes. There is one place to start
from now to understand and extend routing logic.

The improved life cycle management makes it easier to implement more
complex swap optimization strategies in the future like #727.

**Collation of requests:**

llama-swap previously handled requests and swapping in the order they
came in. For example requests for models in this order ABCABC would
result in 5 swaps. Now those requests are handled in this order AABBCC.
The result is less time waiting for swap under a high churn request
queue. This fixes #588 #612.

A possible future enhancement is to support a starvation parameter so
swap can be forced when models have been waiting too long.

**Shared base implementation for groups and swap matrix:** 

During the refactor it became clear that much of the swapping logic was
shared between these two implementations. That is not surprising
considering the swap matrix was added many moons after groups. Now they
share a common base and their specific swap strategies are implemented
into the swapPlanner interface.

Requests for bespoke or specific swapping scenarios is a common theme in
the issues. Now users can implement whatever bespoke and weird swapping
strategy they want in their own fork. Just ask your agent of choice to
implement swapPlanner. I'll still remaining more conservative on what
actually lands in core llama-swap and will continue to evaluate PRs if
the changes is good for everyone or just one specific use case.

**AI / Agentic Disclosure:** 

I paid very close attention to the low level swap concurrency design and
implementation. It's important to keep that essential part reliable,
boring and no surprises. Backwards compatibility was also maintained,
even the one way non-exclusive group model loading behaviour that people
have rightly pointed out be a weird design decision.

With the underlying swap core done the web server, api and UI sitting on
top were largely ported over with Claude Code and Opus 4.7 in multiple
phases. If you're curious I kept the changes in docs/newrouter-todo.md.
I did several passes to make sure things weren't left behind.

However, even frontier LLMs at the time of this PR still make small
decisions that don't make a lot of sense. They get shit wrong all the
time, just in small subtle way.

That said, there's likely to be some new bugs introduced with this
massive refactor. I'm fairly confident that there's no major
architectural flaws that would cause goal seeking agents to make dumb,
ugly code decisions.

For a little while the legacy llama-swap will be available under
cmd/legacy/llama-swap. The plan is to eventually delete that entry point
as well as the proxy package.

On a bit of a personal note, this PR is exciting and a bit sad for me. I
hand wrote much of the original code and this PR ultimately replaces
much of it. While the old code served as a good reference for the agent
to implement the new stuff it still a bit sad to eventually delete it
all.

2026-05-28 21:47:01 -07:00

.github

Increase inactivity thresholds for stale issues

2026-05-17 22:52:58 -07:00

ai-plans

proxy: Refactor tests (#660 )

2026-04-16 22:47:42 -07:00

cmd

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

docker

Multi arch cpu (#746 )

2026-05-11 21:03:48 -07:00

docs

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

internal

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

models

first commit

2024-10-03 20:20:01 -07:00

proxy

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

scripts

Improve install script (#144 )

2025-05-23 09:39:55 -07:00

ui-svelte

ui-svelte: update link to performance discussion thread

2026-05-17 11:45:56 -07:00

.coderabbit.yaml

Disable auto review feature in coderabbit config

2026-05-18 10:40:21 -07:00

.gitignore

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

.goreleaser.yaml

fix more goreleaser deprecation warnings [skip ci]

2025-06-18 11:15:12 -07:00

AGENTS.md

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

CLAUDE.md

ui: persist playground state across route navigation (#525 )

2026-02-15 21:30:52 -08:00

config-schema.json

Changes and fixes before the release (docs/small tweaks) (#750 )

2026-05-13 21:18:19 -07:00

config.example.yaml

config.example.yaml: Improve matrix vs groups info

2026-05-17 15:59:25 -07:00

go.mod

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

go.sum

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

LICENSE.md

2024-10-04 09:31:08 -07:00

llama-swap.go

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

Makefile

Introduce new routing backend (#790 )

2026-05-28 21:47:01 -07:00

README.md

Changes and fixes before the release (docs/small tweaks) (#750 )

2026-05-13 21:18:19 -07:00

README.md

llama-swap

Run multiple generative AI models on your machine and hot-swap between them on demand. llama-swap works with any OpenAI and Anthropic API compatible server and is used by thousands of people to power their local AI workflows.

Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.

Features:

✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies
✅ On-demand model switching
✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, stable-diffusion.cpp, etc.)
- future proof, upgrade your inference servers at any time.
✅ OpenAI API supported endpoints:
- v1/completions
- v1/chat/completions
- v1/responses
- v1/embeddings
- v1/models - list available models
- v1/audio/speech (#36)
- v1/audio/transcriptions (docs)
- v1/audio/voices
- v1/images/generations
- v1/images/edits
✅ Anthropic API supported endpoints:
- v1/messages
- v1/messages/count_tokens
✅ llama-server (llama.cpp) supported endpoints
- v1/rerank, v1/reranking, /rerank
- /infill - for code infilling
- /completion - for completion endpoint
✅ SDAPI via stable-diffusion.cpp's server
- /sdapi/v1/txt2img
- /sdapi/v1/img2img
- /sdapi/v1/loras - requires model in request body to fetch the correct loras
✅ llama-swap API
- /ui - web UI
- /upstream/:model_id - direct access to upstream server (demo)
- /running - list currently running models (#61)
- POST /api/models/unload - manually unload all running models (#58)
- POST /api/models/unload/:model_id - unload a specific model
- /logs - remote log monitoring
  - GET /logs returns buffered plain text logs.
    - If Accept: text/html is sent, /logs redirects to /ui/.
  - GET /logs/stream keeps the connection open for live log streaming.
    - Stream endpoints send buffered history first by default; add ?no-history to stream only new lines.
  - GET /logs/stream/proxy streams proxy logs only.
  - GET /logs/stream/upstream streams upstream process logs only.
  - GET /logs/stream/{model_id} streams logs for one model (including IDs with slashes, like author/model).
- /health - just returns "OK"
- /metrics - system and GPU metrics for prometheus
✅ API Key support - define keys to restrict access to API endpoints
✅ Customizable
- Run concurrent models with a custom DSL swap matrix (#643)
- Automatic unloading of models after timeout by setting a ttl
- Docker and Podman support using cmd and cmdStop together
- Preload models on startup with hooks (#235)
- Apply filters to requests to control inference with stripParams, setParams and setParamsByID

Web UI

llama-swap includes a real time web interface with a playground for testing out all sorts of local models:

View detailed token metrics:

Inspect request and responses:

Manually load and unload models:

Real time log streaming:

Installation

llama-swap can be installed in multiple ways

Docker
Homebrew (OSX and Linux)
WinGet
From release binaries
From source

Docker Install (download images)

Two types of container images are built nightly for llama-swap:

A unified container with llama-server, ik-llama-server, stable-diffusion.cpp, whisper.cpp and llama-swap built from source. This is only available for cuda and vulkan but has more capabilities. This one is recommended for use.
A legacy image that is based on llama.cpp's images and llama-swap copied into the container. Use this one if you prefer to stay close to llama.cpp's container images.

Unified container (Recommended)

$ docker pull ghcr.io/mostlygeek/llama-swap:unified-cuda

# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
 -v /path/to/models:/models \
 -v /path/to/custom/config.yaml:/etc/llama-swap/config/config.yaml \
 ghcr.io/mostlygeek/llama-swap:unified-cuda

Legacy container

$ docker pull ghcr.io/mostlygeek/llama-swap:cuda

# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
 -v /path/to/models:/models \
 -v /path/to/custom/config.yaml:/app/config.yaml \
 ghcr.io/mostlygeek/llama-swap:cuda

more examples

# pull latest images per platform
docker pull ghcr.io/mostlygeek/llama-swap:cpu
docker pull ghcr.io/mostlygeek/llama-swap:cuda
docker pull ghcr.io/mostlygeek/llama-swap:vulkan
docker pull ghcr.io/mostlygeek/llama-swap:intel
docker pull ghcr.io/mostlygeek/llama-swap:musa

# tagged llama-swap, platform and llama-server version images
docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795

# non-root cuda
docker pull ghcr.io/mostlygeek/llama-swap:cuda-non-root

Homebrew Install (macOS/Linux)

brew tap mostlygeek/llama-swap
brew install llama-swap
llama-swap --config path/to/config.yaml --listen localhost:8080

WinGet Install (Windows)

Note

WinGet is maintained by community contributor Dvd-Znf (#327). It is not an official part of llama-swap.

# install
C:\> winget install llama-swap

# upgrade
C:\> winget upgrade llama-swap

Pre-built Binaries

Binaries are available on the release page for Linux, Mac, Windows and FreeBSD.

Building from source

Building requires Go and Node.js (for UI).
git clone https://github.com/mostlygeek/llama-swap.git
make clean all
look in the build/ subdirectory for the llama-swap binary

Configuration

# minimum viable config.yaml

models:
  model1:
    cmd: llama-server --port ${PORT} --model /path/to/model.gguf

That's all you need to get started:

models - holds all model configurations
model1 - the ID used in API calls
cmd - the command to run to start the server.
${PORT} - an automatically assigned port number

Almost all configuration settings are optional and can be added one step at a time:

Advanced features
- matrix to run concurrent models with a custom swap logic DSL
- hooks to run things on startup
- macros reusable snippets
Model customization
- ttl to automatically unload models
- aliases to use familiar model names (e.g., "gpt-4o-mini")
- env to pass custom environment variables to inference servers
- cmdStop gracefully stop Docker/Podman containers
- useModelName to override model names sent to upstream servers
- ${PORT} automatic port variables for dynamic port assignment
- filters rewrite parts of requests before sending to the upstream server

See the configuration documentation for all options.

How does llama-swap work?

When a request is made to an OpenAI compatible endpoint, llama-swap will extract the model value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.

In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, using a matrix allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.

Reverse Proxy Configuration (nginx)

If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. (#236)

Recommended nginx configuration snippets:

# SSE for UI events/logs
location /api/events {
    proxy_pass http://your-llama-swap-backend;
    proxy_buffering off;
    proxy_cache off;
}

# Streaming chat completions (stream=true)
location /v1/chat/completions {
    proxy_pass http://your-llama-swap-backend;
    proxy_buffering off;
    proxy_cache off;
}

As a safeguard, llama-swap also sets X-Accel-Buffering: no on SSE responses. However, explicitly disabling proxy_buffering at your reverse proxy is still recommended for reliable streaming behavior.

Monitoring Logs on the CLI

# sends up to the last 10KB of logs
$ curl http://host/logs

# streams combined logs
curl -Ns http://host/logs/stream

# stream llama-swap's proxy status logs
curl -Ns http://host/logs/stream/proxy

# stream logs from upstream processes that llama-swap loads
curl -Ns http://host/logs/stream/upstream

# stream logs only from a specific model
curl -Ns http://host/logs/stream/{model_id}

# stream and filter logs with linux pipes
curl -Ns http://host/logs/stream | grep 'eval time'

# appending ?no-history will disable sending buffered history first
curl -Ns 'http://host/logs/stream?no-history'

Do I need to use llama.cpp's server (llama-server)?

Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.

For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to SIGTERM signals for proper shutdown.

Star History

Note

Thank you to everyone who has given this project a ⭐️!

Languages

Go 64.1%

Svelte 22.5%

TypeScript 6.4%

Shell 4.9%

Dockerfile 1%

Other 1.1%

README.md Unescape Escape

llama-swap

Features:

Web UI

Installation

Docker Install (download images)

Unified container (Recommended)

Legacy container

Homebrew Install (macOS/Linux)

WinGet Install (Windows)

Pre-built Binaries

Building from source

Configuration

How does llama-swap work?

Reverse Proxy Configuration (nginx)

Monitoring Logs on the CLI

Do I need to use llama.cpp's server (llama-server)?

Star History

README.md