llama-swap

mirror of https://github.com/mostlygeek/llama-swap.git synced 2026-06-09 06:46:34 +02:00

Author	SHA1	Message	Date
Benson Wong	46cea36bc2	proxy: remove legacy code. Thanks champ 🫡 (#822 ) Linux CI / run-tests (push) Failing after 14m55s Details Windows CI / run-tests (push) Has been cancelled Details Fixes #820	2026-06-06 21:00:30 -07:00
Benson Wong	02e015fa49	Introduce new routing backend (#790 ) Linux CI / run-tests (push) Failing after 14m56s Details Windows CI / run-tests (push) Has been cancelled Details This is a huge backend change that essentially started with rewriting the concurrency handling for processes and blew up to a refactor of the entire application. In short these are the improvements: Better state and life cycle management: Life cycle management of processes has always been the trickiest part of the code. Juggling mutex locks between multiple locations to reduce race conditions was complex. Too complex for my feeble brain to build a simple mental model around as llama-swap gained more features. All of that has been refactored. Most of the locks are gone, replaced with a single run() that owns all state changes. There is one place to start from now to understand and extend routing logic. The improved life cycle management makes it easier to implement more complex swap optimization strategies in the future like #727. Collation of requests: llama-swap previously handled requests and swapping in the order they came in. For example requests for models in this order ABCABC would result in 5 swaps. Now those requests are handled in this order AABBCC. The result is less time waiting for swap under a high churn request queue. This fixes #588 #612. A possible future enhancement is to support a starvation parameter so swap can be forced when models have been waiting too long. Shared base implementation for groups and swap matrix: During the refactor it became clear that much of the swapping logic was shared between these two implementations. That is not surprising considering the swap matrix was added many moons after groups. Now they share a common base and their specific swap strategies are implemented into the swapPlanner interface. Requests for bespoke or specific swapping scenarios is a common theme in the issues. Now users can implement whatever bespoke and weird swapping strategy they want in their own fork. Just ask your agent of choice to implement swapPlanner. I'll still remaining more conservative on what actually lands in core llama-swap and will continue to evaluate PRs if the changes is good for everyone or just one specific use case. AI / Agentic Disclosure: I paid very close attention to the low level swap concurrency design and implementation. It's important to keep that essential part reliable, boring and no surprises. Backwards compatibility was also maintained, even the one way non-exclusive group model loading behaviour that people have rightly pointed out be a weird design decision. With the underlying swap core done the web server, api and UI sitting on top were largely ported over with Claude Code and Opus 4.7 in multiple phases. If you're curious I kept the changes in docs/newrouter-todo.md. I did several passes to make sure things weren't left behind. However, even frontier LLMs at the time of this PR still make small decisions that don't make a lot of sense. They get shit wrong all the time, just in small subtle way. That said, there's likely to be some new bugs introduced with this massive refactor. I'm fairly confident that there's no major architectural flaws that would cause goal seeking agents to make dumb, ugly code decisions. For a little while the legacy llama-swap will be available under cmd/legacy/llama-swap. The plan is to eventually delete that entry point as well as the proxy package. On a bit of a personal note, this PR is exciting and a bit sad for me. I hand wrote much of the original code and this PR ultimately replaces much of it. While the old code served as a good reference for the agent to implement the new stuff it still a bit sad to eventually delete it all.	2026-05-28 21:47:01 -07:00
Benson Wong	a4b91e08cf	Changes and fixes before the release (docs/small tweaks) (#750 ) - update README.md with new docker instructions - update docs/configuration.md - update .github/workflows to have pinned action versions - gofmt events package - fix small bugs in CI scripts - reduce config options for internal/perf/monitor and config. A ring buffer is used to keep 1hr of entries at max 5s granularity. For long term stats use prometheus monitoring on /metrics Fixes #744	2026-05-13 21:18:19 -07:00
Benson Wong	7e3e94a08a	proxy,ui: add performance monitoring with Prometheus metrics (#743 ) Validate JSON Schema / validate-schema (push) Successful in 25s Details UI Tests / run-tests (push) Successful in 1m16s Details Linux CI / run-tests (push) Successful in 3m36s Details Windows CI / run-tests (push) Has been cancelled Details Add a comprehensive performance monitoring system that collects CPU, memory, swap, load average, network IO, and GPU stats. Provides both a REST API for the UI and a Prometheus /metrics endpoint. Backend changes: - New internal/perf package with configurable interval-based stats collection - GPU monitoring via LACT (Unix socket) and nvidia-smi fallback on Linux - Ring buffer (internal/ring) for time-series stat storage - Prometheus /metrics endpoint with all system and GPU metrics - Moved LogMonitor to internal/logmon package - New PerformanceConfig for hot-reloadable monitoring settings - REST /api/performance endpoint replacing SSE streaming UI changes: - New Performance page with real-time charts for CPU, memory, GPU, and network - Reusable PerformanceChart component - LLAMA_SWAP_URL environment variable support - Improved capture dialog display Other: - Example Grafana dashboard for Prometheus metrics - monitor-test standalone binary - Config schema and example updates fixes #596	2026-05-09 13:29:22 -07:00
Benson Wong	15bd55d3a9	proxy, ui-svelte: add /sdapi/v1 endpoint support (#587 ) Add proxy routes for stable-diffusion.cpp's /sdapi/v1/txt2img, /sdapi/v1/img2img, and /sdapi/v1/loras endpoints. POST endpoints use proxyInferenceHandler (model in JSON body), GET /loras uses proxyGETModelHandler (model in query param). Update the image playground with a dual-mode UI supporting both OpenAI and SDAPI backends. In SDAPI mode, loras are fetched first to prime the server-side cache, and all txt2img parameters are exposed (negative prompt, steps, cfg_scale, seed, batch_size, clip_skip, sampler, scheduler, lora selection with multipliers). - Add 3 sdapi route registrations in proxymanager.go - Add sdApi.ts client with generateSdImage and fetchSdLoras - Add SDAPI types (SdApiTxt2ImgRequest, SdApiResponse, etc.) - Add /sdapi to vite dev proxy config - Add backend tests for sdapi routing - Support batch image display in gallery grid https://claude.ai/code/session_0186MGX6NXdHVBTv2KH45fqn --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-03-19 22:08:31 +09:00
Benson Wong	20738f3623	proxy,ui-svelte: replace old UI with svelte+playground Replace the legacy React UI with the new Svelte-based one. Introduce a Playground in the UI to quickly test out text, image, text to speech and speech to text models behind llama-swap. Key Changes New Svelte UI (ui-svelte/) - Multi-tab Playground with Chat, Image Generation, Audio Transcription, and Speech interfaces - Chat: message editing/regeneration, markdown rendering with LaTeX math support, image attachments, code syntax highlighting - Image: size selector, download/fullscreen viewing - Audio: transcription with peer support - Speech: voice caching with manual refresh, download button - Responsive mobile layout with collapsible navigation - XSS fixes and accessibility improvements Proxy Improvements - Add gzip/brotli compression for UI static assets (proxy/ui_compress.go) - Add GET /v1/audio/voices?model={model} endpoint for voice listing - Add peer support for /v1/audio/transcriptions	2026-01-31 22:49:13 -08:00
Benson Wong	b429349e8a	add /ui/ to wol-proxy polling (#388 )	2025-11-08 14:16:12 -08:00
Benson Wong	6aedbe121a	cmd/wol-proxy: show a loading page for / (#381 ) When requesting / wol-proxy will show a loading page that polls /status every second. When the upstream server is ready the loading page will refresh causing the actual root page to be displayed	2025-11-03 19:37:06 -08:00
Benson Wong	d18dc26d01	cmd/wol-proxy: tweak logs to show what is causing wake ups (#356 ) fix the extra wake ups being caused by wol-proxy * cmd/wol-proxy: tweak logs to show what is causing wake ups * cmd/wol-proxy: add skip wakeup * cmd/wol-proxy: replace ticker with SSE connection * cmd/wol-proxy: increase scanner buffer size * cmd/wol-proxy: improve failure tracking	2025-10-25 11:04:31 -07:00
Benson Wong	c07179d6e2	cmd/wol-proxy: add wol-proxy (#352 ) add a wake-on-lan proxy for llama-swap. When the target llama-swap server is unreachable it will send hold a request, send a WoL packet and proxy the request when llama-swap is available.	2025-10-20 20:55:02 -07:00
Benson Wong	9fc0431531	Clean up and Documentation (#347 ) [skip ci] * cmd,misc: move misc binaries to cmd/ * docs: add docs and move examples/ there * misc: remove unused misc/assets dir * docs: add configuration.md * update README with better structure Updates: #334	2025-10-19 14:53:13 -07:00

11 Commits