- Login to git.wylab.me instead of ghcr.io
- Use Gitea-hosted llama.cpp-rocm base image instead of ghcr.io
- Rewrite fetch_llama_tag to use anonymous OCI registry API
- Add LS_UPSTREAM for release binary fetches on forks
- Add REGISTRY and BASE_TAG overrides for self-hosted builds
- Only build rocm platform
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Build the root image once, then derive the rootless variant from it
using a small inline Dockerfile that adds the non-root user and chowns
the writable directories. This halves the number of CI jobs (4 → 2) and
eliminates the redundant full CUDA compilation for the rootless variant.
- remove RUN_UID build arg from build-image.sh
- derive rootless image inline after root build completes
- collapse variant matrix out of unified-docker.yml
- push both root and rootless tags in a single CI job
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Keep request duration from being underreported when upstream timings
only cover part of the full request lifecycle.
- compare wall-clock and upstream timing durations
- keep token and throughput values from timings
- add regression coverage for underreported timings
fixes#602
Add configurable HTTP timeout settings to both models and peers to support installations that requires longer timeouts than the current hardcoded defaults.
Closes#618
Extend the existing config-schema workflow to also validate
config.example.yaml against config-schema.json using check-jsonschema.
- add config.example.yaml to PR and push path triggers
- install check-jsonschema via pip
- run validation of config.example.yaml against schema
https://claude.ai/code/session_01Y1oqwE6mwNs9UTJgZRgXtG
---------
Co-authored-by: Claude <noreply@anthropic.com>
Expose CMAKE_CUDA_ARCHITECTURES as a Docker build ARG so users can
customize CUDA architectures via --build-arg without editing the
Dockerfile.
- convert hardcoded ENV to ARG with default, feeding into ENV
- replace silent fallback defaults (:-) in scripts with :? guards
to fail fast if the env var is missing
- add usage example to Dockerfile header
Follow up to: #624https://claude.ai/code/session_01EWiUe7jNABX7Uz95dUGJqK
Co-authored-by: Claude <noreply@anthropic.com>
multiple fixes to vulkan build:
- use ubuntu 26.04 to be compatible with AMD 395+ (Strix halo) hardware
- add home directory in container
- fix stable-diffusion install to actually enable vulkan
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- set up a GHA scheduled job to build the container nightly
- enabling pushing a llama-swap:unified and a llama-swap:unified-Y-M-D
image to ghcr.io
- tidy up Dockerfile to use a non-root user and llama-swap as an entry
point
Add proxy routes for stable-diffusion.cpp's /sdapi/v1/txt2img,
/sdapi/v1/img2img, and /sdapi/v1/loras endpoints. POST endpoints
use proxyInferenceHandler (model in JSON body), GET /loras uses
proxyGETModelHandler (model in query param).
Update the image playground with a dual-mode UI supporting both
OpenAI and SDAPI backends. In SDAPI mode, loras are fetched first
to prime the server-side cache, and all txt2img parameters are
exposed (negative prompt, steps, cfg_scale, seed, batch_size,
clip_skip, sampler, scheduler, lora selection with multipliers).
- Add 3 sdapi route registrations in proxymanager.go
- Add sdApi.ts client with generateSdImage and fetchSdLoras
- Add SDAPI types (SdApiTxt2ImgRequest, SdApiResponse, etc.)
- Add /sdapi to vite dev proxy config
- Add backend tests for sdapi routing
- Support batch image display in gallery grid
https://claude.ai/code/session_0186MGX6NXdHVBTv2KH45fqn
---------
Co-authored-by: Claude <noreply@anthropic.com>
Upgrade vite and related dependencies to take advantage of Vite 8's
improved build times via Rolldown and Oxc.
- vite: ^6.3.5 → ^8.0.0
- @sveltejs/vite-plugin-svelte: ^5.0.3 → ^7.0.0
- svelte: ^5.19.0 → ^5.46.4
- vite-plugin-compression2: ^2.4.0 → ^2.5.1
- vitest: ^4.0.18 → ^4.1.0
---------
Co-authored-by: Claude <noreply@anthropic.com>
Use natural sorting for model names.
Previously the model list was sorted lexicographically, which resulted
in unintuitive ordering when numbers were included in the name.
Example:
Before
qwen3.5:2B
qwen3.5:35B-3AB
qwen3.5:9B
After
qwen3.5:2B
qwen3.5:9B
qwen3.5:35B-3AB
This change sorts models using natural order so numeric parts are
compared numerically.
Extend macro substitution to the name and description fields of
ModelConfig, matching the behavior already present for cmd, proxy,
checkEndpoint, and filters.
- substitute global/model macros (including MODEL_ID) in name and
description
- substitute PORT macro in name and description when allocated
- validate no unknown macros remain in name and description after
substitution
- add tests for macro substitution, MODEL_ID, and unknown macro error
Add a new configuration parameter globalTTL that all models will
inherit. The default value is 0 which matches the currently
functionality to never automatically unload a model.
The model.ttl's default has changed to -1, which means use the global
TTL value. Any model.ttl >=0 is now value with 0 meaning never unload.
This allows a model to override a globalTTL > 0 and be configured to
never unload.
Fixes#459Closes#512
Add a copy-to-clipboard button that appears on hover for each code block
rendered in the chat interface assistant messages.
- Svelte action `codeBlockCopy` injects a button into every `<pre>`
element
- MutationObserver reattaches buttons as streaming content arrives
- Button shows a check icon for 2 seconds after a successful copy
- Uses clipboard API with execCommand fallback for non-secure contexts
- CSS hides button by default and reveals it on pre:hover
https://claude.ai/code/session_01PTA5ao5YQuFAS6a9juLeZW
---------
Co-authored-by: Claude <noreply@anthropic.com>
Add `cuda13` as a supported build architecture, targeting the
`ghcr.io/ggml-org/llama.cpp:server-cuda13` upstream base image.
The `server-cuda13` image ships with CUDA 13 libraries, providing
improved performance on recent NVIDIA hardware compared to the existing
`server-cuda` (CUDA 12) image. Users with newer GPUs (e.g., RTX
50-series) benefit from reduced model load latency and higher token
throughput.
- Add `cuda13` to the allowed architectures list in
`docker/build-container.sh`
- Add `cuda13` to the CI matrix in `.github/workflows/containers.yml` so
the container is built and pushed automatically
Updated README to enhance the description of the web interface and added details about features like token metrics, request inspection, model management, and real-time log streaming.
Add a new Rerank tab to the playground that lets users test /v1/rerank
endpoints. Supports a visual table editor and a JSON editor mode that
stay in sync when toggling between them.
- add rerankApi.ts with typed wrapper for /v1/rerank
- add RerankInterface.svelte with query input, sortable document table,
color-coded scores, auto-add row, cancel/clear, and token usage
- add rerankLoading store to playgroundActivity derived store
- register Rerank tab in Playground.svelte
Updates #481
Add setParamsByID filter that applies different request parameters based
on the requested model ID, enabling per-alias behaviour for a single
loaded model.
- add SetParamsByID field to Filters struct and SanitizedSetParamsByID
method
- substitute ${MODEL_ID} and other macros in setParamsByID keys and
values
- validate no unknown macros remain in keys or values after substitution
- apply setParamsByID in proxyInferenceHandler after setParams (can
override it)
- update config-schema.json with setParamsByID definition
- update UI to show aliases and make them selectable in the Playground
closes#534
Pause auto-scroll when the user scrolls up to review logs, and resume
when they scroll back to the bottom.
- add `userScrolledUp` state variable
- add `handleScroll` to detect scroll position with 40px threshold
- guard the auto-scroll effect with `!userScrolledUp`
closes#529
- Keep Playground component mounted when navigating away, preserving
streaming/generating state
- Add animated gradient effect on Playground nav link when activity is
in progress
## Summary
- Add `--provenance=false` to docker build commands in
`build-container.sh`
- BuildKit attestation manifests are stored as untagged images in GHCR,
and the `delete-untagged-containers` cleanup job deletes them, breaking
the manifest list and causing `manifest unknown` errors on pull
- ref: https://github.com/actions/delete-package-versions/issues/162
Add saving request and response headers and bodies that go through
llama-swap in memory.
- captureBuffer added to configuration. Captures are enabled by default.
- 5MB of memory is allocated for req/response captures in a ring buffer.
Setting captureBuffer to 0 will disable captures.
- UI elements to view captured data added to Activity page. Includes
some
QOL features like json formatting and recombining SSE chat streams
- capture saving is done at the byte level and has minimal impact on
llama-swap performance
Fixes#464
Ref #503
Reorganizes control placement in the playground interfaces and
improves form interactions for better UX, particularly on mobile
devices.
## Key Changes
- **AudioInterface & ImageInterface**: Moved "Clear" buttons from the
top control bar into the action button group below the form inputs for
better visual hierarchy and logical grouping
- **ImageInterface**:
- Added prompt clearing to the `clearImage()` function so the input
field is reset when clearing generated images
- Updated Clear button disabled state to also check if prompt is empty,
allowing users to clear an empty prompt
- Added responsive flex styling (`flex-1 md:flex-none`) to the Clear
button for better mobile layout
- **ExpandableTextarea**:
- Imported `untrack` from Svelte to properly handle reactive
dependencies
- Wrapped `expandedValue.length` in `untrack()` to prevent unnecessary
reactivity when setting cursor position
- Improved button visibility on mobile by changing opacity from
`opacity-0` to `opacity-60` with `md:opacity-0` breakpoint, making the
expand button more discoverable on touch devices
## Implementation Details
The `untrack()` usage in ExpandableTextarea ensures that reading the
text length doesn't create a reactive dependency, preventing potential
infinite loops while still allowing the effect to run when `isExpanded`
changes.
* .github/workflows: add UI tests and path-filter Go CI
Add ui-tests.yml workflow to run svelte type checking and vitest
on push/PR to main when ui-svelte/ files change.
- Add path filters to go-ci.yml and go-ci-windows.yml to skip
Go tests when only non-backend files change
- Filter on **/*.go, go.mod, go.sum, and Makefile
https://claude.ai/code/session_01E6acq54D8JjuE7pczxPGT7
* ui-svelte: remove unused declarations in SpeechInterface
Remove unused `generatedText` state and `clearAudio` function
that caused svelte-check errors.
https://claude.ai/code/session_01E6acq54D8JjuE7pczxPGT7
* .github/workflows: update Node.js to v24
Node 23 is end-of-life; bump to 24 in ui-tests.yml and release.yml.
https://claude.ai/code/session_01E6acq54D8JjuE7pczxPGT7
---------
Co-authored-by: Claude <noreply@anthropic.com>
Replace the legacy React UI with the new Svelte-based one. Introduce a Playground in the UI to quickly test out text, image, text to speech and speech to text models behind llama-swap.
Key Changes
New Svelte UI (ui-svelte/)
- Multi-tab Playground with Chat, Image Generation, Audio Transcription, and Speech interfaces
- Chat: message editing/regeneration, markdown rendering with LaTeX math support, image attachments, code syntax highlighting
- Image: size selector, download/fullscreen viewing
- Audio: transcription with peer support
- Speech: voice caching with manual refresh, download button
- Responsive mobile layout with collapsible navigation
- XSS fixes and accessibility improvements
Proxy Improvements
- Add gzip/brotli compression for UI static assets (proxy/ui_compress.go)
- Add GET /v1/audio/voices?model={model} endpoint for voice listing
- Add peer support for /v1/audio/transcriptions