Compare commits

...

24 Commits

Author SHA1 Message Date
Oleksandr Kuvshynov 408616adbd server : [easy] fix per round speculative decode logging (#18211)
Currently we always log 0, as we clear slot.drafted before.

To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:

```
% ./build/bin/llama-server -v  \
  -m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
  -md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
  -c 8192 2> /tmp/llama.cpp.debug

Check the log:

slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id  3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id  3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id  3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id  3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id  3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816
```

After the fix, get correct per round logging:

```
slot update_slots: id  3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id  3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id  3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id  3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id  3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id  3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id  3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736
```
2025-12-20 10:57:40 +01:00
Xuan-Son Nguyen 9e39a1e6a9 server: support load model on startup, support preset-only options (#18206)
* server: support autoload model, support preset-only options

* add docs

* load-on-startup

* fix

* Update common/arg.cpp

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-20 09:25:27 +01:00
Sigbjørn Skjæret 74e05131e9 ci : remove non-windows zip artifacts (#18201)
* remove non-windows zip artifacts

* add cuda dll links
2025-12-19 22:29:46 +01:00
Sigbjørn Skjæret f74747d886 ci : only save ccache on master (#18207) 2025-12-19 22:29:37 +01:00
Alfred ce734a8a2f ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (#17977)
* feat: implement real Q8_0

* feat: adding cmake option for configuring FP32 quantize group size

* typo: set() shall be used

---------

Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>
2025-12-19 09:42:28 -08:00
Pascal 14931a826e arg: fix order to use short form before long form (#18196)
* arg: fix order to use short form before long form

* arg: update doc

* arg: update test-arg-parser

* arg: address review feedback from ngxson

simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified

* arg: update doc
2025-12-19 18:01:56 +01:00
Julius Tischbein f99ef53d2a llama : Changing off_t to size_t for Windows (#18204) 2025-12-19 16:42:46 +02:00
Aman Gupta cc0a04343e server: friendlier error msg when ctx < input (#18174)
* llama-server: friendlier error msg when ctx < input

This PR adds formatted strings to the server's send_error function

* llama-server: use string_format inline

* fix test
2025-12-19 12:10:00 +01:00
Xuan-Son Nguyen 98c1c7a7bf presets: refactor, allow cascade presets from different sources, add global section (#18169)
* presets: refactor, allow cascade presets from different sources

* update docs

* fix neg arg handling

* fix empty mmproj

* also filter out server-controlled args before to_ini()

* skip loading custom_models if not specified

* fix unset_reserved_args

* fix crash on windows
2025-12-19 12:08:20 +01:00
Aleksander Grygier acb73d8340 webui: Add editing attachments in user messages (#18147)
* feat: Enable editing attachments in user messages

* feat: Improvements for data handling & UI

* docs: Update Architecture diagrams

* chore: update webui build output

* refactor: Exports

* chore: update webui build output

* feat: Add handling paste for Chat Message Edit Form

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output
2025-12-19 11:14:07 +01:00
Daniel Bevenius 0a271d82b4 model-conversion : add verbose flag in run-org-model.py (#18194)
This commit adds a --verbose flag to the run-org-model.py script to
enable or disable detailed debug output, such as input and output
tensors for each layer. Debug utilities (summarize, debug_hook,
setup_rope_debug) have been moved to utils/common.py.

The motivation for this is that the detailed debug output can be useful
for diagnosing issues with model conversion or execution, but it can
also produce a large amount of output that may not always be needed.

The script will also be further cleaned/refactored in follow-up commits.
2025-12-19 08:43:16 +01:00
Naco Siren 52fc7fee8a android: fix missing screenshots for Android.md (#18156)
* Android basic sample app layout polish

* Add missing screenshots and polish android README doc

* Replace file blobs with URLs served by GitHub pages service.
2025-12-19 09:32:04 +02:00
Jeff Bolz cdbada8d10 vulkan: Add perf logger mode with concurrency (#17944)
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.

GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).

GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
2025-12-19 06:36:46 +01:00
Xuan-Son Nguyen 8ea958d4d9 model : add ASR support for LFM2-Audio-1.5B (conformer) (#18106)
* ASR with LFM2-Audio-1.5B

* Set rope_theta

* Fix comment

* Remove rope_theta setting

* Address PR feedback

* rename functions to conformer

* remove some redundant ggml_cont

* fix missing tensor

* add prefix "a." for conv tensors

* remove redundant reshape

* clean up

* add test model

---------

Co-authored-by: Tarek Dakhran <tarek@liquid.ai>
2025-12-19 00:18:01 +01:00
Pascal f9ec8858ed webui: display prompt processing stats (#18146)
* webui: display prompt processing stats

* feat: Improve UI of Chat Message Statistics

* chore: update webui build output

* refactor: Post-review improvements

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-18 17:55:03 +01:00
Taimur Ahmad f716588e63 ggml-cpu: extend support for RVV floating-point kernels (#17318)
* cmake: add BF16 RVV flag for ggml-cpu

* ggml-cpu: add floating-point conversion kernels

* ggml: add floating-point kernels

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: fix lmul in vec_dot_bf16

* ggml-cpu: change redsum to lmul 4, fix leftover

---------

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2025-12-18 16:02:09 +02:00
Xuan-Son Nguyen 4d1316c440 arg: fix ASAN error on sampler_type_names empty (#18167) 2025-12-18 14:30:32 +01:00
Sigbjørn Skjæret ec7b9329ae gguf-py : use copy-on-write mode for localtensor (#18162) 2025-12-18 13:45:38 +01:00
yulo 54189c0d39 remove i_major_dual (#18157)
Co-authored-by: zhang hui <you@example.com>
2025-12-18 12:50:56 +01:00
Aleksander Grygier 9ce64aed7d webui: Fix selecting generated output issues during active streaming (#18091)
* draft: incremental markdown rendering with stable blocks

* refactor: Logic improvements

* refactor: DRY Markdown post-processing logic

* refactor: ID generation improvements

* fix: Remove runes

* refactor: Clean up & add JSDocs

* chore: update webui static output

* fix: Add tick to prevent race conditions for rendering Markdown blocks

Suggestion from @ServeurpersoCom

Co-authored-by: Pascal <admin@serveurperso.com>

* chore: Run `npm audit fix`

* chore: update webui static output

* feat: Improve performance using global counter & id instead of UUID

* refactor: Enhance Markdown rendering with link and code features

* chore: update webui static output

* fix: Code block content extraction

* chore: update webui static output

* chore: update webui static output

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-18 11:13:52 +01:00
Kim S. 900316da4e webui: fix chat screen shadow width (#18010)
* webui: fix chat screen shadow width

* chore: add index.html.gz
2025-12-18 11:08:42 +01:00
Johannes Gäßler 57c1e05643 llama: offload output layer to GPU first (#18148) 2025-12-18 08:12:18 +01:00
Sigbjørn Skjæret 9cff4cc554 convert : sort and use file parts from model index if present (#18043)
* keep file part order from model index

* treat index as authoritative

* sort index parts
2025-12-18 07:54:54 +01:00
Julius Tischbein 4d4f4cacd1 llama : Async DirectIO model loading on Linux (#18012)
* Uncached model read

* Removing additional --mmap arg

* Removing trailing whitespaces

* Adding fallback when O_DIRECT is not supported

* Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp

* Adding maybe unused keyword for Mac and Windows.

* File seek aligned

* Removing all branches for direct_io in llama-model-loader.cpp

* Always use alignment from llama_file

* use_mmap=true
2025-12-18 08:27:19 +02:00
76 changed files with 3544 additions and 1164 deletions
+29
View File
@@ -70,6 +70,7 @@ jobs:
with:
key: macOS-latest-cmake-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
@@ -106,6 +107,7 @@ jobs:
with:
key: macOS-latest-cmake-x64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
@@ -142,6 +144,7 @@ jobs:
with:
key: macOS-latest-cmake-arm64-webgpu
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dawn Dependency
id: dawn-depends
@@ -195,6 +198,7 @@ jobs:
with:
key: ubuntu-cpu-cmake-${{ matrix.build }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build Dependencies
id: build_depends
@@ -276,6 +280,7 @@ jobs:
with:
key: ubuntu-latest-cmake-sanitizer-${{ matrix.sanitizer }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -396,6 +401,7 @@ jobs:
with:
key: ubuntu-24-cmake-vulkan-deb
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -431,6 +437,7 @@ jobs:
with:
key: ubuntu-24-cmake-vulkan
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -490,6 +497,7 @@ jobs:
with:
key: ubuntu-24-cmake-webgpu
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -562,6 +570,7 @@ jobs:
with:
key: ubuntu-latest-wasm-webgpu
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Install Emscripten
run: |
@@ -609,6 +618,7 @@ jobs:
with:
key: ubuntu-22-cmake-hip
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with native CMake HIP support
id: cmake_build
@@ -641,6 +651,7 @@ jobs:
with:
key: ubuntu-22-cmake-musa
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with native CMake MUSA support
id: cmake_build
@@ -688,6 +699,7 @@ jobs:
with:
key: ubuntu-22-cmake-sycl
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
@@ -738,6 +750,7 @@ jobs:
with:
key: ubuntu-22-cmake-sycl-fp16
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
@@ -771,6 +784,7 @@ jobs:
with:
key: macOS-latest-cmake-ios
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
@@ -802,6 +816,7 @@ jobs:
with:
key: macOS-latest-cmake-tvos
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
@@ -863,6 +878,7 @@ jobs:
with:
key: macOS-latest-swift
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Download xcframework artifact
uses: actions/download-artifact@v4
@@ -905,6 +921,7 @@ jobs:
key: windows-msys2
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Setup ${{ matrix.sys }}
uses: msys2/setup-msys2@v2
@@ -973,6 +990,7 @@ jobs:
key: windows-latest-cmake-${{ matrix.build }}
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Download OpenBLAS
id: get_openblas
@@ -1077,6 +1095,7 @@ jobs:
with:
key: ubuntu-latest-cmake-cuda
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with CMake
run: |
@@ -1109,6 +1128,7 @@ jobs:
key: windows-cuda-${{ matrix.cuda }}
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Install Cuda Toolkit
uses: ./.github/actions/windows-setup-cuda
@@ -1160,6 +1180,7 @@ jobs:
key: windows-latest-cmake-sycl
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Install
run: |
@@ -1221,6 +1242,7 @@ jobs:
with:
key: ${{ github.job }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
@@ -1466,6 +1488,7 @@ jobs:
with:
key: ggml-ci-x64-cpu-low-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -1491,6 +1514,7 @@ jobs:
with:
key: ggml-ci-arm64-cpu-low-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -1516,6 +1540,7 @@ jobs:
with:
key: ggml-ci-x64-cpu-high-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -1541,6 +1566,7 @@ jobs:
with:
key: ggml-ci-arm64-cpu-high-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -1566,6 +1592,7 @@ jobs:
with:
key: ggml-ci-arm64-cpu-high-perf-sve
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -1701,6 +1728,7 @@ jobs:
with:
key: ggml-ci-arm64-cpu-kleidiai
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
@@ -2084,6 +2112,7 @@ jobs:
with:
key: ggml-ci-arm64-graviton4-kleidiai
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Test
id: ggml-ci
+8 -46
View File
@@ -66,16 +66,9 @@ jobs:
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
zip -y -r llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.zip ./build/bin/*
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz -s ",./,llama-${{ steps.tag.outputs.name }}/," -C ./build/bin .
- name: Upload artifacts (zip)
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.zip
name: llama-bin-macos-arm64.zip
- name: Upload artifacts (tar)
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz
@@ -127,16 +120,9 @@ jobs:
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
zip -y -r llama-${{ steps.tag.outputs.name }}-bin-macos-x64.zip ./build/bin/*
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-macos-x64.tar.gz -s ",./,llama-${{ steps.tag.outputs.name }}/," -C ./build/bin .
- name: Upload artifacts (zip)
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-macos-x64.zip
name: llama-bin-macos-x64.zip
- name: Upload artifacts (tar)
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-macos-x64.tar.gz
@@ -196,16 +182,9 @@ jobs:
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
zip -y -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.zip ./build/bin/*
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.tar.gz --transform "s,./,llama-${{ steps.tag.outputs.name }}/," -C ./build/bin .
- name: Upload artifacts (zip)
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.zip
name: llama-bin-ubuntu-${{ matrix.build }}.zip
- name: Upload artifacts (tar)
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.tar.gz
@@ -256,16 +235,9 @@ jobs:
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
zip -y -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.zip ./build/bin/*
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.tar.gz --transform "s,./,llama-${{ steps.tag.outputs.name }}/," -C ./build/bin .
- name: Upload artifacts (zip)
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.zip
name: llama-bin-ubuntu-vulkan-x64.zip
- name: Upload artifacts (tar)
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.tar.gz
@@ -716,16 +688,9 @@ jobs:
- name: Pack artifacts
id: pack_artifacts
run: |
zip -y -r llama-${{ steps.tag.outputs.name }}-xcframework.zip build-apple/llama.xcframework
tar -czvf llama-${{ steps.tag.outputs.name }}-xcframework.tar.gz -C build-apple llama.xcframework
- name: Upload artifacts (zip)
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-xcframework.zip
name: llama-${{ steps.tag.outputs.name }}-xcframework.zip
- name: Upload artifacts (tar)
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-xcframework.tar.gz
@@ -797,7 +762,7 @@ jobs:
cp LICENSE ./build/bin/
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.tar.gz --transform "s,./,llama-${{ steps.tag.outputs.name }}/," -C ./build/bin .
- name: Upload artifacts (tar)
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.tar.gz
@@ -889,9 +854,6 @@ jobs:
with:
tag_name: ${{ steps.tag.outputs.name }}
body: |
> [!WARNING]
> **Release Format Update**: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
<details open>
${{ github.event.head_commit.message }}
@@ -911,8 +873,8 @@ jobs:
**Windows:**
- [Windows x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cpu-x64.zip)
- [Windows arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cpu-arm64.zip)
- [Windows x64 (CUDA 12)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-12.4-x64.zip)
- [Windows x64 (CUDA 13)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-13.1-x64.zip)
- [Windows x64 (CUDA 12)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-12.4-x64.zip) - [CUDA 12.4 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/cudart-llama-bin-win-cuda-12.4-x64.zip)
- [Windows x64 (CUDA 13)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-13.1-x64.zip) - [CUDA 13.1 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/cudart-llama-bin-win-cuda-13.1-x64.zip)
- [Windows x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-vulkan-x64.zip)
- [Windows x64 (SYCL)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-sycl-x64.zip)
- [Windows x64 (HIP)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-hip-radeon-x64.zip)
+46 -13
View File
@@ -96,6 +96,11 @@ common_arg & common_arg::set_sparam() {
return *this;
}
common_arg & common_arg::set_preset_only() {
is_preset_only = true;
return *this;
}
bool common_arg::in_example(enum llama_example ex) {
return examples.find(ex) != examples.end();
}
@@ -772,6 +777,11 @@ bool common_params_to_map(int argc, char ** argv, llama_example ex, std::map<com
}
auto opt = *arg_to_options[arg];
std::string val;
if (opt.value_hint == nullptr && opt.value_hint_2 == nullptr) {
// bool arg (need to reverse the meaning for negative args)
bool is_neg = std::find(opt.args_neg.begin(), opt.args_neg.end(), arg) != opt.args_neg.end();
val = is_neg ? "0" : "1";
}
if (opt.value_hint != nullptr) {
// arg with single value
check_arg(i);
@@ -873,7 +883,9 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
sampler_type_chars += common_sampler_type_to_chr(sampler);
sampler_type_names += common_sampler_type_to_str(sampler) + ";";
}
sampler_type_names.pop_back();
if (!sampler_type_names.empty()) {
sampler_type_names.pop_back(); // remove last semicolon
}
/**
@@ -1137,7 +1149,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_env("LLAMA_ARG_CTX_CHECKPOINTS").set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
add_opt(common_arg(
{"--cache-ram", "-cram"}, "N",
{"-cram", "--cache-ram"}, "N",
string_format("set the maximum cache size in MiB (default: %d, -1 - no limit, 0 - disable)"
"[(more info)](https://github.com/ggml-org/llama.cpp/pull/16391)", params.cache_ram_mib),
[](common_params & params, int value) {
@@ -1145,7 +1157,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_env("LLAMA_ARG_CACHE_RAM").set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
add_opt(common_arg(
{"--kv-unified", "-kvu"},
{"-kvu", "--kv-unified"},
"use single unified KV buffer shared across all sequences (default: enabled if number of slots is auto)",
[](common_params & params) {
params.kv_unified = true;
@@ -1194,7 +1206,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, const std::string & value) {
params.system_prompt = value;
}
).set_examples({LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI, LLAMA_EXAMPLE_DIFFUSION}));
).set_examples({LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI, LLAMA_EXAMPLE_DIFFUSION, LLAMA_EXAMPLE_MTMD}));
add_opt(common_arg(
{"--perf"},
{"--no-perf"},
@@ -1413,7 +1425,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_sparam());
add_opt(common_arg(
{"--sampling-seq", "--sampler-seq"}, "SEQUENCE",
{"--sampler-seq", "--sampling-seq"}, "SEQUENCE",
string_format("simplified sequence for samplers that will be used (default: %s)", sampler_type_chars.c_str()),
[](common_params & params, const std::string & value) {
params.sampling.samplers = common_sampler_types_from_chars(value);
@@ -2071,26 +2083,26 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
));
add_opt(common_arg(
{"--override-tensor", "-ot"}, "<tensor name pattern>=<buffer type>,...",
{"-ot", "--override-tensor"}, "<tensor name pattern>=<buffer type>,...",
"override tensor buffer type", [](common_params & params, const std::string & value) {
parse_tensor_buffer_overrides(value, params.tensor_buft_overrides);
}
));
add_opt(common_arg(
{"--override-tensor-draft", "-otd"}, "<tensor name pattern>=<buffer type>,...",
{"-otd", "--override-tensor-draft"}, "<tensor name pattern>=<buffer type>,...",
"override tensor buffer type for draft model", [](common_params & params, const std::string & value) {
parse_tensor_buffer_overrides(value, params.speculative.tensor_buft_overrides);
}
).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
add_opt(common_arg(
{"--cpu-moe", "-cmoe"},
{"-cmoe", "--cpu-moe"},
"keep all Mixture of Experts (MoE) weights in the CPU",
[](common_params & params) {
params.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
}
).set_env("LLAMA_ARG_CPU_MOE"));
add_opt(common_arg(
{"--n-cpu-moe", "-ncmoe"}, "N",
{"-ncmoe", "--n-cpu-moe"}, "N",
"keep the Mixture of Experts (MoE) weights of the first N layers in the CPU",
[](common_params & params, int value) {
if (value < 0) {
@@ -2105,14 +2117,14 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_env("LLAMA_ARG_N_CPU_MOE"));
add_opt(common_arg(
{"--cpu-moe-draft", "-cmoed"},
{"-cmoed", "--cpu-moe-draft"},
"keep all Mixture of Experts (MoE) weights in the CPU for the draft model",
[](common_params & params) {
params.speculative.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
}
).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_CPU_MOE_DRAFT"));
add_opt(common_arg(
{"--n-cpu-moe-draft", "-ncmoed"}, "N",
{"-ncmoed", "--n-cpu-moe-draft"}, "N",
"keep the Mixture of Experts (MoE) weights of the first N layers in the CPU for the draft model",
[](common_params & params, int value) {
if (value < 0) {
@@ -2640,7 +2652,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_EMBEDDINGS"));
add_opt(common_arg(
{"--reranking", "--rerank"},
{"--rerank", "--reranking"},
string_format("enable reranking endpoint on server (default: %s)", "disabled"),
[](common_params & params) {
params.embedding = true;
@@ -3111,7 +3123,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_examples({LLAMA_EXAMPLE_SPECULATIVE}));
add_opt(common_arg(
{"--draft-max", "--draft", "--draft-n"}, "N",
{"--draft", "--draft-n", "--draft-max"}, "N",
string_format("number of tokens to draft for speculative decoding (default: %d)", params.speculative.n_max),
[](common_params & params, int value) {
params.speculative.n_max = value;
@@ -3487,3 +3499,24 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
return ctx_arg;
}
void common_params_add_preset_options(std::vector<common_arg> & args) {
// arguments below won't be treated as CLI args, only preset options
args.push_back(common_arg(
{"load-on-startup"}, "NAME",
"in server router mode, autoload this model on startup",
[](common_params &, const std::string &) { /* unused */ }
).set_env(COMMON_ARG_PRESET_LOAD_ON_STARTUP).set_preset_only());
// args.push_back(common_arg(
// {"pin"},
// "in server router mode, do not unload this model if models_max is exceeded",
// [](common_params &) { /* unused */ }
// ).set_preset_only());
// args.push_back(common_arg(
// {"unload-idle-seconds"}, "SECONDS",
// "in server router mode, unload models idle for more than this many seconds",
// [](common_params &, int) { /* unused */ }
// ).set_preset_only());
}
+10 -1
View File
@@ -8,6 +8,9 @@
#include <vector>
#include <cstring>
// pseudo-env variable to identify preset-only arguments
#define COMMON_ARG_PRESET_LOAD_ON_STARTUP "__PRESET_LOAD_ON_STARTUP"
//
// CLI argument parsing
//
@@ -22,6 +25,7 @@ struct common_arg {
const char * env = nullptr;
std::string help;
bool is_sparam = false; // is current arg a sampling param?
bool is_preset_only = false; // is current arg preset-only (not treated as CLI arg)
void (*handler_void) (common_params & params) = nullptr;
void (*handler_string) (common_params & params, const std::string &) = nullptr;
void (*handler_str_str)(common_params & params, const std::string &, const std::string &) = nullptr;
@@ -70,6 +74,7 @@ struct common_arg {
common_arg & set_excludes(std::initializer_list<enum llama_example> excludes);
common_arg & set_env(const char * env);
common_arg & set_sparam();
common_arg & set_preset_only();
bool in_example(enum llama_example ex);
bool is_exclude(enum llama_example ex);
bool get_value_from_env(std::string & output) const;
@@ -114,9 +119,13 @@ struct common_params_context {
bool common_params_parse(int argc, char ** argv, common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);
// parse input arguments from CLI into a map
// TODO: support repeated args in the future
bool common_params_to_map(int argc, char ** argv, llama_example ex, std::map<common_arg, std::string> & out_map);
// populate preset-only arguments
// these arguments are not treated as command line arguments
// see: https://github.com/ggml-org/llama.cpp/issues/18163
void common_params_add_preset_options(std::vector<common_arg> & args);
// initialize argument parser context - used by test-arg-parser and preset
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);
+197 -5
View File
@@ -2,6 +2,7 @@
#include "preset.h"
#include "peg-parser.h"
#include "log.h"
#include "download.h"
#include <fstream>
#include <sstream>
@@ -15,11 +16,22 @@ static std::string rm_leading_dashes(const std::string & str) {
return str.substr(pos);
}
std::vector<std::string> common_preset::to_args() const {
std::vector<std::string> common_preset::to_args(const std::string & bin_path) const {
std::vector<std::string> args;
if (!bin_path.empty()) {
args.push_back(bin_path);
}
for (const auto & [opt, value] : options) {
args.push_back(opt.args.back()); // use the last arg as the main arg
if (opt.is_preset_only) {
continue; // skip preset-only options (they are not CLI args)
}
// use the last arg as the main arg (i.e. --long-form)
args.push_back(opt.args.back());
// handle value(s)
if (opt.value_hint == nullptr && opt.value_hint_2 == nullptr) {
// flag option, no value
if (common_arg_utils::is_falsey(value)) {
@@ -63,6 +75,52 @@ std::string common_preset::to_ini() const {
return ss.str();
}
void common_preset::set_option(const common_preset_context & ctx, const std::string & env, const std::string & value) {
// try if option exists, update it
for (auto & [opt, val] : options) {
if (opt.env && env == opt.env) {
val = value;
return;
}
}
// if option does not exist, we need to add it
if (ctx.key_to_opt.find(env) == ctx.key_to_opt.end()) {
throw std::runtime_error(string_format(
"%s: option with env '%s' not found in ctx_params",
__func__, env.c_str()
));
}
options[ctx.key_to_opt.at(env)] = value;
}
void common_preset::unset_option(const std::string & env) {
for (auto it = options.begin(); it != options.end(); ) {
const common_arg & opt = it->first;
if (opt.env && env == opt.env) {
it = options.erase(it);
return;
} else {
++it;
}
}
}
bool common_preset::get_option(const std::string & env, std::string & value) const {
for (const auto & [opt, val] : options) {
if (opt.env && env == opt.env) {
value = val;
return true;
}
}
return false;
}
void common_preset::merge(const common_preset & other) {
for (const auto & [opt, val] : other.options) {
options[opt] = val; // overwrite existing options
}
}
static std::map<std::string, std::map<std::string, std::string>> parse_ini_from_file(const std::string & path) {
std::map<std::string, std::map<std::string, std::string>> parsed;
@@ -172,9 +230,14 @@ static std::string parse_bool_arg(const common_arg & arg, const std::string & ke
return value;
}
common_presets common_presets_load(const std::string & path, common_params_context & ctx_params) {
common_preset_context::common_preset_context(llama_example ex)
: ctx_params(common_params_parser_init(default_params, ex)) {
common_params_add_preset_options(ctx_params.options);
key_to_opt = get_map_key_opt(ctx_params);
}
common_presets common_preset_context::load_from_ini(const std::string & path, common_preset & global) const {
common_presets out;
auto key_to_opt = get_map_key_opt(ctx_params);
auto ini_data = parse_ini_from_file(path);
for (auto section : ini_data) {
@@ -188,7 +251,7 @@ common_presets common_presets_load(const std::string & path, common_params_conte
for (const auto & [key, value] : section.second) {
LOG_DBG("option: %s = %s\n", key.c_str(), value.c_str());
if (key_to_opt.find(key) != key_to_opt.end()) {
auto & opt = key_to_opt[key];
const auto & opt = key_to_opt.at(key);
if (is_bool_arg(opt)) {
preset.options[opt] = parse_bool_arg(opt, key, value);
} else {
@@ -199,8 +262,137 @@ common_presets common_presets_load(const std::string & path, common_params_conte
// TODO: maybe warn about unknown key?
}
}
if (preset.name == "*") {
// handle global preset
global = preset;
} else {
out[preset.name] = preset;
}
}
return out;
}
common_presets common_preset_context::load_from_cache() const {
common_presets out;
auto cached_models = common_list_cached_models();
for (const auto & model : cached_models) {
common_preset preset;
preset.name = model.to_string();
preset.set_option(*this, "LLAMA_ARG_HF_REPO", model.to_string());
out[preset.name] = preset;
}
return out;
}
struct local_model {
std::string name;
std::string path;
std::string path_mmproj;
};
common_presets common_preset_context::load_from_models_dir(const std::string & models_dir) const {
if (!std::filesystem::exists(models_dir) || !std::filesystem::is_directory(models_dir)) {
throw std::runtime_error(string_format("error: '%s' does not exist or is not a directory\n", models_dir.c_str()));
}
std::vector<local_model> models;
auto scan_subdir = [&models](const std::string & subdir_path, const std::string & name) {
auto files = fs_list(subdir_path, false);
common_file_info model_file;
common_file_info first_shard_file;
common_file_info mmproj_file;
for (const auto & file : files) {
if (string_ends_with(file.name, ".gguf")) {
if (file.name.find("mmproj") != std::string::npos) {
mmproj_file = file;
} else if (file.name.find("-00001-of-") != std::string::npos) {
first_shard_file = file;
} else {
model_file = file;
}
}
}
// single file model
local_model model{
/* name */ name,
/* path */ first_shard_file.path.empty() ? model_file.path : first_shard_file.path,
/* path_mmproj */ mmproj_file.path // can be empty
};
if (!model.path.empty()) {
models.push_back(model);
}
};
auto files = fs_list(models_dir, true);
for (const auto & file : files) {
if (file.is_dir) {
scan_subdir(file.path, file.name);
} else if (string_ends_with(file.name, ".gguf")) {
// single file model
std::string name = file.name;
string_replace_all(name, ".gguf", "");
local_model model{
/* name */ name,
/* path */ file.path,
/* path_mmproj */ ""
};
models.push_back(model);
}
}
// convert local models to presets
common_presets out;
for (const auto & model : models) {
common_preset preset;
preset.name = model.name;
preset.set_option(*this, "LLAMA_ARG_MODEL", model.path);
if (!model.path_mmproj.empty()) {
preset.set_option(*this, "LLAMA_ARG_MMPROJ", model.path_mmproj);
}
out[preset.name] = preset;
}
return out;
}
common_preset common_preset_context::load_from_args(int argc, char ** argv) const {
common_preset preset;
preset.name = COMMON_PRESET_DEFAULT_NAME;
bool ok = common_params_to_map(argc, argv, ctx_params.ex, preset.options);
if (!ok) {
throw std::runtime_error("failed to parse CLI arguments into preset");
}
return preset;
}
common_presets common_preset_context::cascade(const common_presets & base, const common_presets & added) const {
common_presets out = base; // copy
for (const auto & [name, preset_added] : added) {
if (out.find(name) != out.end()) {
// if exists, merge
common_preset & target = out[name];
target.merge(preset_added);
} else {
// otherwise, add directly
out[name] = preset_added;
}
}
return out;
}
common_presets common_preset_context::cascade(const common_preset & base, const common_presets & presets) const {
common_presets out;
for (const auto & [name, preset] : presets) {
common_preset tmp = base; // copy
tmp.name = name;
tmp.merge(preset);
out[name] = std::move(tmp);
}
return out;
}
+45 -3
View File
@@ -13,20 +13,62 @@
constexpr const char * COMMON_PRESET_DEFAULT_NAME = "default";
struct common_preset_context;
struct common_preset {
std::string name;
// TODO: support repeated args in the future
// options are stored as common_arg to string mapping, representing CLI arg and its value
std::map<common_arg, std::string> options;
// convert preset to CLI argument list
std::vector<std::string> to_args() const;
std::vector<std::string> to_args(const std::string & bin_path = "") const;
// convert preset to INI format string
std::string to_ini() const;
// TODO: maybe implement to_env() if needed
// modify preset options where argument is identified by its env variable
void set_option(const common_preset_context & ctx, const std::string & env, const std::string & value);
// unset option by its env variable
void unset_option(const std::string & env);
// get option value by its env variable, return false if not found
bool get_option(const std::string & env, std::string & value) const;
// merge another preset into this one, overwriting existing options
void merge(const common_preset & other);
};
// interface for multiple presets in one file
using common_presets = std::map<std::string, common_preset>;
common_presets common_presets_load(const std::string & path, common_params_context & ctx_params);
// context for loading and editing presets
struct common_preset_context {
common_params default_params; // unused for now
common_params_context ctx_params;
std::map<std::string, common_arg> key_to_opt;
common_preset_context(llama_example ex);
// load presets from INI file
common_presets load_from_ini(const std::string & path, common_preset & global) const;
// generate presets from cached models
common_presets load_from_cache() const;
// generate presets from local models directory
// for the directory structure, see "Using multiple models" in server/README.md
common_presets load_from_models_dir(const std::string & models_dir) const;
// generate one preset from CLI arguments
common_preset load_from_args(int argc, char ** argv) const;
// cascade multiple presets if exist on both: base < added
// if preset does not exist in base, it will be added without modification
common_presets cascade(const common_presets & base, const common_presets & added) const;
// apply presets over a base preset (same idea as CSS cascading)
common_presets cascade(const common_preset & base, const common_presets & presets) const;
};
+92 -7
View File
@@ -189,10 +189,10 @@ class ModelBase:
return tensors
prefix = "model" if not self.is_mistral_format else "consolidated"
part_names: set[str] = set(ModelBase.get_model_part_names(self.dir_model, prefix, ".safetensors"))
part_names: list[str] = ModelBase.get_model_part_names(self.dir_model, prefix, ".safetensors")
is_safetensors: bool = len(part_names) > 0
if not is_safetensors:
part_names = set(ModelBase.get_model_part_names(self.dir_model, "pytorch_model", ".bin"))
part_names = ModelBase.get_model_part_names(self.dir_model, "pytorch_model", ".bin")
tensor_names_from_index: set[str] = set()
@@ -209,7 +209,8 @@ class ModelBase:
if weight_map is None or not isinstance(weight_map, dict):
raise ValueError(f"Can't load 'weight_map' from {index_name!r}")
tensor_names_from_index.update(weight_map.keys())
part_names |= set(weight_map.values())
part_dict: dict[str, None] = dict.fromkeys(weight_map.values(), None)
part_names = sorted(part_dict.keys())
else:
weight_map = {}
else:
@@ -711,6 +712,9 @@ class ModelBase:
if "thinker_config" in config:
# rename for Qwen2.5-Omni
config["text_config"] = config["thinker_config"]["text_config"]
if "lfm" in config:
# rename for LFM2-Audio
config["text_config"] = config["lfm"]
return config
@classmethod
@@ -9712,12 +9716,12 @@ class LFM2Model(TextModel):
self._add_feed_forward_length()
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
is_vision_tensor = "vision_tower" in name or "multi_modal_projector" in name
if is_vision_tensor:
# skip vision tensors
if self._is_vision_tensor(name) or self._is_audio_tensor(name):
# skip multimodal tensors
return []
name = name.replace("language_model.", "")
name = name.replace("language_model.", "") # vision
name = name.replace("lfm.", "model.") # audio
# conv op requires 2d tensor
if 'conv.conv' in name:
@@ -9725,6 +9729,12 @@ class LFM2Model(TextModel):
return [(self.map_tensor_name(name), data_torch)]
def _is_vision_tensor(self, name: str) -> bool:
return "vision_tower" in name or "multi_modal_projector" in name
def _is_audio_tensor(self, name: str):
return any(p in name for p in ["audio", "codebook", "conformer", "depth_embedding", "depthformer", "depth_linear"])
@ModelBase.register("Lfm2MoeForCausalLM")
class LFM2MoeModel(TextModel):
@@ -9830,6 +9840,81 @@ class LFM2VLModel(MmprojModel):
return [] # skip other tensors
@ModelBase.register("Lfm2AudioForConditionalGeneration")
class LFM2AudioModel(MmprojModel):
has_vision_encoder = False
has_audio_encoder = True
model_name = "Lfm2AudioEncoder"
_batch_norm_tensors: list[dict[str, Tensor]] | None = None
def get_audio_config(self) -> dict[str, Any] | None:
return self.global_config.get("encoder")
def set_gguf_parameters(self):
assert self.hparams_audio is not None
self.hparams_audio["hidden_size"] = self.hparams_audio["d_model"]
self.hparams_audio["intermediate_size"] = self.hparams_audio["d_model"]
self.hparams_audio["num_attention_heads"] = self.hparams_audio["n_heads"]
super().set_gguf_parameters()
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.LFM2A)
self.gguf_writer.add_audio_num_mel_bins(self.hparams_audio["feat_in"])
self.gguf_writer.add_audio_attention_layernorm_eps(1e-5)
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ".conv" in name and ".weight" in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# skip language model tensors
if name.startswith("lfm."):
return []
# for training only
if any(p in name for p in ["audio_loss_weight"]):
return []
# for audio output
if any(p in name for p in ["codebook_offsets", "depth_embeddings", "depth_linear", "depthformer"]):
return []
# fold running_mean, running_var and eps into weight and bias for batch_norm
if "batch_norm" in name:
if self._batch_norm_tensors is None:
self._batch_norm_tensors = [{} for _ in range(self.block_count)]
assert bid is not None
self._batch_norm_tensors[bid][name] = data_torch
if len(self._batch_norm_tensors[bid]) < 5:
return []
weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"]
bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"]
running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"]
running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"]
eps = 1e-5 # default value
a = weight / torch.sqrt(running_var + eps)
b = bias - running_mean * a
return [
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a),
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b),
]
# reshape conv weights
if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"):
data_torch = data_torch[:, None, None]
if "conv.depthwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[1] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2])
if "conv.pointwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[2] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1])
return [(self.map_tensor_name(name), data_torch)]
@ModelBase.register("SmallThinkerForCausalLM")
class SmallThinkerModel(TextModel):
model_arch = gguf.MODEL_ARCH.SMALLTHINKER
+9 -9
View File
@@ -1,27 +1,27 @@
# Android
## Build with Android Studio
## Build GUI binding using Android Studio
Import the `examples/llama.android` directory into Android Studio, then perform a Gradle sync and build the project.
![Project imported into Android Studio](./android/imported-into-android-studio.png)
![Project imported into Android Studio](./android/imported-into-android-studio.jpg)
This Android binding supports hardware acceleration up to `SME2` for **Arm** and `AMX` for **x86-64** CPUs on Android and ChromeOS devices.
It automatically detects the host's hardware to load compatible kernels. As a result, it runs seamlessly on both the latest premium devices and older devices that may lack modern CPU features or have limited RAM, without requiring any manual configuration.
A minimal Android app frontend is included to showcase the bindings core functionalities:
1. **Parse GGUF metadata** via `GgufMetadataReader` from either a `ContentResolver` provided `Uri` or a local `File`.
2. **Obtain a `TierDetection` or `InferenceEngine`** instance through the high-level facade APIs.
3. **Send a raw user prompt** for automatic template formatting, prefill, and decoding. Then collect the generated tokens in a Kotlin `Flow`.
1. **Parse GGUF metadata** via `GgufMetadataReader` from either a `ContentResolver` provided `Uri` from shared storage, or a local `File` from your app's private storage.
2. **Obtain a `InferenceEngine`** instance through the `AiChat` facade and load your selected model via its app-private file path.
3. **Send a raw user prompt** for automatic template formatting, prefill, and batch decoding. Then collect the generated tokens in a Kotlin `Flow`.
For a production-ready experience that leverages advanced features such as system prompts and benchmarks, check out [Arm AI Chat](https://play.google.com/store/apps/details?id=com.arm.aichat) on Google Play.
For a production-ready experience that leverages advanced features such as system prompts and benchmarks, plus friendly UI features such as model management and Arm feature visualizer, check out [Arm AI Chat](https://play.google.com/store/apps/details?id=com.arm.aichat) on Google Play.
This project is made possible through a collaborative effort by Arm's **CT-ML**, **CE-ML** and **STE** groups:
| ![Home screen](./android/arm-ai-chat-home-screen.png) | ![System prompt](./android/system-prompt-setup.png) | !["Haiku"](./android/chat-with-system-prompt-haiku.png) |
| ![Home screen](https://naco-siren.github.io/ai-chat/policy/index/1-llm-starter-pack.png) | ![System prompt](https://naco-siren.github.io/ai-chat/policy/index/5-system-prompt.png) | !["Haiku"](https://naco-siren.github.io/ai-chat/policy/index/4-metrics.png) |
|:------------------------------------------------------:|:----------------------------------------------------:|:--------------------------------------------------------:|
| Home screen | System prompt | "Haiku" |
## Build on Android using Termux
## Build CLI on Android using Termux
[Termux](https://termux.dev/en/) is an Android terminal emulator and Linux environment app (no root required). As of writing, Termux is available experimentally in the Google Play Store; otherwise, it may be obtained directly from the project repo or on F-Droid.
@@ -52,7 +52,7 @@ To see what it might look like visually, here's an old demo of an interactive se
https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
## Cross-compile using Android NDK
## Cross-compile CLI using Android NDK
It's possible to build `llama.cpp` for Android on your host system via CMake and the Android NDK. If you are interested in this path, ensure you already have an environment prepared to cross-compile programs for Android (i.e., install the Android SDK). Note that, unlike desktop environments, the Android environment ships with a limited set of native libraries, and so only those libraries are available to CMake when building with the Android NDK (see: https://developer.android.com/ndk/guides/stable_apis.)
Once you're ready and have cloned `llama.cpp`, invoke the following in the project directory:
Binary file not shown.

After

Width:  |  Height:  |  Size: 479 KiB

@@ -22,6 +22,7 @@
"GGML_LLAMAFILE": "OFF",
"GGML_OPENCL": "ON",
"GGML_HEXAGON": "ON",
"GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE": "128",
"LLAMA_CURL": "OFF"
}
},
@@ -36,6 +37,7 @@
"GGML_LLAMAFILE": "OFF",
"GGML_OPENCL": "ON",
"GGML_HEXAGON": "ON",
"GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE": "128",
"LLAMA_CURL": "OFF"
}
},
View File
@@ -1,55 +1,57 @@
<?xml version="1.0" encoding="utf-8"?>
<androidx.constraintlayout.widget.ConstraintLayout xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:app="http://schemas.android.com/apk/res-auto"
xmlns:tools="http://schemas.android.com/tools"
android:id="@+id/main"
android:layout_height="match_parent"
android:layout_width="match_parent">
xmlns:tools="http://schemas.android.com/tools"
android:id="@+id/main"
android:layout_height="match_parent"
android:layout_width="match_parent">
<LinearLayout
android:fitsSystemWindows="true"
android:layout_width="match_parent"
android:layout_height="match_parent"
android:orientation="vertical"
android:layout_marginEnd="4dp"
tools:context=".MainActivity">
<FrameLayout
<ScrollView
android:layout_width="match_parent"
android:layout_height="0dp"
android:layout_weight="1">
android:layout_weight="1"
android:fadeScrollbars="false">
<ScrollView
<TextView
android:id="@+id/gguf"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:fadeScrollbars="false">
android:layout_margin="16dp"
android:text="Selected GGUF model's metadata will show here."
style="@style/TextAppearance.MaterialComponents.Body2" />
<TextView
android:id="@+id/gguf"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:layout_margin="16dp"
android:text="Selected GGUF model's metadata will show here."
style="@style/TextAppearance.MaterialComponents.Body2"
android:maxLines="100" />
</ScrollView>
</ScrollView>
</FrameLayout>
<com.google.android.material.divider.MaterialDivider
android:layout_width="match_parent"
android:layout_height="2dp"
android:layout_marginHorizontal="16dp"
android:layout_marginVertical="8dp" />
<androidx.recyclerview.widget.RecyclerView
android:id="@+id/messages"
android:layout_width="match_parent"
android:layout_height="0dp"
android:layout_weight="4"
android:padding="16dp"
android:fadeScrollbars="false"
android:scrollbars="vertical"
app:reverseLayout="true"
tools:listitem="@layout/item_message_assistant"/>
<LinearLayout
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:orientation="horizontal">
android:orientation="horizontal"
android:paddingStart="16dp"
android:paddingEnd="4dp">
<EditText
android:id="@+id/user_input"
@@ -67,7 +69,7 @@
style="@style/Widget.Material3.FloatingActionButton.Primary"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_margin="8dp"
android:layout_margin="12dp"
android:src="@drawable/outline_folder_open_24" />
</LinearLayout>
@@ -2,7 +2,8 @@
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:padding="8dp"
android:layout_marginHorizontal="16dp"
android:layout_marginVertical="8dp"
android:gravity="start">
<TextView
@@ -2,7 +2,8 @@
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:padding="8dp"
android:layout_marginHorizontal="16dp"
android:layout_marginVertical="8dp"
android:gravity="end">
<TextView
@@ -2,135 +2,22 @@
import argparse
import os
import sys
import importlib
from pathlib import Path
# Add parent directory to path for imports
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForImageTextToText, AutoConfig
import torch
import numpy as np
### If you want to dump RoPE activations, apply this monkey patch to the model
### class from Transformers that you are running (replace apertus.modeling_apertus
### with the proper package and class for your model
### === START ROPE DEBUG ===
# from transformers.models.apertus.modeling_apertus import apply_rotary_pos_emb
# orig_rope = apply_rotary_pos_emb
# torch.set_printoptions(threshold=float('inf'))
# torch.set_printoptions(precision=6, sci_mode=False)
# def debug_rope(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
# # log inputs
# summarize(q, "RoPE.q_in")
# summarize(k, "RoPE.k_in")
# # call original
# q_out, k_out = orig_rope(q, k, cos, sin, position_ids, unsqueeze_dim)
# # log outputs
# summarize(q_out, "RoPE.q_out")
# summarize(k_out, "RoPE.k_out")
# return q_out, k_out
# # Patch it
# import transformers.models.apertus.modeling_apertus as apertus_mod # noqa: E402
# apertus_mod.apply_rotary_pos_emb = debug_rope
### == END ROPE DEBUG ===
def summarize(tensor: torch.Tensor, name: str, max_seq: int = 3, max_vals: int = 3):
"""
Print a tensor in llama.cpp debug style.
Supports:
- 2D tensors (seq, hidden)
- 3D tensors (batch, seq, hidden)
- 4D tensors (batch, seq, heads, dim_per_head) via flattening heads × dim_per_head
Shows first and last max_vals of each vector per sequence position.
"""
t = tensor.detach().to(torch.float32).cpu()
# Determine dimensions
if t.ndim == 3:
_, s, _ = t.shape
elif t.ndim == 2:
_, s = 1, t.shape[0]
t = t.unsqueeze(0)
elif t.ndim == 4:
_, s, _, _ = t.shape
else:
print(f"Skipping tensor due to unsupported dimensions: {t.ndim}")
return
ten_shape = t.shape
print(f"ggml_debug: {name} = (f32) ... = {{{ten_shape}}}")
print(" [")
print(" [")
# Determine indices for first and last sequences
first_indices = list(range(min(s, max_seq)))
last_indices = list(range(max(0, s - max_seq), s))
# Check if there's an overlap between first and last indices or if we're at the edge case of s = 2 * max_seq
has_overlap = bool(set(first_indices) & set(last_indices)) or (max_seq * 2 == s)
# Combine indices
if has_overlap:
# If there's overlap, just use the combined unique indices
indices = sorted(list(set(first_indices + last_indices)))
separator_index = None
else:
# If no overlap, we'll add a separator between first and last sequences
indices = first_indices + last_indices
separator_index = len(first_indices)
for i, si in enumerate(indices):
# Add separator if needed
if separator_index is not None and i == separator_index:
print(" ...")
# Extract appropriate slice
vec = t[0, si]
if vec.ndim == 2: # 4D case: flatten heads × dim_per_head
flat = vec.flatten().tolist()
else: # 2D or 3D case
flat = vec.tolist()
# First and last slices
first = flat[:max_vals]
last = flat[-max_vals:] if len(flat) >= max_vals else flat
first_str = ", ".join(f"{v:12.4f}" for v in first)
last_str = ", ".join(f"{v:12.4f}" for v in last)
print(f" [{first_str}, ..., {last_str}]")
print(" ],")
print(" ]")
print(f" sum = {t.sum().item():.6f}\n")
def debug_hook(name):
def fn(_m, input, output):
if isinstance(input, torch.Tensor):
summarize(input, name + "_in")
elif isinstance(input, (tuple, list)) and len(input) > 0 and isinstance(input[0], torch.Tensor):
summarize(input[0], name + "_in")
if isinstance(output, torch.Tensor):
summarize(output, name + "_out")
elif isinstance(output, (tuple, list)) and len(output) > 0 and isinstance(output[0], torch.Tensor):
summarize(output[0], name + "_out")
return fn
unreleased_model_name = os.getenv("UNRELEASED_MODEL_NAME")
from utils.common import debug_hook
parser = argparse.ArgumentParser(description="Process model with specified path")
parser.add_argument("--model-path", "-m", help="Path to the model")
parser.add_argument("--prompt-file", "-f", help="Optional prompt file", required=False)
parser.add_argument("--verbose", "-v", action="store_true", help="Enable verbose debug output")
args = parser.parse_args()
model_path = os.environ.get("MODEL_PATH", args.model_path)
@@ -139,6 +26,12 @@ if model_path is None:
"Model path must be specified either via --model-path argument or MODEL_PATH environment variable"
)
### If you want to dump RoPE activations, uncomment the following lines:
### === START ROPE DEBUG ===
# from utils.common import setup_rope_debug
# setup_rope_debug("transformers.models.apertus.modeling_apertus")
### == END ROPE DEBUG ===
print("Loading model and tokenizer using AutoTokenizer:", model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
@@ -156,6 +49,7 @@ print("Number of layers: ", config.num_hidden_layers)
print("BOS token id: ", config.bos_token_id)
print("EOS token id: ", config.eos_token_id)
unreleased_model_name = os.getenv("UNRELEASED_MODEL_NAME")
if unreleased_model_name:
model_name_lower = unreleased_model_name.lower()
unreleased_module_path = (
@@ -184,9 +78,10 @@ else:
model_path, device_map="auto", offload_folder="offload", trust_remote_code=True, config=config
)
for name, module in model.named_modules():
if len(list(module.children())) == 0: # only leaf modules
module.register_forward_hook(debug_hook(name))
if args.verbose:
for name, module in model.named_modules():
if len(list(module.children())) == 0: # only leaf modules
module.register_forward_hook(debug_hook(name))
model_name = os.path.basename(model_path)
# Printing the Model class to allow for easier debugging. This can be useful
@@ -2,6 +2,8 @@
import os
import sys
import torch
def get_model_name_from_env_path(env_path_name):
model_path = os.getenv(env_path_name)
@@ -18,3 +20,131 @@ def get_model_name_from_env_path(env_path_name):
name = name[:-5]
return name
def summarize(tensor: torch.Tensor, name: str, max_seq: int = 3, max_vals: int = 3):
"""
Print a tensor in llama.cpp debug style.
Supports:
- 2D tensors (seq, hidden)
- 3D tensors (batch, seq, hidden)
- 4D tensors (batch, seq, heads, dim_per_head) via flattening heads × dim_per_head
Shows first and last max_vals of each vector per sequence position.
"""
t = tensor.detach().to(torch.float32).cpu()
# Determine dimensions
if t.ndim == 3:
_, s, _ = t.shape
elif t.ndim == 2:
_, s = 1, t.shape[0]
t = t.unsqueeze(0)
elif t.ndim == 4:
_, s, _, _ = t.shape
else:
print(f"Skipping tensor due to unsupported dimensions: {t.ndim}")
return
ten_shape = t.shape
print(f"ggml_debug: {name} = (f32) ... = {{{ten_shape}}}")
print(" [")
print(" [")
# Determine indices for first and last sequences
first_indices = list(range(min(s, max_seq)))
last_indices = list(range(max(0, s - max_seq), s))
# Check if there's an overlap between first and last indices or if we're at the edge case of s = 2 * max_seq
has_overlap = bool(set(first_indices) & set(last_indices)) or (max_seq * 2 == s)
# Combine indices
if has_overlap:
# If there's overlap, just use the combined unique indices
indices = sorted(list(set(first_indices + last_indices)))
separator_index = None
else:
# If no overlap, we'll add a separator between first and last sequences
indices = first_indices + last_indices
separator_index = len(first_indices)
for i, si in enumerate(indices):
# Add separator if needed
if separator_index is not None and i == separator_index:
print(" ...")
# Extract appropriate slice
vec = t[0, si]
if vec.ndim == 2: # 4D case: flatten heads × dim_per_head
flat = vec.flatten().tolist()
else: # 2D or 3D case
flat = vec.tolist()
# First and last slices
first = flat[:max_vals]
last = flat[-max_vals:] if len(flat) >= max_vals else flat
first_str = ", ".join(f"{v:12.4f}" for v in first)
last_str = ", ".join(f"{v:12.4f}" for v in last)
print(f" [{first_str}, ..., {last_str}]")
print(" ],")
print(" ]")
print(f" sum = {t.sum().item():.6f}\n")
def debug_hook(name):
def fn(_m, input, output):
if isinstance(input, torch.Tensor):
summarize(input, name + "_in")
elif isinstance(input, (tuple, list)) and len(input) > 0 and isinstance(input[0], torch.Tensor):
summarize(input[0], name + "_in")
if isinstance(output, torch.Tensor):
summarize(output, name + "_out")
elif isinstance(output, (tuple, list)) and len(output) > 0 and isinstance(output[0], torch.Tensor):
summarize(output[0], name + "_out")
return fn
def setup_rope_debug(model_module_path: str, function_name: str = "apply_rotary_pos_emb"):
"""
Apply monkey patch to dump RoPE activations for debugging.
Args:
model_module_path: Path to the model module (e.g., "transformers.models.apertus.modeling_apertus")
function_name: Name of the RoPE function to patch (default: "apply_rotary_pos_emb")
Example:
from utils.common import setup_rope_debug
setup_rope_debug("transformers.models.apertus.modeling_apertus")
"""
import importlib
# Import the module and get the original function
module = importlib.import_module(model_module_path)
orig_rope = getattr(module, function_name)
# Set torch print options for better debugging
torch.set_printoptions(threshold=float('inf'))
torch.set_printoptions(precision=6, sci_mode=False)
def debug_rope(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
# log inputs
summarize(q, "RoPE.q_in")
summarize(k, "RoPE.k_in")
# call original
q_out, k_out = orig_rope(q, k, cos, sin, position_ids, unsqueeze_dim)
# log outputs
summarize(q_out, "RoPE.q_out")
summarize(k_out, "RoPE.k_out")
return q_out, k_out
# Patch it
setattr(module, function_name, debug_rope)
print(f"RoPE debug patching applied to {model_module_path}.{function_name}")
+1
View File
@@ -254,6 +254,7 @@ set (GGML_OPENCL_TARGET_VERSION "300" CACHE STRING
"gmml: OpenCL API version to target")
option(GGML_HEXAGON "ggml: enable Hexagon backend" OFF)
set(GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE 128 CACHE STRING "ggml: quantize group size (32, 64, or 128)")
# toolchain for vulkan-shaders-gen
set (GGML_VULKAN_SHADERS_GEN_TOOLCHAIN "" CACHE FILEPATH "ggml: toolchain file for vulkan-shaders-gen")
+4
View File
@@ -458,6 +458,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
if (GGML_RV_ZFH)
string(APPEND MARCH_STR "_zfh")
endif()
if (GGML_XTHEADVECTOR)
string(APPEND MARCH_STR "_xtheadvector")
elseif (GGML_RVV)
@@ -465,6 +466,9 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
if (GGML_RV_ZVFH)
string(APPEND MARCH_STR "_zvfh")
endif()
if (GGML_RV_ZVFBFWMA)
string(APPEND MARCH_STR "_zvfbfwma")
endif()
endif()
if (GGML_RV_ZICBOP)
string(APPEND MARCH_STR "_zicbop")
+51 -6
View File
@@ -3320,13 +3320,33 @@ void ggml_cpu_fp16_to_fp32(const ggml_fp16_t * x, float * y, int64_t n) {
__m128 y_vec = _mm_cvtph_ps(x_vec);
_mm_storeu_ps(y + i, y_vec);
}
#elif defined(__riscv_zvfh)
for (int vl; i < n; i += vl) {
vl = __riscv_vsetvl_e16m1(n - i);
vfloat16m1_t vx = __riscv_vle16_v_f16m1((_Float16 *)&x[i], vl);
vfloat32m2_t vy = __riscv_vfwcvt_f_f_v_f32m2(vx, vl);
__riscv_vse32_v_f32m2(&y[i], vy, vl);
#elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfhmin)
// calculate step size
const int epr = __riscv_vsetvlmax_e16m2();
const int step = epr * 2;
const int np = (n & ~(step - 1));
// unroll by 2
for (; i < np; i += step) {
vfloat16m2_t ax0 = __riscv_vle16_v_f16m2((const _Float16*)x + i, epr);
vfloat32m4_t ay0 = __riscv_vfwcvt_f_f_v_f32m4(ax0, epr);
__riscv_vse32_v_f32m4(y + i, ay0, epr);
vfloat16m2_t ax1 = __riscv_vle16_v_f16m2((const _Float16*)x + i + epr, epr);
vfloat32m4_t ay1 = __riscv_vfwcvt_f_f_v_f32m4(ax1, epr);
__riscv_vse32_v_f32m4(y + i + epr, ay1, epr);
}
// leftovers
int vl;
for (i = np; i < n; i += vl) {
vl = __riscv_vsetvl_e16m2(n - i);
vfloat16m2_t ax0 = __riscv_vle16_v_f16m2((const _Float16*)x + i, vl);
vfloat32m4_t ay0 = __riscv_vfwcvt_f_f_v_f32m4(ax0, vl);
__riscv_vse32_v_f32m4(y + i, ay0, vl);
}
#endif
for (; i < n; ++i) {
@@ -3371,6 +3391,31 @@ void ggml_cpu_bf16_to_fp32(const ggml_bf16_t * x, float * y, int64_t n) {
(const __m128i *)(x + i))),
16)));
}
#elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfbfmin)
// calculate step size
const int epr = __riscv_vsetvlmax_e16m2();
const int step = epr * 2;
const int np = (n & ~(step - 1));
// unroll by 2
for (; i < np; i += step) {
vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16*)x + i, epr);
vfloat32m4_t ay0 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax0, epr);
__riscv_vse32_v_f32m4(y + i, ay0, epr);
vbfloat16m2_t ax1 = __riscv_vle16_v_bf16m2((const __bf16*)x + i + epr, epr);
vfloat32m4_t ay1 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax1, epr);
__riscv_vse32_v_f32m4(y + i + epr, ay1, epr);
}
// leftovers
int vl;
for (i = np; i < n; i += vl) {
vl = __riscv_vsetvl_e16m2(n - i);
vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16*)x + i, vl);
vfloat32m4_t ay0 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax0, vl);
__riscv_vse32_v_f32m4(y + i, ay0, vl);
}
#endif
for (; i < n; i++) {
y[i] = GGML_BF16_TO_FP32(x[i]);
+41 -1
View File
@@ -195,8 +195,48 @@ void ggml_vec_dot_bf16(int n, float * GGML_RESTRICT s, size_t bs, ggml_bf16_t *
sumf += (ggml_float)_mm_cvtss_f32(g);
#undef LOAD
#endif
#elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfbfwma)
size_t vl = __riscv_vsetvlmax_e32m4();
// initialize accumulators to all zeroes
vfloat32m4_t vsum0 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
vfloat32m4_t vsum1 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
// calculate step size
const size_t epr = __riscv_vsetvlmax_e16m2();
const size_t step = epr * 2;
const int np = (n & ~(step - 1));
// unroll by 2
for (; i < np; i += step) {
vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16 *)&x[i], epr);
vbfloat16m2_t ay0 = __riscv_vle16_v_bf16m2((const __bf16 *)&y[i], epr);
vsum0 = __riscv_vfwmaccbf16_vv_f32m4(vsum0, ax0, ay0, epr);
__asm__ __volatile__ ("" ::: "memory");
vbfloat16m2_t ax1 = __riscv_vle16_v_bf16m2((const __bf16 *)&x[i + epr], epr);
vbfloat16m2_t ay1 = __riscv_vle16_v_bf16m2((const __bf16 *)&y[i + epr], epr);
vsum1 = __riscv_vfwmaccbf16_vv_f32m4(vsum1, ax1, ay1, epr);
__asm__ __volatile__ ("" ::: "memory");
}
// accumulate in 1 register
vsum0 = __riscv_vfadd_vv_f32m4(vsum0, vsum1, vl);
// leftovers
for (i = np; i < n; i += vl) {
vl = __riscv_vsetvl_e16m2(n - i);
vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16 *)&x[i], vl);
vbfloat16m2_t ay0 = __riscv_vle16_v_bf16m2((const __bf16 *)&y[i], vl);
vsum0 = __riscv_vfwmaccbf16_vv_f32m4(vsum0, ax0, ay0, vl);
}
// reduce
vl = __riscv_vsetvlmax_e32m4();
vfloat32m1_t redsum = __riscv_vfredusum_vs_f32m4_f32m1(vsum0, __riscv_vfmv_v_f_f32m1(0.0f, 1), vl);
sumf += __riscv_vfmv_f_s_f32m1_f32(redsum);
#endif
for (; i < n; ++i) {
sumf += (ggml_float)(GGML_BF16_TO_FP32(x[i]) *
GGML_BF16_TO_FP32(y[i]));
+125 -22
View File
@@ -224,13 +224,71 @@ inline static void ggml_vec_dot_f16_unroll(const int n, const int xs, float * GG
}
GGML_F16x_VEC_REDUCE(sumf[0], sum_00, sum_01, sum_02, sum_03);
GGML_F16x_VEC_REDUCE(sumf[1], sum_10, sum_11, sum_12, sum_13);
#elif defined(__riscv_v_intrinsic)
// todo: RVV impl
for (int i = 0; i < n; ++i) {
for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
sumf[j] += (ggml_float)(GGML_CPU_FP16_TO_FP32(x[j][i])*GGML_CPU_FP16_TO_FP32(y[i]));
}
}
#elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfh)
size_t vl = __riscv_vsetvlmax_e32m4();
// initialize accumulators to all zeroes
vfloat32m4_t vsum0_0 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
vfloat32m4_t vsum0_1 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
vfloat32m4_t vsum1_0 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
vfloat32m4_t vsum1_1 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
// calculate step size
const size_t epr = __riscv_vsetvlmax_e16m2();
const size_t step = epr * 2;
const int np = (n & ~(step - 1));
// unroll by 2 along the row dimension
for (int i = 0; i < np; i += step) {
vfloat16m2_t ay0 = __riscv_vle16_v_f16m2((const _Float16 *)(y + i), epr);
vfloat16m2_t ax0_0 = __riscv_vle16_v_f16m2((const _Float16 *)(x[0] + i), epr);
vfloat16m2_t ax1_0 = __riscv_vle16_v_f16m2((const _Float16 *)(x[1] + i), epr);
vsum0_0 = __riscv_vfwmacc_vv_f32m4(vsum0_0, ax0_0, ay0, epr);
vsum1_0 = __riscv_vfwmacc_vv_f32m4(vsum1_0, ax1_0, ay0, epr);
vfloat16m2_t ay1 = __riscv_vle16_v_f16m2((const _Float16 *)(y + i + epr), epr);
vfloat16m2_t ax0_1 = __riscv_vle16_v_f16m2((const _Float16 *)(x[0] + i + epr), epr);
vfloat16m2_t ax1_1 = __riscv_vle16_v_f16m2((const _Float16 *)(x[1] + i + epr), epr);
vsum0_1 = __riscv_vfwmacc_vv_f32m4(vsum0_1, ax0_1, ay1, epr);
vsum1_1 = __riscv_vfwmacc_vv_f32m4(vsum1_1, ax1_1, ay1, epr);
}
vfloat32m4_t vsum0 = __riscv_vfadd_vv_f32m4(vsum0_0, vsum0_1, vl);
vfloat32m4_t vsum1 = __riscv_vfadd_vv_f32m4(vsum1_0, vsum1_1, vl);
// leftovers
for (int i = np; i < n; i += vl) {
vl = __riscv_vsetvl_e16m2(n - i);
vfloat16m2_t ay = __riscv_vle16_v_f16m2((const _Float16 *)(y + i), vl);
vfloat16m2_t ax0 = __riscv_vle16_v_f16m2((const _Float16 *)(x[0] + i), vl);
vfloat16m2_t ax1 = __riscv_vle16_v_f16m2((const _Float16 *)(x[1] + i), vl);
vsum0 = __riscv_vfwmacc_vv_f32m4(vsum0, ax0, ay, vl);
vsum1 = __riscv_vfwmacc_vv_f32m4(vsum1, ax1, ay, vl);
}
// reduce
vl = __riscv_vsetvlmax_e32m2();
vfloat32m2_t acc0_0 = __riscv_vfadd_vv_f32m2(__riscv_vget_v_f32m4_f32m2(vsum0, 0),
__riscv_vget_v_f32m4_f32m2(vsum0, 1), vl);
vl = __riscv_vsetvlmax_e32m1();
vfloat32m1_t acc0_1 = __riscv_vfadd_vv_f32m1(__riscv_vget_v_f32m2_f32m1(acc0_0, 0),
__riscv_vget_v_f32m2_f32m1(acc0_0, 1), vl);
vfloat32m1_t redsum0 = __riscv_vfredusum_vs_f32m1_f32m1(
acc0_1, __riscv_vfmv_v_f_f32m1(0.0f, 1), vl);
vl = __riscv_vsetvlmax_e32m2();
vfloat32m2_t acc1_0 = __riscv_vfadd_vv_f32m2(__riscv_vget_v_f32m4_f32m2(vsum1, 0),
__riscv_vget_v_f32m4_f32m2(vsum1, 1), vl);
vl = __riscv_vsetvlmax_e32m1();
vfloat32m1_t acc1_1 = __riscv_vfadd_vv_f32m1(__riscv_vget_v_f32m2_f32m1(acc1_0, 0),
__riscv_vget_v_f32m2_f32m1(acc1_0, 1), vl);
vfloat32m1_t redsum1 = __riscv_vfredusum_vs_f32m1_f32m1(
acc1_1, __riscv_vfmv_v_f_f32m1(0.0f, 1), vl);
sumf[0] = __riscv_vfmv_f_s_f32m1_f32(redsum0);
sumf[1] = __riscv_vfmv_f_s_f32m1_f32(redsum1);
#else
const int np = (n & ~(GGML_F16_STEP - 1));
@@ -475,15 +533,39 @@ inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * GGML_RESTRICT y,
}
np = n;
#elif defined(__riscv_zvfh) // implies __riscv_v_intrinsic
const int np = n;
_Float16 hv = (_Float16)v;
for (int i = 0, avl; i < n; i += avl) {
avl = __riscv_vsetvl_e16m8(n - i);
vfloat16m8_t ax = __riscv_vle16_v_f16m8((const _Float16 *)&x[i], avl);
vfloat16m8_t ay = __riscv_vle16_v_f16m8((_Float16 *)&y[i], avl);
vfloat16m8_t ny = __riscv_vfmadd_vf_f16m8(ax, hv, ay, avl);
__riscv_vse16_v_f16m8((_Float16 *)&y[i], ny, avl);
const ggml_fp16_t s = GGML_CPU_FP32_TO_FP16(v);
const _Float16 scale = *(const _Float16*)(&s);
// calculate step size
const int epr = __riscv_vsetvlmax_e16m4();
const int step = epr * 2;
int np = (n & ~(step - 1));
// unroll by 2
for (int i = 0; i < np; i += step) {
vfloat16m4_t ax0 = __riscv_vle16_v_f16m4((const _Float16*)x + i, epr);
vfloat16m4_t ay0 = __riscv_vle16_v_f16m4((const _Float16*)y + i, epr);
ay0 = __riscv_vfmacc_vf_f16m4(ay0, scale, ax0, epr);
__riscv_vse16_v_f16m4((_Float16*)y + i, ay0, epr);
__asm__ __volatile__ ("" ::: "memory");
vfloat16m4_t ax1 = __riscv_vle16_v_f16m4((const _Float16*)x + i + epr, epr);
vfloat16m4_t ay1 = __riscv_vle16_v_f16m4((const _Float16*)y + i + epr, epr);
ay1 = __riscv_vfmacc_vf_f16m4(ay1, scale, ax1, epr);
__riscv_vse16_v_f16m4((_Float16*)y + i + epr, ay1, epr);
__asm__ __volatile__ ("" ::: "memory");
}
// leftovers
int vl;
for (int i = np; i < n; i += vl) {
vl = __riscv_vsetvl_e16m4(n - i);
vfloat16m4_t ax0 = __riscv_vle16_v_f16m4((const _Float16*)x + i, vl);
vfloat16m4_t ay0 = __riscv_vle16_v_f16m4((const _Float16*)y + i, vl);
ay0 = __riscv_vfmacc_vf_f16m4(ay0, scale, ax0, vl);
__riscv_vse16_v_f16m4((_Float16*)y + i, ay0, vl);
}
np = n;
#elif defined(GGML_SIMD)
const int np = (n & ~(GGML_F16_STEP - 1));
@@ -724,13 +806,34 @@ inline static void ggml_vec_scale_f16(const int n, ggml_fp16_t * y, const float
svst1_f16(pg, (__fp16 *)(y + np), out);
}
#elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfh)
for (int i = 0, vl; i < n; i += vl) {
vl = __riscv_vsetvl_e16m2(n - i);
vfloat16m2_t vy = __riscv_vle16_v_f16m2((_Float16 *)&y[i], vl);
vfloat32m4_t vy32 = __riscv_vfwcvt_f_f_v_f32m4(vy, vl);
vy32 = __riscv_vfmul_vf_f32m4(vy32, v, vl);
vy = __riscv_vfncvt_f_f_w_f16m2(vy32, vl);
__riscv_vse16_v_f16m2((_Float16 *)&y[i], vy, vl);
const ggml_fp16_t s = GGML_CPU_FP32_TO_FP16(v);
const _Float16 scale = *(const _Float16*)(&s);
// calculate step size
const int epr = __riscv_vsetvlmax_e16m4();
const int step = epr * 2;
const int np = (n & ~(step - 1));
// unroll by 2
for (int i = 0; i < np; i += step) {
vfloat16m4_t ay0 = __riscv_vle16_v_f16m4((const _Float16*)y + i, epr);
ay0 = __riscv_vfmul_vf_f16m4(ay0, scale, epr);
__riscv_vse16_v_f16m4((_Float16*)y + i, ay0, epr);
__asm__ __volatile__ ("" ::: "memory");
vfloat16m4_t ay1 = __riscv_vle16_v_f16m4((const _Float16*)y + i + epr, epr);
ay1 = __riscv_vfmul_vf_f16m4(ay1, scale, epr);
__riscv_vse16_v_f16m4((_Float16*)y + i + epr, ay1, epr);
__asm__ __volatile__ ("" ::: "memory");
}
// leftovers
int vl;
for (int i = np; i < n; i += vl) {
vl = __riscv_vsetvl_e16m4(n - i);
vfloat16m4_t ay0 = __riscv_vle16_v_f16m4((const _Float16*)y + i, vl);
ay0 = __riscv_vfmul_vf_f16m4(ay0, scale, vl);
__riscv_vse16_v_f16m4((_Float16*)y + i, ay0, vl);
}
#elif defined(GGML_SIMD)
const int np = (n & ~(GGML_F16_STEP - 1));
+84 -45
View File
@@ -78,27 +78,25 @@ namespace ggml_cuda_mma {
// MIRRORED == Each data value is held exactly once per thread subgroup.
DATA_LAYOUT_I_MAJOR = 0, // Always used for Turing, Ampere, Ada Lovelace, consumer Blackwell, matrix A&B for RDNA4 and CDNA.
DATA_LAYOUT_J_MAJOR = 10, // Matrix C for CDNA and RDNA4, int and float matrix C for RDNA3.
DATA_LAYOUT_I_MAJOR_MIRRORED = 20,
DATA_LAYOUT_I_MAJOR_MIRRORED = 20, // Volta, matrix A&B for RDNA3.
DATA_LAYOUT_J_MAJOR_MIRRORED = 30,
DATA_LAYOUT_I_MAJOR_DUAL = 40, // Matrix A&B for RDNA3.
};
// Implemented mma combinations are:
// - (I_MAJOR, I_MAJOR) -> I_MAJOR
// - (I_MAJOR, I_MAJOR_MIRRORED) -> I_MAJOR
// - (I_MAJOR, J_MAJOR_MIRRORED) -> I_MAJOR
constexpr bool is_i_major(const data_layout dl) {
static constexpr bool is_i_major(const data_layout dl) {
return dl == DATA_LAYOUT_I_MAJOR ||
dl == DATA_LAYOUT_I_MAJOR_MIRRORED ||
dl == DATA_LAYOUT_I_MAJOR_DUAL;
dl == DATA_LAYOUT_I_MAJOR_MIRRORED;
}
constexpr data_layout get_input_data_layout() {
#if defined(RDNA3)
return DATA_LAYOUT_I_MAJOR_DUAL;
static constexpr __device__ data_layout get_input_data_layout() {
#if defined(RDNA3) || __CUDA_ARCH__ == GGML_CUDA_CC_VOLTA
return DATA_LAYOUT_I_MAJOR_MIRRORED;
#else
return DATA_LAYOUT_I_MAJOR;
#endif // defined(RDNA3)
#endif // defined(RDNA3) || __CUDA_ARCH__ == GGML_CUDA_CC_VOLTA
}
template <int I_, int J_, typename T, data_layout ds_=DATA_LAYOUT_I_MAJOR>
@@ -462,11 +460,65 @@ namespace ggml_cuda_mma {
}
};
template <int I_, int J_, typename T>
struct tile<I_, J_, T, DATA_LAYOUT_I_MAJOR_MIRRORED> {
static constexpr int I = I_;
static constexpr int J = J_;
static constexpr data_layout dl = DATA_LAYOUT_I_MAJOR_MIRRORED;
// RDNA3
static constexpr int ne = I * J / 32 * 2;
T x[ne] = {0};
static constexpr __device__ bool supported() {
if (I == 16 && J == 16) return true;
if (I == 16 && J == 8) return true;
if (I == 16 && J == 4) return true;
return false;
}
static __device__ __forceinline__ int get_i(const int /*l*/) {
if constexpr (supported()) {
return threadIdx.x % 16;
} else {
NO_DEVICE_CODE;
return -1;
}
}
static __device__ __forceinline__ int get_j(const int l) {
if constexpr (supported()) {
return l;
} else {
NO_DEVICE_CODE;
return -1;
}
}
};
template <int I_, int J_>
struct tile<I_, J_, half2, DATA_LAYOUT_I_MAJOR_MIRRORED> {
static constexpr int I = I_;
static constexpr int J = J_;
static constexpr data_layout dl = DATA_LAYOUT_I_MAJOR_MIRRORED;
#if defined(RDNA3)
static constexpr int ne = tile<I_, J_, float, DATA_LAYOUT_I_MAJOR_MIRRORED>::ne;
half2 x[ne] = {{0.0f, 0.0f}};
static constexpr __device__ bool supported() {
return tile<I_, J_, float, DATA_LAYOUT_I_MAJOR_MIRRORED>::supported();
}
static __device__ __forceinline__ int get_i(const int l) {
return tile<I_, J_, float, DATA_LAYOUT_I_MAJOR_MIRRORED>::get_i(l);
}
static __device__ __forceinline__ int get_j(const int l) {
return tile<I_, J_, float, DATA_LAYOUT_I_MAJOR_MIRRORED>::get_j(l);
}
#else // Volta
static constexpr int ne = I * J / (WARP_SIZE/4);
half2 x[ne] = {{0.0f, 0.0f}};
@@ -493,6 +545,29 @@ namespace ggml_cuda_mma {
return -1;
}
}
#endif // defined(RDNA3)
};
template <int I_, int J_>
struct tile<I_, J_, nv_bfloat162, DATA_LAYOUT_I_MAJOR_MIRRORED> {
static constexpr int I = I_;
static constexpr int J = J_;
static constexpr data_layout dl = DATA_LAYOUT_I_MAJOR_MIRRORED;
static constexpr int ne = tile<I_, J_, float, DATA_LAYOUT_I_MAJOR_MIRRORED>::ne;
nv_bfloat162 x[ne] = {{0.0f, 0.0f}};
static constexpr __device__ bool supported() {
return tile<I_, J_, float, DATA_LAYOUT_I_MAJOR_MIRRORED>::supported();
}
static __device__ __forceinline__ int get_i(const int l) {
return tile<I_, J_, float, DATA_LAYOUT_I_MAJOR_MIRRORED>::get_i(l);
}
static __device__ __forceinline__ int get_j(const int l) {
return tile<I_, J_, float, DATA_LAYOUT_I_MAJOR_MIRRORED>::get_j(l);
}
};
template <int I_, int J_>
@@ -528,42 +603,6 @@ namespace ggml_cuda_mma {
}
};
template <int I_, int J_, typename T>
struct tile<I_, J_, T, DATA_LAYOUT_I_MAJOR_DUAL> {
static constexpr int I = I_;
static constexpr int J = J_;
static constexpr data_layout dl = DATA_LAYOUT_I_MAJOR_DUAL;
static constexpr int ne = I * J / 32 * 2;
T x[ne] = {0};
static constexpr __device__ bool supported() {
if (I == 16 && J == 16) return true;
if (I == 16 && J == 8) return true;
if (I == 16 && J == 4) return true;
return false;
}
static __device__ __forceinline__ int get_i(const int l) {
if constexpr (supported()) {
return threadIdx.x % 16;
} else {
NO_DEVICE_CODE;
return -1;
}
}
static __device__ __forceinline__ int get_j(const int l) {
if constexpr (supported()) {
return l;
} else {
NO_DEVICE_CODE;
return -1;
}
}
};
#if defined(TURING_MMA_AVAILABLE)
template <int I, int J>
static __device__ __forceinline__ tile<I, J/2, half2> get_half2(const tile<I, J, float> & tile_float) {
+14 -20
View File
@@ -102,31 +102,25 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const int
const int threads = 128;
GGML_ASSERT(nr % threads == 0);
if (n_t <= 32) {
const dim3 blocks(n_s, (nr + threads - 1) / threads, 1);
if (nc == 4) {
ssm_conv_f32<threads, 4><<<blocks, threads, 0, stream>>>(src0, src1, src0_nb0, src0_nb1, src0_nb2, src1_nb1,
dst, dst_nb0, dst_nb1, dst_nb2, n_t);
} else if (nc == 3) {
ssm_conv_f32<threads, 3><<<blocks, threads, 0, stream>>>(src0, src1, src0_nb0, src0_nb1, src0_nb2, src1_nb1,
dst, dst_nb0, dst_nb1, dst_nb2, n_t);
auto launch_kernel = [&](auto NC) {
constexpr int kNC = decltype(NC)::value;
if (n_t <= 32) {
const dim3 blocks(n_s, (nr + threads - 1) / threads, 1);
ssm_conv_f32<threads, kNC><<<blocks, threads, 0, stream>>>(src0, src1, src0_nb0, src0_nb1, src0_nb2, src1_nb1,
dst, dst_nb0, dst_nb1, dst_nb2, n_t);
} else {
GGML_ABORT("Only support kernel size = 3 or size = 4 right now.");
}
} else {
if (nc == 4) {
const int64_t split_n_t = 32;
dim3 blocks(n_s, (nr + threads - 1) / threads, (n_t + split_n_t - 1) / split_n_t);
ssm_conv_long_token_f32<threads, 4, split_n_t><<<blocks, threads, 0, stream>>>(
ssm_conv_long_token_f32<threads, kNC, split_n_t><<<blocks, threads, 0, stream>>>(
src0, src1, src0_nb0, src0_nb1, src0_nb2, src1_nb1, dst, dst_nb0, dst_nb1, dst_nb2, n_t);
} else if (nc == 3) {
const int64_t split_n_t = 32;
dim3 blocks(n_s, (nr + threads - 1) / threads, (n_t + split_n_t - 1) / split_n_t);
ssm_conv_long_token_f32<threads, 3, split_n_t><<<blocks, threads, 0, stream>>>(
src0, src1, src0_nb0, src0_nb1, src0_nb2, src1_nb1, dst, dst_nb0, dst_nb1, dst_nb2, n_t);
} else {
GGML_ABORT("Only support kernel size = 3 or size = 4 right now.");
}
};
switch (nc) {
case 3: launch_kernel(std::integral_constant<int, 3>{}); break;
case 4: launch_kernel(std::integral_constant<int, 4>{}); break;
case 9: launch_kernel(std::integral_constant<int, 9>{}); break;
default: GGML_ABORT("Only support kernel sizes 3, 4, 9 right now.");
}
}
+3 -1
View File
@@ -2,6 +2,7 @@ include(${HEXAGON_SDK_ROOT}/build/cmake/hexagon_fun.cmake)
include(ExternalProject)
option(GGML_HEXAGON_HTP_DEBUG "ggml-hexagon: enable HTP debug output" OFF)
set(GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE 128 CACHE STRING "ggml-hexagon: quantize group size (32, 64, or 128)")
add_library(htp_iface OBJECT
${CMAKE_CURRENT_BINARY_DIR}/htp_iface_stub.c)
@@ -41,7 +42,8 @@ set(HTP_CMAKE_ARGS
-DCMAKE_INSTALL_LIBDIR=${CMAKE_CURRENT_BINARY_DIR}
-DHEXAGON_SDK_ROOT=$ENV{HEXAGON_SDK_ROOT}
-DHEXAGON_TOOLS_ROOT=$ENV{HEXAGON_TOOLS_ROOT}
-DHEXAGON_HTP_DEBUG=${GGML_HEXAGON_HTP_DEBUG})
-DHEXAGON_HTP_DEBUG=${GGML_HEXAGON_HTP_DEBUG}
-DGGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE=${GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE})
ExternalProject_Add(htp-v68
SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/htp BUILD_ALWAYS ON
+2 -1
View File
@@ -31,7 +31,8 @@ add_library(${HTP_LIB} SHARED
)
target_compile_definitions(${HTP_LIB} PRIVATE
$<IF:$<BOOL:${HEXAGON_HTP_DEBUG}>,HTP_DEBUG=1,NDEBUG=1>)
$<IF:$<BOOL:${HEXAGON_HTP_DEBUG}>,HTP_DEBUG=1,NDEBUG=1>
FP32_QUANTIZE_GROUP_SIZE=${GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE})
build_idl(htp_iface.idl ${HTP_LIB})
+139
View File
@@ -92,6 +92,18 @@ static const uint8_t __attribute__((aligned(128))) repl_1x_fp16[128] = {
0x02, 0x02, 0x04, 0x04, 0x02, 0x02, 0x08, 0x08, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02,
};
// vdelta control to replicate first fp16 value across all elements
static const uint8_t __attribute__((aligned(128))) repl_2x_fp16[128] = {
0x00, 0x00, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02, 0x08, 0x08, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02,
0x10, 0x10, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02, 0x08, 0x08, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02,
0x20, 0x20, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02, 0x08, 0x08, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02,
0x10, 0x10, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02, 0x08, 0x08, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02,
0x00, 0x00, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02, 0x08, 0x08, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02,
0x10, 0x10, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02, 0x08, 0x08, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02,
0x20, 0x20, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02, 0x08, 0x08, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02,
0x10, 0x10, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02, 0x08, 0x08, 0x02, 0x02, 0x04, 0x04, 0x02, 0x02,
};
// vdelta control to expand first 32 e8m0 values into 32 uint32 elements
static const uint8_t __attribute__((aligned(128))) expand_x32_e8m0[128] = {
0x00, 0x00, 0x00, 0x00, 0x01, 0x04, 0x00, 0x00, 0x02, 0x00, 0x08, 0x08, 0x01, 0x02, 0x00, 0x04, 0x04, 0x00, 0x00,
@@ -1594,6 +1606,118 @@ static void matmul_f16_f32(struct htp_tensor * restrict src0,
// *** dynamic quant
static inline void quantize_block_fp32_q8x1(float * restrict x, uint8_t * restrict y_q, uint8_t * restrict y_d) {
assert((unsigned long) x % 128 == 0);
assert((unsigned long) y_q % 128 == 0);
HVX_Vector * vx = (HVX_Vector *) x;
HVX_Vector zero = Q6_V_vsplat_R(0);
// Use reduce max fp32 to find max(abs(e)) first
HVX_Vector vmax0_sf = hvx_vec_reduce_max_fp32(hvx_vec_abs_fp32(vx[0]));
HVX_Vector vmax1_sf = hvx_vec_reduce_max_fp32(hvx_vec_abs_fp32(vx[1]));
HVX_Vector vmax2_sf = hvx_vec_reduce_max_fp32(hvx_vec_abs_fp32(vx[2]));
HVX_Vector vmax3_sf = hvx_vec_reduce_max_fp32(hvx_vec_abs_fp32(vx[3]));
// Load and convert into QF32
HVX_Vector vx0_qf = Q6_Vqf32_vsub_VsfVsf(vx[0], zero); // 32 elements
HVX_Vector vx1_qf = Q6_Vqf32_vsub_VsfVsf(vx[1], zero); // 32 elements
HVX_Vector vx2_qf = Q6_Vqf32_vsub_VsfVsf(vx[2], zero); // 32 elements
HVX_Vector vx3_qf = Q6_Vqf32_vsub_VsfVsf(vx[3], zero); // 32 elements
// Convert to QF32
HVX_Vector vmax0_qf = Q6_Vqf32_vsub_VsfVsf(vmax0_sf, zero);
HVX_Vector vmax1_qf = Q6_Vqf32_vsub_VsfVsf(vmax1_sf, zero);
HVX_Vector vmax2_qf = Q6_Vqf32_vsub_VsfVsf(vmax2_sf, zero);
HVX_Vector vmax3_qf = Q6_Vqf32_vsub_VsfVsf(vmax3_sf, zero);
// Combine and convert to fp16
HVX_Vector vmax01_hf = Q6_Vh_vdeal_Vh(Q6_Vhf_equals_Wqf32(Q6_W_vcombine_VV(vmax1_qf, vmax0_qf)));
HVX_Vector vmax23_hf = Q6_Vh_vdeal_Vh(Q6_Vhf_equals_Wqf32(Q6_W_vcombine_VV(vmax3_qf, vmax2_qf)));
// Convert into fp16
HVX_Vector vx01_hf = Q6_Vh_vdeal_Vh(Q6_Vhf_equals_Wqf32(Q6_W_vcombine_VV(vx1_qf, vx0_qf)));
HVX_Vector vx23_hf = Q6_Vh_vdeal_Vh(Q6_Vhf_equals_Wqf32(Q6_W_vcombine_VV(vx3_qf, vx2_qf)));
// Replicate first fp16 scale across all lanes
HVX_Vector ctrl = *(const HVX_Vector *) repl_2x_fp16;
vmax01_hf = Q6_V_vdelta_VV(vmax01_hf, ctrl);
vmax23_hf = Q6_V_vdelta_VV(vmax23_hf, ctrl);
HVX_Vector vd01_qf16 = Q6_Vqf16_vmpy_VhfVhf(vmax01_hf, Q6_Vh_vsplat_R(0x2008)); // 1.0 / 127.0
HVX_Vector vd23_qf16 = Q6_Vqf16_vmpy_VhfVhf(vmax23_hf, Q6_Vh_vsplat_R(0x2008)); // 1.0 / 127.0
HVX_Vector vd01_hf = Q6_Vhf_equals_Vqf16(vd01_qf16);
HVX_Vector vd23_hf = Q6_Vhf_equals_Vqf16(vd23_qf16);
hvx_vec_store_u(y_d + 0, 2, vd01_hf);
HVX_Vector rotated_vd_hf = Q6_V_vror_VR(vd01_hf, 64);
hvx_vec_store_u(y_d + 2, 2, rotated_vd_hf);
hvx_vec_store_u(y_d + 4, 2, vd23_hf);
rotated_vd_hf = Q6_V_vror_VR(vd23_hf, 64);
hvx_vec_store_u(y_d + 6, 2, rotated_vd_hf);
// Divide input by the scale
HVX_Vector vd01_inv_hf = hvx_vec_inverse_fp16(vd01_hf);
HVX_Vector vd23_inv_hf = hvx_vec_inverse_fp16(vd23_hf);
vx01_hf = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(vx01_hf, vd01_inv_hf));
vx23_hf = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(vx23_hf, vd23_inv_hf));
// Convert to int8
HVX_Vector vx01_i16 = hvx_vec_i16_from_hf_rnd_sat(vx01_hf);
HVX_Vector vx23_i16 = hvx_vec_i16_from_hf_rnd_sat(vx23_hf);
HVX_Vector vx_i8 = Q6_Vb_vpack_VhVh_sat(vx23_i16, vx01_i16);
*(HVX_Vector *) y_q = vx_i8;
}
static inline void quantize_block_fp32_q8x2(float * restrict x, uint8_t * restrict y_q, uint8_t * restrict y_d) {
assert((unsigned long) x % 128 == 0);
assert((unsigned long) y_q % 128 == 0);
HVX_Vector * vx = (HVX_Vector *) x;
// Load and convert into QF32
HVX_Vector zero = Q6_V_vsplat_R(0);
HVX_Vector vx0_qf = Q6_Vqf32_vsub_VsfVsf(vx[0], zero); // 32 elements
HVX_Vector vx1_qf = Q6_Vqf32_vsub_VsfVsf(vx[1], zero); // 32 elements
HVX_Vector vx2_qf = Q6_Vqf32_vsub_VsfVsf(vx[2], zero); // 32 elements
HVX_Vector vx3_qf = Q6_Vqf32_vsub_VsfVsf(vx[3], zero); // 32 elements
// Convert into fp16
HVX_Vector vx01_hf = Q6_Vh_vdeal_Vh(Q6_Vhf_equals_Wqf32(Q6_W_vcombine_VV(vx1_qf, vx0_qf)));
HVX_Vector vx23_hf = Q6_Vh_vdeal_Vh(Q6_Vhf_equals_Wqf32(Q6_W_vcombine_VV(vx3_qf, vx2_qf)));
// Compute max and scale
HVX_Vector vmax01_hf = hvx_vec_reduce_max_fp16(hvx_vec_abs_fp16(vx01_hf));
HVX_Vector vmax23_hf = hvx_vec_reduce_max_fp16(hvx_vec_abs_fp16(vx23_hf));
// Replicate first fp16 scale across all lanes
HVX_Vector ctrl = *(const HVX_Vector *) repl_1x_fp16;
vmax01_hf = Q6_V_vdelta_VV(vmax01_hf, ctrl);
vmax23_hf = Q6_V_vdelta_VV(vmax23_hf, ctrl);
HVX_Vector vd01_qf16 = Q6_Vqf16_vmpy_VhfVhf(vmax01_hf, Q6_Vh_vsplat_R(0x2008)); // 1.0 / 127.0
HVX_Vector vd23_qf16 = Q6_Vqf16_vmpy_VhfVhf(vmax23_hf, Q6_Vh_vsplat_R(0x2008)); // 1.0 / 127.0
HVX_Vector vd01_hf = Q6_Vhf_equals_Vqf16(vd01_qf16);
HVX_Vector vd23_hf = Q6_Vhf_equals_Vqf16(vd23_qf16);
hvx_vec_store_u(y_d + 0, 4, vd01_hf);
hvx_vec_store_u(y_d + 4, 4, vd23_hf);
// Divide input by the scale
HVX_Vector vd01_inv_hf = hvx_vec_inverse_fp16(vd01_hf);
HVX_Vector vd23_inv_hf = hvx_vec_inverse_fp16(vd23_hf);
vx01_hf = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(vx01_hf, vd01_inv_hf));
vx23_hf = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(vx23_hf, vd23_inv_hf));
// Convert to int8
HVX_Vector vx01_i16 = hvx_vec_i16_from_hf_rnd_sat(vx01_hf);
HVX_Vector vx23_i16 = hvx_vec_i16_from_hf_rnd_sat(vx23_hf);
HVX_Vector vx_i8 = Q6_Vb_vpack_VhVh_sat(vx23_i16, vx01_i16);
*(HVX_Vector *) y_q = vx_i8;
}
static inline void quantize_block_fp32_q8x4(float * restrict x, uint8_t * restrict y_q, uint8_t * restrict y_d) {
assert((unsigned long) x % 128 == 0);
assert((unsigned long) y_q % 128 == 0);
@@ -1655,10 +1779,24 @@ static void quantize_row_fp32_q8x4x2(float * restrict x, uint8_t * restrict y, u
uint8_t * restrict t_d = (uint8_t *) x;
for (uint32_t i = 0; i < nb; i++) {
#if FP32_QUANTIZE_GROUP_SIZE == 32
quantize_block_fp32_q8x1(x + (i * 2 + 0) * qk / 2, y_q + (i * 2 + 0) * qblk_size / 2,
t_d + (i * 2 + 0) * dblk_size / 2);
quantize_block_fp32_q8x1(x + (i * 2 + 1) * qk / 2, y_q + (i * 2 + 1) * qblk_size / 2,
t_d + (i * 2 + 1) * dblk_size / 2);
#elif FP32_QUANTIZE_GROUP_SIZE == 64
quantize_block_fp32_q8x2(x + (i * 2 + 0) * qk / 2, y_q + (i * 2 + 0) * qblk_size / 2,
t_d + (i * 2 + 0) * dblk_size / 2);
quantize_block_fp32_q8x2(x + (i * 2 + 1) * qk / 2, y_q + (i * 2 + 1) * qblk_size / 2,
t_d + (i * 2 + 1) * dblk_size / 2);
#elif FP32_QUANTIZE_GROUP_SIZE == 128
quantize_block_fp32_q8x4(x + (i * 2 + 0) * qk / 2, y_q + (i * 2 + 0) * qblk_size / 2,
t_d + (i * 2 + 0) * dblk_size / 2);
quantize_block_fp32_q8x4(x + (i * 2 + 1) * qk / 2, y_q + (i * 2 + 1) * qblk_size / 2,
t_d + (i * 2 + 1) * dblk_size / 2);
#else
#error "FP32_QUANTIZE_GROUP_SIZE must be 32, 64, or 128"
#endif
}
// now copy the scales into final location
@@ -1671,6 +1809,7 @@ static void quantize_fp32_q8x4x2(const struct htp_tensor * src,
uint32_t nth,
uint32_t ith,
uint32_t nrows_per_thread) {
uint64_t t1 = HAP_perf_get_qtimer_count();
const uint32_t ne0 = src->ne[0];
+104 -42
View File
@@ -1527,6 +1527,8 @@ private:
#endif // GGML_VULKAN_MEMORY_DEBUG
static bool vk_perf_logger_enabled = false;
static bool vk_perf_logger_concurrent = false;
static bool vk_enable_sync_logger = false;
// number of calls between perf logger prints
static uint32_t vk_perf_logger_frequency = 1;
@@ -1577,14 +1579,14 @@ class vk_perf_logger {
flops.clear();
}
void log_timing(const ggml_tensor * node, const char *fusion_name, uint64_t time) {
std::string get_node_fusion_name(const ggml_tensor * node, const char *fusion_name, uint64_t *n_flops) {
*n_flops = 0;
std::string fusion_str;
if (fusion_name) {
fusion_str = fusion_name + std::string(" ");
}
if (node->op == GGML_OP_UNARY) {
timings[fusion_str + ggml_unary_op_name(ggml_get_unary_op(node))].push_back(time);
return;
return fusion_str + ggml_unary_op_name(ggml_get_unary_op(node));
}
if (node->op == GGML_OP_MUL_MAT || node->op == GGML_OP_MUL_MAT_ID) {
const uint64_t m = node->ne[0];
@@ -1606,9 +1608,8 @@ class vk_perf_logger {
name += " batch=" + std::to_string(batch);
}
name = fusion_str + name;
timings[name].push_back(time);
flops[name].push_back(m * n * (k + (k - 1)) * batch);
return;
*n_flops = m * n * (k + (k - 1)) * batch;
return name;
}
if (node->op == GGML_OP_CONV_2D || node->op == GGML_OP_CONV_TRANSPOSE_2D) {
std::string name = ggml_op_name(node->op);
@@ -1624,20 +1625,17 @@ class vk_perf_logger {
uint64_t size_M = Cout;
uint64_t size_K = Cin * KW * KH;
uint64_t size_N = N * OW * OH;
uint64_t n_flops = size_M * size_N * (size_K + (size_K - 1));
*n_flops = size_M * size_N * (size_K + (size_K - 1));
name += " M=Cout=" + std::to_string(size_M) + ", K=Cin*KW*KH=" + std::to_string(size_K) +
", N=N*OW*OH=" + std::to_string(size_N);
name = fusion_str + name;
flops[name].push_back(n_flops);
timings[name].push_back(time);
return;
return name;
}
if (node->op == GGML_OP_RMS_NORM) {
std::string name = ggml_op_name(node->op);
name += "(" + std::to_string(node->ne[0]) + "," + std::to_string(node->ne[1]) + "," + std::to_string(node->ne[2]) + "," + std::to_string(node->ne[3]) + ")";
name = fusion_str + name;
timings[name].push_back(time);
return;
return name;
}
if (node->op == GGML_OP_FLASH_ATTN_EXT) {
const ggml_tensor * dst = node;
@@ -1653,8 +1651,7 @@ class vk_perf_logger {
" k(" << k->ne[0] << "," << k->ne[1] << "," << k->ne[2] << "," << k->ne[3] << "), " <<
" v(" << v->ne[0] << "," << v->ne[1] << "," << v->ne[2] << "," << v->ne[3] << "), " <<
" m(" << (m?m->ne[0]:0) << "," << (m?m->ne[1]:0) << "," << (m?m->ne[2]:0) << "," << (m?m->ne[3]:0) << ")";
timings[name.str()].push_back(time);
return;
return name.str();
}
if (node->op == GGML_OP_TOP_K) {
std::stringstream name;
@@ -1662,11 +1659,38 @@ class vk_perf_logger {
name << ggml_op_name(node->op) <<
" K=" << node->ne[0] <<
" (" << node->src[0]->ne[0] << "," << node->src[0]->ne[1] << "," << node->src[0]->ne[2] << "," << node->src[0]->ne[3] << ")";
timings[name.str()].push_back(time);
return;
return name.str();
}
timings[fusion_str + ggml_op_name(node->op)].push_back(time);
return fusion_str + ggml_op_name(node->op);
}
void log_timing(const ggml_tensor * node, const char *fusion_name, uint64_t time) {
uint64_t n_flops;
std::string name = get_node_fusion_name(node, fusion_name, &n_flops);
if (n_flops) {
flops[name].push_back(n_flops);
}
timings[name].push_back(time);
}
void log_timing(const std::vector<ggml_tensor *> &nodes, const std::vector<const char *> &names, uint64_t time) {
uint64_t total_flops = 0;
std::string name;
for (size_t n = 0; n < nodes.size(); ++n) {
uint64_t n_flops = 0;
name += get_node_fusion_name(nodes[n], names[n], &n_flops);
total_flops += n_flops;
if (n != nodes.size() - 1) {
name += ", ";
}
}
if (total_flops) {
flops[name].push_back(total_flops);
}
timings[name].push_back(time);
}
private:
std::map<std::string, std::vector<uint64_t>> timings;
std::map<std::string, std::vector<uint64_t>> flops;
@@ -1729,7 +1753,9 @@ struct ggml_backend_vk_context {
std::unique_ptr<vk_perf_logger> perf_logger;
vk::QueryPool query_pool;
std::vector<const char *> query_fusion_names;
std::vector<int> query_fusion_node_count;
std::vector<ggml_tensor *> query_nodes;
std::vector<int> query_node_idx;
int32_t num_queries {};
int32_t query_idx {};
};
@@ -5194,6 +5220,8 @@ static void ggml_vk_instance_init() {
}
vk_perf_logger_enabled = getenv("GGML_VK_PERF_LOGGER") != nullptr;
vk_perf_logger_concurrent = getenv("GGML_VK_PERF_LOGGER_CONCURRENT") != nullptr;
vk_enable_sync_logger = getenv("GGML_VK_SYNC_LOGGER") != nullptr;
const char* GGML_VK_PERF_LOGGER_FREQUENCY = getenv("GGML_VK_PERF_LOGGER_FREQUENCY");
if (GGML_VK_PERF_LOGGER_FREQUENCY != nullptr) {
@@ -11820,15 +11848,18 @@ static bool ggml_vk_build_graph(ggml_backend_vk_context * ctx, ggml_cgraph * cgr
}
}
#define ENABLE_SYNC_LOGGING 0
if (need_sync) {
#if ENABLE_SYNC_LOGGING
std::cerr << "sync" << std::endl;
#endif
if (vk_enable_sync_logger) {
std::cerr << "sync" << std::endl;
}
ctx->unsynced_nodes_written.clear();
ctx->unsynced_nodes_read.clear();
ggml_vk_sync_buffers(ctx, compute_ctx);
if (vk_perf_logger_enabled && vk_perf_logger_concurrent) {
ctx->query_node_idx[ctx->query_idx] = node_idx;
compute_ctx->s->buffer.writeTimestamp(vk::PipelineStageFlagBits::eAllCommands, ctx->query_pool, ctx->query_idx++);
}
}
// Add all fused nodes to the unsynchronized lists.
for (int32_t i = 0; i < ctx->num_additional_fused_ops + 1; ++i) {
@@ -11845,20 +11876,20 @@ static bool ggml_vk_build_graph(ggml_backend_vk_context * ctx, ggml_cgraph * cgr
}
}
}
#if ENABLE_SYNC_LOGGING
for (int i = 0; i < ctx->num_additional_fused_ops + 1; ++i) {
auto *n = cgraph->nodes[node_idx + i];
std::cerr << node_idx + i << " " << ggml_op_name(n->op) << " " << n->name;
if (n->op == GGML_OP_GLU) {
std::cerr << " " << ggml_glu_op_name(ggml_get_glu_op(n)) << " " << (n->src[1] ? "split" : "single") << " ";
if (vk_enable_sync_logger) {
for (int i = 0; i < ctx->num_additional_fused_ops + 1; ++i) {
auto *n = cgraph->nodes[node_idx + i];
std::cerr << node_idx + i << " " << ggml_op_name(n->op) << " " << n->name;
if (n->op == GGML_OP_GLU) {
std::cerr << " " << ggml_glu_op_name(ggml_get_glu_op(n)) << " " << (n->src[1] ? "split" : "single") << " ";
}
if (n->op == GGML_OP_ROPE) {
const int mode = ((const int32_t *) n->op_params)[2];
std::cerr << " rope mode: " << mode;
}
std::cerr << std::endl;
}
if (n->op == GGML_OP_ROPE) {
const int mode = ((const int32_t *) n->op_params)[2];
std::cerr << " rope mode: " << mode;
}
std::cerr << std::endl;
}
#endif
switch (node->op) {
case GGML_OP_REPEAT:
@@ -13138,12 +13169,16 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
ctx->query_pool = ctx->device->device.createQueryPool(query_create_info);
ctx->num_queries = query_create_info.queryCount;
ctx->query_fusion_names.resize(ctx->num_queries);
ctx->query_fusion_node_count.resize(ctx->num_queries);
ctx->query_nodes.resize(ctx->num_queries);
ctx->query_node_idx.resize(ctx->num_queries);
}
ctx->device->device.resetQueryPool(ctx->query_pool, 0, cgraph->n_nodes+1);
std::fill(ctx->query_fusion_names.begin(), ctx->query_fusion_names.end(), nullptr);
std::fill(ctx->query_fusion_node_count.begin(), ctx->query_fusion_node_count.end(), 0);
std::fill(ctx->query_nodes.begin(), ctx->query_nodes.end(), nullptr);
std::fill(ctx->query_node_idx.begin(), ctx->query_node_idx.end(), 0);
GGML_ASSERT(ctx->compute_ctx.expired());
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
@@ -13272,9 +13307,16 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
} else {
compute_ctx = ctx->compute_ctx.lock();
}
ctx->query_nodes[ctx->query_idx] = cgraph->nodes[i];
ctx->query_fusion_names[ctx->query_idx] = fusion_string;
compute_ctx->s->buffer.writeTimestamp(vk::PipelineStageFlagBits::eAllCommands, ctx->query_pool, ctx->query_idx++);
if (!vk_perf_logger_concurrent) {
// track a single node/fusion for the current query
ctx->query_nodes[ctx->query_idx] = cgraph->nodes[i];
ctx->query_fusion_names[ctx->query_idx] = fusion_string;
compute_ctx->s->buffer.writeTimestamp(vk::PipelineStageFlagBits::eAllCommands, ctx->query_pool, ctx->query_idx++);
} else {
// track a fusion string and number of fused ops for the current node_idx
ctx->query_fusion_names[i] = fusion_string;
ctx->query_fusion_node_count[i] = ctx->num_additional_fused_ops;
}
}
if (enqueued) {
@@ -13316,12 +13358,32 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
// Get the results and pass them to the logger
std::vector<uint64_t> timestamps(cgraph->n_nodes + 1);
VK_CHECK(ctx->device->device.getQueryPoolResults(ctx->query_pool, 0, ctx->query_idx, (cgraph->n_nodes + 1)*sizeof(uint64_t), timestamps.data(), sizeof(uint64_t), vk::QueryResultFlagBits::e64 | vk::QueryResultFlagBits::eWait), "get timestamp results");
for (int i = 1; i < ctx->query_idx; i++) {
auto node = ctx->query_nodes[i];
auto name = ctx->query_fusion_names[i];
ctx->perf_logger->log_timing(node, name, uint64_t((timestamps[i] - timestamps[i-1]) * ctx->device->properties.limits.timestampPeriod));
if (!vk_perf_logger_concurrent) {
// Log each op separately
for (int i = 1; i < ctx->query_idx; i++) {
auto node = ctx->query_nodes[i];
auto name = ctx->query_fusion_names[i];
ctx->perf_logger->log_timing(node, name, uint64_t((timestamps[i] - timestamps[i-1]) * ctx->device->properties.limits.timestampPeriod));
}
} else {
// Log each group of nodes
int prev_node_idx = 0;
for (int i = 1; i < ctx->query_idx; i++) {
auto cur_node_idx = ctx->query_node_idx[i];
std::vector<ggml_tensor *> nodes;
std::vector<const char *> names;
for (int node_idx = prev_node_idx; node_idx < cur_node_idx; ++node_idx) {
if (ggml_op_is_empty(cgraph->nodes[node_idx]->op)) {
continue;
}
nodes.push_back(cgraph->nodes[node_idx]);
names.push_back(ctx->query_fusion_names[node_idx]);
node_idx += ctx->query_fusion_node_count[node_idx];
}
prev_node_idx = cur_node_idx;
ctx->perf_logger->log_timing(nodes, names, uint64_t((timestamps[i] - timestamps[i-1]) * ctx->device->properties.limits.timestampPeriod));
}
}
ctx->perf_logger->print_timings();
}
+52
View File
@@ -690,6 +690,8 @@ class MODEL_TENSOR(IntEnum):
V_TOK_EOI = auto() # cogvlm
# audio (mtmd)
A_ENC_EMBD_POS = auto()
A_ENC_EMBD_NORM = auto()
A_ENC_EMBD_TO_LOGITS = auto()
A_ENC_CONV1D = auto()
A_PRE_NORM = auto()
A_POST_NORM = auto()
@@ -700,8 +702,13 @@ class MODEL_TENSOR(IntEnum):
A_ENC_OUTPUT = auto()
A_ENC_OUTPUT_NORM = auto()
A_ENC_FFN_UP = auto()
A_ENC_FFN_NORM = auto()
A_ENC_FFN_GATE = auto()
A_ENC_FFN_DOWN = auto()
A_ENC_FFN_UP_1 = auto()
A_ENC_FFN_NORM_1 = auto()
A_ENC_FFN_GATE_1 = auto()
A_ENC_FFN_DOWN_1 = auto()
A_MMPROJ = auto()
A_MMPROJ_FC = auto()
A_MM_NORM_PRE = auto()
@@ -713,6 +720,16 @@ class MODEL_TENSOR(IntEnum):
NEXTN_HNORM = auto()
NEXTN_SHARED_HEAD_HEAD = auto()
NEXTN_SHARED_HEAD_NORM = auto()
# lfm2 audio
A_ENC_NORM_CONV = auto()
A_ENC_LINEAR_POS = auto()
A_ENC_POS_BIAS_U = auto()
A_ENC_POS_BIAS_V = auto()
A_ENC_OUT = auto()
A_ENC_CONV_DW = auto() # SSM conv
A_ENC_CONV_NORM = auto() # SSM conv
A_ENC_CONV_PW1 = auto()
A_ENC_CONV_PW2 = auto()
MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
@@ -1064,7 +1081,10 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
MODEL_TENSOR.V_TOK_BOI: "v.boi",
MODEL_TENSOR.V_TOK_EOI: "v.eoi",
# audio (mtmd)
# note: all audio tensor names must use prefix "a." or "mm.a."
MODEL_TENSOR.A_ENC_EMBD_POS: "a.position_embd",
MODEL_TENSOR.A_ENC_EMBD_NORM: "a.position_embd_norm",
MODEL_TENSOR.A_ENC_EMBD_TO_LOGITS: "a.embd_to_logits",
MODEL_TENSOR.A_ENC_CONV1D: "a.conv1d.{bid}",
MODEL_TENSOR.A_PRE_NORM: "a.pre_ln",
MODEL_TENSOR.A_POST_NORM: "a.post_ln",
@@ -1074,13 +1094,28 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
MODEL_TENSOR.A_ENC_INPUT_NORM: "a.blk.{bid}.ln1",
MODEL_TENSOR.A_ENC_OUTPUT: "a.blk.{bid}.attn_out",
MODEL_TENSOR.A_ENC_OUTPUT_NORM: "a.blk.{bid}.ln2",
MODEL_TENSOR.A_ENC_FFN_NORM: "a.blk.{bid}.ffn_norm",
MODEL_TENSOR.A_ENC_FFN_UP: "a.blk.{bid}.ffn_up",
MODEL_TENSOR.A_ENC_FFN_GATE: "a.blk.{bid}.ffn_gate",
MODEL_TENSOR.A_ENC_FFN_DOWN: "a.blk.{bid}.ffn_down",
MODEL_TENSOR.A_ENC_FFN_NORM_1: "a.blk.{bid}.ffn_norm_1",
MODEL_TENSOR.A_ENC_FFN_UP_1: "a.blk.{bid}.ffn_up_1",
MODEL_TENSOR.A_ENC_FFN_GATE_1: "a.blk.{bid}.ffn_gate_1",
MODEL_TENSOR.A_ENC_FFN_DOWN_1: "a.blk.{bid}.ffn_down_1",
MODEL_TENSOR.A_MMPROJ: "mm.a.mlp.{bid}",
MODEL_TENSOR.A_MMPROJ_FC: "mm.a.fc",
MODEL_TENSOR.A_MM_NORM_PRE: "mm.a.norm_pre",
MODEL_TENSOR.A_MM_NORM_MID: "mm.a.norm_mid",
# lfm2 audio
MODEL_TENSOR.A_ENC_NORM_CONV: "a.blk.{bid}.norm_conv",
MODEL_TENSOR.A_ENC_LINEAR_POS: "a.blk.{bid}.linear_pos",
MODEL_TENSOR.A_ENC_POS_BIAS_U: "a.blk.{bid}.pos_bias_u",
MODEL_TENSOR.A_ENC_POS_BIAS_V: "a.blk.{bid}.pos_bias_v",
MODEL_TENSOR.A_ENC_OUT: "a.pre_encode.out",
MODEL_TENSOR.A_ENC_CONV_DW: "a.blk.{bid}.conv_dw",
MODEL_TENSOR.A_ENC_CONV_NORM: "a.blk.{bid}.conv_norm",
MODEL_TENSOR.A_ENC_CONV_PW1: "a.blk.{bid}.conv_pw1",
MODEL_TENSOR.A_ENC_CONV_PW2: "a.blk.{bid}.conv_pw2",
# NextN/MTP
MODEL_TENSOR.NEXTN_EH_PROJ: "blk.{bid}.nextn.eh_proj",
MODEL_TENSOR.NEXTN_EMBED_TOKENS: "blk.{bid}.nextn.embed_tokens",
@@ -1145,6 +1180,8 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.V_TOK_EOI,
# audio
MODEL_TENSOR.A_ENC_EMBD_POS,
MODEL_TENSOR.A_ENC_EMBD_NORM,
MODEL_TENSOR.A_ENC_EMBD_TO_LOGITS,
MODEL_TENSOR.A_ENC_CONV1D,
MODEL_TENSOR.A_PRE_NORM,
MODEL_TENSOR.A_POST_NORM,
@@ -1154,13 +1191,27 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.A_ENC_INPUT_NORM,
MODEL_TENSOR.A_ENC_OUTPUT,
MODEL_TENSOR.A_ENC_OUTPUT_NORM,
MODEL_TENSOR.A_ENC_FFN_NORM,
MODEL_TENSOR.A_ENC_FFN_UP,
MODEL_TENSOR.A_ENC_FFN_GATE,
MODEL_TENSOR.A_ENC_FFN_DOWN,
MODEL_TENSOR.A_ENC_FFN_NORM_1,
MODEL_TENSOR.A_ENC_FFN_UP_1,
MODEL_TENSOR.A_ENC_FFN_GATE_1,
MODEL_TENSOR.A_ENC_FFN_DOWN_1,
MODEL_TENSOR.A_MMPROJ,
MODEL_TENSOR.A_MMPROJ_FC,
MODEL_TENSOR.A_MM_NORM_PRE,
MODEL_TENSOR.A_MM_NORM_MID,
MODEL_TENSOR.A_ENC_NORM_CONV,
MODEL_TENSOR.A_ENC_LINEAR_POS,
MODEL_TENSOR.A_ENC_POS_BIAS_U,
MODEL_TENSOR.A_ENC_POS_BIAS_V,
MODEL_TENSOR.A_ENC_OUT,
MODEL_TENSOR.A_ENC_CONV_DW,
MODEL_TENSOR.A_ENC_CONV_NORM,
MODEL_TENSOR.A_ENC_CONV_PW1,
MODEL_TENSOR.A_ENC_CONV_PW2,
],
MODEL_ARCH.LLAMA: [
MODEL_TENSOR.TOKEN_EMBD,
@@ -3363,6 +3414,7 @@ class VisionProjectorType:
LIGHTONOCR = "lightonocr"
COGVLM = "cogvlm"
JANUS_PRO = "janus_pro"
LFM2A = "lfm2a" # audio
GLM4V = "glm4v"
+71
View File
@@ -1535,10 +1535,20 @@ class TensorNameMap:
MODEL_TENSOR.A_ENC_EMBD_POS: (
"audio_tower.embed_positions", # ultravox
"audio_embedding.embedding", # lfm2
),
MODEL_TENSOR.A_ENC_EMBD_NORM: (
"audio_embedding.embedding_norm", # lfm2
),
MODEL_TENSOR.A_ENC_EMBD_TO_LOGITS: (
"audio_embedding.to_logits", # lfm2
),
MODEL_TENSOR.A_ENC_CONV1D: (
"audio_tower.conv{bid}", # ultravox
"conformer.pre_encode.conv.{bid}", # lfm2
),
MODEL_TENSOR.A_PRE_NORM: (),
@@ -1550,36 +1560,76 @@ class TensorNameMap:
MODEL_TENSOR.A_ENC_ATTN_Q: (
"audio_tower.layers.{bid}.self_attn.q_proj", # ultravox
"conformer.layers.{bid}.self_attn.linear_q", # lfm2
),
MODEL_TENSOR.A_ENC_ATTN_K: (
"audio_tower.layers.{bid}.self_attn.k_proj", # ultravox
"conformer.layers.{bid}.self_attn.linear_k", # lfm2
),
MODEL_TENSOR.A_ENC_ATTN_V: (
"audio_tower.layers.{bid}.self_attn.v_proj", # ultravox
"conformer.layers.{bid}.self_attn.linear_v", # lfm2
),
MODEL_TENSOR.A_ENC_INPUT_NORM: (
"audio_tower.layers.{bid}.self_attn_layer_norm", # ultravox
"conformer.layers.{bid}.norm_self_att", # lfm2
),
MODEL_TENSOR.A_ENC_OUTPUT: (
"audio_tower.layers.{bid}.self_attn.out_proj", # ultravox
"conformer.layers.{bid}.self_attn.linear_out", # lfm2
),
MODEL_TENSOR.A_ENC_OUTPUT_NORM: (
"audio_tower.layers.{bid}.final_layer_norm", # ultravox
"conformer.layers.{bid}.norm_out", # lfm2
),
MODEL_TENSOR.A_ENC_FFN_NORM: (
"conformer.layers.{bid}.norm_feed_forward1", # lfm2
),
MODEL_TENSOR.A_ENC_FFN_UP: (
"audio_tower.layers.{bid}.fc1", # ultravox
"conformer.layers.{bid}.feed_forward1.linear1", # lfm2
),
MODEL_TENSOR.A_ENC_FFN_GATE: (),
MODEL_TENSOR.A_ENC_FFN_DOWN: (
"audio_tower.layers.{bid}.fc2", # ultravox
"conformer.layers.{bid}.feed_forward1.linear2", # lfm2
),
MODEL_TENSOR.A_ENC_FFN_UP_1: (
"conformer.layers.{bid}.feed_forward2.linear1", # lfm2
),
MODEL_TENSOR.A_ENC_FFN_DOWN_1: (
"conformer.layers.{bid}.feed_forward2.linear2", # lfm2
),
MODEL_TENSOR.A_ENC_FFN_NORM_1: (
"conformer.layers.{bid}.norm_feed_forward2", # lfm2
),
MODEL_TENSOR.A_ENC_LINEAR_POS: (
"conformer.layers.{bid}.self_attn.linear_pos", # lfm2
),
MODEL_TENSOR.A_ENC_POS_BIAS_U: (
"conformer.layers.{bid}.self_attn.pos_bias_u", # lfm2
),
MODEL_TENSOR.A_ENC_POS_BIAS_V: (
"conformer.layers.{bid}.self_attn.pos_bias_v", # lfm2
),
MODEL_TENSOR.A_ENC_OUT: (
"conformer.pre_encode.out", # lfm2
),
# note: some tensors below has "audio." pseudo-prefix, to prevent conflicts with vision tensors
@@ -1587,6 +1637,7 @@ class TensorNameMap:
MODEL_TENSOR.A_MMPROJ: (
"audio.multi_modal_projector.linear_{bid}", # ultravox
"audio_adapter.model.{bid}" # lfm2
),
MODEL_TENSOR.A_MMPROJ_FC: (
@@ -1602,6 +1653,26 @@ class TensorNameMap:
"audio.multi_modal_projector.ln_mid", # ultravox
),
MODEL_TENSOR.A_ENC_CONV_DW: (
"conformer.layers.{bid}.conv.depthwise_conv", # lfm2
),
MODEL_TENSOR.A_ENC_CONV_NORM: (
"conformer.layers.{bid}.conv.batch_norm", # lfm2
),
MODEL_TENSOR.A_ENC_CONV_PW1: (
"conformer.layers.{bid}.conv.pointwise_conv1", # lfm2
),
MODEL_TENSOR.A_ENC_CONV_PW2: (
"conformer.layers.{bid}.conv.pointwise_conv2", # lfm2
),
MODEL_TENSOR.A_ENC_NORM_CONV: (
"conformer.layers.{bid}.norm_conv", # lfm2
),
# NextN/MTP tensors for GLM4_MOE
MODEL_TENSOR.NEXTN_EH_PROJ: (
"model.layers.{bid}.eh_proj",
+1 -1
View File
@@ -288,7 +288,7 @@ class LocalTensor:
data_range: LocalTensorRange
def mmap_bytes(self) -> np.ndarray:
return np.memmap(self.data_range.filename, mode='r', offset=self.data_range.offset, shape=self.data_range.size)
return np.memmap(self.data_range.filename, mode='c', offset=self.data_range.offset, shape=self.data_range.size)
class SafetensorsLocal:
+123 -28
View File
@@ -13,9 +13,10 @@
#ifdef __has_include
#if __has_include(<unistd.h>)
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#if defined(_POSIX_MAPPED_FILES)
#include <sys/mman.h>
#include <fcntl.h>
#endif
#if defined(_POSIX_MEMLOCK_RANGE)
#include <sys/resource.h>
@@ -74,7 +75,7 @@ struct llama_file::impl {
return ret;
}
impl(const char * fname, const char * mode) {
impl(const char * fname, const char * mode, [[maybe_unused]] const bool use_direct_io = false) {
fp = ggml_fopen(fname, mode);
if (fp == NULL) {
throw std::runtime_error(format("failed to open %s: %s", fname, strerror(errno)));
@@ -153,13 +154,40 @@ struct llama_file::impl {
write_raw(&val, sizeof(val));
}
void read_aligned_chunk(size_t offset, void * dest, size_t size) const {
throw std::runtime_error("DirectIO is not implemented on Windows.");
}
~impl() {
if (fp) {
std::fclose(fp);
}
}
#else
impl(const char * fname, const char * mode) {
impl(const char * fname, const char * mode, [[maybe_unused]] const bool use_direct_io = false) {
#ifdef __linux__
// Try unbuffered I/O for read only
if (use_direct_io && std::strcmp(mode, "rb") == 0) {
fd = open(fname, O_RDONLY | O_DIRECT);
if (fd != -1) {
struct stat file_stats{};
fstat(fd, &file_stats);
size = file_stats.st_size;
alignment = file_stats.st_blksize;
off_t ret = lseek(fd, 0, SEEK_SET);
if (ret == -1) {
throw std::runtime_error(format("seek error: %s", strerror(errno)));
}
return;
}
LLAMA_LOG_WARN("Failed to open model %s with error: %s. Falling back to buffered I/O",
fname, strerror(errno));
}
#endif
fp = ggml_fopen(fname, mode);
if (fp == NULL) {
throw std::runtime_error(format("failed to open %s: %s", fname, strerror(errno)));
@@ -170,27 +198,30 @@ struct llama_file::impl {
}
size_t tell() const {
// TODO: this ifdef is never true?
#ifdef _WIN32
__int64 ret = _ftelli64(fp);
#else
long ret = std::ftell(fp);
#endif
if (ret == -1) {
throw std::runtime_error(format("ftell error: %s", strerror(errno)));
if (fd == -1) {
long ret = std::ftell(fp);
if (ret == -1) {
throw std::runtime_error(format("ftell error: %s", strerror(errno)));
}
return (size_t) ret;
}
return (size_t) ret;
off_t pos = lseek(fd, 0, SEEK_CUR);
if (pos == -1) {
throw std::runtime_error(format("lseek error: %s", strerror(errno)));
}
return (size_t) pos;
}
void seek(size_t offset, int whence) const {
// TODO: this ifdef is never true?
#ifdef _WIN32
int ret = _fseeki64(fp, (__int64) offset, whence);
#else
int ret = std::fseek(fp, (long) offset, whence);
#endif
if (ret != 0) {
off_t ret = 0;
if (fd == -1) {
ret = std::fseek(fp, (long) offset, whence);
} else {
ret = lseek(fd, offset, whence);
}
if (ret == -1) {
throw std::runtime_error(format("seek error: %s", strerror(errno)));
}
}
@@ -200,13 +231,55 @@ struct llama_file::impl {
return;
}
errno = 0;
std::size_t ret = std::fread(ptr, len, 1, fp);
if (ferror(fp)) {
throw std::runtime_error(format("read error: %s", strerror(errno)));
if (fd == -1) {
std::size_t ret = std::fread(ptr, len, 1, fp);
if (ferror(fp)) {
throw std::runtime_error(format("read error: %s", strerror(errno)));
}
if (ret != 1) {
throw std::runtime_error("unexpectedly reached end of file");
}
} else {
bool successful = false;
while (!successful) {
off_t ret = read(fd, ptr, len);
if (ret == -1) {
if (errno == EINTR) {
continue; // Interrupted by signal, retry
}
throw std::runtime_error(format("read error: %s", strerror(errno)));
}
if (ret == 0) {
throw std::runtime_error("unexpectedly reached end of file");
}
successful = true;
}
}
if (ret != 1) {
throw std::runtime_error("unexpectedly reached end of file");
}
void read_aligned_chunk(size_t offset, void * dest, size_t size) const {
off_t aligned_offset = offset & ~(alignment - 1);
off_t offset_from_alignment = offset - aligned_offset;
size_t bytes_to_read = (offset_from_alignment + size + alignment - 1) & ~(alignment - 1);
void * raw_buffer = nullptr;
int ret = posix_memalign(&raw_buffer, alignment, bytes_to_read);
if (ret != 0) {
throw std::runtime_error(format("posix_memalign failed with error %d", ret));
}
struct aligned_buffer_deleter {
void operator()(void * p) const { free(p); }
};
std::unique_ptr<void, aligned_buffer_deleter> buffer(raw_buffer);
seek(aligned_offset, SEEK_SET);
read_raw(buffer.get(), bytes_to_read);
uintptr_t actual_data = reinterpret_cast<uintptr_t>(buffer.get()) + offset_from_alignment;
memcpy(dest, reinterpret_cast<void *>(actual_data), size);
}
uint32_t read_u32() const {
@@ -231,22 +304,43 @@ struct llama_file::impl {
}
~impl() {
if (fp) {
if (fd != -1) {
close(fd);
} else {
std::fclose(fp);
}
}
int fd = -1;
#endif
FILE * fp;
size_t size;
void read_raw_at(void * ptr, size_t len, size_t offset) const {
if (alignment != 1) {
read_aligned_chunk(offset, ptr, len);
} else {
seek(offset, SEEK_SET);
read_raw(ptr, len);
}
}
size_t read_alignment() const {
return alignment;
}
size_t alignment = 1;
FILE * fp{};
size_t size{};
};
llama_file::llama_file(const char * fname, const char * mode) : pimpl(std::make_unique<impl>(fname, mode)) {}
llama_file::llama_file(const char * fname, const char * mode, const bool use_direct_io) :
pimpl(std::make_unique<impl>(fname, mode, use_direct_io)) {}
llama_file::~llama_file() = default;
size_t llama_file::tell() const { return pimpl->tell(); }
size_t llama_file::size() const { return pimpl->size; }
size_t llama_file::read_alignment() const { return pimpl->read_alignment(); }
int llama_file::file_id() const {
#ifdef _WIN32
return _fileno(pimpl->fp);
@@ -261,6 +355,7 @@ int llama_file::file_id() const {
void llama_file::seek(size_t offset, int whence) const { pimpl->seek(offset, whence); }
void llama_file::read_raw(void * ptr, size_t len) const { pimpl->read_raw(ptr, len); }
void llama_file::read_raw_at(void * ptr, size_t len, size_t offset) const { pimpl->read_raw_at(ptr, len, offset); }
uint32_t llama_file::read_u32() const { return pimpl->read_u32(); }
+5 -1
View File
@@ -3,6 +3,7 @@
#include <cstdint>
#include <memory>
#include <vector>
#include <cstdio>
struct llama_file;
struct llama_mmap;
@@ -13,7 +14,7 @@ using llama_mmaps = std::vector<std::unique_ptr<llama_mmap>>;
using llama_mlocks = std::vector<std::unique_ptr<llama_mlock>>;
struct llama_file {
llama_file(const char * fname, const char * mode);
llama_file(const char * fname, const char * mode, bool use_direct_io = false);
~llama_file();
size_t tell() const;
@@ -24,11 +25,14 @@ struct llama_file {
void seek(size_t offset, int whence) const;
void read_raw(void * ptr, size_t len) const;
void read_raw_at(void * ptr, size_t len, size_t offset) const;
void read_aligned_chunk(size_t offset, void * dest, size_t size) const;
uint32_t read_u32() const;
void write_raw(const void * ptr, size_t len) const;
void write_u32(uint32_t val) const;
size_t read_alignment() const;
private:
struct impl;
std::unique_ptr<impl> pimpl;
+56 -13
View File
@@ -504,7 +504,7 @@ llama_model_loader::llama_model_loader(
get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
llm_kv = LLM_KV(llm_arch_from_string(arch_name));
files.emplace_back(new llama_file(fname.c_str(), "rb"));
files.emplace_back(new llama_file(fname.c_str(), "rb", !use_mmap));
contexts.emplace_back(ctx);
// Save tensors data offset of the main file.
@@ -572,7 +572,7 @@ llama_model_loader::llama_model_loader(
}
}
files.emplace_back(new llama_file(fname_split, "rb"));
files.emplace_back(new llama_file(fname_split, "rb", !use_mmap));
contexts.emplace_back(ctx);
// Save tensors data offset info of the shard.
@@ -935,7 +935,15 @@ bool llama_model_loader::load_all_data(
// 4 staging buffers for async uploads, each sized 1MB seems to be a good default for single NVMe drives.
// NVMe raid configurations might require more / larger buffers.
constexpr size_t n_buffers = 4;
constexpr size_t buffer_size = 1 * 1024 * 1024; // 1MB
size_t alignment = 1;
for (const auto & file : files) {
alignment = std::max(file->read_alignment(), alignment);
}
// Buffer size: balance between memory usage and I/O efficiency
// 64MB works well for NVMe drives
const size_t buffer_size = alignment != 1 ? 64 * 1024 * 1024 + 2 * alignment : 1 * 1024 * 1024;
std::vector<ggml_backend_buffer_t> host_buffers;
std::vector<ggml_backend_event_t> events;
@@ -985,6 +993,7 @@ bool llama_model_loader::load_all_data(
// If the backend is supported, create pinned memory buffers and events for synchronisation.
for (size_t idx = 0; idx < n_buffers; ++idx) {
auto * buf = ggml_backend_buft_alloc_buffer(host_buft, buffer_size);
if (!buf) {
LLAMA_LOG_DEBUG("%s: failed to allocate host buffer for async uploads for device %s\n", func,
ggml_backend_dev_name(dev));
@@ -1066,9 +1075,9 @@ bool llama_model_loader::load_all_data(
}
} else {
const auto & file = files.at(weight->idx);
if (ggml_backend_buffer_is_host(cur->buffer)) {
file->seek(weight->offs, SEEK_SET);
file->read_raw(cur->data, n_size);
file->read_raw_at(cur->data, n_size, weight->offs);
if (check_tensors) {
validation_result.emplace_back(std::async(std::launch::async, [cur, n_size] {
return std::make_pair(cur, ggml_validate_row_data(cur->type, cur->data, n_size));
@@ -1077,26 +1086,60 @@ bool llama_model_loader::load_all_data(
} else {
// If upload_backend is valid load the tensor in chunks to pinned memory and upload the buffers asynchronously to the GPU.
if (upload_backend) {
file->seek(weight->offs, SEEK_SET);
size_t offset = weight->offs;
alignment = file->read_alignment();
size_t aligned_offset = offset & ~(alignment - 1);
size_t offset_from_alignment = offset - aligned_offset;
file->seek(aligned_offset, SEEK_SET);
// Calculate aligned read boundaries
size_t read_start = aligned_offset;
size_t read_end = (offset + n_size + alignment - 1) & ~(alignment - 1);
size_t bytes_read = 0;
size_t data_read = 0; // Actual tensor data copied (excluding padding)
while (bytes_read < n_size) {
size_t read_iteration = std::min<size_t>(buffer_size, n_size - bytes_read);
while (bytes_read < read_end - read_start) {
size_t read_size = std::min<size_t>(buffer_size, read_end - read_start - bytes_read);
// Align the destination pointer within the pinned buffer
uintptr_t ptr_dest_aligned = (reinterpret_cast<uintptr_t>(host_ptrs[buffer_idx]) + alignment - 1) & ~(alignment - 1);
// Wait for previous upload to complete before reusing buffer
ggml_backend_event_synchronize(events[buffer_idx]);
file->read_raw(host_ptrs[buffer_idx], read_iteration);
ggml_backend_tensor_set_async(upload_backend, cur, host_ptrs[buffer_idx], bytes_read, read_iteration);
// Read aligned chunk from file
file->read_raw(reinterpret_cast<void *>(ptr_dest_aligned), read_size);
// Calculate actual data portion (excluding alignment padding)
uintptr_t ptr_data = ptr_dest_aligned;
size_t data_to_copy = read_size;
// Skip alignment padding at start of first chunk
if (bytes_read == 0) {
ptr_data += offset_from_alignment;
data_to_copy -= offset_from_alignment;
}
// Trim alignment padding at end of last chunk
if (aligned_offset + bytes_read + read_size > offset + n_size) {
data_to_copy -= (read_end - (offset + n_size));
}
// Async upload actual data to GPU
ggml_backend_tensor_set_async(upload_backend, cur,
reinterpret_cast<void *>(ptr_data), data_read, data_to_copy);
ggml_backend_event_record(events[buffer_idx], upload_backend);
bytes_read += read_iteration;
data_read += data_to_copy;
bytes_read += read_size;
++buffer_idx;
buffer_idx %= n_buffers;
}
} else {
read_buf.resize(n_size);
file->seek(weight->offs, SEEK_SET);
file->read_raw(read_buf.data(), n_size);
file->read_raw_at(read_buf.data(), n_size, weight->offs);
ggml_backend_tensor_set(cur, read_buf.data(), 0, n_size);
if (check_tensors && !ggml_validate_row_data(cur->type, read_buf.data(), n_size)) {
throw std::runtime_error(format("tensor '%s' has invalid data", ggml_get_name(cur)));
+7 -5
View File
@@ -2378,10 +2378,10 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
if (cpu_dev == nullptr) {
throw std::runtime_error(format("%s: no CPU backend found", __func__));
}
const int i_gpu_start = std::max((int) hparams.n_layer - n_gpu_layers, (int) 0);
const int act_gpu_layers = devices.empty() ? 0 : std::min(n_gpu_layers, (int)n_layer + 1);
const int i_gpu_start = std::max(int(hparams.n_layer) + 1 - n_gpu_layers, 0);
const int act_gpu_layers = devices.empty() ? 0 : std::min(n_gpu_layers, int(n_layer) + 1);
auto get_layer_buft_list = [&](int il) -> llama_model::impl::layer_dev {
const bool is_swa = il < (int) hparams.n_layer && hparams.is_swa(il);
const bool is_swa = il < int(hparams.n_layer) && hparams.is_swa(il);
if (il < i_gpu_start || (il - i_gpu_start) >= act_gpu_layers) {
LLAMA_LOG_DEBUG("load_tensors: layer %3d assigned to device %s, is_swa = %d\n", il, ggml_backend_dev_name(cpu_dev), is_swa);
return {cpu_dev, &pimpl->cpu_buft_list};
@@ -6693,10 +6693,12 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
if (llama_supports_gpu_offload()) {
const int n_gpu = std::min(n_gpu_layers, int(hparams.n_layer));
LLAMA_LOG_INFO("%s: offloading %d repeating layers to GPU\n", __func__, n_gpu);
if (n_gpu_layers > (int) hparams.n_layer) {
int n_repeating = n_gpu;
if (n_repeating > 0) {
LLAMA_LOG_INFO("%s: offloading output layer to GPU\n", __func__);
n_repeating--;
}
LLAMA_LOG_INFO("%s: offloading %d repeating layers to GPU\n", __func__, n_repeating);
const int max_backend_supported_layers = hparams.n_layer + 1;
const int max_offloadable_layers = hparams.n_layer + 1;
+20 -32
View File
@@ -292,10 +292,6 @@ static void llama_params_fit_impl(
if (mparams->split_mode == LLAMA_SPLIT_MODE_ROW) {
throw std::runtime_error("changing weight allocation for LLAMA_SPLIT_MODE_ROW not implemented, abort");
}
if (hp_ngl < 2*nd) {
throw std::runtime_error("model has only " + std::to_string(hp_ngl) + " layers but need at least "
+ std::to_string(2*nd) + " to fit memory for " + std::to_string(nd) + " devices, abort");
}
}
if (!tensor_buft_overrides) {
throw std::runtime_error("did not provide buffer to set tensor_buft_overrides, abort");
@@ -362,8 +358,7 @@ static void llama_params_fit_impl(
auto set_ngl_tensor_split_tbo = [&](
const std::vector<ngl_t> & ngl_per_device,
const std::vector<ggml_backend_buffer_type_t> & overflow_bufts,
llama_model_params & mparams,
const bool add_nonrepeating) {
llama_model_params & mparams) {
mparams.n_gpu_layers = 0;
for (size_t id = 0; id < nd; id++) {
mparams.n_gpu_layers += ngl_per_device[id].n_layer;
@@ -371,13 +366,9 @@ static void llama_params_fit_impl(
tensor_split[id] = ngl_per_device[id].n_layer;
}
}
assert(uint32_t(mparams.n_gpu_layers) <= hp_ngl);
uint32_t il0 = hp_ngl - mparams.n_gpu_layers; // start index for tensor buft overrides
assert(uint32_t(mparams.n_gpu_layers) <= hp_ngl + 1);
uint32_t il0 = hp_ngl + 1 - mparams.n_gpu_layers; // start index for tensor buft overrides
if (add_nonrepeating) {
mparams.n_gpu_layers += 1;
tensor_split[nd - 1] += 1;
}
mparams.tensor_split = tensor_split;
size_t itbo = 0;
@@ -408,10 +399,9 @@ static void llama_params_fit_impl(
auto get_memory_for_layers = [&](
const char * func_name,
const std::vector<ngl_t> & ngl_per_device,
const std::vector<ggml_backend_buffer_type_t> & overflow_bufts,
const bool add_nonrepeating) -> std::vector<int64_t> {
const std::vector<ggml_backend_buffer_type_t> & overflow_bufts) -> std::vector<int64_t> {
llama_model_params mparams_copy = *mparams;
set_ngl_tensor_split_tbo(ngl_per_device, overflow_bufts, mparams_copy, add_nonrepeating);
set_ngl_tensor_split_tbo(ngl_per_device, overflow_bufts, mparams_copy);
const dmds_t dmd_nl = llama_get_device_memory_data(
path_model, &mparams_copy, cparams, devs, hp_ngl, hp_nct, hp_nex, log_level);
@@ -469,9 +459,6 @@ static void llama_params_fit_impl(
LLAMA_LOG_DEBUG("%s: id=%zu, target=%" PRId64 " MiB\n", __func__, id, targets[id]/MiB);
}
// whether for the optimal memory use we expect to load at least some MoE tensors:
const bool partial_moe = hp_nex > 0 && global_surplus_cpu_moe > 0;
std::vector<ggml_backend_buffer_type_t> overflow_bufts; // which bufts the partial layers of a device overflow to:
overflow_bufts.reserve(nd);
for (size_t id = 0; id < nd - 1; ++id) {
@@ -480,7 +467,7 @@ static void llama_params_fit_impl(
overflow_bufts.push_back(ggml_backend_cpu_buffer_type());
std::vector<ngl_t> ngl_per_device(nd);
std::vector<int64_t> mem = get_memory_for_layers(__func__, ngl_per_device, overflow_bufts, partial_moe);
std::vector<int64_t> mem = get_memory_for_layers(__func__, ngl_per_device, overflow_bufts);
if (hp_nex > 0) {
for (size_t id = 0; id < nd; id++) {
ngl_per_device[id].overflow_type = LAYER_FRACTION_MOE;
@@ -493,13 +480,14 @@ static void llama_params_fit_impl(
// - interpolate the memory use / layer between low and high linearly to get a guess where it meets our target
// - check memory use of our guess, replace either the low or high bound
// - once we only have a difference of a single layer, stop and return the lower bound that just barely still fits
// - the last device has the output layer, which cannot be a partial layer
if (hp_nex == 0) {
LLAMA_LOG_INFO("%s: filling dense layers back-to-front:\n", __func__);
} else {
LLAMA_LOG_INFO("%s: filling dense-only layers back-to-front:\n", __func__);
}
for (int id = nd - 1; id >= 0; id--) {
uint32_t n_unassigned = hp_ngl;
uint32_t n_unassigned = hp_ngl + 1;
for (size_t jd = id + 1; jd < nd; ++jd) {
assert(n_unassigned >= ngl_per_device[jd].n_layer);
n_unassigned -= ngl_per_device[jd].n_layer;
@@ -508,10 +496,10 @@ static void llama_params_fit_impl(
std::vector<ngl_t> ngl_per_device_high = ngl_per_device;
ngl_per_device_high[id].n_layer = n_unassigned;
if (hp_nex > 0) {
ngl_per_device_high[id].n_part = ngl_per_device_high[id].n_layer;
ngl_per_device_high[id].n_part = size_t(id) < nd - 1 ? ngl_per_device_high[id].n_layer : ngl_per_device_high[id].n_layer - 1;
}
if (ngl_per_device_high[id].n_layer > 0) {
std::vector<int64_t> mem_high = get_memory_for_layers(__func__, ngl_per_device_high, overflow_bufts, partial_moe);
std::vector<int64_t> mem_high = get_memory_for_layers(__func__, ngl_per_device_high, overflow_bufts);
if (mem_high[id] > targets[id]) {
assert(ngl_per_device_high[id].n_layer > ngl_per_device[id].n_layer);
uint32_t delta = ngl_per_device_high[id].n_layer - ngl_per_device[id].n_layer;
@@ -526,7 +514,7 @@ static void llama_params_fit_impl(
if (hp_nex) {
ngl_per_device_test[id].n_part += step_size;
}
const std::vector<int64_t> mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts, partial_moe);
const std::vector<int64_t> mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts);
if (mem_test[id] <= targets[id]) {
ngl_per_device = ngl_per_device_test;
@@ -553,7 +541,7 @@ static void llama_params_fit_impl(
__func__, dev_names[id].c_str(), ngl_per_device[id].n_layer, mem[id]/MiB, projected_margin/MiB);
}
if (hp_nex == 0 || global_surplus_cpu_moe <= 0) {
set_ngl_tensor_split_tbo(ngl_per_device, overflow_bufts, *mparams, partial_moe);
set_ngl_tensor_split_tbo(ngl_per_device, overflow_bufts, *mparams);
return;
}
@@ -576,13 +564,13 @@ static void llama_params_fit_impl(
for (size_t id = 0; id <= id_dense_start; id++) {
std::vector<ngl_t> ngl_per_device_high = ngl_per_device;
for (size_t jd = id_dense_start; jd < nd; jd++) {
const uint32_t n_layer_move = ngl_per_device_high[jd].n_layer;
const uint32_t n_layer_move = jd < nd - 1 ? ngl_per_device_high[jd].n_layer : ngl_per_device_high[jd].n_layer - 1;
ngl_per_device_high[id].n_layer += n_layer_move;
ngl_per_device_high[jd].n_layer -= n_layer_move;
ngl_per_device_high[jd].n_part = 0;
}
size_t id_dense_start_high = nd - 1;
std::vector<int64_t> mem_high = get_memory_for_layers(__func__, ngl_per_device_high, overflow_bufts, partial_moe);
std::vector<int64_t> mem_high = get_memory_for_layers(__func__, ngl_per_device_high, overflow_bufts);
if (mem_high[id] > targets[id]) {
assert(ngl_per_device_high[id].n_layer >= ngl_per_device_high[id].n_part);
@@ -610,7 +598,7 @@ static void llama_params_fit_impl(
break;
}
}
const std::vector<int64_t> mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts, partial_moe);
const std::vector<int64_t> mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts);
if (mem_test[id] <= targets[id]) {
ngl_per_device = ngl_per_device_test;
@@ -637,7 +625,7 @@ static void llama_params_fit_impl(
}
// try to fit at least part of one more layer
if (ngl_per_device[id_dense_start].n_layer > 0) {
if (ngl_per_device[id_dense_start].n_layer > (id < nd - 1 ? 0 : 1)) {
std::vector<ngl_t> ngl_per_device_test = ngl_per_device;
size_t id_dense_start_test = id_dense_start;
ngl_per_device_test[id_dense_start_test].n_layer--;
@@ -649,7 +637,7 @@ static void llama_params_fit_impl(
}
ngl_per_device_test[id].overflow_type = LAYER_FRACTION_UP;
LLAMA_LOG_DEBUG("%s: trying to fit one extra layer with overflow_type=LAYER_FRACTION_UP\n", __func__);
std::vector<int64_t> mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts, partial_moe);
std::vector<int64_t> mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts);
if (mem_test[id] < targets[id]) {
ngl_per_device = ngl_per_device_test;
mem = mem_test;
@@ -659,7 +647,7 @@ static void llama_params_fit_impl(
ngl_per_device_test[id].overflow_type = LAYER_FRACTION_GATE;
LLAMA_LOG_DEBUG("%s: trying to fit one extra layer with overflow_type=LAYER_FRACTION_GATE\n", __func__);
mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts, partial_moe);
mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts);
if (mem_test[id] < targets[id]) {
ngl_per_device = ngl_per_device_test;
mem = mem_test;
@@ -670,7 +658,7 @@ static void llama_params_fit_impl(
} else {
ngl_per_device_test[id].overflow_type = LAYER_FRACTION_ATTN;
LLAMA_LOG_DEBUG("%s: trying to fit one extra layer with overflow_type=LAYER_FRACTION_ATTN\n", __func__);
mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts, partial_moe);
mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts);
if (mem_test[id] < targets[id]) {
ngl_per_device = ngl_per_device_test;
mem = mem_test;
@@ -687,7 +675,7 @@ static void llama_params_fit_impl(
__func__, dev_names[id].c_str(), ngl_per_device[id].n_layer, ngl_per_device[id].n_part, mem[id]/MiB, projected_margin/MiB);
}
set_ngl_tensor_split_tbo(ngl_per_device, overflow_bufts, *mparams, partial_moe);
set_ngl_tensor_split_tbo(ngl_per_device, overflow_bufts, *mparams);
}
bool llama_params_fit(
+25
View File
@@ -16,6 +16,7 @@ int main(void) {
for (int ex = 0; ex < LLAMA_EXAMPLE_COUNT; ex++) {
try {
auto ctx_arg = common_params_parser_init(params, (enum llama_example)ex);
common_params_add_preset_options(ctx_arg.options);
std::unordered_set<std::string> seen_args;
std::unordered_set<std::string> seen_env_vars;
for (const auto & opt : ctx_arg.options) {
@@ -37,6 +38,30 @@ int main(void) {
exit(1);
}
}
// ensure shorter argument precedes longer argument
if (opt.args.size() > 1) {
const std::string first(opt.args.front());
const std::string last(opt.args.back());
if (first.length() > last.length()) {
fprintf(stderr, "test-arg-parser: shorter argument should come before longer one: %s, %s\n",
first.c_str(), last.c_str());
assert(false);
}
}
// same check for negated arguments
if (opt.args_neg.size() > 1) {
const std::string first(opt.args_neg.front());
const std::string last(opt.args_neg.back());
if (first.length() > last.length()) {
fprintf(stderr, "test-arg-parser: shorter negated argument should come before longer one: %s, %s\n",
first.c_str(), last.c_str());
assert(false);
}
}
}
} catch (std::exception & e) {
printf("%s\n", e.what());
+4 -4
View File
@@ -7295,11 +7295,11 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_l2_norm(GGML_TYPE_F32, {64, 5, 4, 3}, 1e-12f));
for (int64_t d_conv : {3, 4}) {
for (int64_t d_conv : {3, 4, 9}) {
for (int64_t d_inner: {1024, 1536, 2048}) {
test_cases.emplace_back(new test_ssm_conv(GGML_TYPE_F32, {4, d_inner, 1, 1}, {d_conv, d_inner, 1, 1}));
test_cases.emplace_back(new test_ssm_conv(GGML_TYPE_F32, {8, d_inner, 1, 1}, {d_conv, d_inner, 1, 1}));
test_cases.emplace_back(new test_ssm_conv(GGML_TYPE_F32, {4, d_inner, 4, 1}, {d_conv, d_inner, 1, 1}));
test_cases.emplace_back(new test_ssm_conv(GGML_TYPE_F32, {d_conv, d_inner, 1, 1}, {d_conv, d_inner, 1, 1}));
test_cases.emplace_back(new test_ssm_conv(GGML_TYPE_F32, {2 * d_conv, d_inner, 1, 1}, {d_conv, d_inner, 1, 1}));
test_cases.emplace_back(new test_ssm_conv(GGML_TYPE_F32, {d_conv, d_inner, 4, 1}, {d_conv, d_inner, 1, 1}));
}
}
+1
View File
@@ -15,6 +15,7 @@ add_library(mtmd
clip-graph.h
models/models.h
models/cogvlm.cpp
models/conformer.cpp
models/glm4v.cpp
models/internvl.cpp
models/kimivl.cpp
+17
View File
@@ -138,6 +138,21 @@
#define TN_TOK_BOI "v.boi"
#define TN_TOK_EOI "v.eoi"
// (conformer) lfm2
#define TN_PRE_ENCODE_OUT "a.pre_encode.out.%s"
#define TN_FFN_NORM "%s.blk.%d.ffn_norm.%s"
#define TN_FFN_NORM_1 "%s.blk.%d.ffn_norm_1.%s"
#define TN_FFN_UP_1 "%s.blk.%d.ffn_up_1.%s"
#define TN_FFN_DOWN_1 "%s.blk.%d.ffn_down_1.%s"
#define TN_POS_BIAS_U "%s.blk.%d.pos_bias_u"
#define TN_POS_BIAS_V "%s.blk.%d.pos_bias_v"
#define TN_NORM_CONV "%s.blk.%d.norm_conv.%s"
#define TN_LINEAR_POS "%s.blk.%d.linear_pos.%s"
#define TN_CONV_DW "%s.blk.%d.conv_dw.%s"
#define TN_CONV_NORM "%s.blk.%d.conv_norm.%s"
#define TN_CONV_PW1 "%s.blk.%d.conv_pw1.%s"
#define TN_CONV_PW2 "%s.blk.%d.conv_pw2.%s"
// align x to upper multiple of n
#define CLIP_ALIGN(x, n) ((((x) + (n) - 1) / (n)) * (n))
@@ -170,6 +185,7 @@ enum projector_type {
PROJECTOR_TYPE_LIGHTONOCR,
PROJECTOR_TYPE_COGVLM,
PROJECTOR_TYPE_JANUS_PRO,
PROJECTOR_TYPE_LFM2A,
PROJECTOR_TYPE_GLM4V,
PROJECTOR_TYPE_UNKNOWN,
};
@@ -198,6 +214,7 @@ static std::map<projector_type, std::string> PROJECTOR_TYPE_NAMES = {
{ PROJECTOR_TYPE_LIGHTONOCR,"lightonocr"},
{ PROJECTOR_TYPE_COGVLM, "cogvlm"},
{ PROJECTOR_TYPE_JANUS_PRO, "janus_pro"},
{ PROJECTOR_TYPE_LFM2A, "lfm2a"},
{ PROJECTOR_TYPE_GLM4V, "glm4v"},
};
+31
View File
@@ -4,6 +4,7 @@
#include "clip.h"
#include "clip-impl.h"
#include <array>
#include <vector>
#include <unordered_set>
#include <cstdint>
@@ -142,6 +143,30 @@ struct clip_layer {
ggml_tensor * deepstack_fc2_w = nullptr;
ggml_tensor * deepstack_fc2_b = nullptr;
// lfm2
ggml_tensor * ff_norm_w = nullptr;
ggml_tensor * ff_norm_b = nullptr;
ggml_tensor * ff_norm_1_w = nullptr;
ggml_tensor * ff_norm_1_b = nullptr;
ggml_tensor * ff_up_1_w = nullptr;
ggml_tensor * ff_up_1_b = nullptr;
ggml_tensor * ff_down_1_w = nullptr;
ggml_tensor * ff_down_1_b = nullptr;
ggml_tensor * pos_bias_u = nullptr;
ggml_tensor * pos_bias_v = nullptr;
ggml_tensor * norm_conv_w = nullptr;
ggml_tensor * norm_conv_b = nullptr;
ggml_tensor * linear_pos_w = nullptr;
ggml_tensor * conv_norm_w = nullptr;
ggml_tensor * conv_norm_b = nullptr;
ggml_tensor * conv_dw_w = nullptr;
ggml_tensor * conv_dw_b = nullptr;
ggml_tensor * conv_pw1_w = nullptr;
ggml_tensor * conv_pw1_b = nullptr;
ggml_tensor * conv_pw2_w = nullptr;
ggml_tensor * conv_pw2_b = nullptr;
bool has_deepstack() const {
return deepstack_fc1_w != nullptr;
}
@@ -286,6 +311,12 @@ struct clip_model {
ggml_tensor * mm_boi = nullptr;
ggml_tensor * mm_eoi = nullptr;
// lfm2 audio
std::array<ggml_tensor *, 7> pre_encode_conv_X_w = {nullptr};
std::array<ggml_tensor *, 7> pre_encode_conv_X_b = {nullptr};
ggml_tensor * pre_encode_out_w = nullptr;
ggml_tensor * pre_encode_out_b = nullptr;
bool audio_has_avgpool() const {
return proj_type == PROJECTOR_TYPE_QWEN2A
|| proj_type == PROJECTOR_TYPE_VOXTRAL;
+86
View File
@@ -837,6 +837,10 @@ static ggml_cgraph * clip_image_build_graph(clip_ctx * ctx, const clip_image_f32
{
builder = std::make_unique<clip_graph_llava>(ctx, img);
} break;
case PROJECTOR_TYPE_LFM2A:
{
builder = std::make_unique<clip_graph_conformer>(ctx, img);
} break;
case PROJECTOR_TYPE_GLM4V:
{
builder = std::make_unique<clip_graph_glm4v>(ctx, img);
@@ -1187,6 +1191,15 @@ struct clip_model_loader {
hparams.audio_window_len = 400;
hparams.audio_hop_len = 160;
} break;
case PROJECTOR_TYPE_LFM2A:
{
// audio preprocessing params
hparams.audio_chunk_len = 1; // in seconds
hparams.audio_sample_rate = 16000;
hparams.audio_n_fft = 512;
hparams.audio_window_len = 400;
hparams.audio_hop_len = 160;
} break;
default:
break;
}
@@ -1611,6 +1624,52 @@ struct clip_model_loader {
model.mm_1_w = get_tensor(string_format(TN_LLAVA_PROJ, 1, "weight"));
model.mm_1_b = get_tensor(string_format(TN_LLAVA_PROJ, 1, "bias"));
} break;
case PROJECTOR_TYPE_LFM2A:
{
for (int i : {0, 2, 3, 5, 6}) {
model.pre_encode_conv_X_w[i] = get_tensor(string_format(TN_CONV1D, i, "weight"));
model.pre_encode_conv_X_b[i] = get_tensor(string_format(TN_CONV1D, i, "bias"));
}
model.pre_encode_out_w = get_tensor(string_format(TN_PRE_ENCODE_OUT, "weight"));
model.pre_encode_out_b = get_tensor(string_format(TN_PRE_ENCODE_OUT, "bias"));
model.mm_0_w = get_tensor(string_format(TN_MM_AUDIO_MLP, 0, "weight"));
model.mm_0_b = get_tensor(string_format(TN_MM_AUDIO_MLP, 0, "bias"));
model.mm_1_w = get_tensor(string_format(TN_MM_AUDIO_MLP, 1, "weight"));
model.mm_1_b = get_tensor(string_format(TN_MM_AUDIO_MLP, 1, "bias"));
model.mm_3_w = get_tensor(string_format(TN_MM_AUDIO_MLP, 3, "weight"));
model.mm_3_b = get_tensor(string_format(TN_MM_AUDIO_MLP, 3, "bias"));
for (int il = 0; il < hparams.n_layer; ++il) {
auto & layer = model.layers[il];
layer.ff_norm_w = get_tensor(string_format(TN_FFN_NORM, prefix, il, "weight"));
layer.ff_norm_b = get_tensor(string_format(TN_FFN_NORM, prefix, il, "bias"));
layer.ff_norm_1_w = get_tensor(string_format(TN_FFN_NORM_1, prefix, il, "weight"));
layer.ff_norm_1_b = get_tensor(string_format(TN_FFN_NORM_1, prefix, il, "bias"));
layer.ff_up_1_w = get_tensor(string_format(TN_FFN_UP_1, prefix, il, "weight"));
layer.ff_up_1_b = get_tensor(string_format(TN_FFN_UP_1, prefix, il, "bias"));
layer.ff_down_1_w = get_tensor(string_format(TN_FFN_DOWN_1, prefix, il, "weight"));
layer.ff_down_1_b = get_tensor(string_format(TN_FFN_DOWN_1, prefix, il, "bias"));
layer.pos_bias_u = get_tensor(string_format(TN_POS_BIAS_U, prefix, il));
layer.pos_bias_v = get_tensor(string_format(TN_POS_BIAS_V, prefix, il));
layer.norm_conv_w = get_tensor(string_format(TN_NORM_CONV, prefix, il, "weight"));
layer.norm_conv_b = get_tensor(string_format(TN_NORM_CONV, prefix, il, "bias"));
layer.linear_pos_w = get_tensor(string_format(TN_LINEAR_POS, prefix, il, "weight"));
layer.conv_norm_w = get_tensor(string_format(TN_CONV_NORM, prefix, il, "weight"));
layer.conv_norm_b = get_tensor(string_format(TN_CONV_NORM, prefix, il, "bias"));
layer.conv_dw_w = get_tensor(string_format(TN_CONV_DW, prefix, il, "weight"));
layer.conv_dw_b = get_tensor(string_format(TN_CONV_DW, prefix, il, "bias"));
layer.conv_pw1_w = get_tensor(string_format(TN_CONV_PW1, prefix, il, "weight"));
layer.conv_pw1_b = get_tensor(string_format(TN_CONV_PW1, prefix, il, "bias"));
layer.conv_pw2_w = get_tensor(string_format(TN_CONV_PW2, prefix, il, "weight"));
layer.conv_pw2_b = get_tensor(string_format(TN_CONV_PW2, prefix, il, "bias"));
}
} break;
default:
GGML_ASSERT(false && "unknown projector type");
}
@@ -3004,6 +3063,10 @@ int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * im
{
n_patches += 2; // for BOI and EOI token embeddings
} break;
case PROJECTOR_TYPE_LFM2A:
{
n_patches = ((((img->nx + 1) / 2) + 1) / 2 + 1) / 2;
} break;
default:
GGML_ABORT("unsupported projector type");
}
@@ -3362,6 +3425,27 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
}
set_input_i32("pos_w", pos_data);
} break;
case PROJECTOR_TYPE_LFM2A:
{
GGML_ASSERT(imgs.entries.size() == 1);
const auto n_frames = clip_n_output_tokens(ctx, imgs.entries.front().get());
auto d_model = 512;
auto seq_len = n_frames * 2 - 1;
std::vector<float> pos_emb(d_model*seq_len);
std::vector<double> inv_freq(d_model / 2);
for (size_t i = 0; i < inv_freq.size(); ++i) {
inv_freq[i] = std::exp(-(std::log(10000.0) / (float)d_model) * (2.0f * (float)(i)));
}
for (int64_t pos = 0; pos < seq_len; ++pos) {
for (size_t i = 0; i < inv_freq.size(); ++i) {
const float ang = (n_frames - pos - 1) * inv_freq[i];
pos_emb[pos*d_model + 2*i + 0] = sinf(ang); // even
pos_emb[pos*d_model + 2*i + 1] = cosf(ang); // odd
}
}
set_input_f32("pos_emb", pos_emb);
} break;
default:
GGML_ABORT("Unknown projector type");
}
@@ -3456,6 +3540,8 @@ int clip_n_mmproj_embd(const struct clip_ctx * ctx) {
return ctx->model.mm_2_w->ne[1];
case PROJECTOR_TYPE_COGVLM:
return ctx->model.mm_4h_to_h_w->ne[1];
case PROJECTOR_TYPE_LFM2A:
return ctx->model.position_embeddings->ne[0];
case PROJECTOR_TYPE_GLM4V:
return ctx->model.mm_ffn_down_w->ne[1];
default:
+217
View File
@@ -0,0 +1,217 @@
#include "models.h"
ggml_cgraph * clip_graph_conformer::build() {
const int n_frames = img.nx;
const int n_pos = n_frames / 2;
const int n_pos_embd = (((((n_frames + 1) / 2) + 1) / 2 + 1) / 2) * 2 - 1;
GGML_ASSERT(model.position_embeddings->ne[1] >= n_pos);
ggml_tensor * pos_emb = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, 512, n_pos_embd);
ggml_set_name(pos_emb, "pos_emb");
ggml_set_input(pos_emb);
ggml_build_forward_expand(gf, pos_emb);
ggml_tensor * inp = build_inp_raw(1);
cb(inp, "input", -1);
auto * cur = ggml_cont(ctx0, ggml_transpose(ctx0, inp));
// pre encode, conv subsampling
{
// layer.0 - conv2d
cur = ggml_conv_2d(ctx0, model.pre_encode_conv_X_w[0], cur, 2, 2, 1, 1, 1, 1);
cur = ggml_add(ctx0, cur, model.pre_encode_conv_X_b[0]);
cb(cur, "conformer.pre_encode.conv.{}", 0);
// layer.1 - relu
cur = ggml_relu_inplace(ctx0, cur);
// layer.2 conv2d dw
cur = ggml_conv_2d_dw_direct(ctx0, model.pre_encode_conv_X_w[2], cur, 2, 2, 1, 1, 1, 1);
cur = ggml_add(ctx0, cur, model.pre_encode_conv_X_b[2]);
cb(cur, "conformer.pre_encode.conv.{}", 2);
// layer.3 conv2d
cur = ggml_conv_2d_direct(ctx0, model.pre_encode_conv_X_w[3], cur, 1, 1, 0, 0, 1, 1);
cur = ggml_add(ctx0, cur, model.pre_encode_conv_X_b[3]);
cb(cur, "conformer.pre_encode.conv.{}", 3);
// layer.4 - relu
cur = ggml_relu_inplace(ctx0, cur);
// layer.5 conv2d dw
cur = ggml_conv_2d_dw_direct(ctx0, model.pre_encode_conv_X_w[5], cur, 2, 2, 1, 1, 1, 1);
cur = ggml_add(ctx0, cur, model.pre_encode_conv_X_b[5]);
cb(cur, "conformer.pre_encode.conv.{}", 5);
// layer.6 conv2d
cur = ggml_conv_2d_direct(ctx0, model.pre_encode_conv_X_w[6], cur, 1, 1, 0, 0, 1, 1);
cur = ggml_add(ctx0, cur, model.pre_encode_conv_X_b[6]);
cb(cur, "conformer.pre_encode.conv.{}", 6);
// layer.7 - relu
cur = ggml_relu_inplace(ctx0, cur);
// flatten channel and frequency axis
cur = ggml_cont(ctx0, ggml_permute(ctx0, cur, 0, 2, 1, 3));
cur = ggml_reshape_2d(ctx0, cur, cur->ne[0] * cur->ne[1], cur->ne[2]);
// calculate out
cur = ggml_mul_mat(ctx0, model.pre_encode_out_w, cur);
cur = ggml_add(ctx0, cur, model.pre_encode_out_b);
cb(cur, "conformer.pre_encode.out", -1);
}
// pos_emb
cb(pos_emb, "pos_emb", -1);
for (int il = 0; il < hparams.n_layer; il++) {
const auto & layer = model.layers[il];
auto * residual = cur;
cb(cur, "layer.in", il);
// feed_forward1
cur = build_norm(cur, layer.ff_norm_w, layer.ff_norm_b, NORM_TYPE_NORMAL, 1e-5, il);
cb(cur, "conformer.layers.{}.norm_feed_forward1", il);
cur = build_ffn(cur, layer.ff_up_w, layer.ff_up_b, nullptr, nullptr, layer.ff_down_w, layer.ff_down_b, FFN_SILU,
il);
cb(cur, "conformer.layers.{}.feed_forward1.linear2", il);
const auto fc_factor = 0.5f;
residual = ggml_add(ctx0, residual, ggml_scale(ctx0, cur, fc_factor));
// self-attention
{
cur = build_norm(residual, layer.ln_1_w, layer.ln_1_b, NORM_TYPE_NORMAL, 1e-5, il);
cb(cur, "conformer.layers.{}.norm_self_att", il);
ggml_tensor * Qcur = ggml_mul_mat(ctx0, layer.q_w, cur);
Qcur = ggml_add(ctx0, Qcur, layer.q_b);
Qcur = ggml_reshape_3d(ctx0, Qcur, d_head, n_head, Qcur->ne[1]);
ggml_tensor * Q_bias_u = ggml_add(ctx0, Qcur, layer.pos_bias_u);
Q_bias_u = ggml_permute(ctx0, Q_bias_u, 0, 2, 1, 3);
ggml_tensor * Q_bias_v = ggml_add(ctx0, Qcur, layer.pos_bias_v);
Q_bias_v = ggml_permute(ctx0, Q_bias_v, 0, 2, 1, 3);
// TODO @ngxson : some cont can/should be removed when ggml_mul_mat support these cases
ggml_tensor * Kcur = ggml_mul_mat(ctx0, layer.k_w, cur);
Kcur = ggml_add(ctx0, Kcur, layer.k_b);
Kcur = ggml_reshape_3d(ctx0, Kcur, d_head, n_head, Kcur->ne[1]);
Kcur = ggml_cont(ctx0, ggml_permute(ctx0, Kcur, 0, 2, 1, 3));
ggml_tensor * Vcur = ggml_mul_mat(ctx0, layer.v_w, cur);
Vcur = ggml_add(ctx0, Vcur, layer.v_b);
Vcur = ggml_reshape_3d(ctx0, Vcur, d_head, n_head, Vcur->ne[1]);
Vcur = ggml_cont(ctx0, ggml_permute(ctx0, Vcur, 1, 2, 0, 3));
// build_attn won't fit due to matrix_ac and matrix_bd separation
ggml_tensor * matrix_ac = ggml_mul_mat(ctx0, Q_bias_u, Kcur);
matrix_ac = ggml_cont(ctx0, ggml_permute(ctx0, matrix_ac, 1, 0, 2, 3));
cb(matrix_ac, "conformer.layers.{}.self_attn.id3", il);
auto * p = ggml_mul_mat(ctx0, layer.linear_pos_w, pos_emb);
cb(p, "conformer.layers.{}.self_attn.linear_pos", il);
p = ggml_reshape_3d(ctx0, p, d_head, n_head, p->ne[1]);
p = ggml_permute(ctx0, p, 0, 2, 1, 3);
auto * matrix_bd = ggml_mul_mat(ctx0, Q_bias_v, p);
matrix_bd = ggml_cont(ctx0, ggml_permute(ctx0, matrix_bd, 1, 0, 2, 3));
// rel shift
{
const auto pos_len = matrix_bd->ne[0];
const auto q_len = matrix_bd->ne[1];
const auto h = matrix_bd->ne[2];
matrix_bd = ggml_pad(ctx0, matrix_bd, 1, 0, 0, 0);
matrix_bd = ggml_roll(ctx0, matrix_bd, 1, 0, 0, 0);
matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, q_len, pos_len + 1, h);
matrix_bd = ggml_view_3d(ctx0, matrix_bd, q_len, pos_len, h, matrix_bd->nb[1],
matrix_bd->nb[2], matrix_bd->nb[0] * q_len);
matrix_bd = ggml_cont_3d(ctx0, matrix_bd, pos_len, q_len, h);
}
matrix_bd = ggml_view_3d(ctx0, matrix_bd, matrix_ac->ne[0], matrix_bd->ne[1],
matrix_bd->ne[2], matrix_bd->nb[1], matrix_bd->nb[2], 0);
auto * scores = ggml_add(ctx0, matrix_ac, matrix_bd);
scores = ggml_scale(ctx0, scores, 1.0f / std::sqrt(d_head));
cb(scores, "conformer.layers.{}.self_attn.id0", il);
ggml_tensor * attn = ggml_soft_max(ctx0, scores);
ggml_tensor * x = ggml_mul_mat(ctx0, attn, Vcur);
x = ggml_permute(ctx0, x, 2, 0, 1, 3);
x = ggml_cont_2d(ctx0, x, x->ne[0] * x->ne[1], x->ne[2]);
ggml_tensor * out = ggml_mul_mat(ctx0, layer.o_w, x);
out = ggml_add(ctx0, out, layer.o_b);
cb(out, "conformer.layers.{}.self_attn.linear_out", il);
cur = out;
}
residual = ggml_add(ctx0, residual, cur);
cur = build_norm(residual, layer.norm_conv_w, layer.norm_conv_b, NORM_TYPE_NORMAL, 1e-5, il);
cb(cur, "conformer.layers.{}.norm_conv", il);
// conv
{
auto * x = cur;
x = ggml_mul_mat(ctx0, layer.conv_pw1_w, x);
x = ggml_add(ctx0, x, layer.conv_pw1_b);
cb(x, "conformer.layers.{}.conv.pointwise_conv1", il);
// ggml_glu doesn't support sigmoid
// TODO @ngxson : support this ops in ggml
{
int64_t d = x->ne[0] / 2;
ggml_tensor * gate = ggml_sigmoid(ctx0, ggml_view_2d(ctx0, x, d, x->ne[1], x->nb[1], d * x->nb[0]));
x = ggml_mul(ctx0, ggml_view_2d(ctx0, x, d, x->ne[1], x->nb[1], 0), gate);
x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
}
// use ggml_ssm_conv for f32 precision
x = ggml_pad(ctx0, x, 4, 0, 0, 0);
x = ggml_roll(ctx0, x, 4, 0, 0, 0);
x = ggml_pad(ctx0, x, 4, 0, 0, 0);
x = ggml_ssm_conv(ctx0, x, layer.conv_dw_w);
x = ggml_add(ctx0, x, layer.conv_dw_b);
x = ggml_add(ctx0, ggml_mul(ctx0, x, layer.conv_norm_w), layer.conv_norm_b);
x = ggml_silu(ctx0, x);
// pointwise_conv2
x = ggml_mul_mat(ctx0, layer.conv_pw2_w, x);
x = ggml_add(ctx0, x, layer.conv_pw2_b);
cur = x;
}
residual = ggml_add(ctx0, residual, cur);
cur = build_norm(residual, layer.ff_norm_1_w, layer.ff_norm_1_b, NORM_TYPE_NORMAL, 1e-5, il);
cb(cur, "conformer.layers.{}.norm_feed_forward2", il);
cur = build_ffn(cur, layer.ff_up_1_w, layer.ff_up_1_b, nullptr, nullptr, layer.ff_down_1_w, layer.ff_down_1_b,
FFN_SILU, il); // TODO(tarek): read activation for ffn from hparams
cb(cur, "conformer.layers.{}.feed_forward2.linear2", il);
residual = ggml_add(ctx0, residual, ggml_scale(ctx0, cur, fc_factor));
cb(residual, "conformer.layers.{}.conv.id", il);
cur = build_norm(residual, layer.ln_2_w, layer.ln_2_b, NORM_TYPE_NORMAL, 1e-5, il);
cb(cur, "conformer.layers.{}.norm_out", il);
}
// audio adapter
cur = build_norm(cur, model.mm_0_w, model.mm_0_b, NORM_TYPE_NORMAL, 1e-5, -1);
cb(cur, "audio_adapter.model.{}", 0);
cur = build_ffn(cur, model.mm_1_w, model.mm_1_b, nullptr, nullptr, model.mm_3_w, model.mm_3_b, FFN_GELU_ERF, -1);
cb(cur, "projected", -1);
ggml_build_forward_expand(gf, cur);
return gf;
}
+5
View File
@@ -57,6 +57,11 @@ struct clip_graph_whisper_enc : clip_graph {
ggml_cgraph * build() override;
};
struct clip_graph_conformer : clip_graph {
clip_graph_conformer(clip_ctx * ctx, const clip_image_f32 & img) : clip_graph(ctx, img) {}
ggml_cgraph * build() override;
};
struct clip_graph_glm4v : clip_graph {
clip_graph_glm4v(clip_ctx * ctx, const clip_image_f32 & img) : clip_graph(ctx, img) {}
ggml_cgraph * build() override;
+53
View File
@@ -535,3 +535,56 @@ bool mtmd_audio_preprocessor_whisper::preprocess(
return true;
}
//
// mtmd_audio_preprocessor_conformer
//
void mtmd_audio_preprocessor_conformer::initialize() {
g_cache.fill_sin_cos_table(hparams.audio_n_fft);
g_cache.fill_hann_window(hparams.audio_window_len, true);
g_cache.fill_mel_filterbank_matrix(
hparams.n_mel_bins,
hparams.audio_n_fft,
hparams.audio_sample_rate);
}
bool mtmd_audio_preprocessor_conformer::preprocess(
const float * samples,
size_t n_samples,
std::vector<mtmd_audio_mel> & output) {
// empty audio
if (n_samples == 0) {
return false;
}
filter_params params;
params.n_mel = hparams.n_mel_bins;
params.n_fft_bins = 1 + (hparams.audio_n_fft / 2);
params.hann_window_size = hparams.audio_window_len;
params.hop_length = hparams.audio_hop_len;
params.sample_rate = hparams.audio_sample_rate;
params.center_padding = true;
params.preemph = 0.97f;
params.use_natural_log = true;
params.norm_per_feature = true;
// make sure the global cache is initialized
GGML_ASSERT(!g_cache.sin_vals.empty());
GGML_ASSERT(!g_cache.cos_vals.empty());
GGML_ASSERT(!g_cache.filters.data.empty());
mtmd_audio_mel out_full;
bool ok = log_mel_spectrogram(
samples,
n_samples,
4, // n_threads
params,
out_full);
if (!ok) {
return false;
}
output.push_back(std::move(out_full));
return true;
}
+6
View File
@@ -32,3 +32,9 @@ struct mtmd_audio_preprocessor_whisper : mtmd_audio_preprocessor {
void initialize() override;
bool preprocess(const float * samples, size_t n_samples, std::vector<mtmd_audio_mel> & output) override;
};
struct mtmd_audio_preprocessor_conformer : mtmd_audio_preprocessor {
mtmd_audio_preprocessor_conformer(const clip_ctx * ctx) : mtmd_audio_preprocessor(ctx) {}
void initialize() override;
bool preprocess(const float * samples, size_t n_samples, std::vector<mtmd_audio_mel> & output) override;
};
+19
View File
@@ -309,9 +309,24 @@ int main(int argc, char ** argv) {
if (g_is_interrupted) return 130;
auto eval_system_prompt_if_present = [&] {
if (params.system_prompt.empty()) {
return 0;
}
common_chat_msg msg;
msg.role = "system";
msg.content = params.system_prompt;
return eval_message(ctx, msg);
};
LOG_WRN("WARN: This is an experimental CLI for testing multimodal capability.\n");
LOG_WRN(" For normal use cases, please use the standard llama-cli\n");
if (eval_system_prompt_if_present()) {
return 1;
}
if (is_single_turn) {
g_is_generating = true;
if (params.prompt.find(mtmd_default_marker()) == std::string::npos) {
@@ -321,6 +336,7 @@ int main(int argc, char ** argv) {
params.prompt = mtmd_default_marker() + params.prompt;
}
}
common_chat_msg msg;
msg.role = "user";
msg.content = params.prompt;
@@ -369,6 +385,9 @@ int main(int argc, char ** argv) {
ctx.n_past = 0;
ctx.chat_history.clear();
llama_memory_clear(llama_get_memory(ctx.lctx), true);
if (eval_system_prompt_if_present()) {
return 1;
}
LOG("Chat history cleared\n\n");
continue;
}
+3
View File
@@ -332,6 +332,9 @@ struct mtmd_context {
case PROJECTOR_TYPE_GLMA:
audio_preproc = std::make_unique<mtmd_audio_preprocessor_whisper>(ctx_a);
break;
case PROJECTOR_TYPE_LFM2A:
audio_preproc = std::make_unique<mtmd_audio_preprocessor_conformer>(ctx_a);
break;
default:
GGML_ABORT("unsupported audio projector type");
}
+1
View File
@@ -84,6 +84,7 @@ add_test_vision "ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0"
add_test_audio "ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0"
add_test_audio "ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M"
add_test_audio "ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M"
add_test_audio "ggml-org/LFM2-Audio-1.5B-GGUF:Q8_0"
# to test the big models, run: ./tests.sh big
if [ "$RUN_BIG_TESTS" = true ]; then
+27 -13
View File
@@ -75,9 +75,9 @@ For the ful list of features, please refer to [server's changelog](https://githu
| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
| `-dev, --device <dev1,dev2,..>` | comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
| `--list-devices` | print list of available devices and exit |
| `--override-tensor, -ot <tensor name pattern>=<buffer type>,...` | override tensor buffer type |
| `--cpu-moe, -cmoe` | keep all Mixture of Experts (MoE) weights in the CPU<br/>(env: LLAMA_ARG_CPU_MOE) |
| `--n-cpu-moe, -ncmoe N` | keep the Mixture of Experts (MoE) weights of the first N layers in the CPU<br/>(env: LLAMA_ARG_N_CPU_MOE) |
| `-ot, --override-tensor <tensor name pattern>=<buffer type>,...` | override tensor buffer type |
| `-cmoe, --cpu-moe` | keep all Mixture of Experts (MoE) weights in the CPU<br/>(env: LLAMA_ARG_CPU_MOE) |
| `-ncmoe, --n-cpu-moe N` | keep the Mixture of Experts (MoE) weights of the first N layers in the CPU<br/>(env: LLAMA_ARG_N_CPU_MOE) |
| `-ngl, --gpu-layers, --n-gpu-layers N` | max. number of layers to store in VRAM (default: -1)<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
| `-sm, --split-mode {none,layer,row}` | how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs<br/>(env: LLAMA_ARG_SPLIT_MODE) |
| `-ts, --tensor-split N0,N1,N2,...` | fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1<br/>(env: LLAMA_ARG_TENSOR_SPLIT) |
@@ -120,7 +120,7 @@ For the ful list of features, please refer to [server's changelog](https://githu
| -------- | ----------- |
| `--samplers SAMPLERS` | samplers that will be used for generation in the order, separated by ';'<br/>(default: penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature) |
| `-s, --seed SEED` | RNG seed (default: -1, use random seed for -1) |
| `--sampling-seq, --sampler-seq SEQUENCE` | simplified sequence for samplers that will be used (default: edskypmxt) |
| `--sampler-seq, --sampling-seq SEQUENCE` | simplified sequence for samplers that will be used (default: edskypmxt) |
| `--ignore-eos` | ignore end of stream token and continue generating (implies --logit-bias EOS-inf) |
| `--temp N` | temperature (default: 0.8) |
| `--top-k N` | top-k sampling (default: 40, 0 = disabled)<br/>(env: LLAMA_ARG_TOP_K) |
@@ -156,8 +156,8 @@ For the ful list of features, please refer to [server's changelog](https://githu
| Argument | Explanation |
| -------- | ----------- |
| `--ctx-checkpoints, --swa-checkpoints N` | max number of context checkpoints to create per slot (default: 8)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)<br/>(env: LLAMA_ARG_CTX_CHECKPOINTS) |
| `--cache-ram, -cram N` | set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 - disable)[(more info)](https://github.com/ggml-org/llama.cpp/pull/16391)<br/>(env: LLAMA_ARG_CACHE_RAM) |
| `--kv-unified, -kvu` | use single unified KV buffer shared across all sequences (default: enabled if number of slots is auto)<br/>(env: LLAMA_ARG_KV_UNIFIED) |
| `-cram, --cache-ram N` | set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 - disable)[(more info)](https://github.com/ggml-org/llama.cpp/pull/16391)<br/>(env: LLAMA_ARG_CACHE_RAM) |
| `-kvu, --kv-unified` | use single unified KV buffer shared across all sequences (default: enabled if number of slots is auto)<br/>(env: LLAMA_ARG_KV_UNIFIED) |
| `--context-shift, --no-context-shift` | whether to use context shift on infinite text generation (default: disabled)<br/>(env: LLAMA_ARG_CONTEXT_SHIFT) |
| `-r, --reverse-prompt PROMPT` | halt generation at PROMPT, return control in interactive mode<br/> |
| `-sp, --special` | special tokens output enabled (default: false) |
@@ -172,9 +172,9 @@ For the ful list of features, please refer to [server's changelog](https://githu
| `--mmproj-offload, --no-mmproj-offload` | whether to enable GPU offloading for multimodal projector (default: enabled)<br/>(env: LLAMA_ARG_MMPROJ_OFFLOAD) |
| `--image-min-tokens N` | minimum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MIN_TOKENS) |
| `--image-max-tokens N` | maximum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MAX_TOKENS) |
| `--override-tensor-draft, -otd <tensor name pattern>=<buffer type>,...` | override tensor buffer type for draft model |
| `--cpu-moe-draft, -cmoed` | keep all Mixture of Experts (MoE) weights in the CPU for the draft model<br/>(env: LLAMA_ARG_CPU_MOE_DRAFT) |
| `--n-cpu-moe-draft, -ncmoed N` | keep the Mixture of Experts (MoE) weights of the first N layers in the CPU for the draft model<br/>(env: LLAMA_ARG_N_CPU_MOE_DRAFT) |
| `-otd, --override-tensor-draft <tensor name pattern>=<buffer type>,...` | override tensor buffer type for draft model |
| `-cmoed, --cpu-moe-draft` | keep all Mixture of Experts (MoE) weights in the CPU for the draft model<br/>(env: LLAMA_ARG_CPU_MOE_DRAFT) |
| `-ncmoed, --n-cpu-moe-draft N` | keep the Mixture of Experts (MoE) weights of the first N layers in the CPU for the draft model<br/>(env: LLAMA_ARG_N_CPU_MOE_DRAFT) |
| `-a, --alias STRING` | set alias for model name (to be used by REST API)<br/>(env: LLAMA_ARG_ALIAS) |
| `--host HOST` | ip address to listen, or bind to an UNIX socket if the address ends with .sock (default: 127.0.0.1)<br/>(env: LLAMA_ARG_HOST) |
| `--port PORT` | port to listen (default: 8080)<br/>(env: LLAMA_ARG_PORT) |
@@ -184,7 +184,7 @@ For the ful list of features, please refer to [server's changelog](https://githu
| `--webui-config-file PATH` | JSON file that provides default WebUI settings (overrides WebUI defaults)<br/>(env: LLAMA_ARG_WEBUI_CONFIG_FILE) |
| `--webui, --no-webui` | whether to enable the Web UI (default: enabled)<br/>(env: LLAMA_ARG_WEBUI) |
| `--embedding, --embeddings` | restrict to only support embedding use case; use only with dedicated embedding models (default: disabled)<br/>(env: LLAMA_ARG_EMBEDDINGS) |
| `--reranking, --rerank` | enable reranking endpoint on server (default: disabled)<br/>(env: LLAMA_ARG_RERANKING) |
| `--rerank, --reranking` | enable reranking endpoint on server (default: disabled)<br/>(env: LLAMA_ARG_RERANKING) |
| `--api-key KEY` | API key to use for authentication (default: none)<br/>(env: LLAMA_API_KEY) |
| `--api-key-file FNAME` | path to file containing API keys (default: none) |
| `--ssl-key-file FNAME` | path to file a PEM-encoded SSL private key<br/>(env: LLAMA_ARG_SSL_KEY_FILE) |
@@ -212,7 +212,7 @@ For the ful list of features, please refer to [server's changelog](https://githu
| `--lora-init-without-apply` | load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
| `-td, --threads-draft N` | number of threads to use during generation (default: same as --threads) |
| `-tbd, --threads-batch-draft N` | number of threads to use during batch and prompt processing (default: same as --threads-draft) |
| `--draft-max, --draft, --draft-n N` | number of tokens to draft for speculative decoding (default: 16)<br/>(env: LLAMA_ARG_DRAFT_MAX) |
| `--draft, --draft-n, --draft-max N` | number of tokens to draft for speculative decoding (default: 16)<br/>(env: LLAMA_ARG_DRAFT_MAX) |
| `--draft-min, --draft-n-min N` | minimum number of draft tokens to use for speculative decoding (default: 0)<br/>(env: LLAMA_ARG_DRAFT_MIN) |
| `--draft-p-min P` | minimum speculative decoding probability (greedy) (default: 0.8)<br/>(env: LLAMA_ARG_DRAFT_P_MIN) |
| `-cd, --ctx-size-draft N` | size of the prompt context for the draft model (default: 0, 0 = loaded from model)<br/>(env: LLAMA_ARG_CTX_SIZE_DRAFT) |
@@ -1443,6 +1443,12 @@ Example:
```ini
version = 1
; (Optional) This section provides global settings shared across all presets.
; If the same key is defined in a specific preset, it will override the value in this global section.
[*]
c = 8192
n-gpu-layer = 8
; If the key corresponds to an existing model on the server,
; this will be used as the default config for that model
[ggml-org/MY-MODEL-GGUF:Q8_0]
@@ -1462,12 +1468,20 @@ model-draft = ./my-models/draft.gguf
model-draft = /Users/abc/my-models/draft.gguf
; If the key does NOT correspond to an existing model,
; you need to specify at least the model path
; you need to specify at least the model path or HF repo
[custom_model]
model = /Users/abc/my-awesome-model-Q4_K_M.gguf
```
Note: some arguments are controlled by router (e.g., host, port, API key, HF repo, model alias). They will be removed or overwritten upload loading.
Note: some arguments are controlled by router (e.g., host, port, API key, HF repo, model alias). They will be removed or overwritten upon loading.
The precedence rule for preset options is as follows:
1. **Command-line arguments** passed to `llama-server` (highest priority)
2. **Model-specific options** defined in the preset file (e.g. `[ggml-org/MY-MODEL...]`)
3. **Global options** defined in the preset file (`[*]`)
We also offer additional options that are exclusive to presets (these aren't treated as command-line arguments):
- `load-on-startup` (boolean): Controls whether the model loads automatically when the server starts
### Routing requests
Binary file not shown.
+18 -4
View File
@@ -1974,19 +1974,33 @@ struct server_context_impl {
if (!slot.can_split()) {
if (slot.task->n_tokens() > n_ubatch) {
send_error(slot, "input is too large to process. increase the physical batch size", ERROR_TYPE_SERVER);
send_error(slot,
string_format(
"input (%d tokens) is too large to process. increase the physical batch "
"size (current batch size: %d)",
slot.task->n_tokens(), n_ubatch),
ERROR_TYPE_SERVER);
slot.release();
continue;
}
if (slot.task->n_tokens() > slot.n_ctx) {
send_error(slot, "input is larger than the max context size. skipping", ERROR_TYPE_EXCEED_CONTEXT_SIZE);
send_error(
slot,
string_format(
"input (%d tokens) is larger than the max context size (%d tokens). skipping",
slot.task->n_tokens(), slot.n_ctx),
ERROR_TYPE_EXCEED_CONTEXT_SIZE);
slot.release();
continue;
}
} else {
if (slot.task->n_tokens() >= slot.n_ctx) {
send_error(slot, "the request exceeds the available context size, try increasing it", ERROR_TYPE_EXCEED_CONTEXT_SIZE);
send_error(slot,
string_format("request (%d tokens) exceeds the available context size (%d "
"tokens), try increasing it",
slot.task->n_tokens(), slot.n_ctx),
ERROR_TYPE_EXCEED_CONTEXT_SIZE);
slot.release();
continue;
}
@@ -2614,7 +2628,7 @@ struct server_context_impl {
}
}
SLT_DBG(slot, "accepted %d/%d draft tokens, new n_tokens = %d\n", (int) ids.size() - 1, (int) slot.drafted.size(), slot.prompt.n_tokens());
SLT_DBG(slot, "accepted %d/%d draft tokens, new n_tokens = %d\n", (int) ids.size() - 1, (int) n_draft, slot.prompt.n_tokens());
}
}
+116 -231
View File
@@ -82,154 +82,30 @@ static std::filesystem::path get_server_exec_path() {
#endif
}
struct local_model {
std::string name;
std::string path;
std::string path_mmproj;
};
static std::vector<local_model> list_local_models(const std::string & dir) {
if (!std::filesystem::exists(dir) || !std::filesystem::is_directory(dir)) {
throw std::runtime_error(string_format("error: '%s' does not exist or is not a directory\n", dir.c_str()));
}
std::vector<local_model> models;
auto scan_subdir = [&models](const std::string & subdir_path, const std::string & name) {
auto files = fs_list(subdir_path, false);
common_file_info model_file;
common_file_info first_shard_file;
common_file_info mmproj_file;
for (const auto & file : files) {
if (string_ends_with(file.name, ".gguf")) {
if (file.name.find("mmproj") != std::string::npos) {
mmproj_file = file;
} else if (file.name.find("-00001-of-") != std::string::npos) {
first_shard_file = file;
} else {
model_file = file;
}
}
}
// single file model
local_model model{
/* name */ name,
/* path */ first_shard_file.path.empty() ? model_file.path : first_shard_file.path,
/* path_mmproj */ mmproj_file.path // can be empty
};
if (!model.path.empty()) {
models.push_back(model);
}
};
auto files = fs_list(dir, true);
for (const auto & file : files) {
if (file.is_dir) {
scan_subdir(file.path, file.name);
} else if (string_ends_with(file.name, ".gguf")) {
// single file model
std::string name = file.name;
string_replace_all(name, ".gguf", "");
local_model model{
/* name */ name,
/* path */ file.path,
/* path_mmproj */ ""
};
models.push_back(model);
}
}
return models;
}
//
// server_presets
//
server_presets::server_presets(int argc, char ** argv, common_params & base_params, const std::string & presets_path)
: ctx_params(common_params_parser_init(base_params, LLAMA_EXAMPLE_SERVER)) {
if (!presets_path.empty()) {
presets = common_presets_load(presets_path, ctx_params);
SRV_INF("Loaded %zu presets from %s\n", presets.size(), presets_path.c_str());
}
// populate reserved args (will be appended by the router)
for (auto & opt : ctx_params.options) {
if (opt.env == nullptr) {
continue;
}
std::string env = opt.env;
if (env == "LLAMA_ARG_PORT" ||
env == "LLAMA_ARG_HOST" ||
env == "LLAMA_ARG_ALIAS" ||
env == "LLAMA_ARG_API_KEY" ||
env == "LLAMA_ARG_MODELS_DIR" ||
env == "LLAMA_ARG_MODELS_MAX" ||
env == "LLAMA_ARG_MODELS_PRESET" ||
env == "LLAMA_ARG_MODEL" ||
env == "LLAMA_ARG_MMPROJ" ||
env == "LLAMA_ARG_HF_REPO" ||
env == "LLAMA_ARG_NO_MODELS_AUTOLOAD" ||
env == "LLAMA_ARG_SSL_KEY_FILE" ||
env == "LLAMA_ARG_SSL_CERT_FILE") {
control_args[env] = opt;
}
}
// read base args from router's argv
common_params_to_map(argc, argv, LLAMA_EXAMPLE_SERVER, base_args);
// remove any router-controlled args from base_args
for (const auto & cargs : control_args) {
auto it = base_args.find(cargs.second);
if (it != base_args.end()) {
base_args.erase(it);
}
static void unset_reserved_args(common_preset & preset, bool unset_model_args) {
preset.unset_option("LLAMA_ARG_SSL_KEY_FILE");
preset.unset_option("LLAMA_ARG_SSL_CERT_FILE");
preset.unset_option("LLAMA_API_KEY");
preset.unset_option("LLAMA_ARG_MODELS_DIR");
preset.unset_option("LLAMA_ARG_MODELS_MAX");
preset.unset_option("LLAMA_ARG_MODELS_PRESET");
preset.unset_option("LLAMA_ARG_MODELS_AUTOLOAD");
if (unset_model_args) {
preset.unset_option("LLAMA_ARG_MODEL");
preset.unset_option("LLAMA_ARG_MMPROJ");
preset.unset_option("LLAMA_ARG_HF_REPO");
}
}
common_preset server_presets::get_preset(const std::string & name) {
auto it = presets.find(name);
if (it != presets.end()) {
return it->second;
}
return common_preset();
}
void server_presets::render_args(server_model_meta & meta) {
common_preset preset = meta.preset; // copy
// merging 3 kinds of args:
// 1. model-specific args (from preset)
// force removing control args if any
for (auto & cargs : control_args) {
if (preset.options.find(cargs.second) != preset.options.end()) {
SRV_WRN("Preset '%s' contains reserved arg '%s', removing it\n", preset.name.c_str(), cargs.second.args[0]);
preset.options.erase(cargs.second);
}
}
// 2. base args (from router)
// inherit from base args
for (const auto & [arg, value] : base_args) {
preset.options[arg] = value;
}
// 3. control args (from router)
// set control values
preset.options[control_args["LLAMA_ARG_HOST"]] = CHILD_ADDR;
preset.options[control_args["LLAMA_ARG_PORT"]] = std::to_string(meta.port);
preset.options[control_args["LLAMA_ARG_ALIAS"]] = meta.name;
if (meta.in_cache) {
preset.options[control_args["LLAMA_ARG_HF_REPO"]] = meta.name;
} else {
preset.options[control_args["LLAMA_ARG_MODEL"]] = meta.path;
if (!meta.path_mmproj.empty()) {
preset.options[control_args["LLAMA_ARG_MMPROJ"]] = meta.path_mmproj;
}
}
// disable SSL for child processes (HTTPS already handled by router)
preset.options[control_args["LLAMA_ARG_SSL_KEY_FILE"]] = "";
preset.options[control_args["LLAMA_ARG_SSL_CERT_FILE"]] = "";
meta.args = preset.to_args();
// add back the binary path at the front
meta.args.insert(meta.args.begin(), get_server_exec_path().string());
void server_model_meta::update_args(common_preset_context & ctx_preset, std::string bin_path) {
// update params
unset_reserved_args(preset, false);
preset.set_option(ctx_preset, "LLAMA_ARG_HOST", CHILD_ADDR);
preset.set_option(ctx_preset, "LLAMA_ARG_PORT", std::to_string(port));
preset.set_option(ctx_preset, "LLAMA_ARG_ALIAS", name);
// TODO: maybe validate preset before rendering ?
// render args
args = preset.to_args(bin_path);
}
//
@@ -240,20 +116,22 @@ server_models::server_models(
const common_params & params,
int argc,
char ** argv,
char ** envp) : base_params(params), presets(argc, argv, base_params, params.models_preset) {
for (int i = 0; i < argc; i++) {
base_args.push_back(std::string(argv[i]));
}
char ** envp)
: ctx_preset(LLAMA_EXAMPLE_SERVER),
base_params(params),
base_preset(ctx_preset.load_from_args(argc, argv)) {
for (char ** env = envp; *env != nullptr; env++) {
base_env.push_back(std::string(*env));
}
GGML_ASSERT(!base_args.empty());
// clean up base preset
unset_reserved_args(base_preset, true);
// set binary path
try {
base_args[0] = get_server_exec_path().string();
bin_path = get_server_exec_path().string();
} catch (const std::exception & e) {
bin_path = argv[0];
LOG_WRN("failed to get server executable path: %s\n", e.what());
LOG_WRN("using original argv[0] as fallback: %s\n", base_args[0].c_str());
LOG_WRN("using original argv[0] as fallback: %s\n", argv[0]);
}
load_models();
}
@@ -262,7 +140,7 @@ void server_models::add_model(server_model_meta && meta) {
if (mapping.find(meta.name) != mapping.end()) {
throw std::runtime_error(string_format("model '%s' appears multiple times", meta.name.c_str()));
}
presets.render_args(meta); // populate meta.args
meta.update_args(ctx_preset, bin_path); // render args
std::string name = meta.name;
mapping[name] = instance_t{
/* subproc */ std::make_shared<subprocess_s>(),
@@ -271,86 +149,62 @@ void server_models::add_model(server_model_meta && meta) {
};
}
static std::vector<local_model> list_custom_path_models(server_presets & presets) {
// detect any custom-path models in presets
std::vector<local_model> custom_models;
for (auto & [model_name, preset] : presets.presets) {
local_model model;
model.name = model_name;
std::vector<common_arg> to_erase;
for (auto & [arg, value] : preset.options) {
std::string env(arg.env ? arg.env : "");
if (env == "LLAMA_ARG_MODEL") {
model.path = value;
to_erase.push_back(arg);
}
if (env == "LLAMA_ARG_MMPROJ") {
model.path_mmproj = value;
to_erase.push_back(arg);
}
}
for (auto & arg : to_erase) {
preset.options.erase(arg);
}
if (!model.name.empty() && !model.path.empty()) {
custom_models.push_back(model);
}
}
return custom_models;
}
// TODO: allow refreshing cached model list
void server_models::load_models() {
// loading models from 3 sources:
// 1. cached models
auto cached_models = common_list_cached_models();
for (const auto & model : cached_models) {
server_model_meta meta{
/* preset */ presets.get_preset(model.to_string()),
/* name */ model.to_string(),
/* path */ model.manifest_path,
/* path_mmproj */ "", // auto-detected when loading
/* in_cache */ true,
/* port */ 0,
/* status */ SERVER_MODEL_STATUS_UNLOADED,
/* last_used */ 0,
/* args */ std::vector<std::string>(),
/* exit_code */ 0
};
add_model(std::move(meta));
}
// 2. local models specificed via --models-dir
common_presets cached_models = ctx_preset.load_from_cache();
SRV_INF("Loaded %zu cached model presets\n", cached_models.size());
// 2. local models from --models-dir
common_presets local_models;
if (!base_params.models_dir.empty()) {
auto local_models = list_local_models(base_params.models_dir);
for (const auto & model : local_models) {
if (mapping.find(model.name) != mapping.end()) {
// already exists in cached models, skip
continue;
}
server_model_meta meta{
/* preset */ presets.get_preset(model.name),
/* name */ model.name,
/* path */ model.path,
/* path_mmproj */ model.path_mmproj,
/* in_cache */ false,
/* port */ 0,
/* status */ SERVER_MODEL_STATUS_UNLOADED,
/* last_used */ 0,
/* args */ std::vector<std::string>(),
/* exit_code */ 0
};
add_model(std::move(meta));
local_models = ctx_preset.load_from_models_dir(base_params.models_dir);
SRV_INF("Loaded %zu local model presets from %s\n", local_models.size(), base_params.models_dir.c_str());
}
// 3. custom-path models from presets
common_preset global = {};
common_presets custom_presets = {};
if (!base_params.models_preset.empty()) {
custom_presets = ctx_preset.load_from_ini(base_params.models_preset, global);
SRV_INF("Loaded %zu custom model presets from %s\n", custom_presets.size(), base_params.models_preset.c_str());
}
// cascade, apply global preset first
cached_models = ctx_preset.cascade(global, cached_models);
local_models = ctx_preset.cascade(global, local_models);
custom_presets = ctx_preset.cascade(global, custom_presets);
// note: if a model exists in both cached and local, local takes precedence
common_presets final_presets;
for (const auto & [name, preset] : cached_models) {
final_presets[name] = preset;
}
for (const auto & [name, preset] : local_models) {
final_presets[name] = preset;
}
// process custom presets from INI
for (const auto & [name, custom] : custom_presets) {
if (final_presets.find(name) != final_presets.end()) {
// apply custom config if exists
common_preset & target = final_presets[name];
target.merge(custom);
} else {
// otherwise add directly
final_presets[name] = custom;
}
}
// 3. custom-path models specified in presets
auto custom_models = list_custom_path_models(presets);
for (const auto & model : custom_models) {
// server base preset from CLI args take highest precedence
for (auto & [name, preset] : final_presets) {
preset.merge(base_preset);
}
// convert presets to server_model_meta and add to mapping
for (const auto & preset : final_presets) {
server_model_meta meta{
/* preset */ presets.get_preset(model.name),
/* name */ model.name,
/* path */ model.path,
/* path_mmproj */ model.path_mmproj,
/* in_cache */ false,
/* preset */ preset.second,
/* name */ preset.first,
/* port */ 0,
/* status */ SERVER_MODEL_STATUS_UNLOADED,
/* last_used */ 0,
@@ -359,10 +213,38 @@ void server_models::load_models() {
};
add_model(std::move(meta));
}
// log available models
SRV_INF("Available models (%zu) (*: custom preset)\n", mapping.size());
{
std::unordered_set<std::string> custom_names;
for (const auto & [name, preset] : custom_presets) {
custom_names.insert(name);
}
SRV_INF("Available models (%zu) (*: custom preset)\n", mapping.size());
for (const auto & [name, inst] : mapping) {
bool has_custom = custom_names.find(name) != custom_names.end();
SRV_INF(" %c %s\n", has_custom ? '*' : ' ', name.c_str());
}
}
// load any autoload models
std::vector<std::string> models_to_load;
for (const auto & [name, inst] : mapping) {
SRV_INF(" %c %s\n", inst.meta.preset.name.empty() ? ' ' : '*', name.c_str());
std::string val;
if (inst.meta.preset.get_option(COMMON_ARG_PRESET_LOAD_ON_STARTUP, val)) {
models_to_load.push_back(name);
}
}
if ((int)models_to_load.size() > base_params.models_max) {
throw std::runtime_error(string_format(
"number of models to load on startup (%zu) exceeds models_max (%d)",
models_to_load.size(),
base_params.models_max
));
}
for (const auto & name : models_to_load) {
SRV_INF("(startup) loading model %s\n", name.c_str());
load(name);
}
}
@@ -526,7 +408,7 @@ void server_models::load(const std::string & name) {
{
SRV_INF("spawning server instance with name=%s on port %d\n", inst.meta.name.c_str(), inst.meta.port);
presets.render_args(inst.meta); // update meta.args
inst.meta.update_args(ctx_preset, bin_path); // render args
std::vector<std::string> child_args = inst.meta.args; // copy
std::vector<std::string> child_env = base_env; // copy
@@ -877,7 +759,12 @@ void server_models_routes::init_routes() {
{"args", meta.args},
};
if (!meta.preset.name.empty()) {
status["preset"] = meta.preset.to_ini();
common_preset preset_copy = meta.preset;
unset_reserved_args(preset_copy, false);
preset_copy.unset_option("LLAMA_ARG_HOST");
preset_copy.unset_option("LLAMA_ARG_PORT");
preset_copy.unset_option("LLAMA_ARG_ALIAS");
status["preset"] = preset_copy.to_ini();
}
if (meta.is_failed()) {
status["exit_code"] = meta.exit_code;
@@ -888,8 +775,6 @@ void server_models_routes::init_routes() {
{"object", "model"}, // for OAI-compat
{"owned_by", "llamacpp"}, // for OAI-compat
{"created", t}, // for OAI-compat
{"in_cache", meta.in_cache},
{"path", meta.path},
{"status", status},
// TODO: add other fields, may require reading GGUF metadata
});
+14 -25
View File
@@ -51,9 +51,6 @@ static std::string server_model_status_to_string(server_model_status status) {
struct server_model_meta {
common_preset preset;
std::string name;
std::string path;
std::string path_mmproj; // only available if in_cache=false
bool in_cache = false; // if true, use -hf; use -m otherwise
int port = 0;
server_model_status status = SERVER_MODEL_STATUS_UNLOADED;
int64_t last_used = 0; // for LRU unloading
@@ -67,19 +64,8 @@ struct server_model_meta {
bool is_failed() const {
return status == SERVER_MODEL_STATUS_UNLOADED && exit_code != 0;
}
};
// the server_presets struct holds the presets read from presets.ini
// as well as base args from the router server
struct server_presets {
common_presets presets;
common_params_context ctx_params;
std::map<common_arg, std::string> base_args;
std::map<std::string, common_arg> control_args; // args reserved for server control
server_presets(int argc, char ** argv, common_params & base_params, const std::string & models_dir);
common_preset get_preset(const std::string & name);
void render_args(server_model_meta & meta);
void update_args(common_preset_context & ctx_presets, std::string bin_path);
};
struct subprocess_s;
@@ -97,11 +83,12 @@ private:
std::condition_variable cv;
std::map<std::string, instance_t> mapping;
common_params base_params;
std::vector<std::string> base_args;
std::vector<std::string> base_env;
common_preset_context ctx_preset;
server_presets presets;
common_params base_params;
std::string bin_path;
std::vector<std::string> base_env;
common_preset base_preset; // base preset from llama-server CLI args
void update_meta(const std::string & name, const server_model_meta & meta);
@@ -116,27 +103,29 @@ public:
void load_models();
// check if a model instance exists
// check if a model instance exists (thread-safe)
bool has_model(const std::string & name);
// return a copy of model metadata
// return a copy of model metadata (thread-safe)
std::optional<server_model_meta> get_meta(const std::string & name);
// return a copy of all model metadata
// return a copy of all model metadata (thread-safe)
std::vector<server_model_meta> get_all_meta();
// load and unload model instances
// these functions are thread-safe
void load(const std::string & name);
void unload(const std::string & name);
void unload_all();
// update the status of a model instance
// update the status of a model instance (thread-safe)
void update_status(const std::string & name, server_model_status status);
// wait until the model instance is fully loaded
// wait until the model instance is fully loaded (thread-safe)
// return when the model is loaded or failed to load
void wait_until_loaded(const std::string & name);
// load the model if not loaded, otherwise do nothing
// load the model if not loaded, otherwise do nothing (thread-safe)
// return false if model is already loaded; return true otherwise (meta may need to be refreshed)
bool ensure_model_loaded(const std::string & name);
@@ -11,6 +11,8 @@ flowchart TB
C_Screen["ChatScreen"]
C_Form["ChatForm"]
C_Messages["ChatMessages"]
C_Message["ChatMessage"]
C_MessageEditForm["ChatMessageEditForm"]
C_ModelsSelector["ModelsSelector"]
C_Settings["ChatSettings"]
end
@@ -54,7 +56,9 @@ flowchart TB
%% Component hierarchy
C_Screen --> C_Form & C_Messages & C_Settings
C_Form & C_Messages --> C_ModelsSelector
C_Messages --> C_Message
C_Message --> C_MessageEditForm
C_Form & C_MessageEditForm --> C_ModelsSelector
%% Components → Hooks → Stores
C_Form & C_Messages --> H1 & H2
@@ -93,7 +97,7 @@ flowchart TB
classDef apiStyle fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
class R1,R2,RL routeStyle
class C_Sidebar,C_Screen,C_Form,C_Messages,C_ModelsSelector,C_Settings componentStyle
class C_Sidebar,C_Screen,C_Form,C_Messages,C_Message,C_MessageEditForm,C_ModelsSelector,C_Settings componentStyle
class H1,H2 hookStyle
class S1,S2,S3,S4,S5 storeStyle
class SV1,SV2,SV3,SV4,SV5 serviceStyle
@@ -16,6 +16,8 @@ end
C_Form["ChatForm"]
C_Messages["ChatMessages"]
C_Message["ChatMessage"]
C_MessageUser["ChatMessageUser"]
C_MessageEditForm["ChatMessageEditForm"]
C_Attach["ChatAttachments"]
C_ModelsSelector["ModelsSelector"]
C_Settings["ChatSettings"]
@@ -38,7 +40,7 @@ end
S1Error["<b>Error Handling:</b><br/>showErrorDialog()<br/>dismissErrorDialog()<br/>isAbortError()"]
S1Msg["<b>Message Operations:</b><br/>addMessage()<br/>sendMessage()<br/>updateMessage()<br/>deleteMessage()<br/>getDeletionInfo()"]
S1Regen["<b>Regeneration:</b><br/>regenerateMessage()<br/>regenerateMessageWithBranching()<br/>continueAssistantMessage()"]
S1Edit["<b>Editing:</b><br/>editAssistantMessage()<br/>editUserMessagePreserveResponses()<br/>editMessageWithBranching()"]
S1Edit["<b>Editing:</b><br/>editAssistantMessage()<br/>editUserMessagePreserveResponses()<br/>editMessageWithBranching()<br/>clearEditMode()<br/>isEditModeActive()<br/>getAddFilesHandler()<br/>setEditModeActive()"]
S1Utils["<b>Utilities:</b><br/>getApiOptions()<br/>parseTimingData()<br/>getOrCreateAbortController()<br/>getConversationModel()"]
end
subgraph S2["conversationsStore"]
@@ -88,6 +90,10 @@ end
RE7["getChatStreaming()"]
RE8["getAllLoadingChats()"]
RE9["getAllStreamingChats()"]
RE9a["isEditModeActive()"]
RE9b["getAddFilesHandler()"]
RE9c["setEditModeActive()"]
RE9d["clearEditMode()"]
end
subgraph ConvExports["conversationsStore"]
RE10["conversations()"]
@@ -182,7 +188,10 @@ end
%% Component hierarchy
C_Screen --> C_Form & C_Messages & C_Settings
C_Messages --> C_Message
C_Message --> C_ModelsSelector
C_Message --> C_MessageUser
C_MessageUser --> C_MessageEditForm
C_MessageEditForm --> C_ModelsSelector
C_MessageEditForm --> C_Attach
C_Form --> C_ModelsSelector
C_Form --> C_Attach
C_Message --> C_Attach
@@ -190,6 +199,7 @@ end
%% Components use Hooks
C_Form --> H1
C_Message --> H1 & H2
C_MessageEditForm --> H1
C_Screen --> H2
%% Hooks use Stores
@@ -244,7 +254,7 @@ end
classDef apiStyle fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
class R1,R2,RL routeStyle
class C_Sidebar,C_Screen,C_Form,C_Messages,C_Message componentStyle
class C_Sidebar,C_Screen,C_Form,C_Messages,C_Message,C_MessageUser,C_MessageEditForm componentStyle
class C_ModelsSelector,C_Settings componentStyle
class C_Attach componentStyle
class H1,H2,H3 methodStyle
+10 -10
View File
@@ -25,7 +25,7 @@
"@chromatic-com/storybook": "^4.1.2",
"@eslint/compat": "^1.2.5",
"@eslint/js": "^9.18.0",
"@internationalized/date": "^3.8.2",
"@internationalized/date": "^3.10.1",
"@lucide/svelte": "^0.515.0",
"@playwright/test": "^1.49.1",
"@storybook/addon-a11y": "^10.0.7",
@@ -862,9 +862,9 @@
}
},
"node_modules/@internationalized/date": {
"version": "3.8.2",
"resolved": "https://registry.npmjs.org/@internationalized/date/-/date-3.8.2.tgz",
"integrity": "sha512-/wENk7CbvLbkUvX1tu0mwq49CVkkWpkXubGel6birjRPyo6uQ4nQpnq5xZu823zRCwwn82zgHrvgF1vZyvmVgA==",
"version": "3.10.1",
"resolved": "https://registry.npmjs.org/@internationalized/date/-/date-3.10.1.tgz",
"integrity": "sha512-oJrXtQiAXLvT9clCf1K4kxp3eKsQhIaZqxEyowkBcsvZDdZkbWrVmnGknxs5flTD0VGsxrxKgBCZty1EzoiMzA==",
"dev": true,
"license": "Apache-2.0",
"dependencies": {
@@ -2109,9 +2109,9 @@
}
},
"node_modules/@sveltejs/kit": {
"version": "2.48.5",
"resolved": "https://registry.npmjs.org/@sveltejs/kit/-/kit-2.48.5.tgz",
"integrity": "sha512-/rnwfSWS3qwUSzvHynUTORF9xSJi7PCR9yXkxUOnRrNqyKmCmh3FPHH+E9BbgqxXfTevGXBqgnlh9kMb+9T5XA==",
"version": "2.49.2",
"resolved": "https://registry.npmjs.org/@sveltejs/kit/-/kit-2.49.2.tgz",
"integrity": "sha512-Vp3zX/qlwerQmHMP6x0Ry1oY7eKKRcOWGc2P59srOp4zcqyn+etJyQpELgOi4+ZSUgteX8Y387NuwruLgGXLUQ==",
"dev": true,
"license": "MIT",
"dependencies": {
@@ -5797,9 +5797,9 @@
}
},
"node_modules/mdast-util-to-hast": {
"version": "13.2.0",
"resolved": "https://registry.npmjs.org/mdast-util-to-hast/-/mdast-util-to-hast-13.2.0.tgz",
"integrity": "sha512-QGYKEuUsYT9ykKBCMOEDLsU5JRObWQusAolFMeko/tYPufNkRffBAQjIE+99jbA87xv6FgmjLtwjh9wBWajwAA==",
"version": "13.2.1",
"resolved": "https://registry.npmjs.org/mdast-util-to-hast/-/mdast-util-to-hast-13.2.1.tgz",
"integrity": "sha512-cctsq2wp5vTsLIcaymblUriiTcZd0CwWtCbLvrOzYCDZoWyMNV8sZ7krj09FSnsiJi3WVsHLM4k6Dq/yaPyCXA==",
"license": "MIT",
"dependencies": {
"@types/hast": "^3.0.0",
+1 -1
View File
@@ -26,7 +26,7 @@
"@chromatic-com/storybook": "^4.1.2",
"@eslint/compat": "^1.2.5",
"@eslint/js": "^9.18.0",
"@internationalized/date": "^3.8.2",
"@internationalized/date": "^3.10.1",
"@lucide/svelte": "^0.515.0",
"@playwright/test": "^1.49.1",
"@storybook/addon-a11y": "^10.0.7",
+7
View File
@@ -124,3 +124,10 @@ declare global {
SettingsConfigType
};
}
declare global {
interface Window {
idxThemeStyle?: number;
idxCodeBlock?: number;
}
}
@@ -8,6 +8,7 @@
ChatFormTextarea
} from '$lib/components/app';
import { INPUT_CLASSES } from '$lib/constants/input-classes';
import { SETTING_CONFIG_DEFAULT } from '$lib/constants/settings-config';
import { config } from '$lib/stores/settings.svelte';
import { modelsStore, modelOptions, selectedModelId } from '$lib/stores/models.svelte';
import { isRouterMode } from '$lib/stores/server.svelte';
@@ -66,7 +67,7 @@
let message = $state('');
let pasteLongTextToFileLength = $derived.by(() => {
const n = Number(currentConfig.pasteLongTextToFileLen);
return Number.isNaN(n) ? 2500 : n;
return Number.isNaN(n) ? Number(SETTING_CONFIG_DEFAULT.pasteLongTextToFileLen) : n;
});
let previousIsLoading = $state(isLoading);
let recordingSupported = $state(false);
@@ -12,13 +12,21 @@
onCopy?: (message: DatabaseMessage) => void;
onContinueAssistantMessage?: (message: DatabaseMessage) => void;
onDelete?: (message: DatabaseMessage) => void;
onEditWithBranching?: (message: DatabaseMessage, newContent: string) => void;
onEditWithBranching?: (
message: DatabaseMessage,
newContent: string,
newExtras?: DatabaseMessageExtra[]
) => void;
onEditWithReplacement?: (
message: DatabaseMessage,
newContent: string,
shouldBranch: boolean
) => void;
onEditUserMessagePreserveResponses?: (message: DatabaseMessage, newContent: string) => void;
onEditUserMessagePreserveResponses?: (
message: DatabaseMessage,
newContent: string,
newExtras?: DatabaseMessageExtra[]
) => void;
onNavigateToSibling?: (siblingId: string) => void;
onRegenerateWithBranching?: (message: DatabaseMessage, modelOverride?: string) => void;
siblingInfo?: ChatMessageSiblingInfo | null;
@@ -45,6 +53,8 @@
messageTypes: string[];
} | null>(null);
let editedContent = $state(message.content);
let editedExtras = $state<DatabaseMessageExtra[]>(message.extra ? [...message.extra] : []);
let editedUploadedFiles = $state<ChatUploadedFile[]>([]);
let isEditing = $state(false);
let showDeleteDialog = $state(false);
let shouldBranchAfterEdit = $state(false);
@@ -85,6 +95,16 @@
function handleCancelEdit() {
isEditing = false;
editedContent = message.content;
editedExtras = message.extra ? [...message.extra] : [];
editedUploadedFiles = [];
}
function handleEditedExtrasChange(extras: DatabaseMessageExtra[]) {
editedExtras = extras;
}
function handleEditedUploadedFilesChange(files: ChatUploadedFile[]) {
editedUploadedFiles = files;
}
async function handleCopy() {
@@ -107,6 +127,8 @@
function handleEdit() {
isEditing = true;
editedContent = message.content;
editedExtras = message.extra ? [...message.extra] : [];
editedUploadedFiles = [];
setTimeout(() => {
if (textareaElement) {
@@ -143,9 +165,10 @@
onContinueAssistantMessage?.(message);
}
function handleSaveEdit() {
async function handleSaveEdit() {
if (message.role === 'user' || message.role === 'system') {
onEditWithBranching?.(message, editedContent.trim());
const finalExtras = await getMergedExtras();
onEditWithBranching?.(message, editedContent.trim(), finalExtras);
} else {
// For assistant messages, preserve exact content including trailing whitespace
// This is important for the Continue feature to work properly
@@ -154,15 +177,30 @@
isEditing = false;
shouldBranchAfterEdit = false;
editedUploadedFiles = [];
}
function handleSaveEditOnly() {
async function handleSaveEditOnly() {
if (message.role === 'user') {
// For user messages, trim to avoid accidental whitespace
onEditUserMessagePreserveResponses?.(message, editedContent.trim());
const finalExtras = await getMergedExtras();
onEditUserMessagePreserveResponses?.(message, editedContent.trim(), finalExtras);
}
isEditing = false;
editedUploadedFiles = [];
}
async function getMergedExtras(): Promise<DatabaseMessageExtra[]> {
if (editedUploadedFiles.length === 0) {
return editedExtras;
}
const { parseFilesToMessageExtras } = await import('$lib/utils/browser-only');
const result = await parseFilesToMessageExtras(editedUploadedFiles);
const newExtras = result?.extras || [];
return [...editedExtras, ...newExtras];
}
function handleShowDeleteDialogChange(show: boolean) {
@@ -197,6 +235,8 @@
class={className}
{deletionInfo}
{editedContent}
{editedExtras}
{editedUploadedFiles}
{isEditing}
{message}
onCancelEdit={handleCancelEdit}
@@ -206,6 +246,8 @@
onEdit={handleEdit}
onEditKeydown={handleEditKeydown}
onEditedContentChange={handleEditedContentChange}
onEditedExtrasChange={handleEditedExtrasChange}
onEditedUploadedFilesChange={handleEditedUploadedFilesChange}
{onNavigateToSibling}
onSaveEdit={handleSaveEdit}
onSaveEditOnly={handleSaveEditOnly}
@@ -244,7 +244,7 @@
<div class="info my-6 grid gap-4">
{#if displayedModel()}
<span class="inline-flex flex-wrap items-center gap-2 text-xs text-muted-foreground">
<div class="inline-flex flex-wrap items-start gap-2 text-xs text-muted-foreground">
{#if isRouter}
<ModelsSelector
currentModel={displayedModel()}
@@ -258,11 +258,13 @@
{#if currentConfig.showMessageStats && message.timings && message.timings.predicted_n && message.timings.predicted_ms}
<ChatMessageStatistics
promptTokens={message.timings.prompt_n}
promptMs={message.timings.prompt_ms}
predictedTokens={message.timings.predicted_n}
predictedMs={message.timings.predicted_ms}
/>
{/if}
</span>
</div>
{/if}
{#if config().showToolCalls}
@@ -0,0 +1,391 @@
<script lang="ts">
import { X, ArrowUp, Paperclip, AlertTriangle } from '@lucide/svelte';
import { Button } from '$lib/components/ui/button';
import { Switch } from '$lib/components/ui/switch';
import { ChatAttachmentsList, DialogConfirmation, ModelsSelector } from '$lib/components/app';
import { INPUT_CLASSES } from '$lib/constants/input-classes';
import { SETTING_CONFIG_DEFAULT } from '$lib/constants/settings-config';
import { AttachmentType, FileTypeCategory, MimeTypeText } from '$lib/enums';
import { config } from '$lib/stores/settings.svelte';
import { useModelChangeValidation } from '$lib/hooks/use-model-change-validation.svelte';
import { setEditModeActive, clearEditMode } from '$lib/stores/chat.svelte';
import { conversationsStore } from '$lib/stores/conversations.svelte';
import { modelsStore } from '$lib/stores/models.svelte';
import { isRouterMode } from '$lib/stores/server.svelte';
import {
autoResizeTextarea,
getFileTypeCategory,
getFileTypeCategoryByExtension,
parseClipboardContent
} from '$lib/utils';
interface Props {
messageId: string;
editedContent: string;
editedExtras?: DatabaseMessageExtra[];
editedUploadedFiles?: ChatUploadedFile[];
originalContent: string;
originalExtras?: DatabaseMessageExtra[];
showSaveOnlyOption?: boolean;
onCancelEdit: () => void;
onSaveEdit: () => void;
onSaveEditOnly?: () => void;
onEditKeydown: (event: KeyboardEvent) => void;
onEditedContentChange: (content: string) => void;
onEditedExtrasChange?: (extras: DatabaseMessageExtra[]) => void;
onEditedUploadedFilesChange?: (files: ChatUploadedFile[]) => void;
textareaElement?: HTMLTextAreaElement;
}
let {
messageId,
editedContent,
editedExtras = [],
editedUploadedFiles = [],
originalContent,
originalExtras = [],
showSaveOnlyOption = false,
onCancelEdit,
onSaveEdit,
onSaveEditOnly,
onEditKeydown,
onEditedContentChange,
onEditedExtrasChange,
onEditedUploadedFilesChange,
textareaElement = $bindable()
}: Props = $props();
let fileInputElement: HTMLInputElement | undefined = $state();
let saveWithoutRegenerate = $state(false);
let showDiscardDialog = $state(false);
let isRouter = $derived(isRouterMode());
let currentConfig = $derived(config());
let pasteLongTextToFileLength = $derived.by(() => {
const n = Number(currentConfig.pasteLongTextToFileLen);
return Number.isNaN(n) ? Number(SETTING_CONFIG_DEFAULT.pasteLongTextToFileLen) : n;
});
let hasUnsavedChanges = $derived.by(() => {
if (editedContent !== originalContent) return true;
if (editedUploadedFiles.length > 0) return true;
const extrasChanged =
editedExtras.length !== originalExtras.length ||
editedExtras.some((extra, i) => extra !== originalExtras[i]);
if (extrasChanged) return true;
return false;
});
let hasAttachments = $derived(
(editedExtras && editedExtras.length > 0) ||
(editedUploadedFiles && editedUploadedFiles.length > 0)
);
let canSubmit = $derived(editedContent.trim().length > 0 || hasAttachments);
function getEditedAttachmentsModalities(): ModelModalities {
const modalities: ModelModalities = { vision: false, audio: false };
for (const extra of editedExtras) {
if (extra.type === AttachmentType.IMAGE) {
modalities.vision = true;
}
if (
extra.type === AttachmentType.PDF &&
'processedAsImages' in extra &&
extra.processedAsImages
) {
modalities.vision = true;
}
if (extra.type === AttachmentType.AUDIO) {
modalities.audio = true;
}
}
for (const file of editedUploadedFiles) {
const category = getFileTypeCategory(file.type) || getFileTypeCategoryByExtension(file.name);
if (category === FileTypeCategory.IMAGE) {
modalities.vision = true;
}
if (category === FileTypeCategory.AUDIO) {
modalities.audio = true;
}
}
return modalities;
}
function getRequiredModalities(): ModelModalities {
const beforeModalities = conversationsStore.getModalitiesUpToMessage(messageId);
const editedModalities = getEditedAttachmentsModalities();
return {
vision: beforeModalities.vision || editedModalities.vision,
audio: beforeModalities.audio || editedModalities.audio
};
}
const { handleModelChange } = useModelChangeValidation({
getRequiredModalities,
onValidationFailure: async (previousModelId) => {
if (previousModelId) {
await modelsStore.selectModelById(previousModelId);
}
}
});
function handleFileInputChange(event: Event) {
const input = event.target as HTMLInputElement;
if (!input.files || input.files.length === 0) return;
const files = Array.from(input.files);
processNewFiles(files);
input.value = '';
}
function handleGlobalKeydown(event: KeyboardEvent) {
if (event.key === 'Escape') {
event.preventDefault();
attemptCancel();
}
}
function attemptCancel() {
if (hasUnsavedChanges) {
showDiscardDialog = true;
} else {
onCancelEdit();
}
}
function handleRemoveExistingAttachment(index: number) {
if (!onEditedExtrasChange) return;
const newExtras = [...editedExtras];
newExtras.splice(index, 1);
onEditedExtrasChange(newExtras);
}
function handleRemoveUploadedFile(fileId: string) {
if (!onEditedUploadedFilesChange) return;
const newFiles = editedUploadedFiles.filter((f) => f.id !== fileId);
onEditedUploadedFilesChange(newFiles);
}
function handleSubmit() {
if (!canSubmit) return;
if (saveWithoutRegenerate && onSaveEditOnly) {
onSaveEditOnly();
} else {
onSaveEdit();
}
saveWithoutRegenerate = false;
}
async function processNewFiles(files: File[]) {
if (!onEditedUploadedFilesChange) return;
const { processFilesToChatUploaded } = await import('$lib/utils/browser-only');
const processed = await processFilesToChatUploaded(files);
onEditedUploadedFilesChange([...editedUploadedFiles, ...processed]);
}
function handlePaste(event: ClipboardEvent) {
if (!event.clipboardData) return;
const files = Array.from(event.clipboardData.items)
.filter((item) => item.kind === 'file')
.map((item) => item.getAsFile())
.filter((file): file is File => file !== null);
if (files.length > 0) {
event.preventDefault();
processNewFiles(files);
return;
}
const text = event.clipboardData.getData(MimeTypeText.PLAIN);
if (text.startsWith('"')) {
const parsed = parseClipboardContent(text);
if (parsed.textAttachments.length > 0) {
event.preventDefault();
onEditedContentChange(parsed.message);
const attachmentFiles = parsed.textAttachments.map(
(att) =>
new File([att.content], att.name, {
type: MimeTypeText.PLAIN
})
);
processNewFiles(attachmentFiles);
setTimeout(() => {
textareaElement?.focus();
}, 10);
return;
}
}
if (
text.length > 0 &&
pasteLongTextToFileLength > 0 &&
text.length > pasteLongTextToFileLength
) {
event.preventDefault();
const textFile = new File([text], 'Pasted', {
type: MimeTypeText.PLAIN
});
processNewFiles([textFile]);
}
}
$effect(() => {
if (textareaElement) {
autoResizeTextarea(textareaElement);
}
});
$effect(() => {
setEditModeActive(processNewFiles);
return () => {
clearEditMode();
};
});
</script>
<svelte:window onkeydown={handleGlobalKeydown} />
<input
bind:this={fileInputElement}
type="file"
multiple
class="hidden"
onchange={handleFileInputChange}
/>
<div
class="{INPUT_CLASSES} w-full max-w-[80%] overflow-hidden rounded-3xl backdrop-blur-md"
data-slot="edit-form"
>
<ChatAttachmentsList
attachments={editedExtras}
uploadedFiles={editedUploadedFiles}
readonly={false}
onFileRemove={(fileId) => {
if (fileId.startsWith('attachment-')) {
const index = parseInt(fileId.replace('attachment-', ''), 10);
if (!isNaN(index) && index >= 0 && index < editedExtras.length) {
handleRemoveExistingAttachment(index);
}
} else {
handleRemoveUploadedFile(fileId);
}
}}
limitToSingleRow
class="py-5"
style="scroll-padding: 1rem;"
/>
<div class="relative min-h-[48px] px-5 py-3">
<textarea
bind:this={textareaElement}
bind:value={editedContent}
class="field-sizing-content max-h-80 min-h-10 w-full resize-none bg-transparent text-sm outline-none"
onkeydown={onEditKeydown}
oninput={(e) => {
autoResizeTextarea(e.currentTarget);
onEditedContentChange(e.currentTarget.value);
}}
onpaste={handlePaste}
placeholder="Edit your message..."
></textarea>
<div class="flex w-full items-center gap-3" style="container-type: inline-size">
<Button
class="h-8 w-8 shrink-0 rounded-full bg-transparent p-0 text-muted-foreground hover:bg-foreground/10 hover:text-foreground"
onclick={() => fileInputElement?.click()}
type="button"
title="Add attachment"
>
<span class="sr-only">Attach files</span>
<Paperclip class="h-4 w-4" />
</Button>
<div class="flex-1"></div>
{#if isRouter}
<ModelsSelector
forceForegroundText={true}
useGlobalSelection={true}
onModelChange={handleModelChange}
/>
{/if}
<Button
class="h-8 w-8 shrink-0 rounded-full p-0"
onclick={handleSubmit}
disabled={!canSubmit}
type="button"
title={saveWithoutRegenerate ? 'Save changes' : 'Send and regenerate'}
>
<span class="sr-only">{saveWithoutRegenerate ? 'Save' : 'Send'}</span>
<ArrowUp class="h-5 w-5" />
</Button>
</div>
</div>
</div>
<div class="mt-2 flex w-full max-w-[80%] items-center justify-between">
{#if showSaveOnlyOption && onSaveEditOnly}
<div class="flex items-center gap-2">
<Switch id="save-only-switch" bind:checked={saveWithoutRegenerate} class="scale-75" />
<label for="save-only-switch" class="cursor-pointer text-xs text-muted-foreground">
Update without re-sending
</label>
</div>
{:else}
<div></div>
{/if}
<Button class="h-7 px-3 text-xs" onclick={attemptCancel} size="sm" variant="ghost">
<X class="mr-1 h-3 w-3" />
Cancel
</Button>
</div>
<DialogConfirmation
bind:open={showDiscardDialog}
title="Discard changes?"
description="You have unsaved changes. Are you sure you want to discard them?"
confirmText="Discard"
cancelText="Keep editing"
variant="destructive"
icon={AlertTriangle}
onConfirm={onCancelEdit}
onCancel={() => (showDiscardDialog = false)}
/>
@@ -1,20 +1,122 @@
<script lang="ts">
import { Clock, Gauge, WholeWord } from '@lucide/svelte';
import { Clock, Gauge, WholeWord, BookOpenText, Sparkles } from '@lucide/svelte';
import { BadgeChatStatistic } from '$lib/components/app';
import * as Tooltip from '$lib/components/ui/tooltip';
import { ChatMessageStatsView } from '$lib/enums';
interface Props {
predictedTokens: number;
predictedMs: number;
promptTokens?: number;
promptMs?: number;
}
let { predictedTokens, predictedMs }: Props = $props();
let { predictedTokens, predictedMs, promptTokens, promptMs }: Props = $props();
let activeView: ChatMessageStatsView = $state(ChatMessageStatsView.GENERATION);
let tokensPerSecond = $derived((predictedTokens / predictedMs) * 1000);
let timeInSeconds = $derived((predictedMs / 1000).toFixed(2));
let promptTokensPerSecond = $derived(
promptTokens !== undefined && promptMs !== undefined
? (promptTokens / promptMs) * 1000
: undefined
);
let promptTimeInSeconds = $derived(
promptMs !== undefined ? (promptMs / 1000).toFixed(2) : undefined
);
let hasPromptStats = $derived(
promptTokens !== undefined &&
promptMs !== undefined &&
promptTokensPerSecond !== undefined &&
promptTimeInSeconds !== undefined
);
</script>
<BadgeChatStatistic icon={WholeWord} value="{predictedTokens} tokens" />
<div class="inline-flex items-center text-xs text-muted-foreground">
<div class="inline-flex items-center rounded-sm bg-muted-foreground/15 p-0.5">
{#if hasPromptStats}
<Tooltip.Root>
<Tooltip.Trigger>
<button
type="button"
class="inline-flex h-5 w-5 items-center justify-center rounded-sm transition-colors {activeView ===
ChatMessageStatsView.READING
? 'bg-background text-foreground shadow-sm'
: 'hover:text-foreground'}"
onclick={() => (activeView = ChatMessageStatsView.READING)}
>
<BookOpenText class="h-3 w-3" />
<span class="sr-only">Reading</span>
</button>
</Tooltip.Trigger>
<Tooltip.Content>
<p>Reading (prompt processing)</p>
</Tooltip.Content>
</Tooltip.Root>
{/if}
<Tooltip.Root>
<Tooltip.Trigger>
<button
type="button"
class="inline-flex h-5 w-5 items-center justify-center rounded-sm transition-colors {activeView ===
ChatMessageStatsView.GENERATION
? 'bg-background text-foreground shadow-sm'
: 'hover:text-foreground'}"
onclick={() => (activeView = ChatMessageStatsView.GENERATION)}
>
<Sparkles class="h-3 w-3" />
<span class="sr-only">Generation</span>
</button>
</Tooltip.Trigger>
<Tooltip.Content>
<p>Generation (token output)</p>
</Tooltip.Content>
</Tooltip.Root>
</div>
<BadgeChatStatistic icon={Clock} value="{timeInSeconds}s" />
<BadgeChatStatistic icon={Gauge} value="{tokensPerSecond.toFixed(2)} tokens/s" />
<div class="flex items-center gap-1 px-2">
{#if activeView === ChatMessageStatsView.GENERATION}
<BadgeChatStatistic
class="bg-transparent"
icon={WholeWord}
value="{predictedTokens} tokens"
tooltipLabel="Generated tokens"
/>
<BadgeChatStatistic
class="bg-transparent"
icon={Clock}
value="{timeInSeconds}s"
tooltipLabel="Generation time"
/>
<BadgeChatStatistic
class="bg-transparent"
icon={Gauge}
value="{tokensPerSecond.toFixed(2)} tokens/s"
tooltipLabel="Generation speed"
/>
{:else if hasPromptStats}
<BadgeChatStatistic
class="bg-transparent"
icon={WholeWord}
value="{promptTokens} tokens"
tooltipLabel="Prompt tokens"
/>
<BadgeChatStatistic
class="bg-transparent"
icon={Clock}
value="{promptTimeInSeconds}s"
tooltipLabel="Prompt processing time"
/>
<BadgeChatStatistic
class="bg-transparent"
icon={Gauge}
value="{promptTokensPerSecond!.toFixed(2)} tokens/s"
tooltipLabel="Prompt processing speed"
/>
{/if}
</div>
</div>
@@ -1,18 +1,17 @@
<script lang="ts">
import { Check, X, Send } from '@lucide/svelte';
import { Card } from '$lib/components/ui/card';
import { Button } from '$lib/components/ui/button';
import { ChatAttachmentsList, MarkdownContent } from '$lib/components/app';
import { INPUT_CLASSES } from '$lib/constants/input-classes';
import { config } from '$lib/stores/settings.svelte';
import { autoResizeTextarea } from '$lib/utils';
import ChatMessageActions from './ChatMessageActions.svelte';
import ChatMessageEditForm from './ChatMessageEditForm.svelte';
interface Props {
class?: string;
message: DatabaseMessage;
isEditing: boolean;
editedContent: string;
editedExtras?: DatabaseMessageExtra[];
editedUploadedFiles?: ChatUploadedFile[];
siblingInfo?: ChatMessageSiblingInfo | null;
showDeleteDialog: boolean;
deletionInfo: {
@@ -26,6 +25,8 @@
onSaveEditOnly?: () => void;
onEditKeydown: (event: KeyboardEvent) => void;
onEditedContentChange: (content: string) => void;
onEditedExtrasChange?: (extras: DatabaseMessageExtra[]) => void;
onEditedUploadedFilesChange?: (files: ChatUploadedFile[]) => void;
onCopy: () => void;
onEdit: () => void;
onDelete: () => void;
@@ -40,6 +41,8 @@
message,
isEditing,
editedContent,
editedExtras = [],
editedUploadedFiles = [],
siblingInfo = null,
showDeleteDialog,
deletionInfo,
@@ -48,6 +51,8 @@
onSaveEditOnly,
onEditKeydown,
onEditedContentChange,
onEditedExtrasChange,
onEditedUploadedFilesChange,
onCopy,
onEdit,
onDelete,
@@ -61,12 +66,6 @@
let messageElement: HTMLElement | undefined = $state();
const currentConfig = config();
$effect(() => {
if (isEditing && textareaElement) {
autoResizeTextarea(textareaElement);
}
});
$effect(() => {
if (!messageElement || !message.content.trim()) return;
@@ -98,44 +97,23 @@
role="group"
>
{#if isEditing}
<div class="w-full max-w-[80%]">
<textarea
bind:this={textareaElement}
bind:value={editedContent}
class="min-h-[60px] w-full resize-none rounded-2xl px-3 py-2 text-sm {INPUT_CLASSES}"
onkeydown={onEditKeydown}
oninput={(e) => {
autoResizeTextarea(e.currentTarget);
onEditedContentChange(e.currentTarget.value);
}}
placeholder="Edit your message..."
></textarea>
<div class="mt-2 flex justify-end gap-2">
<Button class="h-8 px-3" onclick={onCancelEdit} size="sm" variant="ghost">
<X class="mr-1 h-3 w-3" />
Cancel
</Button>
{#if onSaveEditOnly}
<Button
class="h-8 px-3"
onclick={onSaveEditOnly}
disabled={!editedContent.trim()}
size="sm"
variant="outline"
>
<Check class="mr-1 h-3 w-3" />
Save
</Button>
{/if}
<Button class="h-8 px-3" onclick={onSaveEdit} disabled={!editedContent.trim()} size="sm">
<Send class="mr-1 h-3 w-3" />
Send
</Button>
</div>
</div>
<ChatMessageEditForm
bind:textareaElement
messageId={message.id}
{editedContent}
{editedExtras}
{editedUploadedFiles}
originalContent={message.content}
originalExtras={message.extra}
showSaveOnlyOption={!!onSaveEditOnly}
{onCancelEdit}
{onSaveEdit}
{onSaveEditOnly}
{onEditKeydown}
{onEditedContentChange}
{onEditedExtrasChange}
{onEditedUploadedFilesChange}
/>
{:else}
{#if message.extra && message.extra.length > 0}
<div class="mb-2 max-w-[80%]">
@@ -66,10 +66,14 @@
await conversationsStore.navigateToSibling(siblingId);
}
async function handleEditWithBranching(message: DatabaseMessage, newContent: string) {
async function handleEditWithBranching(
message: DatabaseMessage,
newContent: string,
newExtras?: DatabaseMessageExtra[]
) {
onUserAction?.();
await chatStore.editMessageWithBranching(message.id, newContent);
await chatStore.editMessageWithBranching(message.id, newContent, newExtras);
refreshAllMessages();
}
@@ -104,11 +108,12 @@
async function handleEditUserMessagePreserveResponses(
message: DatabaseMessage,
newContent: string
newContent: string,
newExtras?: DatabaseMessageExtra[]
) {
onUserAction?.();
await chatStore.editUserMessagePreserveResponses(message.id, newContent);
await chatStore.editUserMessagePreserveResponses(message.id, newContent, newExtras);
refreshAllMessages();
}
@@ -17,7 +17,13 @@
AUTO_SCROLL_INTERVAL,
INITIAL_SCROLL_DELAY
} from '$lib/constants/auto-scroll';
import { chatStore, errorDialog, isLoading } from '$lib/stores/chat.svelte';
import {
chatStore,
errorDialog,
isLoading,
isEditing,
getAddFilesHandler
} from '$lib/stores/chat.svelte';
import {
conversationsStore,
activeMessages,
@@ -181,7 +187,18 @@
dragCounter = 0;
if (event.dataTransfer?.files) {
processFiles(Array.from(event.dataTransfer.files));
const files = Array.from(event.dataTransfer.files);
if (isEditing()) {
const handler = getAddFilesHandler();
if (handler) {
handler(files);
return;
}
}
processFiles(files);
}
}
@@ -410,7 +427,7 @@
<div class="conversation-chat-form pointer-events-auto rounded-t-3xl pb-4">
<ChatForm
disabled={hasPropsError}
disabled={hasPropsError || isEditing()}
isLoading={isCurrentConversationLoading}
onFileRemove={handleFileRemove}
onFileUpload={handleFileUpload}
@@ -587,7 +604,7 @@
&::after {
content: '';
position: fixed;
position: absolute;
bottom: 0;
z-index: -1;
left: 0;
@@ -1,5 +1,6 @@
<script lang="ts">
import { BadgeInfo } from '$lib/components/app';
import * as Tooltip from '$lib/components/ui/tooltip';
import { copyToClipboard } from '$lib/utils';
import type { Component } from 'svelte';
@@ -7,19 +8,37 @@
class?: string;
icon: Component;
value: string | number;
tooltipLabel?: string;
}
let { class: className = '', icon: Icon, value }: Props = $props();
let { class: className = '', icon: Icon, value, tooltipLabel }: Props = $props();
function handleClick() {
void copyToClipboard(String(value));
}
</script>
<BadgeInfo class={className} onclick={handleClick}>
{#snippet icon()}
<Icon class="h-3 w-3" />
{/snippet}
{#if tooltipLabel}
<Tooltip.Root>
<Tooltip.Trigger>
<BadgeInfo class={className} onclick={handleClick}>
{#snippet icon()}
<Icon class="h-3 w-3" />
{/snippet}
{value}
</BadgeInfo>
{value}
</BadgeInfo>
</Tooltip.Trigger>
<Tooltip.Content>
<p>{tooltipLabel}</p>
</Tooltip.Content>
</Tooltip.Root>
{:else}
<BadgeInfo class={className} onclick={handleClick}>
{#snippet icon()}
<Icon class="h-3 w-3" />
{/snippet}
{value}
</BadgeInfo>
{/if}
@@ -7,15 +7,19 @@
import remarkRehype from 'remark-rehype';
import rehypeKatex from 'rehype-katex';
import rehypeStringify from 'rehype-stringify';
import { copyCodeToClipboard, preprocessLaTeX } from '$lib/utils';
import { rehypeRestoreTableHtml } from '$lib/markdown/table-html-restorer';
import type { Root as HastRoot, RootContent as HastRootContent } from 'hast';
import type { Root as MdastRoot } from 'mdast';
import { browser } from '$app/environment';
import { onDestroy, tick } from 'svelte';
import { rehypeRestoreTableHtml } from '$lib/markdown/table-html-restorer';
import { rehypeEnhanceLinks } from '$lib/markdown/enhance-links';
import { rehypeEnhanceCodeBlocks } from '$lib/markdown/enhance-code-blocks';
import { remarkLiteralHtml } from '$lib/markdown/literal-html';
import { copyCodeToClipboard, preprocessLaTeX } from '$lib/utils';
import '$styles/katex-custom.scss';
import githubDarkCss from 'highlight.js/styles/github-dark.css?inline';
import githubLightCss from 'highlight.js/styles/github.css?inline';
import { mode } from 'mode-watcher';
import { remarkLiteralHtml } from '$lib/markdown/literal-html';
import CodePreviewDialog from './CodePreviewDialog.svelte';
interface Props {
@@ -23,33 +27,24 @@
class?: string;
}
interface MarkdownBlock {
id: string;
html: string;
}
let { content, class: className = '' }: Props = $props();
let containerRef = $state<HTMLDivElement>();
let processedHtml = $state('');
let renderedBlocks = $state<MarkdownBlock[]>([]);
let unstableBlockHtml = $state('');
let previewDialogOpen = $state(false);
let previewCode = $state('');
let previewLanguage = $state('text');
function loadHighlightTheme(isDark: boolean) {
if (!browser) return;
let pendingMarkdown: string | null = null;
let isProcessing = false;
const existingThemes = document.querySelectorAll('style[data-highlight-theme]');
existingThemes.forEach((style) => style.remove());
const style = document.createElement('style');
style.setAttribute('data-highlight-theme', 'true');
style.textContent = isDark ? githubDarkCss : githubLightCss;
document.head.appendChild(style);
}
$effect(() => {
const currentMode = mode.current;
const isDark = currentMode === 'dark';
loadHighlightTheme(isDark);
});
const themeStyleId = `highlight-theme-${(window.idxThemeStyle = (window.idxThemeStyle ?? 0) + 1)}`;
let processor = $derived(() => {
return remark()
@@ -61,139 +56,64 @@
.use(rehypeKatex) // Render math using KaTeX
.use(rehypeHighlight) // Add syntax highlighting
.use(rehypeRestoreTableHtml) // Restore limited HTML (e.g., <br>, <ul>) inside Markdown tables
.use(rehypeStringify); // Convert to HTML string
.use(rehypeEnhanceLinks) // Add target="_blank" to links
.use(rehypeEnhanceCodeBlocks) // Wrap code blocks with header and actions
.use(rehypeStringify, { allowDangerousHtml: true }); // Convert to HTML string
});
function enhanceLinks(html: string): string {
if (!html.includes('<a')) {
return html;
/**
* Removes click event listeners from copy and preview buttons.
* Called on component destroy.
*/
function cleanupEventListeners() {
if (!containerRef) return;
const copyButtons = containerRef.querySelectorAll<HTMLButtonElement>('.copy-code-btn');
const previewButtons = containerRef.querySelectorAll<HTMLButtonElement>('.preview-code-btn');
for (const button of copyButtons) {
button.removeEventListener('click', handleCopyClick);
}
const tempDiv = document.createElement('div');
tempDiv.innerHTML = html;
// Make all links open in new tabs
const linkElements = tempDiv.querySelectorAll('a[href]');
let mutated = false;
for (const link of linkElements) {
const target = link.getAttribute('target');
const rel = link.getAttribute('rel');
if (target !== '_blank' || rel !== 'noopener noreferrer') {
mutated = true;
}
link.setAttribute('target', '_blank');
link.setAttribute('rel', 'noopener noreferrer');
}
return mutated ? tempDiv.innerHTML : html;
}
function enhanceCodeBlocks(html: string): string {
if (!html.includes('<pre')) {
return html;
}
const tempDiv = document.createElement('div');
tempDiv.innerHTML = html;
const preElements = tempDiv.querySelectorAll('pre');
let mutated = false;
for (const [index, pre] of Array.from(preElements).entries()) {
const codeElement = pre.querySelector('code');
if (!codeElement) {
continue;
}
mutated = true;
let language = 'text';
const classList = Array.from(codeElement.classList);
for (const className of classList) {
if (className.startsWith('language-')) {
language = className.replace('language-', '');
break;
}
}
const rawCode = codeElement.textContent || '';
const codeId = `code-${Date.now()}-${index}`;
codeElement.setAttribute('data-code-id', codeId);
codeElement.setAttribute('data-raw-code', rawCode);
const wrapper = document.createElement('div');
wrapper.className = 'code-block-wrapper';
const header = document.createElement('div');
header.className = 'code-block-header';
const languageLabel = document.createElement('span');
languageLabel.className = 'code-language';
languageLabel.textContent = language;
const copyButton = document.createElement('button');
copyButton.className = 'copy-code-btn';
copyButton.setAttribute('data-code-id', codeId);
copyButton.setAttribute('title', 'Copy code');
copyButton.setAttribute('type', 'button');
copyButton.innerHTML = `
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-copy-icon lucide-copy"><rect width="14" height="14" x="8" y="8" rx="2" ry="2"/><path d="M4 16c-1.1 0-2-.9-2-2V4c0-1.1.9-2 2-2h10c1.1 0 2 .9 2 2"/></svg>
`;
const actions = document.createElement('div');
actions.className = 'code-block-actions';
actions.appendChild(copyButton);
if (language.toLowerCase() === 'html') {
const previewButton = document.createElement('button');
previewButton.className = 'preview-code-btn';
previewButton.setAttribute('data-code-id', codeId);
previewButton.setAttribute('title', 'Preview code');
previewButton.setAttribute('type', 'button');
previewButton.innerHTML = `
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-eye lucide-eye-icon"><path d="M2.062 12.345a1 1 0 0 1 0-.69C3.5 7.73 7.36 5 12 5s8.5 2.73 9.938 6.655a1 1 0 0 1 0 .69C20.5 16.27 16.64 19 12 19s-8.5-2.73-9.938-6.655"/><circle cx="12" cy="12" r="3"/></svg>
`;
actions.appendChild(previewButton);
}
header.appendChild(languageLabel);
header.appendChild(actions);
wrapper.appendChild(header);
const clonedPre = pre.cloneNode(true) as HTMLElement;
wrapper.appendChild(clonedPre);
pre.parentNode?.replaceChild(wrapper, pre);
}
return mutated ? tempDiv.innerHTML : html;
}
async function processMarkdown(text: string): Promise<string> {
try {
let normalized = preprocessLaTeX(text);
const result = await processor().process(normalized);
const html = String(result);
const enhancedLinks = enhanceLinks(html);
return enhanceCodeBlocks(enhancedLinks);
} catch (error) {
console.error('Markdown processing error:', error);
// Fallback to plain text with line breaks
return text.replace(/\n/g, '<br>');
for (const button of previewButtons) {
button.removeEventListener('click', handlePreviewClick);
}
}
/**
* Removes this component's highlight.js theme style from the document head.
* Called on component destroy to clean up injected styles.
*/
function cleanupHighlightTheme() {
if (!browser) return;
const existingTheme = document.getElementById(themeStyleId);
existingTheme?.remove();
}
/**
* Loads the appropriate highlight.js theme based on dark/light mode.
* Injects a scoped style element into the document head.
* @param isDark - Whether to load the dark theme (true) or light theme (false)
*/
function loadHighlightTheme(isDark: boolean) {
if (!browser) return;
const existingTheme = document.getElementById(themeStyleId);
existingTheme?.remove();
const style = document.createElement('style');
style.id = themeStyleId;
style.textContent = isDark ? githubDarkCss : githubLightCss;
document.head.appendChild(style);
}
/**
* Extracts code information from a button click target within a code block.
* @param target - The clicked button element
* @returns Object with rawCode and language, or null if extraction fails
*/
function getCodeInfoFromTarget(target: HTMLElement) {
const wrapper = target.closest('.code-block-wrapper');
@@ -209,12 +129,7 @@
return null;
}
const rawCode = codeElement.getAttribute('data-raw-code');
if (rawCode === null) {
console.error('No raw code found');
return null;
}
const rawCode = codeElement.textContent ?? '';
const languageLabel = wrapper.querySelector<HTMLElement>('.code-language');
const language = languageLabel?.textContent?.trim() || 'text';
@@ -222,6 +137,28 @@
return { rawCode, language };
}
/**
* Generates a unique identifier for a HAST node based on its position.
* Used for stable block identification during incremental rendering.
* @param node - The HAST root content node
* @param indexFallback - Fallback index if position is unavailable
* @returns Unique string identifier for the node
*/
function getHastNodeId(node: HastRootContent, indexFallback: number): string {
const position = node.position;
if (position?.start?.offset != null && position?.end?.offset != null) {
return `hast-${position.start.offset}-${position.end.offset}`;
}
return `${node.type}-${indexFallback}`;
}
/**
* Handles click events on copy buttons within code blocks.
* Copies the raw code content to the clipboard.
* @param event - The click event from the copy button
*/
async function handleCopyClick(event: Event) {
event.preventDefault();
event.stopPropagation();
@@ -245,6 +182,25 @@
}
}
/**
* Handles preview dialog open state changes.
* Clears preview content when dialog is closed.
* @param open - Whether the dialog is being opened or closed
*/
function handlePreviewDialogOpenChange(open: boolean) {
previewDialogOpen = open;
if (!open) {
previewCode = '';
previewLanguage = 'text';
}
}
/**
* Handles click events on preview buttons within HTML code blocks.
* Opens a preview dialog with the rendered HTML content.
* @param event - The click event from the preview button
*/
function handlePreviewClick(event: Event) {
event.preventDefault();
event.stopPropagation();
@@ -266,6 +222,61 @@
previewDialogOpen = true;
}
/**
* Processes markdown content into stable and unstable HTML blocks.
* Uses incremental rendering: stable blocks are cached, unstable block is re-rendered.
* @param markdown - The raw markdown string to process
*/
async function processMarkdown(markdown: string) {
if (!markdown) {
renderedBlocks = [];
unstableBlockHtml = '';
return;
}
const normalized = preprocessLaTeX(markdown);
const processorInstance = processor();
const ast = processorInstance.parse(normalized) as MdastRoot;
const processedRoot = (await processorInstance.run(ast)) as HastRoot;
const processedChildren = processedRoot.children ?? [];
const stableCount = Math.max(processedChildren.length - 1, 0);
const nextBlocks: MarkdownBlock[] = [];
for (let index = 0; index < stableCount; index++) {
const hastChild = processedChildren[index];
const id = getHastNodeId(hastChild, index);
const existing = renderedBlocks[index];
if (existing && existing.id === id) {
nextBlocks.push(existing);
continue;
}
const html = stringifyProcessedNode(
processorInstance,
processedRoot,
processedChildren[index]
);
nextBlocks.push({ id, html });
}
let unstableHtml = '';
if (processedChildren.length > stableCount) {
const unstableChild = processedChildren[stableCount];
unstableHtml = stringifyProcessedNode(processorInstance, processedRoot, unstableChild);
}
renderedBlocks = nextBlocks;
await tick(); // Force DOM sync before updating unstable HTML block
unstableBlockHtml = unstableHtml;
}
/**
* Attaches click event listeners to copy and preview buttons in code blocks.
* Uses data-listener-bound attribute to prevent duplicate bindings.
*/
function setupCodeBlockActions() {
if (!containerRef) return;
@@ -287,40 +298,97 @@
}
}
function handlePreviewDialogOpenChange(open: boolean) {
previewDialogOpen = open;
/**
* Converts a single HAST node to an enhanced HTML string.
* Applies link and code block enhancements to the output.
* @param processorInstance - The remark/rehype processor instance
* @param processedRoot - The full processed HAST root (for context)
* @param child - The specific HAST child node to stringify
* @returns Enhanced HTML string representation of the node
*/
function stringifyProcessedNode(
processorInstance: ReturnType<typeof processor>,
processedRoot: HastRoot,
child: unknown
) {
const root: HastRoot = {
...(processedRoot as HastRoot),
children: [child as never]
};
if (!open) {
previewCode = '';
previewLanguage = 'text';
return processorInstance.stringify(root);
}
/**
* Queues markdown for processing with coalescing support.
* Only processes the latest markdown when multiple updates arrive quickly.
* @param markdown - The markdown content to render
*/
async function updateRenderedBlocks(markdown: string) {
pendingMarkdown = markdown;
if (isProcessing) {
return;
}
isProcessing = true;
try {
while (pendingMarkdown !== null) {
const nextMarkdown = pendingMarkdown;
pendingMarkdown = null;
await processMarkdown(nextMarkdown);
}
} catch (error) {
console.error('Failed to process markdown:', error);
renderedBlocks = [];
unstableBlockHtml = markdown.replace(/\n/g, '<br>');
} finally {
isProcessing = false;
}
}
$effect(() => {
if (content) {
processMarkdown(content)
.then((result) => {
processedHtml = result;
})
.catch((error) => {
console.error('Failed to process markdown:', error);
processedHtml = content.replace(/\n/g, '<br>');
});
} else {
processedHtml = '';
}
const currentMode = mode.current;
const isDark = currentMode === 'dark';
loadHighlightTheme(isDark);
});
$effect(() => {
if (containerRef && processedHtml) {
updateRenderedBlocks(content);
});
$effect(() => {
const hasRenderedBlocks = renderedBlocks.length > 0;
const hasUnstableBlock = Boolean(unstableBlockHtml);
if ((hasRenderedBlocks || hasUnstableBlock) && containerRef) {
setupCodeBlockActions();
}
});
onDestroy(() => {
cleanupEventListeners();
cleanupHighlightTheme();
});
</script>
<div bind:this={containerRef} class={className}>
<!-- eslint-disable-next-line no-at-html-tags -->
{@html processedHtml}
{#each renderedBlocks as block (block.id)}
<div class="markdown-block" data-block-id={block.id}>
<!-- eslint-disable-next-line no-at-html-tags -->
{@html block.html}
</div>
{/each}
{#if unstableBlockHtml}
<div class="markdown-block markdown-block--unstable" data-block-id="unstable">
<!-- eslint-disable-next-line no-at-html-tags -->
{@html unstableBlockHtml}
</div>
{/if}
</div>
<CodePreviewDialog
@@ -331,6 +399,11 @@
/>
<style>
.markdown-block,
.markdown-block--unstable {
display: contents;
}
/* Base typography styles */
div :global(p:not(:last-child)) {
margin-bottom: 1rem;
@@ -0,0 +1,7 @@
import Root from './switch.svelte';
export {
Root,
//
Root as Switch
};
@@ -0,0 +1,29 @@
<script lang="ts">
import { Switch as SwitchPrimitive } from 'bits-ui';
import { cn, type WithoutChildrenOrChild } from '$lib/components/ui/utils.js';
let {
ref = $bindable(null),
class: className,
checked = $bindable(false),
...restProps
}: WithoutChildrenOrChild<SwitchPrimitive.RootProps> = $props();
</script>
<SwitchPrimitive.Root
bind:ref
bind:checked
data-slot="switch"
class={cn(
'peer inline-flex h-[1.15rem] w-8 shrink-0 items-center rounded-full border border-transparent shadow-xs transition-all outline-none focus-visible:border-ring focus-visible:ring-[3px] focus-visible:ring-ring/50 disabled:cursor-not-allowed disabled:opacity-50 data-[state=checked]:bg-primary data-[state=unchecked]:bg-input dark:data-[state=unchecked]:bg-input/80',
className
)}
{...restProps}
>
<SwitchPrimitive.Thumb
data-slot="switch-thumb"
class={cn(
'pointer-events-none block size-4 rounded-full bg-background ring-0 transition-transform data-[state=checked]:translate-x-[calc(100%-2px)] data-[state=unchecked]:translate-x-0 dark:data-[state=checked]:bg-primary-foreground dark:data-[state=unchecked]:bg-foreground'
)}
/>
</SwitchPrimitive.Root>
+4
View File
@@ -0,0 +1,4 @@
export enum ChatMessageStatsView {
GENERATION = 'generation',
READING = 'reading'
}
@@ -1,5 +1,7 @@
export { AttachmentType } from './attachment';
export { ChatMessageStatsView } from './chat';
export {
FileTypeCategory,
FileTypeImage,
@@ -0,0 +1,162 @@
/**
* Rehype plugin to enhance code blocks with wrapper, header, and action buttons.
*
* Wraps <pre><code> elements with a container that includes:
* - Language label
* - Copy button
* - Preview button (for HTML code blocks)
*
* This operates directly on the HAST tree for better performance,
* avoiding the need to stringify and re-parse HTML.
*/
import type { Plugin } from 'unified';
import type { Root, Element, ElementContent } from 'hast';
import { visit } from 'unist-util-visit';
declare global {
interface Window {
idxCodeBlock?: number;
}
}
const COPY_ICON_SVG = `<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-copy-icon lucide-copy"><rect width="14" height="14" x="8" y="8" rx="2" ry="2"/><path d="M4 16c-1.1 0-2-.9-2-2V4c0-1.1.9-2 2-2h10c1.1 0 2 .9 2 2"/></svg>`;
const PREVIEW_ICON_SVG = `<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-eye lucide-eye-icon"><path d="M2.062 12.345a1 1 0 0 1 0-.69C3.5 7.73 7.36 5 12 5s8.5 2.73 9.938 6.655a1 1 0 0 1 0 .69C20.5 16.27 16.64 19 12 19s-8.5-2.73-9.938-6.655"/><circle cx="12" cy="12" r="3"/></svg>`;
/**
* Creates an SVG element node from raw SVG string.
* Since we can't parse HTML in HAST directly, we use the raw property.
*/
function createRawHtmlElement(html: string): Element {
return {
type: 'element',
tagName: 'span',
properties: {},
children: [{ type: 'raw', value: html } as unknown as ElementContent]
};
}
function createCopyButton(codeId: string): Element {
return {
type: 'element',
tagName: 'button',
properties: {
className: ['copy-code-btn'],
'data-code-id': codeId,
title: 'Copy code',
type: 'button'
},
children: [createRawHtmlElement(COPY_ICON_SVG)]
};
}
function createPreviewButton(codeId: string): Element {
return {
type: 'element',
tagName: 'button',
properties: {
className: ['preview-code-btn'],
'data-code-id': codeId,
title: 'Preview code',
type: 'button'
},
children: [createRawHtmlElement(PREVIEW_ICON_SVG)]
};
}
function createHeader(language: string, codeId: string): Element {
const actions: Element[] = [createCopyButton(codeId)];
if (language.toLowerCase() === 'html') {
actions.push(createPreviewButton(codeId));
}
return {
type: 'element',
tagName: 'div',
properties: { className: ['code-block-header'] },
children: [
{
type: 'element',
tagName: 'span',
properties: { className: ['code-language'] },
children: [{ type: 'text', value: language }]
},
{
type: 'element',
tagName: 'div',
properties: { className: ['code-block-actions'] },
children: actions
}
]
};
}
function createWrapper(header: Element, preElement: Element): Element {
return {
type: 'element',
tagName: 'div',
properties: { className: ['code-block-wrapper'] },
children: [header, preElement]
};
}
function extractLanguage(codeElement: Element): string {
const className = codeElement.properties?.className;
if (!Array.isArray(className)) return 'text';
for (const cls of className) {
if (typeof cls === 'string' && cls.startsWith('language-')) {
return cls.replace('language-', '');
}
}
return 'text';
}
/**
* Generates a unique code block ID using a global counter.
*/
function generateCodeId(): string {
if (typeof window !== 'undefined') {
return `code-${(window.idxCodeBlock = (window.idxCodeBlock ?? 0) + 1)}`;
}
// Fallback for SSR - use timestamp + random
return `code-${Date.now()}-${Math.random().toString(36).slice(2, 7)}`;
}
/**
* Rehype plugin to enhance code blocks with wrapper, header, and action buttons.
* This plugin wraps <pre><code> elements with a container that includes:
* - Language label
* - Copy button
* - Preview button (for HTML code blocks)
*/
export const rehypeEnhanceCodeBlocks: Plugin<[], Root> = () => {
return (tree: Root) => {
visit(tree, 'element', (node: Element, index, parent) => {
if (node.tagName !== 'pre' || !parent || index === undefined) return;
const codeElement = node.children.find(
(child): child is Element => child.type === 'element' && child.tagName === 'code'
);
if (!codeElement) return;
const language = extractLanguage(codeElement);
const codeId = generateCodeId();
codeElement.properties = {
...codeElement.properties,
'data-code-id': codeId
};
const header = createHeader(language, codeId);
const wrapper = createWrapper(header, node);
// Replace pre with wrapper in parent
(parent.children as ElementContent[])[index] = wrapper;
});
};
};
@@ -0,0 +1,33 @@
/**
* Rehype plugin to enhance links with security attributes.
*
* Adds target="_blank" and rel="noopener noreferrer" to all anchor elements,
* ensuring external links open in new tabs safely.
*/
import type { Plugin } from 'unified';
import type { Root, Element } from 'hast';
import { visit } from 'unist-util-visit';
/**
* Rehype plugin that adds security attributes to all links.
* This plugin ensures external links open in new tabs safely by adding:
* - target="_blank"
* - rel="noopener noreferrer"
*/
export const rehypeEnhanceLinks: Plugin<[], Root> = () => {
return (tree: Root) => {
visit(tree, 'element', (node: Element) => {
if (node.tagName !== 'a') return;
const props = node.properties ?? {};
// Only modify if href exists
if (!props.href) return;
props.target = '_blank';
props.rel = 'noopener noreferrer';
node.properties = props;
});
};
};
+227 -170
View File
@@ -74,6 +74,8 @@ class ChatStore {
private processingStates = new SvelteMap<string, ApiProcessingState | null>();
private activeConversationId = $state<string | null>(null);
private isStreamingActive = $state(false);
private isEditModeActive = $state(false);
private addFilesHandler: ((files: File[]) => void) | null = $state(null);
// ─────────────────────────────────────────────────────────────────────────────
// Loading State
@@ -171,6 +173,7 @@ class ChatStore {
updateProcessingStateFromTimings(
timingData: {
prompt_n: number;
prompt_ms?: number;
predicted_n: number;
predicted_per_second: number;
cache_n: number;
@@ -212,6 +215,7 @@ class ChatStore {
if (message.role === 'assistant' && message.timings) {
const restoredState = this.parseTimingData({
prompt_n: message.timings.prompt_n || 0,
prompt_ms: message.timings.prompt_ms,
predicted_n: message.timings.predicted_n || 0,
predicted_per_second:
message.timings.predicted_n && message.timings.predicted_ms
@@ -282,6 +286,7 @@ class ChatStore {
private parseTimingData(timingData: Record<string, unknown>): ApiProcessingState | null {
const promptTokens = (timingData.prompt_n as number) || 0;
const promptMs = (timingData.prompt_ms as number) || undefined;
const predictedTokens = (timingData.predicted_n as number) || 0;
const tokensPerSecond = (timingData.predicted_per_second as number) || 0;
const cacheTokens = (timingData.cache_n as number) || 0;
@@ -320,6 +325,7 @@ class ChatStore {
speculative: false,
progressPercent,
promptTokens,
promptMs,
cacheTokens
};
}
@@ -536,6 +542,7 @@ class ChatStore {
this.updateProcessingStateFromTimings(
{
prompt_n: timings?.prompt_n || 0,
prompt_ms: timings?.prompt_ms,
predicted_n: timings?.predicted_n || 0,
predicted_per_second: tokensPerSecond,
cache_n: timings?.cache_n || 0,
@@ -768,10 +775,11 @@ class ChatStore {
content: streamingState.response
};
if (lastMessage.thinking?.trim()) updateData.thinking = lastMessage.thinking;
const lastKnownState = this.getCurrentProcessingStateSync();
const lastKnownState = this.getProcessingState(conversationId);
if (lastKnownState) {
updateData.timings = {
prompt_n: lastKnownState.promptTokens || 0,
prompt_ms: lastKnownState.promptMs,
predicted_n: lastKnownState.tokensDecoded || 0,
cache_n: lastKnownState.cacheTokens || 0,
predicted_ms:
@@ -959,6 +967,160 @@ class ChatStore {
// Editing
// ─────────────────────────────────────────────────────────────────────────────
clearEditMode(): void {
this.isEditModeActive = false;
this.addFilesHandler = null;
}
async continueAssistantMessage(messageId: string): Promise<void> {
const activeConv = conversationsStore.activeConversation;
if (!activeConv || this.isLoading) return;
const result = this.getMessageByIdWithRole(messageId, 'assistant');
if (!result) return;
const { message: msg, index: idx } = result;
if (this.isChatLoading(activeConv.id)) return;
try {
this.errorDialogState = null;
this.setChatLoading(activeConv.id, true);
this.clearChatStreaming(activeConv.id);
const allMessages = await conversationsStore.getConversationMessages(activeConv.id);
const dbMessage = allMessages.find((m) => m.id === messageId);
if (!dbMessage) {
this.setChatLoading(activeConv.id, false);
return;
}
const originalContent = dbMessage.content;
const originalThinking = dbMessage.thinking || '';
const conversationContext = conversationsStore.activeMessages.slice(0, idx);
const contextWithContinue = [
...conversationContext,
{ role: 'assistant' as const, content: originalContent }
];
let appendedContent = '',
appendedThinking = '',
hasReceivedContent = false;
const abortController = this.getOrCreateAbortController(msg.convId);
await ChatService.sendMessage(
contextWithContinue,
{
...this.getApiOptions(),
onChunk: (chunk: string) => {
hasReceivedContent = true;
appendedContent += chunk;
const fullContent = originalContent + appendedContent;
this.setChatStreaming(msg.convId, fullContent, msg.id);
conversationsStore.updateMessageAtIndex(idx, { content: fullContent });
},
onReasoningChunk: (reasoningChunk: string) => {
hasReceivedContent = true;
appendedThinking += reasoningChunk;
conversationsStore.updateMessageAtIndex(idx, {
thinking: originalThinking + appendedThinking
});
},
onTimings: (timings: ChatMessageTimings, promptProgress?: ChatMessagePromptProgress) => {
const tokensPerSecond =
timings?.predicted_ms && timings?.predicted_n
? (timings.predicted_n / timings.predicted_ms) * 1000
: 0;
this.updateProcessingStateFromTimings(
{
prompt_n: timings?.prompt_n || 0,
prompt_ms: timings?.prompt_ms,
predicted_n: timings?.predicted_n || 0,
predicted_per_second: tokensPerSecond,
cache_n: timings?.cache_n || 0,
prompt_progress: promptProgress
},
msg.convId
);
},
onComplete: async (
finalContent?: string,
reasoningContent?: string,
timings?: ChatMessageTimings
) => {
const fullContent = originalContent + (finalContent || appendedContent);
const fullThinking = originalThinking + (reasoningContent || appendedThinking);
await DatabaseService.updateMessage(msg.id, {
content: fullContent,
thinking: fullThinking,
timestamp: Date.now(),
timings
});
conversationsStore.updateMessageAtIndex(idx, {
content: fullContent,
thinking: fullThinking,
timestamp: Date.now(),
timings
});
conversationsStore.updateConversationTimestamp();
this.setChatLoading(msg.convId, false);
this.clearChatStreaming(msg.convId);
this.clearProcessingState(msg.convId);
},
onError: async (error: Error) => {
if (this.isAbortError(error)) {
if (hasReceivedContent && appendedContent) {
await DatabaseService.updateMessage(msg.id, {
content: originalContent + appendedContent,
thinking: originalThinking + appendedThinking,
timestamp: Date.now()
});
conversationsStore.updateMessageAtIndex(idx, {
content: originalContent + appendedContent,
thinking: originalThinking + appendedThinking,
timestamp: Date.now()
});
}
this.setChatLoading(msg.convId, false);
this.clearChatStreaming(msg.convId);
this.clearProcessingState(msg.convId);
return;
}
console.error('Continue generation error:', error);
conversationsStore.updateMessageAtIndex(idx, {
content: originalContent,
thinking: originalThinking
});
await DatabaseService.updateMessage(msg.id, {
content: originalContent,
thinking: originalThinking
});
this.setChatLoading(msg.convId, false);
this.clearChatStreaming(msg.convId);
this.clearProcessingState(msg.convId);
this.showErrorDialog(
error.name === 'TimeoutError' ? 'timeout' : 'server',
error.message
);
}
},
msg.convId,
abortController.signal
);
} catch (error) {
if (!this.isAbortError(error)) console.error('Failed to continue message:', error);
if (activeConv) this.setChatLoading(activeConv.id, false);
}
}
async editAssistantMessage(
messageId: string,
newContent: string,
@@ -989,11 +1151,10 @@ class ChatStore {
);
await conversationsStore.updateCurrentNode(newMessage.id);
} else {
await DatabaseService.updateMessage(msg.id, { content: newContent, timestamp: Date.now() });
await DatabaseService.updateMessage(msg.id, { content: newContent });
await conversationsStore.updateCurrentNode(msg.id);
conversationsStore.updateMessageAtIndex(idx, {
content: newContent,
timestamp: Date.now()
content: newContent
});
}
conversationsStore.updateConversationTimestamp();
@@ -1003,7 +1164,11 @@ class ChatStore {
}
}
async editUserMessagePreserveResponses(messageId: string, newContent: string): Promise<void> {
async editUserMessagePreserveResponses(
messageId: string,
newContent: string,
newExtras?: DatabaseMessageExtra[]
): Promise<void> {
const activeConv = conversationsStore.activeConversation;
if (!activeConv) return;
@@ -1012,11 +1177,18 @@ class ChatStore {
const { message: msg, index: idx } = result;
try {
await DatabaseService.updateMessage(messageId, {
content: newContent,
timestamp: Date.now()
});
conversationsStore.updateMessageAtIndex(idx, { content: newContent, timestamp: Date.now() });
const updateData: Partial<DatabaseMessage> = {
content: newContent
};
// Update extras if provided (including empty array to clear attachments)
// Deep clone to avoid Proxy objects from Svelte reactivity
if (newExtras !== undefined) {
updateData.extra = JSON.parse(JSON.stringify(newExtras));
}
await DatabaseService.updateMessage(messageId, updateData);
conversationsStore.updateMessageAtIndex(idx, updateData);
const allMessages = await conversationsStore.getConversationMessages(activeConv.id);
const rootMessage = allMessages.find((m) => m.type === 'root' && m.parent === null);
@@ -1034,7 +1206,11 @@ class ChatStore {
}
}
async editMessageWithBranching(messageId: string, newContent: string): Promise<void> {
async editMessageWithBranching(
messageId: string,
newContent: string,
newExtras?: DatabaseMessageExtra[]
): Promise<void> {
const activeConv = conversationsStore.activeConversation;
if (!activeConv || this.isLoading) return;
@@ -1056,6 +1232,15 @@ class ChatStore {
const parentId = msg.parent || rootMessage?.id;
if (!parentId) return;
// Use newExtras if provided, otherwise copy existing extras
// Deep clone to avoid Proxy objects from Svelte reactivity
const extrasToUse =
newExtras !== undefined
? JSON.parse(JSON.stringify(newExtras))
: msg.extra
? JSON.parse(JSON.stringify(msg.extra))
: undefined;
const newMessage = await DatabaseService.createMessageBranch(
{
convId: msg.convId,
@@ -1066,7 +1251,7 @@ class ChatStore {
thinking: msg.thinking || '',
toolCalls: msg.toolCalls || '',
children: [],
extra: msg.extra ? JSON.parse(JSON.stringify(msg.extra)) : undefined,
extra: extrasToUse,
model: msg.model
},
parentId
@@ -1185,167 +1370,35 @@ class ChatStore {
}
}
async continueAssistantMessage(messageId: string): Promise<void> {
const activeConv = conversationsStore.activeConversation;
if (!activeConv || this.isLoading) return;
const result = this.getMessageByIdWithRole(messageId, 'assistant');
if (!result) return;
const { message: msg, index: idx } = result;
if (this.isChatLoading(activeConv.id)) return;
try {
this.errorDialogState = null;
this.setChatLoading(activeConv.id, true);
this.clearChatStreaming(activeConv.id);
const allMessages = await conversationsStore.getConversationMessages(activeConv.id);
const dbMessage = allMessages.find((m) => m.id === messageId);
if (!dbMessage) {
this.setChatLoading(activeConv.id, false);
return;
}
const originalContent = dbMessage.content;
const originalThinking = dbMessage.thinking || '';
const conversationContext = conversationsStore.activeMessages.slice(0, idx);
const contextWithContinue = [
...conversationContext,
{ role: 'assistant' as const, content: originalContent }
];
let appendedContent = '',
appendedThinking = '',
hasReceivedContent = false;
const abortController = this.getOrCreateAbortController(msg.convId);
await ChatService.sendMessage(
contextWithContinue,
{
...this.getApiOptions(),
onChunk: (chunk: string) => {
hasReceivedContent = true;
appendedContent += chunk;
const fullContent = originalContent + appendedContent;
this.setChatStreaming(msg.convId, fullContent, msg.id);
conversationsStore.updateMessageAtIndex(idx, { content: fullContent });
},
onReasoningChunk: (reasoningChunk: string) => {
hasReceivedContent = true;
appendedThinking += reasoningChunk;
conversationsStore.updateMessageAtIndex(idx, {
thinking: originalThinking + appendedThinking
});
},
onTimings: (timings: ChatMessageTimings, promptProgress?: ChatMessagePromptProgress) => {
const tokensPerSecond =
timings?.predicted_ms && timings?.predicted_n
? (timings.predicted_n / timings.predicted_ms) * 1000
: 0;
this.updateProcessingStateFromTimings(
{
prompt_n: timings?.prompt_n || 0,
predicted_n: timings?.predicted_n || 0,
predicted_per_second: tokensPerSecond,
cache_n: timings?.cache_n || 0,
prompt_progress: promptProgress
},
msg.convId
);
},
onComplete: async (
finalContent?: string,
reasoningContent?: string,
timings?: ChatMessageTimings
) => {
const fullContent = originalContent + (finalContent || appendedContent);
const fullThinking = originalThinking + (reasoningContent || appendedThinking);
await DatabaseService.updateMessage(msg.id, {
content: fullContent,
thinking: fullThinking,
timestamp: Date.now(),
timings
});
conversationsStore.updateMessageAtIndex(idx, {
content: fullContent,
thinking: fullThinking,
timestamp: Date.now(),
timings
});
conversationsStore.updateConversationTimestamp();
this.setChatLoading(msg.convId, false);
this.clearChatStreaming(msg.convId);
this.clearProcessingState(msg.convId);
},
onError: async (error: Error) => {
if (this.isAbortError(error)) {
if (hasReceivedContent && appendedContent) {
await DatabaseService.updateMessage(msg.id, {
content: originalContent + appendedContent,
thinking: originalThinking + appendedThinking,
timestamp: Date.now()
});
conversationsStore.updateMessageAtIndex(idx, {
content: originalContent + appendedContent,
thinking: originalThinking + appendedThinking,
timestamp: Date.now()
});
}
this.setChatLoading(msg.convId, false);
this.clearChatStreaming(msg.convId);
this.clearProcessingState(msg.convId);
return;
}
console.error('Continue generation error:', error);
conversationsStore.updateMessageAtIndex(idx, {
content: originalContent,
thinking: originalThinking
});
await DatabaseService.updateMessage(msg.id, {
content: originalContent,
thinking: originalThinking
});
this.setChatLoading(msg.convId, false);
this.clearChatStreaming(msg.convId);
this.clearProcessingState(msg.convId);
this.showErrorDialog(
error.name === 'TimeoutError' ? 'timeout' : 'server',
error.message
);
}
},
msg.convId,
abortController.signal
);
} catch (error) {
if (!this.isAbortError(error)) console.error('Failed to continue message:', error);
if (activeConv) this.setChatLoading(activeConv.id, false);
}
getAddFilesHandler(): ((files: File[]) => void) | null {
return this.addFilesHandler;
}
public isChatLoadingPublic(convId: string): boolean {
return this.isChatLoading(convId);
public getAllLoadingChats(): string[] {
return Array.from(this.chatLoadingStates.keys());
}
public getAllStreamingChats(): string[] {
return Array.from(this.chatStreamingStates.keys());
}
public getChatStreamingPublic(
convId: string
): { response: string; messageId: string } | undefined {
return this.getChatStreaming(convId);
}
public getAllLoadingChats(): string[] {
return Array.from(this.chatLoadingStates.keys());
public isChatLoadingPublic(convId: string): boolean {
return this.isChatLoading(convId);
}
public getAllStreamingChats(): string[] {
return Array.from(this.chatStreamingStates.keys());
isEditing(): boolean {
return this.isEditModeActive;
}
setEditModeActive(handler: (files: File[]) => void): void {
this.isEditModeActive = true;
this.addFilesHandler = handler;
}
// ─────────────────────────────────────────────────────────────────────────────
@@ -1409,13 +1462,17 @@ class ChatStore {
export const chatStore = new ChatStore();
export const isLoading = () => chatStore.isLoading;
export const activeProcessingState = () => chatStore.activeProcessingState;
export const clearEditMode = () => chatStore.clearEditMode();
export const currentResponse = () => chatStore.currentResponse;
export const errorDialog = () => chatStore.errorDialogState;
export const activeProcessingState = () => chatStore.activeProcessingState;
export const isChatStreaming = () => chatStore.isStreaming();
export const isChatLoading = (convId: string) => chatStore.isChatLoadingPublic(convId);
export const getChatStreaming = (convId: string) => chatStore.getChatStreamingPublic(convId);
export const getAddFilesHandler = () => chatStore.getAddFilesHandler();
export const getAllLoadingChats = () => chatStore.getAllLoadingChats();
export const getAllStreamingChats = () => chatStore.getAllStreamingChats();
export const getChatStreaming = (convId: string) => chatStore.getChatStreamingPublic(convId);
export const isChatLoading = (convId: string) => chatStore.isChatLoadingPublic(convId);
export const isChatStreaming = () => chatStore.isStreaming();
export const isEditing = () => chatStore.isEditing();
export const isLoading = () => chatStore.isLoading;
export const setEditModeActive = (handler: (files: File[]) => void) =>
chatStore.setEditModeActive(handler);
+1
View File
@@ -342,6 +342,7 @@ export interface ApiProcessingState {
// Progress information from prompt_progress
progressPercent?: number;
promptTokens?: number;
promptMs?: number;
cacheTokens?: number;
}