forked from wylab/llama.cpp
Compare commits
19 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 905942abdb | |||
| b5040086d4 | |||
| d39130a398 | |||
| b81ba1f96b | |||
| 210eb9ed0a | |||
| cb4d86c4d7 | |||
| 86e7299ef5 | |||
| 60d83a0149 | |||
| 87e25a1d1b | |||
| 213701b51a | |||
| be20e7f49d | |||
| 7ed03b8974 | |||
| 1d894a790e | |||
| 1f3e1b66e2 | |||
| 148ec970b6 | |||
| 2cccbaa008 | |||
| 8e558309dc | |||
| 0a423800ff | |||
| d12f781074 |
@@ -47,6 +47,7 @@ build*
|
||||
!build-info.cpp.in
|
||||
!build-info.sh
|
||||
!build.zig
|
||||
!docs/build.md
|
||||
/libllama.so
|
||||
/llama-*
|
||||
android-ndk-*
|
||||
|
||||
@@ -13,7 +13,7 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
|
||||
> [!IMPORTANT]
|
||||
[2024 Jun 12] Binaries have been renamed w/ a `llama-` prefix. `main` is now `llama-cli`, `server` is `llama-server`, etc (https://github.com/ggerganov/llama.cpp/pull/7809)
|
||||
|
||||
### Recent API changes
|
||||
## Recent API changes
|
||||
|
||||
- [2024 Jun 26] The source code and CMake build scripts have been restructured https://github.com/ggerganov/llama.cpp/pull/8006
|
||||
- [2024 Apr 21] `llama_token_to_piece` can now optionally render special tokens https://github.com/ggerganov/llama.cpp/pull/6807
|
||||
@@ -24,7 +24,7 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
|
||||
- [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
|
||||
- [2024 Mar 3] `struct llama_context_params` https://github.com/ggerganov/llama.cpp/pull/5849
|
||||
|
||||
### Hot topics
|
||||
## Hot topics
|
||||
|
||||
- **`convert.py` has been deprecated and moved to `examples/convert_legacy_llama.py`, please use `convert_hf_to_gguf.py`** https://github.com/ggerganov/llama.cpp/pull/7430
|
||||
- Initial Flash-Attention support: https://github.com/ggerganov/llama.cpp/pull/5021
|
||||
@@ -39,37 +39,6 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
|
||||
|
||||
----
|
||||
|
||||
<details>
|
||||
<summary>Table of Contents</summary>
|
||||
<ol>
|
||||
<li>
|
||||
<a href="#description">Description</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#usage">Usage</a>
|
||||
<ul>
|
||||
<li><a href="#get-the-code">Get the Code</a></li>
|
||||
<li><a href="#build">Build</a></li>
|
||||
<li><a href="#blas-build">BLAS Build</a></li>
|
||||
<li><a href="#prepare-and-quantize">Prepare and Quantize</a></li>
|
||||
<li><a href="#run-the-quantized-model">Run the quantized model</a></li>
|
||||
<li><a href="#memorydisk-requirements">Memory/Disk Requirements</a></li>
|
||||
<li><a href="#quantization">Quantization</a></li>
|
||||
<li><a href="#interactive-mode">Interactive mode</a></li>
|
||||
<li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li>
|
||||
<li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li>
|
||||
<li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li>
|
||||
<li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li>
|
||||
<li><a href="#android">Android</a></li>
|
||||
<li><a href="#docker">Docker</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li><a href="#contributing">Contributing</a></li>
|
||||
<li><a href="#coding-guidelines">Coding guidelines</a></li>
|
||||
<li><a href="#docs">Docs</a></li>
|
||||
</ol>
|
||||
</details>
|
||||
|
||||
## Description
|
||||
|
||||
The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
|
||||
@@ -87,14 +56,6 @@ Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomm
|
||||
improved significantly thanks to many contributions. It is the main playground for developing new features for the
|
||||
[ggml](https://github.com/ggerganov/ggml) library.
|
||||
|
||||
**Supported platforms:**
|
||||
|
||||
- [X] Mac OS
|
||||
- [X] Linux
|
||||
- [X] Windows (via CMake)
|
||||
- [X] Docker
|
||||
- [X] FreeBSD
|
||||
|
||||
**Supported models:**
|
||||
|
||||
Typically finetunes of the base models below are supported as well.
|
||||
@@ -150,12 +111,6 @@ Typically finetunes of the base models below are supported as well.
|
||||
- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
|
||||
- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
|
||||
|
||||
**HTTP server**
|
||||
|
||||
[llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
|
||||
|
||||
[simplechat](./examples/server/public_simplechat) is a simple chat client, which can be used to chat with the model exposed using above web server (use --path to point to simplechat), from a local web browser.
|
||||
|
||||
**Bindings:**
|
||||
|
||||
- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
|
||||
@@ -224,9 +179,10 @@ Unless otherwise noted these projects are open-source with permissive licensing:
|
||||
|
||||
- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
|
||||
|
||||
---
|
||||
## Demo
|
||||
|
||||
Here is a typical run using LLaMA v2 13B on M2 Ultra:
|
||||
<details>
|
||||
<summary>Typical run using LLaMA v2 13B on M2 Ultra</summary>
|
||||
|
||||
```
|
||||
$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
|
||||
@@ -306,454 +262,85 @@ llama_print_timings: eval time = 24513.59 ms / 399 runs ( 61.44 ms
|
||||
llama_print_timings: total time = 25431.49 ms
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook</summary>
|
||||
|
||||
And here is another demo of running both LLaMA-7B and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) on a single M1 Pro MacBook:
|
||||
|
||||
https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4
|
||||
|
||||
</details>
|
||||
|
||||
## Usage
|
||||
|
||||
Here are the end-to-end binary build and model conversion steps for most supported models.
|
||||
|
||||
### Get the Code
|
||||
### Basic usage
|
||||
|
||||
Firstly, you need to get the binary. There are different methods that you can follow:
|
||||
- Method 1: Clone this repository and build locally, see [how to build](./docs/build.md)
|
||||
- Method 2: If you are using MacOS or Linux, you can install llama.cpp via [brew, flox or nix](./docs/install.md)
|
||||
- Method 3: Use a Docker image, see [documentation for Docker](./docs/docker.md)
|
||||
- Method 4: Download pre-built binary from [releases](https://github.com/ggerganov/llama.cpp/releases)
|
||||
|
||||
You can run a basic completion using this command:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/ggerganov/llama.cpp
|
||||
cd llama.cpp
|
||||
llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
|
||||
|
||||
# Output:
|
||||
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
|
||||
```
|
||||
|
||||
### Build
|
||||
See [this page](./examples/main/README.md) for a full list of parameters.
|
||||
|
||||
In order to build llama.cpp you have four different options.
|
||||
### Conversation mode
|
||||
|
||||
- Using `make`:
|
||||
- On Linux or MacOS:
|
||||
|
||||
```bash
|
||||
make
|
||||
```
|
||||
|
||||
- On Windows:
|
||||
|
||||
1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
|
||||
2. Extract `w64devkit` on your pc.
|
||||
3. Run `w64devkit.exe`.
|
||||
4. Use the `cd` command to reach the `llama.cpp` folder.
|
||||
5. From here you can run:
|
||||
```bash
|
||||
make
|
||||
```
|
||||
|
||||
- Notes:
|
||||
- For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `make -j 8` will run 8 jobs in parallel.
|
||||
- For faster repeated compilation, install [ccache](https://ccache.dev/).
|
||||
- For debug builds, run `make LLAMA_DEBUG=1`
|
||||
|
||||
- Using `CMake`:
|
||||
|
||||
```bash
|
||||
cmake -B build
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
**Notes**:
|
||||
|
||||
- For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
|
||||
- For faster repeated compilation, install [ccache](https://ccache.dev/).
|
||||
- For debug builds, there are two cases:
|
||||
|
||||
1. Single-config generators (e.g. default = `Unix Makefiles`; note that they just ignore the `--config` flag):
|
||||
|
||||
```bash
|
||||
cmake -B build -DCMAKE_BUILD_TYPE=Debug
|
||||
cmake --build build
|
||||
```
|
||||
|
||||
2. Multi-config generators (`-G` param set to Visual Studio, XCode...):
|
||||
|
||||
```bash
|
||||
cmake -B build -G "Xcode"
|
||||
cmake --build build --config Debug
|
||||
```
|
||||
|
||||
- Using `gmake` (FreeBSD):
|
||||
|
||||
1. Install and activate [DRM in FreeBSD](https://wiki.freebsd.org/Graphics)
|
||||
2. Add your user to **video** group
|
||||
3. Install compilation dependencies.
|
||||
|
||||
```bash
|
||||
sudo pkg install gmake automake autoconf pkgconf llvm15 openblas
|
||||
|
||||
gmake CC=/usr/local/bin/clang15 CXX=/usr/local/bin/clang++15 -j4
|
||||
```
|
||||
|
||||
### Homebrew
|
||||
|
||||
On Mac and Linux, the homebrew package manager can be used via
|
||||
```
|
||||
brew install llama.cpp
|
||||
```
|
||||
The formula is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggerganov/llama.cpp/discussions/7668
|
||||
|
||||
### Nix
|
||||
|
||||
On Mac and Linux, the Nix package manager can be used via
|
||||
```
|
||||
nix profile install nixpkgs#llama-cpp
|
||||
```
|
||||
For flake enabled installs.
|
||||
|
||||
Or
|
||||
```
|
||||
nix-env --file '<nixpkgs>' --install --attr llama-cpp
|
||||
```
|
||||
For non-flake enabled installs.
|
||||
|
||||
This expression is automatically updated within the [nixpkgs repo](https://github.com/NixOS/nixpkgs/blob/nixos-24.05/pkgs/by-name/ll/llama-cpp/package.nix#L164).
|
||||
|
||||
#### Flox
|
||||
|
||||
On Mac and Linux, Flox can be used to install llama.cpp within a Flox environment via
|
||||
```
|
||||
flox install llama-cpp
|
||||
```
|
||||
Flox follows the nixpkgs build of llama.cpp.
|
||||
|
||||
### Metal Build
|
||||
|
||||
On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU.
|
||||
To disable the Metal build at compile time use the `GGML_NO_METAL=1` flag or the `GGML_METAL=OFF` cmake option.
|
||||
|
||||
When built with Metal support, you can explicitly disable GPU inference with the `--n-gpu-layers|-ngl 0` command-line
|
||||
argument.
|
||||
|
||||
### BLAS Build
|
||||
|
||||
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS. There are currently several different BLAS implementations available for build and use:
|
||||
|
||||
- #### Accelerate Framework:
|
||||
|
||||
This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
|
||||
|
||||
- #### OpenBLAS:
|
||||
|
||||
This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
|
||||
|
||||
- Using `make`:
|
||||
- On Linux:
|
||||
```bash
|
||||
make GGML_OPENBLAS=1
|
||||
```
|
||||
|
||||
- On Windows:
|
||||
|
||||
1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
|
||||
2. Download the latest version of [OpenBLAS for Windows](https://github.com/xianyi/OpenBLAS/releases).
|
||||
3. Extract `w64devkit` on your pc.
|
||||
4. From the OpenBLAS zip that you just downloaded copy `libopenblas.a`, located inside the `lib` folder, inside `w64devkit\x86_64-w64-mingw32\lib`.
|
||||
5. From the same OpenBLAS zip copy the content of the `include` folder inside `w64devkit\x86_64-w64-mingw32\include`.
|
||||
6. Run `w64devkit.exe`.
|
||||
7. Use the `cd` command to reach the `llama.cpp` folder.
|
||||
8. From here you can run:
|
||||
|
||||
```bash
|
||||
make GGML_OPENBLAS=1
|
||||
```
|
||||
|
||||
- Using `CMake` on Linux:
|
||||
|
||||
```bash
|
||||
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
- #### BLIS
|
||||
|
||||
Check [BLIS.md](docs/BLIS.md) for more information.
|
||||
|
||||
- #### SYCL
|
||||
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.
|
||||
|
||||
llama.cpp based on SYCL is used to **support Intel GPU** (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
|
||||
|
||||
For detailed info, please refer to [llama.cpp for SYCL](README-sycl.md).
|
||||
|
||||
- #### Intel oneMKL
|
||||
Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni. Please note that this build config **does not support Intel GPU**. For Intel GPU support, please refer to [llama.cpp for SYCL](./README-sycl.md).
|
||||
|
||||
- Using manual oneAPI installation:
|
||||
By default, `GGML_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DGGML_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. Otherwise please install oneAPI and follow the below steps:
|
||||
```bash
|
||||
source /opt/intel/oneapi/setvars.sh # You can skip this step if in oneapi-basekit docker image, only required for manual installation
|
||||
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
- Using oneAPI docker image:
|
||||
If you do not want to source the environment vars and install oneAPI manually, you can also build the code using intel docker container: [oneAPI-basekit](https://hub.docker.com/r/intel/oneapi-basekit). Then, you can use the commands given above.
|
||||
|
||||
Check [Optimizing and Running LLaMA2 on Intel® CPU](https://www.intel.com/content/www/us/en/content-details/791610/optimizing-and-running-llama2-on-intel-cpu.html) for more information.
|
||||
|
||||
- #### CUDA
|
||||
|
||||
This provides GPU acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
|
||||
|
||||
For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](https://www.jetson-ai-lab.com/tutorial_text-generation.html). If you are using an old model(nano/TX2), need some additional operations before compiling.
|
||||
|
||||
- Using `make`:
|
||||
```bash
|
||||
make GGML_CUDA=1
|
||||
```
|
||||
- Using `CMake`:
|
||||
|
||||
```bash
|
||||
cmake -B build -DGGML_CUDA=ON
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
|
||||
|
||||
| Option | Legal values | Default | Description |
|
||||
|-------------------------------|------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| GGML_CUDA_FORCE_DMMV | Boolean | false | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants. |
|
||||
| GGML_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
|
||||
| GGML_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. |
|
||||
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, RDNA3). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
|
||||
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models |
|
||||
| GGML_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
|
||||
| GGML_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
|
||||
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
|
||||
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
|
||||
|
||||
- #### hipBLAS
|
||||
|
||||
This provides BLAS acceleration on HIP-supported AMD GPUs.
|
||||
Make sure to have ROCm installed.
|
||||
You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick).
|
||||
|
||||
- Using `make`:
|
||||
```bash
|
||||
make GGML_HIPBLAS=1
|
||||
```
|
||||
- Using `CMake` for Linux (assuming a gfx1030-compatible AMD GPU):
|
||||
```bash
|
||||
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
|
||||
cmake -S . -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
|
||||
&& cmake --build build --config Release -- -j 16
|
||||
```
|
||||
On Linux it is also possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting `-DGGML_HIP_UMA=ON`.
|
||||
However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
|
||||
|
||||
Note that if you get the following error:
|
||||
```
|
||||
clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
|
||||
```
|
||||
Try searching for a directory under `HIP_PATH` that contains the file
|
||||
`oclc_abi_version_400.bc`. Then, add the following to the start of the
|
||||
command: `HIP_DEVICE_LIB_PATH=<directory-you-just-found>`, so something
|
||||
like:
|
||||
```bash
|
||||
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" \
|
||||
HIP_DEVICE_LIB_PATH=<directory-you-just-found> \
|
||||
cmake -S . -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
|
||||
&& cmake --build build -- -j 16
|
||||
```
|
||||
|
||||
- Using `make` (example for target gfx1030, build with 16 CPU threads):
|
||||
```bash
|
||||
make -j16 GGML_HIPBLAS=1 GGML_HIP_UMA=1 AMDGPU_TARGETS=gfx1030
|
||||
```
|
||||
|
||||
- Using `CMake` for Windows (using x64 Native Tools Command Prompt for VS, and assuming a gfx1100-compatible AMD GPU):
|
||||
```bash
|
||||
set PATH=%HIP_PATH%\bin;%PATH%
|
||||
cmake -S . -B build -G Ninja -DAMDGPU_TARGETS=gfx1100 -DGGML_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build
|
||||
```
|
||||
Make sure that `AMDGPU_TARGETS` is set to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
|
||||
Find your gpu version string by matching the most significant version information from `rocminfo | grep gfx | head -1 | awk '{print $2}'` with the list of processors, e.g. `gfx1035` maps to `gfx1030`.
|
||||
|
||||
|
||||
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
|
||||
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
|
||||
The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
|
||||
|
||||
| Option | Legal values | Default | Description |
|
||||
|------------------------|------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| GGML_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the HIP dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
|
||||
| GGML_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the HIP mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
|
||||
| GGML_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per HIP thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
|
||||
|
||||
- #### Vulkan
|
||||
|
||||
**With docker**:
|
||||
|
||||
You don't need to install Vulkan SDK. It will be installed inside the container.
|
||||
|
||||
```sh
|
||||
# Build the image
|
||||
docker build -t llama-cpp-vulkan -f .devops/llama-cli-vulkan.Dockerfile .
|
||||
|
||||
# Then, use it:
|
||||
docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-vulkan -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
|
||||
```
|
||||
|
||||
**Without docker**:
|
||||
|
||||
Firstly, you need to make sure you have installed [Vulkan SDK](https://vulkan.lunarg.com/doc/view/latest/linux/getting_started_ubuntu.html)
|
||||
|
||||
For example, on Ubuntu 22.04 (jammy), use the command below:
|
||||
|
||||
```bash
|
||||
wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add -
|
||||
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
|
||||
apt update -y
|
||||
apt-get install -y vulkan-sdk
|
||||
# To verify the installation, use the command below:
|
||||
vulkaninfo
|
||||
```
|
||||
|
||||
Alternatively your package manager might be able to provide the appropriate libraries.
|
||||
For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
|
||||
For Fedora 40, you can install `vulkan-devel`, `glslc` and `glslang` packages.
|
||||
|
||||
Then, build llama.cpp using the cmake command below:
|
||||
|
||||
```bash
|
||||
cmake -B build -DGGML_VULKAN=1
|
||||
cmake --build build --config Release
|
||||
# Test the output binary (with "-ngl 33" to offload all layers to GPU)
|
||||
./bin/llama-cli -m "PATH_TO_MODEL" -p "Hi you how are you" -n 50 -e -ngl 33 -t 4
|
||||
|
||||
# You should see in the output, ggml_vulkan detected your GPU. For example:
|
||||
# ggml_vulkan: Using Intel(R) Graphics (ADL GT2) | uma: 1 | fp16: 1 | warp size: 32
|
||||
```
|
||||
|
||||
### Prepare and Quantize
|
||||
|
||||
> [!NOTE]
|
||||
> You can use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to quantise your model weights without any setup too. It is synced from `llama.cpp` main every 6 hours.
|
||||
|
||||
To obtain the official LLaMA 2 weights please see the <a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
|
||||
|
||||
Note: `convert.py` has been moved to `examples/convert_legacy_llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derivatives.
|
||||
It does not support LLaMA 3, you can use `convert_hf_to_gguf.py` with LLaMA 3 downloaded from Hugging Face.
|
||||
If you want a more ChatGPT-like experience, you can run in conversation mode by passing `-cnv` as a parameter:
|
||||
|
||||
```bash
|
||||
# obtain the official LLaMA model weights and place them in ./models
|
||||
ls ./models
|
||||
llama-2-7b tokenizer_checklist.chk tokenizer.model
|
||||
# [Optional] for models using BPE tokenizers
|
||||
ls ./models
|
||||
<folder containing weights and tokenizer json> vocab.json
|
||||
# [Optional] for PyTorch .bin models like Mistral-7B
|
||||
ls ./models
|
||||
<folder containing weights and tokenizer json>
|
||||
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
|
||||
|
||||
# install Python dependencies
|
||||
python3 -m pip install -r requirements.txt
|
||||
|
||||
# convert the model to ggml FP16 format
|
||||
python3 convert_hf_to_gguf.py models/mymodel/
|
||||
|
||||
# quantize the model to 4-bits (using Q4_K_M method)
|
||||
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
|
||||
|
||||
# update the gguf filetype to current version if older version is now unsupported
|
||||
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
|
||||
# Output:
|
||||
# > hi, who are you?
|
||||
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
|
||||
#
|
||||
# > what is 1+1?
|
||||
# Easy peasy! The answer to 1+1 is... 2!
|
||||
```
|
||||
|
||||
### Run the quantized model
|
||||
By default, the chat template will be taken from the input model. If you want to use another chat template, pass `--chat-template NAME` as a parameter. See the list of [supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
|
||||
|
||||
```bash
|
||||
# start inference on a gguf model
|
||||
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128
|
||||
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
|
||||
```
|
||||
|
||||
When running the larger models, make sure you have enough disk space to store all the intermediate files.
|
||||
You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:
|
||||
|
||||
### Running on Windows with prebuilt binaries
|
||||
|
||||
You will find prebuilt Windows binaries on the release page.
|
||||
|
||||
Simply download and extract the latest zip package of choice: (e.g. `llama-b1380-bin-win-avx2-x64.zip`)
|
||||
|
||||
From the unzipped folder, open a terminal/cmd window here and place a pre-converted `.gguf` model file. Test out the main example like so:
|
||||
|
||||
```
|
||||
.\main -m llama-2-7b.Q4_0.gguf -n 128
|
||||
```bash
|
||||
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
|
||||
```
|
||||
|
||||
### Memory/Disk Requirements
|
||||
### Web server
|
||||
|
||||
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
|
||||
[llama.cpp web server](./examples/server/README.md) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
|
||||
|
||||
| Model | Original size | Quantized size (Q4_0) |
|
||||
|------:|--------------:|----------------------:|
|
||||
| 7B | 13 GB | 3.9 GB |
|
||||
| 13B | 24 GB | 7.8 GB |
|
||||
| 30B | 60 GB | 19.5 GB |
|
||||
| 65B | 120 GB | 38.5 GB |
|
||||
Example usage:
|
||||
|
||||
### Quantization
|
||||
```bash
|
||||
./llama-server -m your_model.gguf --port 8080
|
||||
|
||||
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
|
||||
|
||||
*(outdated)*
|
||||
|
||||
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
|
||||
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
|
||||
| 7B | perplexity | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
|
||||
| 7B | file size | 13.0G | 3.5G | 3.9G | 4.3G | 4.7G | 6.7G |
|
||||
| 7B | ms/tok @ 4th | 127 | 55 | 54 | 76 | 83 | 72 |
|
||||
| 7B | ms/tok @ 8th | 122 | 43 | 45 | 52 | 56 | 67 |
|
||||
| 7B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
|
||||
| 13B | perplexity | 5.2543 | 5.3860 | 5.3608 | 5.2856 | 5.2706 | 5.2548 |
|
||||
| 13B | file size | 25.0G | 6.8G | 7.6G | 8.3G | 9.1G | 13G |
|
||||
| 13B | ms/tok @ 4th | - | 103 | 105 | 148 | 160 | 131 |
|
||||
| 13B | ms/tok @ 8th | - | 73 | 82 | 98 | 105 | 128 |
|
||||
| 13B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
|
||||
|
||||
- [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684)
|
||||
- recent k-quants improvements and new i-quants
|
||||
- [#2707](https://github.com/ggerganov/llama.cpp/pull/2707)
|
||||
- [#2807](https://github.com/ggerganov/llama.cpp/pull/2807)
|
||||
- [#4773 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4773)
|
||||
- [#4856 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4856)
|
||||
- [#4861 - importance matrix](https://github.com/ggerganov/llama.cpp/pull/4861)
|
||||
- [#4872 - MoE models](https://github.com/ggerganov/llama.cpp/pull/4872)
|
||||
- [#4897 - 2-bit quantization](https://github.com/ggerganov/llama.cpp/pull/4897)
|
||||
- [#4930 - imatrix for all k-quants](https://github.com/ggerganov/llama.cpp/pull/4930)
|
||||
- [#4951 - imatrix on the GPU](https://github.com/ggerganov/llama.cpp/pull/4957)
|
||||
- [#4969 - imatrix for legacy quants](https://github.com/ggerganov/llama.cpp/pull/4969)
|
||||
- [#4996 - k-qunats tuning](https://github.com/ggerganov/llama.cpp/pull/4996)
|
||||
- [#5060 - Q3_K_XS](https://github.com/ggerganov/llama.cpp/pull/5060)
|
||||
- [#5196 - 3-bit i-quants](https://github.com/ggerganov/llama.cpp/pull/5196)
|
||||
- [quantization tuning](https://github.com/ggerganov/llama.cpp/pull/5320), [another one](https://github.com/ggerganov/llama.cpp/pull/5334), and [another one](https://github.com/ggerganov/llama.cpp/pull/5361)
|
||||
|
||||
### Perplexity (measuring model quality)
|
||||
|
||||
You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
|
||||
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
|
||||
|
||||
The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
|
||||
The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
|
||||
|
||||
#### How to run
|
||||
|
||||
1. Download/extract: https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
|
||||
2. Run `./llama-perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw`
|
||||
3. Output:
|
||||
# Basic web UI can be accessed via browser: http://localhost:8080
|
||||
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
|
||||
```
|
||||
perplexity : calculating perplexity over 655 chunks
|
||||
24.43 seconds per pass - ETA 4.45 hours
|
||||
[1]4.5970,[2]5.1807,[3]6.0382,...
|
||||
```
|
||||
And after 4.45 hours, you will have the final perplexity.
|
||||
|
||||
### Interactive mode
|
||||
|
||||
If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
|
||||
> [!NOTE]
|
||||
> If you prefer basic usage, please consider using conversation mode instead of interactive mode
|
||||
|
||||
In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMA emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`.
|
||||
|
||||
Here is an example of a few-shot interaction, invoked with the command
|
||||
@@ -804,173 +391,44 @@ The `grammars/` folder contains a handful of sample grammars. To write your own,
|
||||
|
||||
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
|
||||
|
||||
### Obtaining and using the Facebook LLaMA 2 model
|
||||
## Build
|
||||
|
||||
- Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data.
|
||||
- Alternatively, if you want to save time and space, you can download already converted and quantized models from [TheBloke](https://huggingface.co/TheBloke), including:
|
||||
- [LLaMA 2 7B base](https://huggingface.co/TheBloke/Llama-2-7B-GGUF)
|
||||
- [LLaMA 2 13B base](https://huggingface.co/TheBloke/Llama-2-13B-GGUF)
|
||||
- [LLaMA 2 70B base](https://huggingface.co/TheBloke/Llama-2-70B-GGUF)
|
||||
- [LLaMA 2 7B chat](https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF)
|
||||
- [LLaMA 2 13B chat](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF)
|
||||
- [LLaMA 2 70B chat](https://huggingface.co/TheBloke/Llama-2-70B-chat-GGUF)
|
||||
Please refer to [Build llama.cpp locally](./docs/build.md)
|
||||
|
||||
### Seminal papers and background on the models
|
||||
## Supported backends
|
||||
|
||||
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
|
||||
- LLaMA:
|
||||
- [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
|
||||
- [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
|
||||
- GPT-3
|
||||
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
|
||||
- GPT-3.5 / InstructGPT / ChatGPT:
|
||||
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
|
||||
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
|
||||
| Backend | Target devices |
|
||||
| --- | --- |
|
||||
| [Metal](./docs/build.md#metal-build) | Apple Silicon |
|
||||
| [BLAS](./docs/build.md#blas-build) | All |
|
||||
| [BLIS](./docs/backend/BLIS.md) | All |
|
||||
| [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
|
||||
| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
|
||||
| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
|
||||
| [Vulkan](./docs/build.md#vulkan) | GPU |
|
||||
|
||||
### Android
|
||||
## Tools
|
||||
|
||||
#### Build on Android using Termux
|
||||
[Termux](https://github.com/termux/termux-app#installation) is a method to execute `llama.cpp` on an Android device (no root required).
|
||||
```
|
||||
apt update && apt upgrade -y
|
||||
apt install git make cmake
|
||||
```
|
||||
### Prepare and Quantize
|
||||
|
||||
It's recommended to move your model inside the `~/` directory for best performance:
|
||||
```
|
||||
cd storage/downloads
|
||||
mv model.gguf ~/
|
||||
```
|
||||
> [!NOTE]
|
||||
> You can use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to quantise your model weights without any setup too. It is synced from `llama.cpp` main every 6 hours.
|
||||
|
||||
[Get the code](https://github.com/ggerganov/llama.cpp#get-the-code) & [follow the Linux build instructions](https://github.com/ggerganov/llama.cpp#build) to build `llama.cpp`.
|
||||
To obtain the official LLaMA 2 weights please see the <a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
|
||||
|
||||
#### Building the Project using Android NDK
|
||||
Obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake.
|
||||
Note: `convert.py` has been moved to `examples/convert_legacy_llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derivatives.
|
||||
It does not support LLaMA 3, you can use `convert_hf_to_gguf.py` with LLaMA 3 downloaded from Hugging Face.
|
||||
|
||||
Execute the following commands on your computer to avoid downloading the NDK to your mobile. Alternatively, you can also do this in Termux:
|
||||
```
|
||||
$ mkdir build-android
|
||||
$ cd build-android
|
||||
$ export NDK=<your_ndk_directory>
|
||||
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
|
||||
$ make
|
||||
```
|
||||
To learn more about quantizing model, [read this documentation](./examples/quantize/README.md)
|
||||
|
||||
Install [termux](https://github.com/termux/termux-app#installation) on your device and run `termux-setup-storage` to get access to your SD card (if Android 11+ then run the command twice).
|
||||
### Perplexity (measuring model quality)
|
||||
|
||||
Finally, copy these built `llama` binaries and the model file to your device storage. Because the file permissions in the Android sdcard cannot be changed, you can copy the executable files to the `/data/data/com.termux/files/home/bin` path, and then execute the following commands in Termux to add executable permission:
|
||||
You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
|
||||
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
|
||||
|
||||
(Assumed that you have pushed the built executable files to the /sdcard/llama.cpp/bin path using `adb push`)
|
||||
```
|
||||
$cp -r /sdcard/llama.cpp/bin /data/data/com.termux/files/home/
|
||||
$cd /data/data/com.termux/files/home/bin
|
||||
$chmod +x ./*
|
||||
```
|
||||
To learn more how to measure perplexity using llama.cpp, [read this documentation](./examples/perplexity/README.md)
|
||||
|
||||
Download model [llama-2-7b-chat.Q4_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_M.gguf), and push it to `/sdcard/llama.cpp/`, then move it to `/data/data/com.termux/files/home/model/`
|
||||
|
||||
```
|
||||
$mv /sdcard/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf /data/data/com.termux/files/home/model/
|
||||
```
|
||||
|
||||
Now, you can start chatting:
|
||||
```
|
||||
$cd /data/data/com.termux/files/home/bin
|
||||
$./llama-cli -m ../model/llama-2-7b-chat.Q4_K_M.gguf -n 128 -cml
|
||||
```
|
||||
|
||||
Here's a demo of an interactive session running on Pixel 5 phone:
|
||||
|
||||
https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
|
||||
|
||||
### Docker
|
||||
|
||||
#### Prerequisites
|
||||
* Docker must be installed and running on your system.
|
||||
* Create a folder to store big models & intermediate files (ex. /llama/models)
|
||||
|
||||
#### Images
|
||||
We have three Docker images available for this project:
|
||||
|
||||
1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
3. `ghcr.io/ggerganov/llama.cpp:server`: This image only includes the server executable file. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
|
||||
Additionally, there the following images, similar to the above:
|
||||
|
||||
- `ghcr.io/ggerganov/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA support. (platforms: `linux/amd64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA support. (platforms: `linux/amd64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA support. (platforms: `linux/amd64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
|
||||
The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](.github/workflows/docker.yml). If you need different settings (for example, a different CUDA or ROCm library, you'll need to build the images locally for now).
|
||||
|
||||
#### Usage
|
||||
|
||||
The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.
|
||||
|
||||
Replace `/path/to/models` below with the actual path where you downloaded the models.
|
||||
|
||||
```bash
|
||||
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B
|
||||
```
|
||||
|
||||
On completion, you are ready to play!
|
||||
|
||||
```bash
|
||||
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
|
||||
```
|
||||
|
||||
or with a light image:
|
||||
|
||||
```bash
|
||||
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
|
||||
```
|
||||
|
||||
or with a server image:
|
||||
|
||||
```bash
|
||||
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512
|
||||
```
|
||||
|
||||
### Docker With CUDA
|
||||
|
||||
Assuming one has the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) properly installed on Linux, or is using a GPU enabled cloud, `cuBLAS` should be accessible inside the container.
|
||||
|
||||
#### Building Locally
|
||||
|
||||
```bash
|
||||
docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .
|
||||
docker build -t local/llama.cpp:light-cuda -f .devops/llama-cli-cuda.Dockerfile .
|
||||
docker build -t local/llama.cpp:server-cuda -f .devops/llama-server-cuda.Dockerfile .
|
||||
```
|
||||
|
||||
You may want to pass in some different `ARGS`, depending on the CUDA environment supported by your container host, as well as the GPU architecture.
|
||||
|
||||
The defaults are:
|
||||
|
||||
- `CUDA_VERSION` set to `11.7.1`
|
||||
- `CUDA_DOCKER_ARCH` set to `all`
|
||||
|
||||
The resulting images, are essentially the same as the non-CUDA images:
|
||||
|
||||
1. `local/llama.cpp:full-cuda`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
|
||||
2. `local/llama.cpp:light-cuda`: This image only includes the main executable file.
|
||||
3. `local/llama.cpp:server-cuda`: This image only includes the server executable file.
|
||||
|
||||
#### Usage
|
||||
|
||||
After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the `--gpus` flag. You will also want to use the `--n-gpu-layers` flag.
|
||||
|
||||
```bash
|
||||
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
|
||||
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
|
||||
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
|
||||
```
|
||||
|
||||
### Contributing
|
||||
## Contributing
|
||||
|
||||
- Contributors can open PRs
|
||||
- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
|
||||
@@ -981,12 +439,29 @@ docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m
|
||||
- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
|
||||
- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
|
||||
|
||||
### Docs
|
||||
## Other documentations
|
||||
|
||||
- [main (cli)](./examples/main/README.md)
|
||||
- [server](./examples/server/README.md)
|
||||
- [jeopardy](./examples/jeopardy/README.md)
|
||||
- [BLIS](./docs/BLIS.md)
|
||||
- [GBNF grammars](./grammars/README.md)
|
||||
|
||||
**Development documentations**
|
||||
|
||||
- [How to build](./docs/build.md)
|
||||
- [Running on Docker](./docs/docker.md)
|
||||
- [Build on Android](./docs/android.md)
|
||||
- [Performance troubleshooting](./docs/token_generation_performance_tips.md)
|
||||
- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
|
||||
- [GBNF grammars](./grammars/README.md)
|
||||
|
||||
**Seminal papers and background on the models**
|
||||
|
||||
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
|
||||
- LLaMA:
|
||||
- [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
|
||||
- [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
|
||||
- GPT-3
|
||||
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
|
||||
- GPT-3.5 / InstructGPT / ChatGPT:
|
||||
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
|
||||
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
|
||||
|
||||
+70
-45
@@ -190,6 +190,12 @@ int32_t cpu_get_num_math() {
|
||||
// CLI argument parsing
|
||||
//
|
||||
|
||||
void gpt_params_handle_hf_token(gpt_params & params) {
|
||||
if (params.hf_token.empty() && std::getenv("HF_TOKEN")) {
|
||||
params.hf_token = std::getenv("HF_TOKEN");
|
||||
}
|
||||
}
|
||||
|
||||
void gpt_params_handle_model_default(gpt_params & params) {
|
||||
if (!params.hf_repo.empty()) {
|
||||
// short-hand to avoid specifying --hf-file -> default it to --model
|
||||
@@ -237,6 +243,8 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
|
||||
|
||||
gpt_params_handle_model_default(params);
|
||||
|
||||
gpt_params_handle_hf_token(params);
|
||||
|
||||
if (params.escape) {
|
||||
string_process_escapes(params.prompt);
|
||||
string_process_escapes(params.input_prefix);
|
||||
@@ -472,6 +480,14 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
|
||||
else { invalid_param = true; }
|
||||
return true;
|
||||
}
|
||||
if (arg == "--attention") {
|
||||
CHECK_ARG
|
||||
std::string value(argv[i]);
|
||||
/**/ if (value == "causal") { params.attention_type = LLAMA_ATTENTION_TYPE_CAUSAL; }
|
||||
else if (value == "non-causal") { params.attention_type = LLAMA_ATTENTION_TYPE_NON_CAUSAL; }
|
||||
else { invalid_param = true; }
|
||||
return true;
|
||||
}
|
||||
if (arg == "--defrag-thold" || arg == "-dt") {
|
||||
CHECK_ARG
|
||||
params.defrag_thold = std::stof(argv[i]);
|
||||
@@ -644,6 +660,14 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
|
||||
params.model_url = argv[i];
|
||||
return true;
|
||||
}
|
||||
if (arg == "-hft" || arg == "--hf-token") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
return true;
|
||||
}
|
||||
params.hf_token = argv[i];
|
||||
return true;
|
||||
}
|
||||
if (arg == "-hfr" || arg == "--hf-repo") {
|
||||
CHECK_ARG
|
||||
params.hf_repo = argv[i];
|
||||
@@ -1468,8 +1492,10 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
|
||||
"For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead" });
|
||||
|
||||
options.push_back({ "embedding" });
|
||||
options.push_back({ "embedding", " --pooling {none,mean,cls}",
|
||||
options.push_back({ "embedding", " --pooling {none,mean,cls,last}",
|
||||
"pooling type for embeddings, use model default if unspecified" });
|
||||
options.push_back({ "embedding", " --attention {causal,non-causal}",
|
||||
"attention type for embeddings, use model default if unspecified" });
|
||||
|
||||
options.push_back({ "context hacking" });
|
||||
options.push_back({ "*", " --rope-scaling {none,linear,yarn}",
|
||||
@@ -1566,6 +1592,7 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
|
||||
options.push_back({ "*", "-mu, --model-url MODEL_URL", "model download url (default: unused)" });
|
||||
options.push_back({ "*", "-hfr, --hf-repo REPO", "Hugging Face model repository (default: unused)" });
|
||||
options.push_back({ "*", "-hff, --hf-file FILE", "Hugging Face model file (default: unused)" });
|
||||
options.push_back({ "*", "-hft, --hf-token TOKEN", "Hugging Face access token (default: value from HF_TOKEN environment variable)" });
|
||||
|
||||
options.push_back({ "retrieval" });
|
||||
options.push_back({ "retrieval", " --context-file FNAME", "file to load context from (repeat to specify multiple files)" });
|
||||
@@ -2005,9 +2032,9 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
|
||||
llama_model * model = nullptr;
|
||||
|
||||
if (!params.hf_repo.empty() && !params.hf_file.empty()) {
|
||||
model = llama_load_model_from_hf(params.hf_repo.c_str(), params.hf_file.c_str(), params.model.c_str(), mparams);
|
||||
model = llama_load_model_from_hf(params.hf_repo.c_str(), params.hf_file.c_str(), params.model.c_str(), params.hf_token.c_str(), mparams);
|
||||
} else if (!params.model_url.empty()) {
|
||||
model = llama_load_model_from_url(params.model_url.c_str(), params.model.c_str(), mparams);
|
||||
model = llama_load_model_from_url(params.model_url.c_str(), params.model.c_str(), params.hf_token.c_str(), mparams);
|
||||
} else {
|
||||
model = llama_load_model_from_file(params.model.c_str(), mparams);
|
||||
}
|
||||
@@ -2175,6 +2202,7 @@ struct llama_context_params llama_context_params_from_gpt_params(const gpt_param
|
||||
cparams.yarn_beta_slow = params.yarn_beta_slow;
|
||||
cparams.yarn_orig_ctx = params.yarn_orig_ctx;
|
||||
cparams.pooling_type = params.pooling_type;
|
||||
cparams.attention_type = params.attention_type;
|
||||
cparams.defrag_thold = params.defrag_thold;
|
||||
cparams.cb_eval = params.cb_eval;
|
||||
cparams.cb_eval_user_data = params.cb_eval_user_data;
|
||||
@@ -2194,7 +2222,7 @@ static bool starts_with(const std::string & str, const std::string & prefix) {
|
||||
return str.rfind(prefix, 0) == 0;
|
||||
}
|
||||
|
||||
static bool llama_download_file(const std::string & url, const std::string & path) {
|
||||
static bool llama_download_file(const std::string & url, const std::string & path, const std::string & hf_token) {
|
||||
|
||||
// Initialize libcurl
|
||||
std::unique_ptr<CURL, decltype(&curl_easy_cleanup)> curl(curl_easy_init(), &curl_easy_cleanup);
|
||||
@@ -2209,6 +2237,15 @@ static bool llama_download_file(const std::string & url, const std::string & pat
|
||||
curl_easy_setopt(curl.get(), CURLOPT_URL, url.c_str());
|
||||
curl_easy_setopt(curl.get(), CURLOPT_FOLLOWLOCATION, 1L);
|
||||
|
||||
// Check if hf-token or bearer-token was specified
|
||||
if (!hf_token.empty()) {
|
||||
std::string auth_header = "Authorization: Bearer ";
|
||||
auth_header += hf_token.c_str();
|
||||
struct curl_slist *http_headers = NULL;
|
||||
http_headers = curl_slist_append(http_headers, auth_header.c_str());
|
||||
curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers);
|
||||
}
|
||||
|
||||
#if defined(_WIN32)
|
||||
// CURLSSLOPT_NATIVE_CA tells libcurl to use standard certificate store of
|
||||
// operating system. Currently implemented under MS-Windows.
|
||||
@@ -2404,6 +2441,7 @@ static bool llama_download_file(const std::string & url, const std::string & pat
|
||||
struct llama_model * llama_load_model_from_url(
|
||||
const char * model_url,
|
||||
const char * path_model,
|
||||
const char * hf_token,
|
||||
const struct llama_model_params & params) {
|
||||
// Basic validation of the model_url
|
||||
if (!model_url || strlen(model_url) == 0) {
|
||||
@@ -2411,7 +2449,7 @@ struct llama_model * llama_load_model_from_url(
|
||||
return NULL;
|
||||
}
|
||||
|
||||
if (!llama_download_file(model_url, path_model)) {
|
||||
if (!llama_download_file(model_url, path_model, hf_token)) {
|
||||
return NULL;
|
||||
}
|
||||
|
||||
@@ -2459,14 +2497,14 @@ struct llama_model * llama_load_model_from_url(
|
||||
// Prepare download in parallel
|
||||
std::vector<std::future<bool>> futures_download;
|
||||
for (int idx = 1; idx < n_split; idx++) {
|
||||
futures_download.push_back(std::async(std::launch::async, [&split_prefix, &split_url_prefix, &n_split](int download_idx) -> bool {
|
||||
futures_download.push_back(std::async(std::launch::async, [&split_prefix, &split_url_prefix, &n_split, hf_token](int download_idx) -> bool {
|
||||
char split_path[PATH_MAX] = {0};
|
||||
llama_split_path(split_path, sizeof(split_path), split_prefix, download_idx, n_split);
|
||||
|
||||
char split_url[LLAMA_CURL_MAX_URL_LENGTH] = {0};
|
||||
llama_split_path(split_url, sizeof(split_url), split_url_prefix, download_idx, n_split);
|
||||
|
||||
return llama_download_file(split_url, split_path);
|
||||
return llama_download_file(split_url, split_path, hf_token);
|
||||
}, idx));
|
||||
}
|
||||
|
||||
@@ -2485,6 +2523,7 @@ struct llama_model * llama_load_model_from_hf(
|
||||
const char * repo,
|
||||
const char * model,
|
||||
const char * path_model,
|
||||
const char * hf_token,
|
||||
const struct llama_model_params & params) {
|
||||
// construct hugging face model url:
|
||||
//
|
||||
@@ -2500,7 +2539,7 @@ struct llama_model * llama_load_model_from_hf(
|
||||
model_url += "/resolve/main/";
|
||||
model_url += model;
|
||||
|
||||
return llama_load_model_from_url(model_url.c_str(), path_model, params);
|
||||
return llama_load_model_from_url(model_url.c_str(), path_model, hf_token, params);
|
||||
}
|
||||
|
||||
#else
|
||||
@@ -2508,6 +2547,7 @@ struct llama_model * llama_load_model_from_hf(
|
||||
struct llama_model * llama_load_model_from_url(
|
||||
const char * /*model_url*/,
|
||||
const char * /*path_model*/,
|
||||
const char * /*hf_token*/,
|
||||
const struct llama_model_params & /*params*/) {
|
||||
fprintf(stderr, "%s: llama.cpp built without libcurl, downloading from an url not supported.\n", __func__);
|
||||
return nullptr;
|
||||
@@ -2517,6 +2557,7 @@ struct llama_model * llama_load_model_from_hf(
|
||||
const char * /*repo*/,
|
||||
const char * /*model*/,
|
||||
const char * /*path_model*/,
|
||||
const char * /*hf_token*/,
|
||||
const struct llama_model_params & /*params*/) {
|
||||
fprintf(stderr, "%s: llama.cpp built without libcurl, downloading from Hugging Face not supported.\n", __func__);
|
||||
return nullptr;
|
||||
@@ -2581,51 +2622,35 @@ std::vector<llama_token> llama_tokenize(
|
||||
}
|
||||
|
||||
std::string llama_token_to_piece(const struct llama_context * ctx, llama_token token, bool special) {
|
||||
std::vector<char> result(8, 0);
|
||||
const int n_tokens = llama_token_to_piece(llama_get_model(ctx), token, result.data(), result.size(), special);
|
||||
if (n_tokens < 0) {
|
||||
result.resize(-n_tokens);
|
||||
int check = llama_token_to_piece(llama_get_model(ctx), token, result.data(), result.size(), special);
|
||||
GGML_ASSERT(check == -n_tokens);
|
||||
} else {
|
||||
result.resize(n_tokens);
|
||||
std::string piece;
|
||||
piece.resize(piece.capacity()); // using string internal cache, 15 bytes + '\n'
|
||||
const int n_chars = llama_token_to_piece(llama_get_model(ctx), token, &piece[0], piece.size(), 0, special);
|
||||
if (n_chars < 0) {
|
||||
piece.resize(-n_chars);
|
||||
int check = llama_token_to_piece(llama_get_model(ctx), token, &piece[0], piece.size(), 0, special);
|
||||
GGML_ASSERT(check == -n_chars);
|
||||
}
|
||||
else {
|
||||
piece.resize(n_chars);
|
||||
}
|
||||
|
||||
return std::string(result.data(), result.size());
|
||||
return piece;
|
||||
}
|
||||
|
||||
std::string llama_detokenize_spm(llama_context * ctx, const std::vector<llama_token> & tokens) {
|
||||
const llama_token bos_id = llama_token_bos(llama_get_model(ctx));
|
||||
|
||||
std::string piece;
|
||||
std::string result;
|
||||
|
||||
for (size_t i = 0; i < tokens.size(); ++i) {
|
||||
piece = llama_token_to_piece(ctx, tokens[i]);
|
||||
|
||||
// remove the leading space of the first non-BOS token
|
||||
if (((tokens[0] == bos_id && i == 1) || (tokens[0] != bos_id && i == 0)) && piece[0] == ' ') {
|
||||
piece = piece.substr(1);
|
||||
}
|
||||
|
||||
result += piece;
|
||||
std::string llama_detokenize(llama_context * ctx, const std::vector<llama_token> & tokens, bool special) {
|
||||
std::string text;
|
||||
text.resize(std::max(text.capacity(), tokens.size()));
|
||||
int32_t n_chars = llama_detokenize(llama_get_model(ctx), tokens.data(), (int32_t)tokens.size(), &text[0], (int32_t)text.size(), false, special);
|
||||
if (n_chars < 0) {
|
||||
text.resize(-n_chars);
|
||||
n_chars = llama_detokenize(llama_get_model(ctx), tokens.data(), (int32_t)tokens.size(), &text[0], (int32_t)text.size(), false, special);
|
||||
GGML_ASSERT(n_chars <= (int32_t)text.size()); // whitespace trimming is performed after per-token detokenization
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
std::string llama_detokenize_bpe(llama_context * ctx, const std::vector<llama_token> & tokens) {
|
||||
std::string piece;
|
||||
std::string result;
|
||||
|
||||
for (size_t i = 0; i < tokens.size(); ++i) {
|
||||
piece = llama_token_to_piece(ctx, tokens[i]);
|
||||
|
||||
result += piece;
|
||||
}
|
||||
text.resize(n_chars);
|
||||
|
||||
// NOTE: the original tokenizer decodes bytes after collecting the pieces.
|
||||
return result;
|
||||
return text;
|
||||
}
|
||||
|
||||
bool llama_should_add_bos_token(const llama_model * model) {
|
||||
|
||||
+9
-14
@@ -99,6 +99,7 @@ struct gpt_params {
|
||||
enum llama_split_mode split_mode = LLAMA_SPLIT_MODE_LAYER; // how to split the model across GPUs
|
||||
enum llama_rope_scaling_type rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED;
|
||||
enum llama_pooling_type pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED; // pooling type for embeddings
|
||||
enum llama_attention_type attention_type = LLAMA_ATTENTION_TYPE_UNSPECIFIED; // attention type for embeddings
|
||||
|
||||
// // sampling parameters
|
||||
struct llama_sampling_params sparams;
|
||||
@@ -107,6 +108,7 @@ struct gpt_params {
|
||||
std::string model_draft = ""; // draft model for speculative decoding
|
||||
std::string model_alias = "unknown"; // model alias
|
||||
std::string model_url = ""; // model url to download
|
||||
std::string hf_token = ""; // HF token
|
||||
std::string hf_repo = ""; // HF repo
|
||||
std::string hf_file = ""; // HF file
|
||||
std::string prompt = "";
|
||||
@@ -255,6 +257,7 @@ struct gpt_params {
|
||||
bool spm_infill = false; // suffix/prefix/middle pattern for infill
|
||||
};
|
||||
|
||||
void gpt_params_handle_hf_token(gpt_params & params);
|
||||
void gpt_params_handle_model_default(gpt_params & params);
|
||||
|
||||
bool gpt_params_parse_ex (int argc, char ** argv, gpt_params & params);
|
||||
@@ -310,8 +313,8 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
|
||||
struct llama_model_params llama_model_params_from_gpt_params (const gpt_params & params);
|
||||
struct llama_context_params llama_context_params_from_gpt_params(const gpt_params & params);
|
||||
|
||||
struct llama_model * llama_load_model_from_url(const char * model_url, const char * path_model, const struct llama_model_params & params);
|
||||
struct llama_model * llama_load_model_from_hf(const char * repo, const char * file, const char * path_model, const struct llama_model_params & params);
|
||||
struct llama_model * llama_load_model_from_url(const char * model_url, const char * path_model, const char * hf_token, const struct llama_model_params & params);
|
||||
struct llama_model * llama_load_model_from_hf(const char * repo, const char * file, const char * path_model, const char * hf_token, const struct llama_model_params & params);
|
||||
|
||||
// Batch utils
|
||||
|
||||
@@ -349,21 +352,13 @@ std::string llama_token_to_piece(
|
||||
llama_token token,
|
||||
bool special = true);
|
||||
|
||||
// TODO: these should be moved in llama.h C-style API under single `llama_detokenize` function
|
||||
// that takes into account the tokenizer type and decides how to handle the leading space
|
||||
//
|
||||
// detokenizes a vector of tokens into a string
|
||||
// should work similar to Python's `tokenizer.decode`
|
||||
// removes the leading space from the first non-BOS token
|
||||
std::string llama_detokenize_spm(
|
||||
// optionally renders special/control tokens
|
||||
std::string llama_detokenize(
|
||||
llama_context * ctx,
|
||||
const std::vector<llama_token> & tokens);
|
||||
|
||||
// detokenizes a vector of tokens into a string
|
||||
// should work similar to Python's `tokenizer.decode`
|
||||
std::string llama_detokenize_bpe(
|
||||
llama_context * ctx,
|
||||
const std::vector<llama_token> & tokens);
|
||||
const std::vector<llama_token> & tokens,
|
||||
bool special = true);
|
||||
|
||||
// Uses the value from the model metadata if possible, otherwise
|
||||
// defaults to true when model type is SPM, otherwise false.
|
||||
|
||||
+188
-18
@@ -487,6 +487,9 @@ class Model:
|
||||
if chkhsh == "7967bfa498ade6b757b064f31e964dddbb80f8f9a4d68d4ba7998fcf281c531a":
|
||||
# ref: https://huggingface.co/jinaai/jina-embeddings-v2-base-code
|
||||
res = "jina-v2-code"
|
||||
if chkhsh == "b6e8e1518dc4305be2fe39c313ed643381c4da5db34a98f6a04c093f8afbe99b":
|
||||
# ref: https://huggingface.co/THUDM/glm-4-9b-chat
|
||||
res = "chatglm-bpe"
|
||||
if chkhsh == "7fc505bd3104ca1083b150b17d088b59534ede9bde81f0dd2090967d7fe52cee":
|
||||
# ref: https://huggingface.co/LumiOpen/Viking-7B
|
||||
res = "viking"
|
||||
@@ -2445,7 +2448,7 @@ class Gemma2Model(Model):
|
||||
raise ValueError("query_pre_attn_scalar must be equal to n_embd / n_head")
|
||||
|
||||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||
del bid # unusem
|
||||
del bid # unused
|
||||
|
||||
# lm_head is not used in llama.cpp, while autoawq will include this tensor in model
|
||||
# To prevent errors, skip loading lm_head.weight.
|
||||
@@ -3176,6 +3179,190 @@ class JaisModel(Model):
|
||||
self.gguf_writer.add_max_alibi_bias(self.max_alibi_bias)
|
||||
|
||||
|
||||
@Model.register("ChatGLMModel", "ChatGLMForConditionalGeneration")
|
||||
class ChatGLMModel(Model):
|
||||
model_arch = gguf.MODEL_ARCH.CHATGLM
|
||||
|
||||
def set_vocab_chatglm3(self):
|
||||
dir_model = self.dir_model
|
||||
hparams = self.hparams
|
||||
tokens: list[bytearray] = []
|
||||
toktypes: list[int] = []
|
||||
scores: list[float] = []
|
||||
|
||||
from transformers import AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained(dir_model, trust_remote_code=True)
|
||||
vocab_size = hparams.get("padded_vocab_size", len(tokenizer.get_vocab()))
|
||||
assert max(tokenizer.get_vocab().values()) < vocab_size
|
||||
role_special_tokens = ["<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"]
|
||||
special_tokens = ["[MASK]", "[gMASK]", "[sMASK]", "sop", "eop"] + role_special_tokens
|
||||
for token_id in range(vocab_size):
|
||||
piece = tokenizer._convert_id_to_token(token_id)
|
||||
if token_id == 0:
|
||||
piece = "<unk>"
|
||||
elif token_id == 1:
|
||||
piece = "<bos>"
|
||||
elif token_id == 2:
|
||||
piece = "<eos>"
|
||||
|
||||
text = piece.encode("utf-8")
|
||||
score = 0.0
|
||||
# Referencing the tokenizer Python implementation(https://huggingface.co/THUDM/chatglm3-6b/blob/main/tokenization_chatglm.py),
|
||||
# it is only valid if it is less than tokenizer.tokenizer.sp_model.vocab_size()
|
||||
if len(piece) != 0 and token_id < tokenizer.tokenizer.sp_model.vocab_size():
|
||||
score = tokenizer.tokenizer.sp_model.get_score(token_id)
|
||||
|
||||
if len(piece) == 0:
|
||||
text = f"[PAD{token_id}]".encode("utf-8")
|
||||
|
||||
if token_id >= tokenizer.tokenizer.sp_model.vocab_size():
|
||||
if piece in special_tokens:
|
||||
# show special tokens in prompt
|
||||
toktype = SentencePieceTokenTypes.USER_DEFINED
|
||||
else:
|
||||
toktype = SentencePieceTokenTypes.UNKNOWN
|
||||
tokens.append(text)
|
||||
scores.append(score)
|
||||
toktypes.append(toktype)
|
||||
continue
|
||||
|
||||
toktype = SentencePieceTokenTypes.NORMAL
|
||||
if tokenizer.tokenizer.sp_model.is_unknown(token_id):
|
||||
toktype = SentencePieceTokenTypes.UNKNOWN
|
||||
elif tokenizer.tokenizer.sp_model.is_control(token_id):
|
||||
toktype = SentencePieceTokenTypes.CONTROL
|
||||
elif tokenizer.tokenizer.sp_model.is_unused(token_id):
|
||||
toktype = SentencePieceTokenTypes.UNUSED
|
||||
elif tokenizer.tokenizer.sp_model.is_byte(token_id):
|
||||
toktype = SentencePieceTokenTypes.BYTE
|
||||
|
||||
tokens.append(text)
|
||||
scores.append(score)
|
||||
toktypes.append(toktype)
|
||||
|
||||
self.gguf_writer.add_tokenizer_model("llama")
|
||||
# glm3 needs prefix and suffix formatted as:
|
||||
# prompt = "[gMASK]sop<|user|>\n" + prompt + "<|assistant|>"
|
||||
self.gguf_writer.add_tokenizer_pre("chatglm-spm")
|
||||
self.gguf_writer.add_token_list(tokens)
|
||||
self.gguf_writer.add_token_scores(scores)
|
||||
self.gguf_writer.add_token_types(toktypes)
|
||||
|
||||
special_vocab = gguf.SpecialVocab(self.dir_model, n_vocab=len(tokens))
|
||||
special_vocab.add_to_gguf(self.gguf_writer)
|
||||
|
||||
@staticmethod
|
||||
def token_bytes_to_string(b):
|
||||
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode
|
||||
byte_encoder = bytes_to_unicode()
|
||||
return ''.join([byte_encoder[ord(char)] for char in b.decode('latin-1')])
|
||||
|
||||
@staticmethod
|
||||
def bpe(mergeable_ranks: dict[bytes, int], token: bytes, max_rank: int | None = None) -> list[bytes]:
|
||||
parts = [bytes([b]) for b in token]
|
||||
while True:
|
||||
min_idx = None
|
||||
min_rank = None
|
||||
for i, pair in enumerate(zip(parts[:-1], parts[1:])):
|
||||
rank = mergeable_ranks.get(pair[0] + pair[1])
|
||||
if rank is not None and (min_rank is None or rank < min_rank):
|
||||
min_idx = i
|
||||
min_rank = rank
|
||||
if min_rank is None or (max_rank is not None and min_rank >= max_rank):
|
||||
break
|
||||
assert min_idx is not None
|
||||
parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
|
||||
return parts
|
||||
|
||||
def set_vocab(self):
|
||||
if "THUDM/chatglm3-6b" in self.hparams.get("_name_or_path", ""):
|
||||
self.set_vocab_chatglm3()
|
||||
return
|
||||
|
||||
dir_model = self.dir_model
|
||||
hparams = self.hparams
|
||||
tokens: list[str] = []
|
||||
toktypes: list[int] = []
|
||||
|
||||
from transformers import AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained(dir_model, trust_remote_code=True)
|
||||
vocab_size = hparams["padded_vocab_size"]
|
||||
assert max(tokenizer.get_vocab().values()) < vocab_size
|
||||
|
||||
tokpre = self.get_vocab_base_pre(tokenizer)
|
||||
|
||||
merges = []
|
||||
vocab = {}
|
||||
mergeable_ranks = tokenizer.mergeable_ranks
|
||||
for token, rank in mergeable_ranks.items():
|
||||
vocab[ChatGLMModel.token_bytes_to_string(token)] = rank
|
||||
if len(token) == 1:
|
||||
continue
|
||||
merged = ChatGLMModel.bpe(mergeable_ranks, token, max_rank=rank)
|
||||
assert len(merged) >= 2 and len(merged) <= 7
|
||||
merges.append(' '.join(map(ChatGLMModel.token_bytes_to_string, merged)))
|
||||
|
||||
# for this kind of tokenizer, added_vocab is not a subset of vocab, so they need to be combined
|
||||
added_vocab = tokenizer.get_added_vocab()
|
||||
reverse_vocab = {id_ : encoded_tok for encoded_tok, id_ in {**vocab, **added_vocab}.items()}
|
||||
|
||||
for i in range(vocab_size):
|
||||
if i not in reverse_vocab:
|
||||
tokens.append(f"[PAD{i}]")
|
||||
toktypes.append(gguf.TokenType.USER_DEFINED)
|
||||
elif reverse_vocab[i] in added_vocab:
|
||||
tokens.append(reverse_vocab[i])
|
||||
if tokenizer.added_tokens_decoder[i].special:
|
||||
toktypes.append(gguf.TokenType.CONTROL)
|
||||
else:
|
||||
toktypes.append(gguf.TokenType.USER_DEFINED)
|
||||
else:
|
||||
tokens.append(reverse_vocab[i])
|
||||
toktypes.append(gguf.TokenType.NORMAL)
|
||||
|
||||
self.gguf_writer.add_tokenizer_model("gpt2")
|
||||
self.gguf_writer.add_tokenizer_pre(tokpre)
|
||||
self.gguf_writer.add_token_list(tokens)
|
||||
self.gguf_writer.add_token_types(toktypes)
|
||||
|
||||
special_vocab = gguf.SpecialVocab(dir_model, load_merges=False)
|
||||
special_vocab.merges = merges
|
||||
# only add special tokens when they were not already loaded from config.json
|
||||
special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
|
||||
special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])
|
||||
# this one is usually not in config.json anyway
|
||||
special_vocab._set_special_token("unk", tokenizer.get_added_vocab()["<|endoftext|>"])
|
||||
special_vocab.add_to_gguf(self.gguf_writer)
|
||||
|
||||
def set_gguf_parameters(self):
|
||||
self.gguf_writer.add_name(self.hparams.get("_name_or_path").split("/")[1]) # THUDM/glm4-9b-chat or THUDM/chatglm3-6b
|
||||
n_embed = self.hparams.get("hidden_size", self.hparams.get("n_embed"))
|
||||
n_head = self.hparams.get("n_head", self.hparams.get("num_attention_heads"))
|
||||
n_head_kv = self.hparams.get("multi_query_group_num", n_head)
|
||||
self.gguf_writer.add_context_length(self.hparams.get("seq_length", n_embed))
|
||||
self.gguf_writer.add_embedding_length(n_embed)
|
||||
self.gguf_writer.add_feed_forward_length(self.hparams.get("ffn_hidden_size", 4 * n_embed))
|
||||
self.gguf_writer.add_block_count(self.hparams["num_layers"])
|
||||
self.gguf_writer.add_head_count(n_head)
|
||||
self.gguf_writer.add_head_count_kv(n_head_kv)
|
||||
self.gguf_writer.add_layer_norm_rms_eps(self.hparams["layernorm_epsilon"])
|
||||
self.gguf_writer.add_file_type(self.ftype)
|
||||
self.gguf_writer.add_rope_dimension_count(64)
|
||||
self.gguf_writer.add_add_bos_token(False)
|
||||
rope_freq = 10000
|
||||
if "rope_ratio" in self.hparams:
|
||||
rope_freq = rope_freq * self.hparams["rope_ratio"]
|
||||
self.gguf_writer.add_rope_freq_base(rope_freq)
|
||||
|
||||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||
del bid # unused
|
||||
|
||||
if name.endswith(".rotary_pos_emb.inv_freq"):
|
||||
return []
|
||||
|
||||
name = name.removeprefix("transformer.")
|
||||
return [(self.map_tensor_name(name), data_torch)]
|
||||
|
||||
###### CONVERSION LOGIC ######
|
||||
|
||||
|
||||
@@ -3225,10 +3412,6 @@ def parse_args() -> argparse.Namespace:
|
||||
"--vocab-only", action="store_true",
|
||||
help="extract only the vocab",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--awq-path", type=Path, default=None,
|
||||
help="Path to scale awq cache file",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--outfile", type=Path,
|
||||
help="path to write to; default: based on input. {ftype} will be replaced by the outtype.",
|
||||
@@ -3306,19 +3489,6 @@ def main() -> None:
|
||||
|
||||
dir_model = args.model
|
||||
|
||||
if args.awq_path:
|
||||
sys.path.insert(1, str(Path(__file__).parent / 'awq-py'))
|
||||
from awq.apply_awq import add_scale_weights # type: ignore[import-not-found]
|
||||
tmp_model_path = args.model / "weighted_model"
|
||||
dir_model = tmp_model_path
|
||||
if tmp_model_path.is_dir():
|
||||
logger.info(f"{tmp_model_path} exists as a weighted model.")
|
||||
else:
|
||||
tmp_model_path.mkdir(parents=True, exist_ok=True)
|
||||
logger.info("Saving new weighted model ...")
|
||||
add_scale_weights(str(args.model), str(args.awq_path), str(tmp_model_path))
|
||||
logger.info(f"Saved weighted model at {tmp_model_path}.")
|
||||
|
||||
if not dir_model.is_dir():
|
||||
logger.error(f'Error: {args.model} is not a directory')
|
||||
sys.exit(1)
|
||||
|
||||
@@ -0,0 +1,56 @@
|
||||
|
||||
# Android
|
||||
|
||||
## Build on Android using Termux
|
||||
[Termux](https://github.com/termux/termux-app#installation) is a method to execute `llama.cpp` on an Android device (no root required).
|
||||
```
|
||||
apt update && apt upgrade -y
|
||||
apt install git make cmake
|
||||
```
|
||||
|
||||
It's recommended to move your model inside the `~/` directory for best performance:
|
||||
```
|
||||
cd storage/downloads
|
||||
mv model.gguf ~/
|
||||
```
|
||||
|
||||
[Get the code](https://github.com/ggerganov/llama.cpp#get-the-code) & [follow the Linux build instructions](https://github.com/ggerganov/llama.cpp#build) to build `llama.cpp`.
|
||||
|
||||
## Building the Project using Android NDK
|
||||
Obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake.
|
||||
|
||||
Execute the following commands on your computer to avoid downloading the NDK to your mobile. Alternatively, you can also do this in Termux:
|
||||
```
|
||||
$ mkdir build-android
|
||||
$ cd build-android
|
||||
$ export NDK=<your_ndk_directory>
|
||||
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
|
||||
$ make
|
||||
```
|
||||
|
||||
Install [termux](https://github.com/termux/termux-app#installation) on your device and run `termux-setup-storage` to get access to your SD card (if Android 11+ then run the command twice).
|
||||
|
||||
Finally, copy these built `llama` binaries and the model file to your device storage. Because the file permissions in the Android sdcard cannot be changed, you can copy the executable files to the `/data/data/com.termux/files/home/bin` path, and then execute the following commands in Termux to add executable permission:
|
||||
|
||||
(Assumed that you have pushed the built executable files to the /sdcard/llama.cpp/bin path using `adb push`)
|
||||
```
|
||||
$cp -r /sdcard/llama.cpp/bin /data/data/com.termux/files/home/
|
||||
$cd /data/data/com.termux/files/home/bin
|
||||
$chmod +x ./*
|
||||
```
|
||||
|
||||
Download model [llama-2-7b-chat.Q4_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_M.gguf), and push it to `/sdcard/llama.cpp/`, then move it to `/data/data/com.termux/files/home/model/`
|
||||
|
||||
```
|
||||
$mv /sdcard/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf /data/data/com.termux/files/home/model/
|
||||
```
|
||||
|
||||
Now, you can start chatting:
|
||||
```
|
||||
$cd /data/data/com.termux/files/home/bin
|
||||
$./llama-cli -m ../model/llama-2-7b-chat.Q4_K_M.gguf -n 128 -cml
|
||||
```
|
||||
|
||||
Here's a demo of an interactive session running on Pixel 5 phone:
|
||||
|
||||
https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
|
||||
+288
@@ -0,0 +1,288 @@
|
||||
# Build llama.cpp locally
|
||||
|
||||
**To get the Code:**
|
||||
|
||||
```bash
|
||||
git clone https://github.com/ggerganov/llama.cpp
|
||||
cd llama.cpp
|
||||
```
|
||||
|
||||
In order to build llama.cpp you have four different options.
|
||||
|
||||
- Using `make`:
|
||||
- On Linux or MacOS:
|
||||
|
||||
```bash
|
||||
make
|
||||
```
|
||||
|
||||
- On Windows:
|
||||
|
||||
1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
|
||||
2. Extract `w64devkit` on your pc.
|
||||
3. Run `w64devkit.exe`.
|
||||
4. Use the `cd` command to reach the `llama.cpp` folder.
|
||||
5. From here you can run:
|
||||
```bash
|
||||
make
|
||||
```
|
||||
|
||||
- Notes:
|
||||
- For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `make -j 8` will run 8 jobs in parallel.
|
||||
- For faster repeated compilation, install [ccache](https://ccache.dev/).
|
||||
- For debug builds, run `make LLAMA_DEBUG=1`
|
||||
|
||||
- Using `CMake`:
|
||||
|
||||
```bash
|
||||
cmake -B build
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
**Notes**:
|
||||
|
||||
- For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
|
||||
- For faster repeated compilation, install [ccache](https://ccache.dev/).
|
||||
- For debug builds, there are two cases:
|
||||
|
||||
1. Single-config generators (e.g. default = `Unix Makefiles`; note that they just ignore the `--config` flag):
|
||||
|
||||
```bash
|
||||
cmake -B build -DCMAKE_BUILD_TYPE=Debug
|
||||
cmake --build build
|
||||
```
|
||||
|
||||
2. Multi-config generators (`-G` param set to Visual Studio, XCode...):
|
||||
|
||||
```bash
|
||||
cmake -B build -G "Xcode"
|
||||
cmake --build build --config Debug
|
||||
```
|
||||
|
||||
- Using `gmake` (FreeBSD):
|
||||
|
||||
1. Install and activate [DRM in FreeBSD](https://wiki.freebsd.org/Graphics)
|
||||
2. Add your user to **video** group
|
||||
3. Install compilation dependencies.
|
||||
|
||||
```bash
|
||||
sudo pkg install gmake automake autoconf pkgconf llvm15 openblas
|
||||
|
||||
gmake CC=/usr/local/bin/clang15 CXX=/usr/local/bin/clang++15 -j4
|
||||
```
|
||||
|
||||
## Metal Build
|
||||
|
||||
On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU.
|
||||
To disable the Metal build at compile time use the `GGML_NO_METAL=1` flag or the `GGML_METAL=OFF` cmake option.
|
||||
|
||||
When built with Metal support, you can explicitly disable GPU inference with the `--n-gpu-layers|-ngl 0` command-line
|
||||
argument.
|
||||
|
||||
## BLAS Build
|
||||
|
||||
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS. There are currently several different BLAS implementations available for build and use:
|
||||
|
||||
### Accelerate Framework:
|
||||
|
||||
This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
|
||||
|
||||
### OpenBLAS:
|
||||
|
||||
This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
|
||||
|
||||
- Using `make`:
|
||||
- On Linux:
|
||||
```bash
|
||||
make GGML_OPENBLAS=1
|
||||
```
|
||||
|
||||
- On Windows:
|
||||
|
||||
1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
|
||||
2. Download the latest version of [OpenBLAS for Windows](https://github.com/xianyi/OpenBLAS/releases).
|
||||
3. Extract `w64devkit` on your pc.
|
||||
4. From the OpenBLAS zip that you just downloaded copy `libopenblas.a`, located inside the `lib` folder, inside `w64devkit\x86_64-w64-mingw32\lib`.
|
||||
5. From the same OpenBLAS zip copy the content of the `include` folder inside `w64devkit\x86_64-w64-mingw32\include`.
|
||||
6. Run `w64devkit.exe`.
|
||||
7. Use the `cd` command to reach the `llama.cpp` folder.
|
||||
8. From here you can run:
|
||||
|
||||
```bash
|
||||
make GGML_OPENBLAS=1
|
||||
```
|
||||
|
||||
- Using `CMake` on Linux:
|
||||
|
||||
```bash
|
||||
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
### BLIS
|
||||
|
||||
Check [BLIS.md](./backend/BLIS.md) for more information.
|
||||
|
||||
### SYCL
|
||||
|
||||
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.
|
||||
|
||||
llama.cpp based on SYCL is used to **support Intel GPU** (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
|
||||
|
||||
For detailed info, please refer to [llama.cpp for SYCL](./backend/SYCL.md).
|
||||
|
||||
### Intel oneMKL
|
||||
|
||||
Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni. Please note that this build config **does not support Intel GPU**. For Intel GPU support, please refer to [llama.cpp for SYCL](./backend/SYCL.md).
|
||||
|
||||
- Using manual oneAPI installation:
|
||||
By default, `GGML_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DGGML_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. Otherwise please install oneAPI and follow the below steps:
|
||||
```bash
|
||||
source /opt/intel/oneapi/setvars.sh # You can skip this step if in oneapi-basekit docker image, only required for manual installation
|
||||
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
- Using oneAPI docker image:
|
||||
If you do not want to source the environment vars and install oneAPI manually, you can also build the code using intel docker container: [oneAPI-basekit](https://hub.docker.com/r/intel/oneapi-basekit). Then, you can use the commands given above.
|
||||
|
||||
Check [Optimizing and Running LLaMA2 on Intel® CPU](https://www.intel.com/content/www/us/en/content-details/791610/optimizing-and-running-llama2-on-intel-cpu.html) for more information.
|
||||
|
||||
### CUDA
|
||||
|
||||
This provides GPU acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
|
||||
|
||||
For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](https://www.jetson-ai-lab.com/tutorial_text-generation.html). If you are using an old model(nano/TX2), need some additional operations before compiling.
|
||||
|
||||
- Using `make`:
|
||||
```bash
|
||||
make GGML_CUDA=1
|
||||
```
|
||||
- Using `CMake`:
|
||||
|
||||
```bash
|
||||
cmake -B build -DGGML_CUDA=ON
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
|
||||
|
||||
| Option | Legal values | Default | Description |
|
||||
|-------------------------------|------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| GGML_CUDA_FORCE_DMMV | Boolean | false | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants. |
|
||||
| GGML_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
|
||||
| GGML_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. |
|
||||
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, RDNA3). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
|
||||
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models |
|
||||
| GGML_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
|
||||
| GGML_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
|
||||
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
|
||||
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
|
||||
|
||||
### hipBLAS
|
||||
|
||||
This provides BLAS acceleration on HIP-supported AMD GPUs.
|
||||
Make sure to have ROCm installed.
|
||||
You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick).
|
||||
|
||||
- Using `make`:
|
||||
```bash
|
||||
make GGML_HIPBLAS=1
|
||||
```
|
||||
- Using `CMake` for Linux (assuming a gfx1030-compatible AMD GPU):
|
||||
```bash
|
||||
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
|
||||
cmake -S . -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
|
||||
&& cmake --build build --config Release -- -j 16
|
||||
```
|
||||
On Linux it is also possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting `-DGGML_HIP_UMA=ON`.
|
||||
However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
|
||||
|
||||
Note that if you get the following error:
|
||||
```
|
||||
clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
|
||||
```
|
||||
Try searching for a directory under `HIP_PATH` that contains the file
|
||||
`oclc_abi_version_400.bc`. Then, add the following to the start of the
|
||||
command: `HIP_DEVICE_LIB_PATH=<directory-you-just-found>`, so something
|
||||
like:
|
||||
```bash
|
||||
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" \
|
||||
HIP_DEVICE_LIB_PATH=<directory-you-just-found> \
|
||||
cmake -S . -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
|
||||
&& cmake --build build -- -j 16
|
||||
```
|
||||
|
||||
- Using `make` (example for target gfx1030, build with 16 CPU threads):
|
||||
```bash
|
||||
make -j16 GGML_HIPBLAS=1 GGML_HIP_UMA=1 AMDGPU_TARGETS=gfx1030
|
||||
```
|
||||
|
||||
- Using `CMake` for Windows (using x64 Native Tools Command Prompt for VS, and assuming a gfx1100-compatible AMD GPU):
|
||||
```bash
|
||||
set PATH=%HIP_PATH%\bin;%PATH%
|
||||
cmake -S . -B build -G Ninja -DAMDGPU_TARGETS=gfx1100 -DGGML_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build
|
||||
```
|
||||
Make sure that `AMDGPU_TARGETS` is set to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
|
||||
Find your gpu version string by matching the most significant version information from `rocminfo | grep gfx | head -1 | awk '{print $2}'` with the list of processors, e.g. `gfx1035` maps to `gfx1030`.
|
||||
|
||||
|
||||
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
|
||||
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
|
||||
The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
|
||||
|
||||
| Option | Legal values | Default | Description |
|
||||
|------------------------|------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| GGML_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the HIP dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
|
||||
| GGML_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the HIP mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
|
||||
| GGML_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per HIP thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
|
||||
|
||||
### Vulkan
|
||||
|
||||
**With docker**:
|
||||
|
||||
You don't need to install Vulkan SDK. It will be installed inside the container.
|
||||
|
||||
```sh
|
||||
# Build the image
|
||||
docker build -t llama-cpp-vulkan -f .devops/llama-cli-vulkan.Dockerfile .
|
||||
|
||||
# Then, use it:
|
||||
docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-vulkan -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
|
||||
```
|
||||
|
||||
**Without docker**:
|
||||
|
||||
Firstly, you need to make sure you have installed [Vulkan SDK](https://vulkan.lunarg.com/doc/view/latest/linux/getting_started_ubuntu.html)
|
||||
|
||||
For example, on Ubuntu 22.04 (jammy), use the command below:
|
||||
|
||||
```bash
|
||||
wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add -
|
||||
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
|
||||
apt update -y
|
||||
apt-get install -y vulkan-sdk
|
||||
# To verify the installation, use the command below:
|
||||
vulkaninfo
|
||||
```
|
||||
|
||||
Alternatively your package manager might be able to provide the appropriate libraries.
|
||||
For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
|
||||
For Fedora 40, you can install `vulkan-devel`, `glslc` and `glslang` packages.
|
||||
|
||||
Then, build llama.cpp using the cmake command below:
|
||||
|
||||
```bash
|
||||
cmake -B build -DGGML_VULKAN=1
|
||||
cmake --build build --config Release
|
||||
# Test the output binary (with "-ngl 33" to offload all layers to GPU)
|
||||
./bin/llama-cli -m "PATH_TO_MODEL" -p "Hi you how are you" -n 50 -e -ngl 33 -t 4
|
||||
|
||||
# You should see in the output, ggml_vulkan detected your GPU. For example:
|
||||
# ggml_vulkan: Using Intel(R) Graphics (ADL GT2) | uma: 1 | fp16: 1 | warp size: 32
|
||||
```
|
||||
|
||||
### Android
|
||||
|
||||
To read documentation for how to build on Android, [click here](./android.md)
|
||||
@@ -1,4 +1,4 @@
|
||||
## Add a new model architecture to `llama.cpp`
|
||||
# Add a new model architecture to `llama.cpp`
|
||||
|
||||
Adding a model requires few steps:
|
||||
|
||||
@@ -0,0 +1,86 @@
|
||||
# Docker
|
||||
|
||||
## Prerequisites
|
||||
* Docker must be installed and running on your system.
|
||||
* Create a folder to store big models & intermediate files (ex. /llama/models)
|
||||
|
||||
## Images
|
||||
We have three Docker images available for this project:
|
||||
|
||||
1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
3. `ghcr.io/ggerganov/llama.cpp:server`: This image only includes the server executable file. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
|
||||
Additionally, there the following images, similar to the above:
|
||||
|
||||
- `ghcr.io/ggerganov/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA support. (platforms: `linux/amd64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA support. (platforms: `linux/amd64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA support. (platforms: `linux/amd64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
- `ghcr.io/ggerganov/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
|
||||
|
||||
The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](.github/workflows/docker.yml). If you need different settings (for example, a different CUDA or ROCm library, you'll need to build the images locally for now).
|
||||
|
||||
## Usage
|
||||
|
||||
The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.
|
||||
|
||||
Replace `/path/to/models` below with the actual path where you downloaded the models.
|
||||
|
||||
```bash
|
||||
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B
|
||||
```
|
||||
|
||||
On completion, you are ready to play!
|
||||
|
||||
```bash
|
||||
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
|
||||
```
|
||||
|
||||
or with a light image:
|
||||
|
||||
```bash
|
||||
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
|
||||
```
|
||||
|
||||
or with a server image:
|
||||
|
||||
```bash
|
||||
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512
|
||||
```
|
||||
|
||||
## Docker With CUDA
|
||||
|
||||
Assuming one has the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) properly installed on Linux, or is using a GPU enabled cloud, `cuBLAS` should be accessible inside the container.
|
||||
|
||||
## Building Docker locally
|
||||
|
||||
```bash
|
||||
docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .
|
||||
docker build -t local/llama.cpp:light-cuda -f .devops/llama-cli-cuda.Dockerfile .
|
||||
docker build -t local/llama.cpp:server-cuda -f .devops/llama-server-cuda.Dockerfile .
|
||||
```
|
||||
|
||||
You may want to pass in some different `ARGS`, depending on the CUDA environment supported by your container host, as well as the GPU architecture.
|
||||
|
||||
The defaults are:
|
||||
|
||||
- `CUDA_VERSION` set to `11.7.1`
|
||||
- `CUDA_DOCKER_ARCH` set to `all`
|
||||
|
||||
The resulting images, are essentially the same as the non-CUDA images:
|
||||
|
||||
1. `local/llama.cpp:full-cuda`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
|
||||
2. `local/llama.cpp:light-cuda`: This image only includes the main executable file.
|
||||
3. `local/llama.cpp:server-cuda`: This image only includes the server executable file.
|
||||
|
||||
## Usage
|
||||
|
||||
After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the `--gpus` flag. You will also want to use the `--n-gpu-layers` flag.
|
||||
|
||||
```bash
|
||||
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
|
||||
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
|
||||
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
|
||||
```
|
||||
@@ -0,0 +1,39 @@
|
||||
# Install pre-built version of llama.cpp
|
||||
|
||||
## Homebrew
|
||||
|
||||
On Mac and Linux, the homebrew package manager can be used via
|
||||
|
||||
```sh
|
||||
brew install llama.cpp
|
||||
```
|
||||
The formula is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggerganov/llama.cpp/discussions/7668
|
||||
|
||||
## Nix
|
||||
|
||||
On Mac and Linux, the Nix package manager can be used via
|
||||
|
||||
```sh
|
||||
nix profile install nixpkgs#llama-cpp
|
||||
```
|
||||
For flake enabled installs.
|
||||
|
||||
Or
|
||||
|
||||
```sh
|
||||
nix-env --file '<nixpkgs>' --install --attr llama-cpp
|
||||
```
|
||||
|
||||
For non-flake enabled installs.
|
||||
|
||||
This expression is automatically updated within the [nixpkgs repo](https://github.com/NixOS/nixpkgs/blob/nixos-24.05/pkgs/by-name/ll/llama-cpp/package.nix#L164).
|
||||
|
||||
## Flox
|
||||
|
||||
On Mac and Linux, Flox can be used to install llama.cpp within a Flox environment via
|
||||
|
||||
```sh
|
||||
flox install llama-cpp
|
||||
```
|
||||
|
||||
Flox follows the nixpkgs build of llama.cpp.
|
||||
@@ -229,7 +229,7 @@ private func tokenize(text: String, add_bos: Bool) -> [llama_token] {
|
||||
|
||||
private func token_to_piece(token: llama_token, buffer: inout [CChar]) -> String? {
|
||||
var result = [CChar](repeating: 0, count: 8)
|
||||
let nTokens = llama_token_to_piece(model, token, &result, Int32(result.count), false)
|
||||
let nTokens = llama_token_to_piece(model, token, &result, Int32(result.count), 0, false)
|
||||
if nTokens < 0 {
|
||||
let actualTokensCount = -Int(nTokens)
|
||||
result = .init(repeating: 0, count: actualTokensCount)
|
||||
@@ -238,6 +238,7 @@ private func token_to_piece(token: llama_token, buffer: inout [CChar]) -> String
|
||||
token,
|
||||
&result,
|
||||
Int32(result.count),
|
||||
0,
|
||||
false
|
||||
)
|
||||
assert(check == actualTokensCount)
|
||||
|
||||
@@ -87,4 +87,4 @@ The LORA rank can be configured for each model tensor type separately with these
|
||||
|
||||
The LORA rank of 'norm' tensors should always be 1.
|
||||
|
||||
To see all available options use `finetune --help`.
|
||||
To see all available options use `llama-finetune --help`.
|
||||
|
||||
@@ -8,7 +8,7 @@ if [[ ! $LLAMA_MODEL_DIR ]]; then LLAMA_MODEL_DIR="./models"; fi
|
||||
if [[ ! $LLAMA_TRAINING_DIR ]]; then LLAMA_TRAINING_DIR="."; fi
|
||||
|
||||
# MODEL="$LLAMA_MODEL_DIR/openllama-3b-v2-q8_0.gguf" # This is the model the readme uses.
|
||||
MODEL="$LLAMA_MODEL_DIR/openllama-3b-v2.gguf" # An f16 model. Note in this case with "-g", you get an f32-format .BIN file that isn't yet supported if you use it with "main --lora" with GPU inferencing.
|
||||
MODEL="$LLAMA_MODEL_DIR/openllama-3b-v2.gguf" # An f16 model. Note in this case with "-g", you get an f32-format .BIN file that isn't yet supported if you use it with "llama-cli --lora" with GPU inferencing.
|
||||
|
||||
while getopts "dg" opt; do
|
||||
case $opt in
|
||||
|
||||
@@ -322,7 +322,7 @@ actor LlamaContext {
|
||||
defer {
|
||||
result.deallocate()
|
||||
}
|
||||
let nTokens = llama_token_to_piece(model, token, result, 8, false)
|
||||
let nTokens = llama_token_to_piece(model, token, result, 8, 0, false)
|
||||
|
||||
if nTokens < 0 {
|
||||
let newResult = UnsafeMutablePointer<Int8>.allocate(capacity: Int(-nTokens))
|
||||
@@ -330,7 +330,7 @@ actor LlamaContext {
|
||||
defer {
|
||||
newResult.deallocate()
|
||||
}
|
||||
let nNewTokens = llama_token_to_piece(model, token, newResult, -nTokens, false)
|
||||
let nNewTokens = llama_token_to_piece(model, token, newResult, -nTokens, 0, false)
|
||||
let bufferPointer = UnsafeBufferPointer(start: newResult, count: Int(nNewTokens))
|
||||
return Array(bufferPointer)
|
||||
} else {
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
-r ../../requirements/requirements-convert_legacy_llama.txt
|
||||
--extra-index-url https://download.pytorch.org/whl/cpu
|
||||
pillow~=10.2.0
|
||||
torch~=2.2.1
|
||||
|
||||
@@ -4,7 +4,89 @@ You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-
|
||||
|
||||
Note: It is synced from llama.cpp `main` every 6 hours.
|
||||
|
||||
## Llama 2 7B
|
||||
Example usage:
|
||||
|
||||
```bash
|
||||
# obtain the official LLaMA model weights and place them in ./models
|
||||
ls ./models
|
||||
llama-2-7b tokenizer_checklist.chk tokenizer.model
|
||||
# [Optional] for models using BPE tokenizers
|
||||
ls ./models
|
||||
<folder containing weights and tokenizer json> vocab.json
|
||||
# [Optional] for PyTorch .bin models like Mistral-7B
|
||||
ls ./models
|
||||
<folder containing weights and tokenizer json>
|
||||
|
||||
# install Python dependencies
|
||||
python3 -m pip install -r requirements.txt
|
||||
|
||||
# convert the model to ggml FP16 format
|
||||
python3 convert_hf_to_gguf.py models/mymodel/
|
||||
|
||||
# quantize the model to 4-bits (using Q4_K_M method)
|
||||
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
|
||||
|
||||
# update the gguf filetype to current version if older version is now unsupported
|
||||
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
|
||||
```
|
||||
|
||||
Run the quantized model:
|
||||
|
||||
```bash
|
||||
# start inference on a gguf model
|
||||
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128
|
||||
```
|
||||
|
||||
When running the larger models, make sure you have enough disk space to store all the intermediate files.
|
||||
|
||||
## Memory/Disk Requirements
|
||||
|
||||
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
|
||||
|
||||
| Model | Original size | Quantized size (Q4_0) |
|
||||
|------:|--------------:|----------------------:|
|
||||
| 7B | 13 GB | 3.9 GB |
|
||||
| 13B | 24 GB | 7.8 GB |
|
||||
| 30B | 60 GB | 19.5 GB |
|
||||
| 65B | 120 GB | 38.5 GB |
|
||||
|
||||
## Quantization
|
||||
|
||||
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
|
||||
|
||||
*(outdated)*
|
||||
|
||||
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
|
||||
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
|
||||
| 7B | perplexity | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
|
||||
| 7B | file size | 13.0G | 3.5G | 3.9G | 4.3G | 4.7G | 6.7G |
|
||||
| 7B | ms/tok @ 4th | 127 | 55 | 54 | 76 | 83 | 72 |
|
||||
| 7B | ms/tok @ 8th | 122 | 43 | 45 | 52 | 56 | 67 |
|
||||
| 7B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
|
||||
| 13B | perplexity | 5.2543 | 5.3860 | 5.3608 | 5.2856 | 5.2706 | 5.2548 |
|
||||
| 13B | file size | 25.0G | 6.8G | 7.6G | 8.3G | 9.1G | 13G |
|
||||
| 13B | ms/tok @ 4th | - | 103 | 105 | 148 | 160 | 131 |
|
||||
| 13B | ms/tok @ 8th | - | 73 | 82 | 98 | 105 | 128 |
|
||||
| 13B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
|
||||
|
||||
- [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684)
|
||||
- recent k-quants improvements and new i-quants
|
||||
- [#2707](https://github.com/ggerganov/llama.cpp/pull/2707)
|
||||
- [#2807](https://github.com/ggerganov/llama.cpp/pull/2807)
|
||||
- [#4773 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4773)
|
||||
- [#4856 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4856)
|
||||
- [#4861 - importance matrix](https://github.com/ggerganov/llama.cpp/pull/4861)
|
||||
- [#4872 - MoE models](https://github.com/ggerganov/llama.cpp/pull/4872)
|
||||
- [#4897 - 2-bit quantization](https://github.com/ggerganov/llama.cpp/pull/4897)
|
||||
- [#4930 - imatrix for all k-quants](https://github.com/ggerganov/llama.cpp/pull/4930)
|
||||
- [#4951 - imatrix on the GPU](https://github.com/ggerganov/llama.cpp/pull/4957)
|
||||
- [#4969 - imatrix for legacy quants](https://github.com/ggerganov/llama.cpp/pull/4969)
|
||||
- [#4996 - k-qunats tuning](https://github.com/ggerganov/llama.cpp/pull/4996)
|
||||
- [#5060 - Q3_K_XS](https://github.com/ggerganov/llama.cpp/pull/5060)
|
||||
- [#5196 - 3-bit i-quants](https://github.com/ggerganov/llama.cpp/pull/5196)
|
||||
- [quantization tuning](https://github.com/ggerganov/llama.cpp/pull/5320), [another one](https://github.com/ggerganov/llama.cpp/pull/5334), and [another one](https://github.com/ggerganov/llama.cpp/pull/5361)
|
||||
|
||||
**Llama 2 7B**
|
||||
|
||||
| Quantization | Bits per Weight (BPW) |
|
||||
|--------------|-----------------------|
|
||||
@@ -18,7 +100,8 @@ Note: It is synced from llama.cpp `main` every 6 hours.
|
||||
| Q5_K_M | 5.68 |
|
||||
| Q6_K | 6.56 |
|
||||
|
||||
## Llama 2 13B
|
||||
**Llama 2 13B**
|
||||
|
||||
Quantization | Bits per Weight (BPW)
|
||||
-- | --
|
||||
Q2_K | 3.34
|
||||
@@ -31,7 +114,7 @@ Q5_K_S | 5.51
|
||||
Q5_K_M | 5.67
|
||||
Q6_K | 6.56
|
||||
|
||||
# Llama 2 70B
|
||||
**Llama 2 70B**
|
||||
|
||||
Quantization | Bits per Weight (BPW)
|
||||
-- | --
|
||||
|
||||
@@ -366,7 +366,8 @@ Notice that each `probs` is an array of length `n_probs`.
|
||||
"assistant_name": "",
|
||||
"user_name": "",
|
||||
"default_generation_settings": { ... },
|
||||
"total_slots": 1
|
||||
"total_slots": 1,
|
||||
"chat_template": ""
|
||||
}
|
||||
```
|
||||
|
||||
@@ -374,6 +375,7 @@ Notice that each `probs` is an array of length `n_probs`.
|
||||
- `user_name` - the required anti-prompt to generate the prompt in case you have specified a system prompt for all slots.
|
||||
- `default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
|
||||
- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
|
||||
- `chat_template` - the model's original Jinja2 prompt template
|
||||
|
||||
- **POST** `/v1/chat/completions`: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
|
||||
|
||||
|
||||
@@ -2605,7 +2605,7 @@ int main(int argc, char ** argv) {
|
||||
// if a custom chat template is not supplied, we will use the one that comes with the model (if any)
|
||||
if (params.chat_template.empty()) {
|
||||
if (!ctx_server.validate_model_chat_template()) {
|
||||
LOG_ERROR("The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses", {});
|
||||
LOG_WARNING("The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses", {});
|
||||
params.chat_template = "chatml";
|
||||
}
|
||||
}
|
||||
@@ -2967,11 +2967,20 @@ int main(int argc, char ** argv) {
|
||||
};
|
||||
|
||||
const auto handle_props = [&ctx_server](const httplib::Request & req, httplib::Response & res) {
|
||||
std::string template_key = "tokenizer.chat_template", curr_tmpl;
|
||||
int32_t tlen = llama_model_meta_val_str(ctx_server.model, template_key.c_str(), nullptr, 0);
|
||||
if (tlen > 0) {
|
||||
std::vector<char> curr_tmpl_buf(tlen + 1, 0);
|
||||
if (llama_model_meta_val_str(ctx_server.model, template_key.c_str(), curr_tmpl_buf.data(), curr_tmpl_buf.size()) == tlen) {
|
||||
curr_tmpl = std::string(curr_tmpl_buf.data(), tlen);
|
||||
}
|
||||
}
|
||||
res.set_header("Access-Control-Allow-Origin", req.get_header_value("Origin"));
|
||||
json data = {
|
||||
{ "system_prompt", ctx_server.system_prompt.c_str() },
|
||||
{ "default_generation_settings", ctx_server.default_generation_settings_for_props },
|
||||
{ "total_slots", ctx_server.params.n_parallel }
|
||||
{ "total_slots", ctx_server.params.n_parallel },
|
||||
{ "chat_template", curr_tmpl.c_str() }
|
||||
};
|
||||
|
||||
res.set_content(data.dump(), "application/json; charset=utf-8");
|
||||
|
||||
@@ -1175,4 +1175,5 @@ endif()
|
||||
|
||||
if (BUILD_SHARED_LIBS)
|
||||
set_target_properties(ggml PROPERTIES POSITION_INDEPENDENT_CODE ON)
|
||||
target_compile_definitions(ggml PRIVATE GGML_SHARED GGML_BUILD)
|
||||
endif()
|
||||
|
||||
@@ -68,7 +68,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q4_0(
|
||||
const int iqs4 = k_KQ % QI4_0;
|
||||
const int shift = k_KQ & (QI8_1/2);
|
||||
|
||||
const int v = (get_int_from_uint8(K_q4_0[ib].qs, iqs4) >> shift) & 0x0F0F0F0F;
|
||||
const int v = (get_int_b2(K_q4_0[ib].qs, iqs4) >> shift) & 0x0F0F0F0F;
|
||||
const int u = Q_q8[k_KQ_0/WARP_SIZE];
|
||||
|
||||
const int sumi = ggml_cuda_dp4a(v, u, 0);
|
||||
@@ -108,7 +108,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q4_1(
|
||||
const int iqs4 = k_KQ % QI4_1;
|
||||
const int shift = k_KQ & (QI8_1/2);
|
||||
|
||||
const int v = (get_int_from_uint8_aligned(K_q4_1[ib].qs, iqs4) >> shift) & 0x0F0F0F0F;
|
||||
const int v = (get_int_b4(K_q4_1[ib].qs, iqs4) >> shift) & 0x0F0F0F0F;
|
||||
const int u = Q_q8[k_KQ_0/WARP_SIZE];
|
||||
|
||||
const int sumi = ggml_cuda_dp4a(v, u, 0);
|
||||
@@ -153,8 +153,8 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q5_0(
|
||||
const int iqs8 = k_KQ % QI8_1;
|
||||
const int shift = k_KQ & (QI8_1/2);
|
||||
|
||||
int v = (get_int_from_uint8(K_q5_0[ib].qs, iqs4) >> shift) & 0x0F0F0F0F;
|
||||
const int vh = get_int_from_uint8(K_q5_0[ib].qh, 0) >> (iqs8 * QI5_0);
|
||||
int v = (get_int_b2(K_q5_0[ib].qs, iqs4) >> shift) & 0x0F0F0F0F;
|
||||
const int vh = get_int_b2(K_q5_0[ib].qh, 0) >> (iqs8 * QI5_0);
|
||||
v |= (vh << 4) & 0x00000010; // 0 -> 4
|
||||
v |= (vh << 11) & 0x00001000; // 1 -> 12
|
||||
v |= (vh << 18) & 0x00100000; // 2 -> 20
|
||||
@@ -200,8 +200,8 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q5_1(
|
||||
const int iqs8 = k_KQ % QI8_1;
|
||||
const int shift = k_KQ & (QI8_1/2);
|
||||
|
||||
int v = (get_int_from_uint8(K_q5_1[ib].qs, iqs4) >> shift) & 0x0F0F0F0F;
|
||||
const int vh = get_int_from_uint8(K_q5_1[ib].qh, 0) >> (iqs8 * QI5_1);
|
||||
int v = (get_int_b2(K_q5_1[ib].qs, iqs4) >> shift) & 0x0F0F0F0F;
|
||||
const int vh = get_int_b2(K_q5_1[ib].qh, 0) >> (iqs8 * QI5_1);
|
||||
v |= (vh << 4) & 0x00000010; // 0 -> 4
|
||||
v |= (vh << 11) & 0x00001000; // 1 -> 12
|
||||
v |= (vh << 18) & 0x00100000; // 2 -> 20
|
||||
@@ -249,7 +249,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q8_0(
|
||||
const int ib = k_KQ / QI8_0;
|
||||
const int iqs = k_KQ % QI8_0;
|
||||
|
||||
const int v = get_int_from_int8(K_q8_0[ib].qs, iqs);
|
||||
const int v = get_int_b2(K_q8_0[ib].qs, iqs);
|
||||
|
||||
T Q_d;
|
||||
if (std::is_same<T, half>::value) {
|
||||
@@ -408,7 +408,7 @@ static __device__ __forceinline__ T dequantize_1_q5_0(const void * __restrict__
|
||||
|
||||
const T d = x[ib].d;
|
||||
const int ql0 = x[ib].qs[iqs];
|
||||
const int qh0 = get_int_from_uint8(x[ib].qh, 0);
|
||||
const int qh0 = get_int_b2(x[ib].qh, 0);
|
||||
const int ql = ((ql0 >> (4*shift)) & 0x0F);
|
||||
const int qh = ((qh0 >> idq) << 4) & 0x10;
|
||||
const int q = (ql | qh) - 16;
|
||||
@@ -433,7 +433,7 @@ static __device__ __forceinline__ T dequantize_1_q5_1(const void * __restrict__
|
||||
|
||||
const half2 dm = x[ib].dm;
|
||||
const int ql0 = x[ib].qs[iqs];
|
||||
const int qh0 = get_int_from_uint8_aligned(x[ib].qh, 0);
|
||||
const int qh0 = get_int_b4(x[ib].qh, 0);
|
||||
const int ql = ((ql0 >> (4*shift)) & 0x0F);
|
||||
const int qh = ((qh0 >> idq) << 4) & 0x10;
|
||||
const int q = (ql | qh);
|
||||
|
||||
@@ -59,6 +59,12 @@ void ggml_cuda_op_mul_mat_q(
|
||||
case GGML_TYPE_Q6_K:
|
||||
mul_mat_q_case<GGML_TYPE_Q6_K>(ctx, args, stream);
|
||||
break;
|
||||
case GGML_TYPE_IQ4_XS:
|
||||
mul_mat_q_case<GGML_TYPE_IQ4_XS>(ctx, args, stream);
|
||||
break;
|
||||
case GGML_TYPE_IQ4_NL:
|
||||
mul_mat_q_case<GGML_TYPE_IQ4_NL>(ctx, args, stream);
|
||||
break;
|
||||
default:
|
||||
GGML_ASSERT(false);
|
||||
break;
|
||||
@@ -87,6 +93,8 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11) {
|
||||
case GGML_TYPE_Q4_K:
|
||||
case GGML_TYPE_Q5_K:
|
||||
case GGML_TYPE_Q6_K:
|
||||
case GGML_TYPE_IQ4_XS:
|
||||
case GGML_TYPE_IQ4_NL:
|
||||
mmq_supported = true;
|
||||
break;
|
||||
default:
|
||||
|
||||
+185
-43
@@ -92,15 +92,17 @@ static constexpr __device__ int get_mmq_y_device() {
|
||||
|
||||
static constexpr __host__ __device__ tile_x_sizes mmq_get_dp4a_tile_x_sizes(ggml_type type, int mmq_y) {
|
||||
return type == GGML_TYPE_Q4_0 ? MMQ_DP4A_TXS_Q4_0 :
|
||||
type == GGML_TYPE_Q4_1 ? MMQ_DP4A_TXS_Q4_1 :
|
||||
type == GGML_TYPE_Q5_0 ? MMQ_DP4A_TXS_Q5_0 :
|
||||
type == GGML_TYPE_Q5_1 ? MMQ_DP4A_TXS_Q5_1 :
|
||||
type == GGML_TYPE_Q8_0 ? MMQ_DP4A_TXS_Q8_0 :
|
||||
type == GGML_TYPE_Q2_K ? MMQ_DP4A_TXS_Q2_K :
|
||||
type == GGML_TYPE_Q3_K ? MMQ_DP4A_TXS_Q3_K :
|
||||
type == GGML_TYPE_Q4_K ? MMQ_DP4A_TXS_Q4_K :
|
||||
type == GGML_TYPE_Q5_K ? MMQ_DP4A_TXS_Q5_K :
|
||||
type == GGML_TYPE_Q6_K ? MMQ_DP4A_TXS_Q6_K :
|
||||
type == GGML_TYPE_Q4_1 ? MMQ_DP4A_TXS_Q4_1 :
|
||||
type == GGML_TYPE_Q5_0 ? MMQ_DP4A_TXS_Q5_0 :
|
||||
type == GGML_TYPE_Q5_1 ? MMQ_DP4A_TXS_Q5_1 :
|
||||
type == GGML_TYPE_Q8_0 ? MMQ_DP4A_TXS_Q8_0 :
|
||||
type == GGML_TYPE_Q2_K ? MMQ_DP4A_TXS_Q2_K :
|
||||
type == GGML_TYPE_Q3_K ? MMQ_DP4A_TXS_Q3_K :
|
||||
type == GGML_TYPE_Q4_K ? MMQ_DP4A_TXS_Q4_K :
|
||||
type == GGML_TYPE_Q5_K ? MMQ_DP4A_TXS_Q5_K :
|
||||
type == GGML_TYPE_Q6_K ? MMQ_DP4A_TXS_Q6_K :
|
||||
type == GGML_TYPE_IQ4_XS ? MMQ_DP4A_TXS_Q5_0 :
|
||||
type == GGML_TYPE_IQ4_NL ? MMQ_DP4A_TXS_Q5_0 :
|
||||
tile_x_sizes{0, 0, 0};
|
||||
}
|
||||
|
||||
@@ -128,15 +130,17 @@ static_assert(MMQ_MMA_TILE_X_K_Q6_K % 8 == 4, "Wrong padding.");
|
||||
|
||||
static constexpr __host__ __device__ int mmq_get_mma_tile_x_k(ggml_type type) {
|
||||
return type == GGML_TYPE_Q4_0 ? MMQ_MMA_TILE_X_K_Q4_0 :
|
||||
type == GGML_TYPE_Q4_1 ? MMQ_MMA_TILE_X_K_Q4_1 :
|
||||
type == GGML_TYPE_Q5_0 ? MMQ_MMA_TILE_X_K_Q5_0 :
|
||||
type == GGML_TYPE_Q5_1 ? MMQ_MMA_TILE_X_K_Q5_1 :
|
||||
type == GGML_TYPE_Q8_0 ? MMQ_MMA_TILE_X_K_Q8_0 :
|
||||
type == GGML_TYPE_Q2_K ? MMQ_MMA_TILE_X_K_Q2_K :
|
||||
type == GGML_TYPE_Q3_K ? MMQ_MMA_TILE_X_K_Q3_K :
|
||||
type == GGML_TYPE_Q4_K ? MMQ_MMA_TILE_X_K_Q4_K :
|
||||
type == GGML_TYPE_Q5_K ? MMQ_MMA_TILE_X_K_Q5_K :
|
||||
type == GGML_TYPE_Q6_K ? MMQ_MMA_TILE_X_K_Q6_K :
|
||||
type == GGML_TYPE_Q4_1 ? MMQ_MMA_TILE_X_K_Q4_1 :
|
||||
type == GGML_TYPE_Q5_0 ? MMQ_MMA_TILE_X_K_Q5_0 :
|
||||
type == GGML_TYPE_Q5_1 ? MMQ_MMA_TILE_X_K_Q5_1 :
|
||||
type == GGML_TYPE_Q8_0 ? MMQ_MMA_TILE_X_K_Q8_0 :
|
||||
type == GGML_TYPE_Q2_K ? MMQ_MMA_TILE_X_K_Q2_K :
|
||||
type == GGML_TYPE_Q3_K ? MMQ_MMA_TILE_X_K_Q3_K :
|
||||
type == GGML_TYPE_Q4_K ? MMQ_MMA_TILE_X_K_Q4_K :
|
||||
type == GGML_TYPE_Q5_K ? MMQ_MMA_TILE_X_K_Q5_K :
|
||||
type == GGML_TYPE_Q6_K ? MMQ_MMA_TILE_X_K_Q6_K :
|
||||
type == GGML_TYPE_IQ4_XS ? MMQ_MMA_TILE_X_K_Q5_0 :
|
||||
type == GGML_TYPE_IQ4_NL ? MMQ_MMA_TILE_X_K_Q5_0 :
|
||||
0;
|
||||
}
|
||||
|
||||
@@ -185,9 +189,9 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
const block_q4_0 * bxi = (const block_q4_0 *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q4_0 + threadIdx.x] = get_int_from_uint8(bxi->qs, kqsx);
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q4_0 + threadIdx.x] = get_int_b2(bxi->qs, kqsx);
|
||||
#else
|
||||
x_qs[i*(WARP_SIZE + 1) + threadIdx.x] = get_int_from_uint8(bxi->qs, kqsx);
|
||||
x_qs[i*(WARP_SIZE + 1) + threadIdx.x] = get_int_b2(bxi->qs, kqsx);
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
|
||||
@@ -348,9 +352,9 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
const block_q4_1 * bxi = (const block_q4_1 *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q4_1 + threadIdx.x] = get_int_from_uint8_aligned(bxi->qs, kqsx);
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q4_1 + threadIdx.x] = get_int_b4(bxi->qs, kqsx);
|
||||
#else
|
||||
x_qs[i*(WARP_SIZE + 1) + threadIdx.x] = get_int_from_uint8_aligned(bxi->qs, kqsx);
|
||||
x_qs[i*(WARP_SIZE + 1) + threadIdx.x] = get_int_b4(bxi->qs, kqsx);
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
|
||||
@@ -509,8 +513,8 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
|
||||
const block_q5_0 * bxi = (const block_q5_0 *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
const int ql = get_int_from_uint8(bxi->qs, kqsx);
|
||||
const int qh = get_int_from_uint8(bxi->qh, 0) >> (4 * (threadIdx.x % QI5_0));
|
||||
const int ql = get_int_b2(bxi->qs, kqsx);
|
||||
const int qh = get_int_b2(bxi->qh, 0) >> (4 * (threadIdx.x % QI5_0));
|
||||
|
||||
int qs0 = (ql >> 0) & 0x0F0F0F0F;
|
||||
qs0 |= (qh << 4) & 0x00000010; // 0 -> 4
|
||||
@@ -674,8 +678,8 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
|
||||
const block_q5_1 * bxi = (const block_q5_1 *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
const int ql = get_int_from_uint8_aligned(bxi->qs, kqsx);
|
||||
const int qh = get_int_from_uint8_aligned(bxi->qh, 0) >> (4 * (threadIdx.x % QI5_1));
|
||||
const int ql = get_int_b4(bxi->qs, kqsx);
|
||||
const int qh = get_int_b4(bxi->qh, 0) >> (4 * (threadIdx.x % QI5_1));
|
||||
|
||||
int qs0 = (ql >> 0) & 0x0F0F0F0F;
|
||||
qs0 |= (qh << 4) & 0x00000010; // 0 -> 4
|
||||
@@ -839,9 +843,9 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
const block_q8_0 * bxi = (const block_q8_0 *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q8_0 + threadIdx.x] = get_int_from_int8(bxi->qs, kqsx);
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q8_0 + threadIdx.x] = get_int_b2(bxi->qs, kqsx);
|
||||
#else
|
||||
x_qs[i*(WARP_SIZE + 1) + threadIdx.x] = get_int_from_int8(bxi->qs, kqsx);
|
||||
x_qs[i*(WARP_SIZE + 1) + threadIdx.x] = get_int_b2(bxi->qs, kqsx);
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
|
||||
@@ -984,7 +988,7 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
|
||||
const block_q2_K * bxi = (const block_q2_K *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
const int x_ql_0 = get_int_from_uint8(bxi->qs, kqsx);
|
||||
const int x_ql_0 = get_int_b2(bxi->qs, kqsx);
|
||||
|
||||
#pragma unroll
|
||||
for (int l = 0; l < QR2_K; ++l) {
|
||||
@@ -1166,8 +1170,8 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
|
||||
const block_q3_K * bxi = (const block_q3_K *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
const int x_ql_0 = get_int_from_uint8(bxi->qs, kqsx);
|
||||
const int x_qh_0 = get_int_from_uint8(bxi->hmask, kqsx % (QI3_K/2)) >> (4 * (kqsx / (QI3_K/2)));
|
||||
const int x_ql_0 = get_int_b2(bxi->qs, kqsx);
|
||||
const int x_qh_0 = get_int_b2(bxi->hmask, kqsx % (QI3_K/2)) >> (4 * (kqsx / (QI3_K/2)));
|
||||
|
||||
#pragma unroll
|
||||
for (int l = 0; l < QR3_K; ++l) {
|
||||
@@ -1225,11 +1229,11 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
|
||||
const int ksc_low = ksc % (QI3_K/8);
|
||||
const int shift_low = 4 * (ksc / (QI3_K/8));
|
||||
const int sc_low = (get_int_from_uint8(bxi->scales, ksc_low) >> shift_low) & 0x0F0F0F0F;
|
||||
const int sc_low = (get_int_b2(bxi->scales, ksc_low) >> shift_low) & 0x0F0F0F0F;
|
||||
|
||||
const int ksc_high = QI3_K/8;
|
||||
const int shift_high = 2 * ksc;
|
||||
const int sc_high = ((get_int_from_uint8(bxi->scales, ksc_high) >> shift_high) << 4) & 0x30303030;
|
||||
const int sc_high = ((get_int_b2(bxi->scales, ksc_high) >> shift_high) << 4) & 0x30303030;
|
||||
|
||||
const int sc = __vsubss4(sc_low | sc_high, 0x20202020);
|
||||
|
||||
@@ -1393,9 +1397,9 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
const block_q4_K * bxi = (const block_q4_K *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q4_K + threadIdx.x] = get_int_from_uint8_aligned(bxi->qs, kqsx);
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q4_K + threadIdx.x] = get_int_b4(bxi->qs, kqsx);
|
||||
#else
|
||||
x_qs[i*(WARP_SIZE + 1) + threadIdx.x] = get_int_from_uint8_aligned(bxi->qs, kqsx);
|
||||
x_qs[i*(WARP_SIZE + 1) + threadIdx.x] = get_int_b4(bxi->qs, kqsx);
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
|
||||
@@ -1610,11 +1614,11 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
const block_q5_K * bxi = (const block_q5_K *) x + kbx0 + i*stride + kbx;
|
||||
const int ky = QR5_K*kqsx;
|
||||
|
||||
const int ql = get_int_from_uint8_aligned(bxi->qs, kqsx);
|
||||
const int ql = get_int_b4(bxi->qs, kqsx);
|
||||
const int ql0 = (ql >> 0) & 0x0F0F0F0F;
|
||||
const int ql1 = (ql >> 4) & 0x0F0F0F0F;
|
||||
|
||||
const int qh = get_int_from_uint8_aligned(bxi->qh, kqsx % (QI5_K/4));
|
||||
const int qh = get_int_b4(bxi->qh, kqsx % (QI5_K/4));
|
||||
const int qh0 = ((qh >> (2 * (kqsx / (QI5_K/4)) + 0)) << 4) & 0x10101010;
|
||||
const int qh1 = ((qh >> (2 * (kqsx / (QI5_K/4)) + 1)) << 4) & 0x10101010;
|
||||
|
||||
@@ -1832,11 +1836,11 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
const block_q6_K * bxi = (const block_q6_K *) x + kbx0 + i*stride + kbx;
|
||||
const int ky = QR6_K*kqsx;
|
||||
|
||||
const int ql = get_int_from_uint8(bxi->ql, kqsx);
|
||||
const int ql = get_int_b2(bxi->ql, kqsx);
|
||||
const int ql0 = (ql >> 0) & 0x0F0F0F0F;
|
||||
const int ql1 = (ql >> 4) & 0x0F0F0F0F;
|
||||
|
||||
const int qh = get_int_from_uint8(bxi->qh, (QI6_K/4) * (kqsx / (QI6_K/2)) + kqsx % (QI6_K/4));
|
||||
const int qh = get_int_b2(bxi->qh, (QI6_K/4) * (kqsx / (QI6_K/2)) + kqsx % (QI6_K/4));
|
||||
const int qh0 = ((qh >> (2 * ((kqsx % (QI6_K/2)) / (QI6_K/4)))) << 4) & 0x30303030;
|
||||
const int qh1 = (qh >> (2 * ((kqsx % (QI6_K/2)) / (QI6_K/4)))) & 0x30303030;
|
||||
|
||||
@@ -1883,9 +1887,9 @@ template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinlin
|
||||
const block_q6_K * bxi = (const block_q6_K *) x + kbx0 + i*stride + (threadIdx.x % (WARP_SIZE/8)) / 4;
|
||||
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
x_sc[i*MMQ_MMA_TILE_X_K_Q6_K + threadIdx.x % (WARP_SIZE/8)] = get_int_from_int8(bxi->scales, threadIdx.x % (QI6_K/8));
|
||||
x_sc[i*MMQ_MMA_TILE_X_K_Q6_K + threadIdx.x % (WARP_SIZE/8)] = get_int_b2(bxi->scales, threadIdx.x % (QI6_K/8));
|
||||
#else
|
||||
x_sc[i*(WARP_SIZE/8) + i/8 + threadIdx.x % (WARP_SIZE/8)] = get_int_from_int8(bxi->scales, threadIdx.x % (QI6_K/8));
|
||||
x_sc[i*(WARP_SIZE/8) + i/8 + threadIdx.x % (WARP_SIZE/8)] = get_int_b2(bxi->scales, threadIdx.x % (QI6_K/8));
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
}
|
||||
@@ -2018,6 +2022,124 @@ static __device__ __forceinline__ void vec_dot_q6_K_q8_1_mma(
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
|
||||
template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinline__ void load_tiles_iq4_nl(
|
||||
const char * __restrict__ x, int * __restrict__ x_tile, const int & kbx0, const int & i_max, const int & stride) {
|
||||
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
int * x_qs = (int *) x_tile;
|
||||
float * x_df = (float *) (x_qs + WARP_SIZE*2);
|
||||
#else
|
||||
constexpr tile_x_sizes txs = mmq_get_dp4a_tile_x_sizes(GGML_TYPE_IQ4_NL, mmq_y);
|
||||
int * x_qs = (int *) x_tile;
|
||||
float * x_df = (float *) (x_qs + txs.qs);
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
|
||||
const int kbx = threadIdx.x / QI4_NL;
|
||||
const int kqsx = threadIdx.x % QI4_NL;
|
||||
|
||||
#pragma unroll
|
||||
for (int i0 = 0; i0 < mmq_y; i0 += nwarps) {
|
||||
int i = i0 + threadIdx.y;
|
||||
|
||||
if (need_check) {
|
||||
i = min(i, i_max);
|
||||
}
|
||||
|
||||
const block_iq4_nl * bxi = (const block_iq4_nl *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
const int aux_q4 = get_int_b2(bxi->qs, kqsx);
|
||||
const int2 v = get_int_from_table_16(aux_q4);
|
||||
const int k0 = 8 * (threadIdx.x / 4) + threadIdx.x % 4;
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q5_0 + k0 + 0] = v.x;
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q5_0 + k0 + 4] = v.y;
|
||||
#else
|
||||
x_qs[i*(2*WARP_SIZE + 1) + k0 + 0] = v.x;
|
||||
x_qs[i*(2*WARP_SIZE + 1) + k0 + 4] = v.y;
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
|
||||
const int blocks_per_tile_x_row = WARP_SIZE / QI4_NL;
|
||||
const int kbxd = threadIdx.x % blocks_per_tile_x_row;
|
||||
|
||||
#pragma unroll
|
||||
for (int i0 = 0; i0 < mmq_y; i0 += nwarps * QI4_NL) {
|
||||
int i = i0 + threadIdx.y * QI4_NL + threadIdx.x / blocks_per_tile_x_row;
|
||||
|
||||
if (need_check) {
|
||||
i = min(i, i_max);
|
||||
}
|
||||
|
||||
const block_iq4_nl * bxi = (const block_iq4_nl *) x + kbx0 + i*stride + kbxd;
|
||||
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
x_df[i*MMQ_MMA_TILE_X_K_Q5_0 + kbxd] = __half2float(bxi->d);
|
||||
#else
|
||||
x_df[i*(WARP_SIZE/4) + i/4 + kbxd] = __half2float(bxi->d);
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
}
|
||||
|
||||
template <int mmq_y, int nwarps, bool need_check> static __device__ __forceinline__ void load_tiles_iq4_xs(
|
||||
const char * __restrict__ x, int * __restrict__ x_tile, const int & kbx0, const int & i_max, const int & stride) {
|
||||
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
int * x_qs = (int *) x_tile;
|
||||
float * x_df = (float *) (x_qs + WARP_SIZE*2);
|
||||
#else
|
||||
constexpr tile_x_sizes txs = mmq_get_dp4a_tile_x_sizes(GGML_TYPE_IQ4_XS, mmq_y);
|
||||
int * x_qs = (int *) x_tile;
|
||||
float * x_df = (float *) (x_qs + txs.qs);
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
|
||||
const int kbx = 0; // threadIdx.x / QI4_XS
|
||||
const int kqsx = threadIdx.x; // threadIdx.x % QI4_XS
|
||||
|
||||
#pragma unroll
|
||||
for (int i0 = 0; i0 < mmq_y; i0 += nwarps) {
|
||||
int i = i0 + threadIdx.y;
|
||||
|
||||
if (need_check) {
|
||||
i = min(i, i_max);
|
||||
}
|
||||
|
||||
const block_iq4_xs * bxi = (const block_iq4_xs *) x + kbx0 + i*stride + kbx;
|
||||
|
||||
const int aux_q4 = get_int_b4(bxi->qs, kqsx);
|
||||
const int2 v = get_int_from_table_16(aux_q4);
|
||||
const int k0 = 8 * (threadIdx.x / 4) + threadIdx.x % 4;
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q5_0 + k0 + 0] = v.x;
|
||||
x_qs[i*MMQ_MMA_TILE_X_K_Q5_0 + k0 + 4] = v.y;
|
||||
#else
|
||||
x_qs[i*(2*WARP_SIZE + 1) + k0 + 0] = v.x;
|
||||
x_qs[i*(2*WARP_SIZE + 1) + k0 + 4] = v.y;
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
|
||||
#pragma unroll
|
||||
for (int i0 = 0; i0 < mmq_y; i0 += nwarps * 4) {
|
||||
int i = i0 + threadIdx.y * 4 + threadIdx.x / (WARP_SIZE/4);
|
||||
|
||||
if (need_check) {
|
||||
i = min(i, i_max);
|
||||
}
|
||||
|
||||
const block_iq4_xs * bxi = (const block_iq4_xs *) x + kbx0 + i*stride;
|
||||
|
||||
const float d = __half2float(bxi->d);
|
||||
|
||||
const int ls = ((bxi->scales_l[(threadIdx.x % 8)/2] >> (4*(threadIdx.x % 2))) & 0x0F)
|
||||
| (((bxi->scales_h >> (2*(threadIdx.x % 8))) & 0x03) << 4);
|
||||
|
||||
#ifdef INT8_MMA_AVAILABLE
|
||||
x_df[i*MMQ_MMA_TILE_X_K_Q5_0 + threadIdx.x % 8] = d * (ls - 32);
|
||||
#else
|
||||
x_df[i*(WARP_SIZE/4) + i/4 + threadIdx.x % 8] = d * (ls - 32);
|
||||
#endif // INT8_MMA_AVAILABLE
|
||||
}
|
||||
}
|
||||
|
||||
template<int mmq_x, int mmq_y, int nwarps, bool need_check>
|
||||
static __device__ __forceinline__ void mmq_write_back_dp4a(
|
||||
const float * __restrict__ sum, float * __restrict__ dst, const int & stride, const int & i_max, const int & j_max) {
|
||||
@@ -2167,6 +2289,22 @@ struct mmq_type_traits<mmq_x, mmq_y, nwarps, need_check, GGML_TYPE_Q6_K> {
|
||||
static constexpr vec_dot_mmq_t vec_dot_dp4a = vec_dot_q6_K_q8_1_dp4a<mmq_x, mmq_y, nwarps>;
|
||||
};
|
||||
|
||||
template <int mmq_x, int mmq_y, int nwarps, bool need_check>
|
||||
struct mmq_type_traits<mmq_x, mmq_y, nwarps, need_check, GGML_TYPE_IQ4_NL> {
|
||||
static constexpr int vdr = VDR_IQ4_NL_Q8_1_MMQ;
|
||||
static constexpr load_tiles_mmq_t load_tiles = load_tiles_iq4_nl<mmq_y, nwarps, need_check>;
|
||||
static constexpr vec_dot_mmq_t vec_dot_mma = vec_dot_q5_0_q8_1_mma<mmq_x, mmq_y, nwarps>;
|
||||
static constexpr vec_dot_mmq_t vec_dot_dp4a = vec_dot_q5_0_q8_1_dp4a<mmq_x, mmq_y, nwarps>;
|
||||
};
|
||||
|
||||
template <int mmq_x, int mmq_y, int nwarps, bool need_check>
|
||||
struct mmq_type_traits<mmq_x, mmq_y, nwarps, need_check, GGML_TYPE_IQ4_XS> {
|
||||
static constexpr int vdr = VDR_IQ4_XS_Q8_1_MMQ;
|
||||
static constexpr load_tiles_mmq_t load_tiles = load_tiles_iq4_xs<mmq_y, nwarps, need_check>;
|
||||
static constexpr vec_dot_mmq_t vec_dot_mma = vec_dot_q5_0_q8_1_mma<mmq_x, mmq_y, nwarps>;
|
||||
static constexpr vec_dot_mmq_t vec_dot_dp4a = vec_dot_q5_0_q8_1_dp4a<mmq_x, mmq_y, nwarps>;
|
||||
};
|
||||
|
||||
static bool mmq_need_sum(const ggml_type type_x) {
|
||||
switch (type_x) {
|
||||
case GGML_TYPE_Q4_0:
|
||||
@@ -2184,6 +2322,8 @@ static bool mmq_need_sum(const ggml_type type_x) {
|
||||
case GGML_TYPE_Q5_K:
|
||||
return true;
|
||||
case GGML_TYPE_Q6_K:
|
||||
case GGML_TYPE_IQ4_XS:
|
||||
case GGML_TYPE_IQ4_NL:
|
||||
return false;
|
||||
default:
|
||||
GGML_ASSERT(false);
|
||||
@@ -2263,9 +2403,9 @@ static __device__ void mul_mat_q_process_tile(
|
||||
|
||||
template <ggml_type type, int mmq_x, int nwarps, bool need_check>
|
||||
#if defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)
|
||||
#if defined(RDNA3) || defined(RDNA2) || defined(RDNA1)
|
||||
#if defined(RDNA3) || defined(RDNA2)
|
||||
__launch_bounds__(WARP_SIZE*nwarps, 2)
|
||||
#endif // defined(RDNA3) || defined(RDNA2) || defined(RDNA1)
|
||||
#endif // defined(RDNA3) || defined(RDNA2)
|
||||
#else
|
||||
#if __CUDA_ARCH__ >= CC_VOLTA
|
||||
__launch_bounds__(WARP_SIZE*nwarps, 1)
|
||||
@@ -2608,6 +2748,8 @@ extern DECL_MMQ_CASE(GGML_TYPE_Q3_K);
|
||||
extern DECL_MMQ_CASE(GGML_TYPE_Q4_K);
|
||||
extern DECL_MMQ_CASE(GGML_TYPE_Q5_K);
|
||||
extern DECL_MMQ_CASE(GGML_TYPE_Q6_K);
|
||||
extern DECL_MMQ_CASE(GGML_TYPE_IQ4_NL);
|
||||
extern DECL_MMQ_CASE(GGML_TYPE_IQ4_XS);
|
||||
|
||||
// -------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
|
||||
@@ -22,7 +22,8 @@ SOURCE_FATTN_WMMA_CASE = "DECL_FATTN_WMMA_F16_CASE({head_size}, {cols_per_block}
|
||||
|
||||
TYPES_MMQ = [
|
||||
"GGML_TYPE_Q4_0", "GGML_TYPE_Q4_1", "GGML_TYPE_Q5_0", "GGML_TYPE_Q5_1", "GGML_TYPE_Q8_0",
|
||||
"GGML_TYPE_Q2_K", "GGML_TYPE_Q3_K", "GGML_TYPE_Q4_K", "GGML_TYPE_Q5_K", "GGML_TYPE_Q6_K"
|
||||
"GGML_TYPE_Q2_K", "GGML_TYPE_Q3_K", "GGML_TYPE_Q4_K", "GGML_TYPE_Q5_K", "GGML_TYPE_Q6_K",
|
||||
"GGML_TYPE_IQ4_NL", "GGML_TYPE_IQ4_XS"
|
||||
]
|
||||
|
||||
SOURCE_MMQ = """// This file has been autogenerated by generate_cu_files.py, do not edit manually.
|
||||
|
||||
@@ -0,0 +1,5 @@
|
||||
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
|
||||
|
||||
#include "../mmq.cuh"
|
||||
|
||||
DECL_MMQ_CASE(GGML_TYPE_IQ4_NL);
|
||||
@@ -0,0 +1,5 @@
|
||||
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
|
||||
|
||||
#include "../mmq.cuh"
|
||||
|
||||
DECL_MMQ_CASE(GGML_TYPE_IQ4_XS);
|
||||
@@ -1,36 +1,8 @@
|
||||
#include "common.cuh"
|
||||
#include <cstdint>
|
||||
|
||||
static __device__ __forceinline__ int get_int_from_int8(const int8_t * x8, const int & i32) {
|
||||
const uint16_t * x16 = (const uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment
|
||||
|
||||
int x32 = 0;
|
||||
x32 |= x16[0] << 0;
|
||||
x32 |= x16[1] << 16;
|
||||
|
||||
return x32;
|
||||
}
|
||||
|
||||
static __device__ __forceinline__ int get_int_from_uint8(const uint8_t * x8, const int & i32) {
|
||||
const uint16_t * x16 = (const uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment
|
||||
|
||||
int x32 = 0;
|
||||
x32 |= x16[0] << 0;
|
||||
x32 |= x16[1] << 16;
|
||||
|
||||
return x32;
|
||||
}
|
||||
|
||||
static __device__ __forceinline__ int get_int_from_int8_aligned(const int8_t * x8, const int & i32) {
|
||||
return *((const int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment
|
||||
}
|
||||
|
||||
static __device__ __forceinline__ int get_int_from_uint8_aligned(const uint8_t * x8, const int & i32) {
|
||||
return *((const int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment
|
||||
}
|
||||
|
||||
static __device__ __forceinline__ int get_int_b2(const void * x, const int & i32) {
|
||||
const uint16_t * x16 = (const uint16_t *) x;
|
||||
const uint16_t * x16 = (const uint16_t *) x; // assume at least 2 byte alignment
|
||||
|
||||
int x32 = x16[2*i32 + 0] << 0;
|
||||
x32 |= x16[2*i32 + 1] << 16;
|
||||
@@ -768,6 +740,7 @@ static __device__ __forceinline__ float vec_dot_q6_K_q8_1(
|
||||
}
|
||||
|
||||
#define VDR_IQ2_XXS_Q8_1_MMVQ 2
|
||||
#define VDR_IQ2_XXS_Q8_1_MMQ 2
|
||||
|
||||
static __device__ __forceinline__ float vec_dot_iq2_xxs_q8_1(
|
||||
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {
|
||||
@@ -802,6 +775,7 @@ static __device__ __forceinline__ float vec_dot_iq2_xxs_q8_1(
|
||||
}
|
||||
|
||||
#define VDR_IQ2_XS_Q8_1_MMVQ 2
|
||||
#define VDR_IQ2_XS_Q8_1_MMQ 2
|
||||
|
||||
static __device__ __forceinline__ float vec_dot_iq2_xs_q8_1(
|
||||
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {
|
||||
@@ -840,6 +814,7 @@ static __device__ __forceinline__ float vec_dot_iq2_xs_q8_1(
|
||||
}
|
||||
|
||||
#define VDR_IQ2_S_Q8_1_MMVQ 2
|
||||
#define VDR_IQ2_S_Q8_1_MMQ 2
|
||||
|
||||
static __device__ __forceinline__ float vec_dot_iq2_s_q8_1(
|
||||
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {
|
||||
@@ -887,6 +862,7 @@ static __device__ __forceinline__ float vec_dot_iq2_s_q8_1(
|
||||
}
|
||||
|
||||
#define VDR_IQ3_XXS_Q8_1_MMVQ 2
|
||||
#define VDR_IQ3_XXS_Q8_1_MMQ 2
|
||||
|
||||
static __device__ __forceinline__ float vec_dot_iq3_xxs_q8_1(
|
||||
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {
|
||||
@@ -921,6 +897,7 @@ static __device__ __forceinline__ float vec_dot_iq3_xxs_q8_1(
|
||||
}
|
||||
|
||||
#define VDR_IQ3_S_Q8_1_MMVQ 2
|
||||
#define VDR_IQ3_S_Q8_1_MMQ 2
|
||||
|
||||
// TODO: don't use lookup table for signs
|
||||
static __device__ __forceinline__ float vec_dot_iq3_s_q8_1(
|
||||
@@ -962,6 +939,9 @@ static __device__ __forceinline__ float vec_dot_iq3_s_q8_1(
|
||||
return d * sumi;
|
||||
}
|
||||
|
||||
#define VDR_IQ1_S_Q8_1_MMVQ 1
|
||||
#define VDR_IQ1_S_Q8_1_MMQ 1
|
||||
|
||||
static __device__ __forceinline__ float vec_dot_iq1_s_q8_1(
|
||||
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {
|
||||
const block_iq1_s * bq1 = (const block_iq1_s *) vbq + kbx;
|
||||
@@ -992,6 +972,9 @@ static __device__ __forceinline__ float vec_dot_iq1_s_q8_1(
|
||||
return d1q * (ds.x*sumi + ds.y*delta);
|
||||
}
|
||||
|
||||
#define VDR_IQ1_M_Q8_1_MMVQ 1
|
||||
#define VDR_IQ1_M_Q8_1_MMQ 1
|
||||
|
||||
static __device__ __forceinline__ float vec_dot_iq1_m_q8_1(
|
||||
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {
|
||||
|
||||
@@ -1051,6 +1034,7 @@ static __device__ __forceinline__ int2 get_int_from_table_16(const int & q4) {
|
||||
}
|
||||
|
||||
#define VDR_IQ4_NL_Q8_1_MMVQ 2
|
||||
#define VDR_IQ4_NL_Q8_1_MMQ 4
|
||||
|
||||
static __device__ __forceinline__ float vec_dot_iq4_nl_q8_1(
|
||||
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {
|
||||
@@ -1074,6 +1058,7 @@ static __device__ __forceinline__ float vec_dot_iq4_nl_q8_1(
|
||||
}
|
||||
|
||||
#define VDR_IQ4_XS_Q8_1_MMVQ 4
|
||||
#define VDR_IQ4_XS_Q8_1_MMQ 4
|
||||
|
||||
static __device__ __forceinline__ float vec_dot_iq4_xs_q8_1(
|
||||
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {
|
||||
|
||||
@@ -3493,10 +3493,6 @@ static void ggml_sycl_mul_mat_batched_sycl(ggml_backend_sycl_context & ctx,
|
||||
SYCL_CHECK(ggml_sycl_set_device(ctx.device));
|
||||
queue_ptr main_stream = ctx.stream();;
|
||||
|
||||
bool no_mixed_dtypes = main_stream->get_backend() == sycl::backend::ext_oneapi_cuda ||
|
||||
main_stream->get_backend() == sycl::backend::ext_oneapi_hip;
|
||||
|
||||
|
||||
void * src0_ddq = src0->data;
|
||||
sycl::half *src0_as_f16 = (sycl::half *)src0_ddq;
|
||||
float * src1_ddf = (float *) src1->data;
|
||||
@@ -3514,15 +3510,10 @@ static void ggml_sycl_mul_mat_batched_sycl(ggml_backend_sycl_context & ctx,
|
||||
sycl::half *src1_f16 = src1->type == GGML_TYPE_F16 ? (sycl::half *)src1_ddf
|
||||
: src1_f16_alloc.get();
|
||||
|
||||
ggml_sycl_pool_alloc<sycl::half> dst_f16(ctx.pool());
|
||||
char * dst_t;
|
||||
|
||||
dpct::library_data_t cu_compute_type = dpct::library_data_t::real_float;
|
||||
dpct::library_data_t cu_data_type = dpct::library_data_t::real_float;
|
||||
if (no_mixed_dtypes) {
|
||||
cu_compute_type = dpct::library_data_t::real_half;
|
||||
cu_data_type = dpct::library_data_t::real_half;
|
||||
}
|
||||
|
||||
// dst strides
|
||||
size_t nbd2 = dst->nb[2];
|
||||
@@ -3531,26 +3522,10 @@ static void ggml_sycl_mul_mat_batched_sycl(ggml_backend_sycl_context & ctx,
|
||||
const float alpha_f32 = 1.0f;
|
||||
const float beta_f32 = 0.0f;
|
||||
|
||||
const sycl::half alpha_f16 = 1.0f;
|
||||
const sycl::half beta_f16 = 0.0f;
|
||||
|
||||
const void * alpha = &alpha_f32;
|
||||
const void * beta = &beta_f32;
|
||||
if (no_mixed_dtypes) {
|
||||
alpha = &alpha_f16;
|
||||
beta = &beta_f16;
|
||||
}
|
||||
|
||||
// TODO: Renable (dst->op_params[0] =! GGML_PREC_DEFAULT) pathway
|
||||
// when oneMKL open source supports half, half, float, float: datatypes
|
||||
|
||||
dst_t = (char *) dst_ddf;
|
||||
if (no_mixed_dtypes) {
|
||||
dst_t = (char *) dst_f16.alloc(ne_dst);
|
||||
|
||||
nbd2 /= sizeof(float) / sizeof(sycl::half);
|
||||
nbd3 /= sizeof(float) / sizeof(sycl::half);
|
||||
}
|
||||
|
||||
GGML_ASSERT(ne12 % ne02 == 0);
|
||||
GGML_ASSERT(ne13 % ne03 == 0);
|
||||
@@ -3612,11 +3587,6 @@ static void ggml_sycl_mul_mat_batched_sycl(ggml_backend_sycl_context & ctx,
|
||||
(void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23,
|
||||
cu_compute_type)));
|
||||
}
|
||||
|
||||
if (no_mixed_dtypes) {
|
||||
const to_fp32_sycl_t to_fp32_sycl = ggml_get_to_fp32_sycl(GGML_TYPE_F16);
|
||||
to_fp32_sycl(dst_f16.get(), dst_ddf, ne_dst, main_stream);
|
||||
}
|
||||
}
|
||||
catch (sycl::exception const &exc) {
|
||||
std::cerr << exc.what() << "Exception caught at file:" << __FILE__
|
||||
|
||||
@@ -2426,6 +2426,7 @@ namespace dpct
|
||||
b, ldb, beta, c, ldc, batch_size);
|
||||
break;
|
||||
}
|
||||
#endif
|
||||
case detail::get_type_combination_id(
|
||||
library_data_t::real_int8, library_data_t::real_int8,
|
||||
library_data_t::real_int32, library_data_t::real_int32):
|
||||
@@ -2458,7 +2459,6 @@ namespace dpct
|
||||
batch_size);
|
||||
break;
|
||||
}
|
||||
#endif
|
||||
case detail::get_type_combination_id(
|
||||
library_data_t::real_half, library_data_t::real_half,
|
||||
library_data_t::real_half, library_data_t::real_float):
|
||||
@@ -2595,6 +2595,7 @@ namespace dpct
|
||||
stride_c, batch_size);
|
||||
break;
|
||||
}
|
||||
#endif
|
||||
case detail::get_type_combination_id(
|
||||
library_data_t::real_int8, library_data_t::real_int8,
|
||||
library_data_t::real_int32, library_data_t::real_int32):
|
||||
@@ -2623,7 +2624,6 @@ namespace dpct
|
||||
beta, c, ldc, stride_c, batch_size);
|
||||
break;
|
||||
}
|
||||
#endif
|
||||
case detail::get_type_combination_id(
|
||||
library_data_t::real_half, library_data_t::real_half,
|
||||
library_data_t::real_half, library_data_t::real_float):
|
||||
|
||||
@@ -120,7 +120,6 @@ class Keys:
|
||||
MIDDLE_ID = "tokenizer.ggml.middle_token_id"
|
||||
EOT_ID = "tokenizer.ggml.eot_token_id"
|
||||
|
||||
|
||||
#
|
||||
# recommended mapping of model tensor names for storage in gguf
|
||||
#
|
||||
@@ -163,6 +162,7 @@ class MODEL_ARCH(IntEnum):
|
||||
OPENELM = auto()
|
||||
ARCTIC = auto()
|
||||
DEEPSEEK2 = auto()
|
||||
CHATGLM = auto()
|
||||
BITNET = auto()
|
||||
T5 = auto()
|
||||
JAIS = auto()
|
||||
@@ -289,6 +289,7 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
|
||||
MODEL_ARCH.OPENELM: "openelm",
|
||||
MODEL_ARCH.ARCTIC: "arctic",
|
||||
MODEL_ARCH.DEEPSEEK2: "deepseek2",
|
||||
MODEL_ARCH.CHATGLM: "chatglm",
|
||||
MODEL_ARCH.BITNET: "bitnet",
|
||||
MODEL_ARCH.T5: "t5",
|
||||
MODEL_ARCH.JAIS: "jais",
|
||||
@@ -924,6 +925,18 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
|
||||
MODEL_TENSOR.FFN_DOWN_SHEXP,
|
||||
MODEL_TENSOR.FFN_UP_SHEXP,
|
||||
],
|
||||
MODEL_ARCH.CHATGLM : [
|
||||
MODEL_TENSOR.TOKEN_EMBD,
|
||||
MODEL_TENSOR.ROPE_FREQS,
|
||||
MODEL_TENSOR.OUTPUT_NORM,
|
||||
MODEL_TENSOR.OUTPUT,
|
||||
MODEL_TENSOR.ATTN_NORM,
|
||||
MODEL_TENSOR.ATTN_QKV,
|
||||
MODEL_TENSOR.ATTN_OUT,
|
||||
MODEL_TENSOR.FFN_NORM,
|
||||
MODEL_TENSOR.FFN_DOWN,
|
||||
MODEL_TENSOR.FFN_UP,
|
||||
],
|
||||
MODEL_ARCH.BITNET: [
|
||||
MODEL_TENSOR.ATTN_Q,
|
||||
MODEL_TENSOR.ATTN_K,
|
||||
@@ -1020,6 +1033,9 @@ MODEL_TENSOR_SKIP: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
|
||||
MODEL_TENSOR.ROPE_FREQS,
|
||||
MODEL_TENSOR.ATTN_ROT_EMBD,
|
||||
],
|
||||
MODEL_ARCH.CHATGLM: [
|
||||
MODEL_TENSOR.ROPE_FREQS,
|
||||
],
|
||||
}
|
||||
|
||||
#
|
||||
|
||||
@@ -24,6 +24,7 @@ class TensorNameMap:
|
||||
"backbone.embedding", # mamba
|
||||
"backbone.embeddings", # mamba-hf
|
||||
"transformer.in_out_embed", # Grok
|
||||
"embedding.word_embeddings", # chatglm
|
||||
"transformer.token_embeddings", # openelm
|
||||
"shared", # t5
|
||||
),
|
||||
@@ -55,6 +56,7 @@ class TensorNameMap:
|
||||
"output", # llama-pth bloom internlm2
|
||||
"word_embeddings_for_head", # persimmon
|
||||
"lm_head.linear", # phi2
|
||||
"output_layer", # chatglm
|
||||
),
|
||||
|
||||
# Output norm
|
||||
@@ -71,12 +73,14 @@ class TensorNameMap:
|
||||
"model.norm_f", # mamba-qbert
|
||||
"backbone.norm_f", # mamba
|
||||
"transformer.rms_norm", # Grok
|
||||
"encoder.final_layernorm", # chatglm
|
||||
"transformer.norm", # openelm
|
||||
),
|
||||
|
||||
# Rope frequencies
|
||||
MODEL_TENSOR.ROPE_FREQS: (
|
||||
"rope.freqs", # llama-pth
|
||||
"rotary_pos_emb.inv_freq", # chatglm
|
||||
),
|
||||
}
|
||||
|
||||
@@ -101,6 +105,7 @@ class TensorNameMap:
|
||||
"backbone.layers.{bid}.norm", # mamba
|
||||
"transformer.decoder_layer.{bid}.rms_norm", # Grok
|
||||
"transformer.blocks.{bid}.norm_attn_norm.norm_1", # dbrx
|
||||
"encoder.layers.{bid}.input_layernorm", # chatglm
|
||||
"transformer.layers.{bid}.attn_norm", # openelm
|
||||
),
|
||||
|
||||
@@ -124,6 +129,7 @@ class TensorNameMap:
|
||||
"transformer.h.{bid}.mixer.Wqkv", # phi2
|
||||
"encoder.layers.{bid}.attn.Wqkv", # nomic-bert
|
||||
"model.layers.{bid}.self_attn.qkv_proj", # phi3
|
||||
"encoder.layers.{bid}.self_attention.query_key_value", # chatglm
|
||||
"transformer.layers.{bid}.attn.qkv_proj", # openelm
|
||||
),
|
||||
|
||||
@@ -135,7 +141,7 @@ class TensorNameMap:
|
||||
"transformer.h.{bid}.attn.q_proj", # gpt-j
|
||||
"model.layers.layers.{bid}.self_attn.q_proj", # plamo
|
||||
"model.layers.{bid}.attention.wq", # internlm2
|
||||
"transformer.decoder_layer.{bid}.multi_head_attention.query" # Grok
|
||||
"transformer.decoder_layer.{bid}.multi_head_attention.query",# Grok
|
||||
),
|
||||
|
||||
# Attention key
|
||||
@@ -147,7 +153,7 @@ class TensorNameMap:
|
||||
"transformer.h.{bid}.attn.k", # refact
|
||||
"model.layers.layers.{bid}.self_attn.k_proj", # plamo
|
||||
"model.layers.{bid}.attention.wk", # internlm2
|
||||
"transformer.decoder_layer.{bid}.multi_head_attention.key" # Grok
|
||||
"transformer.decoder_layer.{bid}.multi_head_attention.key",# Grok
|
||||
),
|
||||
|
||||
# Attention value
|
||||
@@ -182,6 +188,7 @@ class TensorNameMap:
|
||||
"encoder.layers.{bid}.attn.out_proj", # nomic-bert
|
||||
"transformer.decoder_layer.{bid}.multi_head_attention.linear", # Grok
|
||||
"transformer.blocks.{bid}.norm_attn_norm.attn.out_proj", # dbrx
|
||||
"encoder.layers.{bid}.self_attention.dense", # chatglm
|
||||
"transformer.layers.{bid}.attn.out_proj", # openelm
|
||||
),
|
||||
|
||||
@@ -218,6 +225,7 @@ class TensorNameMap:
|
||||
"h.{bid}.ln_2", # gpt2
|
||||
"model.layers.{bid}.ffn_norm", # internlm2
|
||||
"transformer.decoder_layer.{bid}.rms_norm_2", # Grok
|
||||
"encoder.layers.{bid}.post_attention_layernorm", # chatglm
|
||||
"transformer.layers.{bid}.ffn_norm", # openelm
|
||||
),
|
||||
|
||||
@@ -268,6 +276,7 @@ class TensorNameMap:
|
||||
"model.layers.{bid}.mlp.c_fc", # starcoder2
|
||||
"encoder.layer.{bid}.mlp.gated_layers_v", # jina-bert-v2
|
||||
"model.layers.{bid}.residual_mlp.w3", # arctic
|
||||
"encoder.layers.{bid}.mlp.dense_h_to_4h", # chatglm
|
||||
),
|
||||
|
||||
MODEL_TENSOR.FFN_UP_EXP: (
|
||||
@@ -337,6 +346,7 @@ class TensorNameMap:
|
||||
"transformer.layers.{bid}.ffn.proj_2", # openelm
|
||||
"model.layers.{bid}.residual_mlp.w2", # arctic
|
||||
"encoder.layer.{bid}.mlp.down_layer", # jina-bert-v2
|
||||
"encoder.layers.{bid}.mlp.dense_4h_to_h", # chatglm
|
||||
),
|
||||
|
||||
MODEL_TENSOR.FFN_DOWN_EXP: (
|
||||
|
||||
+29
-3
@@ -88,8 +88,10 @@ extern "C" {
|
||||
LLAMA_VOCAB_PRE_TYPE_DBRX = 13,
|
||||
LLAMA_VOCAB_PRE_TYPE_SMAUG = 14,
|
||||
LLAMA_VOCAB_PRE_TYPE_PORO = 15,
|
||||
LLAMA_VOCAB_PRE_TYPE_VIKING = 16,
|
||||
LLAMA_VOCAB_PRE_TYPE_JAIS = 17,
|
||||
LLAMA_VOCAB_PRE_TYPE_CHATGLM3 = 16,
|
||||
LLAMA_VOCAB_PRE_TYPE_CHATGLM4 = 17,
|
||||
LLAMA_VOCAB_PRE_TYPE_VIKING = 18,
|
||||
LLAMA_VOCAB_PRE_TYPE_JAIS = 19,
|
||||
};
|
||||
|
||||
// note: these values should be synchronized with ggml_rope
|
||||
@@ -180,6 +182,12 @@ extern "C" {
|
||||
LLAMA_POOLING_TYPE_LAST = 3,
|
||||
};
|
||||
|
||||
enum llama_attention_type {
|
||||
LLAMA_ATTENTION_TYPE_UNSPECIFIED = -1,
|
||||
LLAMA_ATTENTION_TYPE_CAUSAL = 0,
|
||||
LLAMA_ATTENTION_TYPE_NON_CAUSAL = 1,
|
||||
};
|
||||
|
||||
enum llama_split_mode {
|
||||
LLAMA_SPLIT_MODE_NONE = 0, // single GPU
|
||||
LLAMA_SPLIT_MODE_LAYER = 1, // split layers and KV across GPUs
|
||||
@@ -297,6 +305,7 @@ extern "C" {
|
||||
|
||||
enum llama_rope_scaling_type rope_scaling_type; // RoPE scaling type, from `enum llama_rope_scaling_type`
|
||||
enum llama_pooling_type pooling_type; // whether to pool (sum) embedding results by sequence id
|
||||
enum llama_attention_type attention_type; // attention type to use for embeddings
|
||||
|
||||
// ref: https://github.com/ggerganov/llama.cpp/pull/2054
|
||||
float rope_freq_base; // RoPE base frequency, 0 = from model
|
||||
@@ -897,6 +906,7 @@ extern "C" {
|
||||
/// @param tokens The tokens pointer must be large enough to hold the resulting tokens.
|
||||
/// @return Returns the number of tokens on success, no more than n_tokens_max
|
||||
/// @return Returns a negative number on failure - the number of tokens that would have been returned
|
||||
/// @param add_special Allow to add BOS and EOS tokens if model is configured to do so.
|
||||
/// @param parse_special Allow tokenizing special and/or control tokens which otherwise are not exposed and treated
|
||||
/// as plaintext. Does not insert a leading space.
|
||||
LLAMA_API int32_t llama_tokenize(
|
||||
@@ -911,15 +921,31 @@ extern "C" {
|
||||
// Token Id -> Piece.
|
||||
// Uses the vocabulary in the provided context.
|
||||
// Does not write null terminator to the buffer.
|
||||
// User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens.
|
||||
// User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix')
|
||||
// @param special If true, special tokens are rendered in the output.
|
||||
LLAMA_API int32_t llama_token_to_piece(
|
||||
const struct llama_model * model,
|
||||
llama_token token,
|
||||
char * buf,
|
||||
int32_t length,
|
||||
int32_t lstrip,
|
||||
bool special);
|
||||
|
||||
/// @details Convert the provided tokens into text (inverse of llama_tokenize()).
|
||||
/// @param text The char pointer must be large enough to hold the resulting text.
|
||||
/// @return Returns the number of chars/bytes on success, no more than text_len_max.
|
||||
/// @return Returns a negative number on failure - the number of chars/bytes that would have been returned.
|
||||
/// @param remove_special Allow to remove BOS and EOS tokens if model is configured to do so.
|
||||
/// @param unparse_special If true, special tokens are rendered in the output.
|
||||
LLAMA_API int32_t llama_detokenize(
|
||||
const struct llama_model * model,
|
||||
const llama_token * tokens,
|
||||
int32_t n_tokens,
|
||||
char * text,
|
||||
int32_t text_len_max,
|
||||
bool remove_special,
|
||||
bool unparse_special);
|
||||
|
||||
/// Apply chat template. Inspired by hf apply_chat_template() on python.
|
||||
/// Both "model" and "custom_template" are optional, but at least one is required. "custom_template" has higher precedence than "model"
|
||||
/// NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
|
||||
|
||||
@@ -1,2 +1,3 @@
|
||||
-r ./requirements-convert_legacy_llama.txt
|
||||
--extra-index-url https://download.pytorch.org/whl/cpu
|
||||
torch~=2.2.1
|
||||
|
||||
@@ -1,2 +1,3 @@
|
||||
-r ./requirements-convert_legacy_llama.txt
|
||||
--extra-index-url https://download.pytorch.org/whl/cpu
|
||||
torch~=2.2.1
|
||||
|
||||
+613
-281
File diff suppressed because it is too large
Load Diff
+8
-10
@@ -232,8 +232,7 @@ static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & t
|
||||
};
|
||||
|
||||
auto _get_flags = [&] (const size_t pos) -> codepoint_flags {
|
||||
static const codepoint_flags undef(codepoint_flags::UNDEFINED);
|
||||
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags(cpts[pos]) : undef;
|
||||
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags(cpts[pos]) : codepoint_flags{};
|
||||
};
|
||||
|
||||
size_t _prev_end = offset_ini;
|
||||
@@ -295,9 +294,9 @@ static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & t
|
||||
continue;
|
||||
}
|
||||
// regex: <space>?[^\s\p{L}\p{N}]+
|
||||
if (!(flags2.is_whitespace || flags2.is_letter || flags2.is_number || flags2.is_undefined)) {
|
||||
if (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags2.as_uint()) {
|
||||
pos += (cpt == ' ');
|
||||
while (!(flags2.is_whitespace || flags2.is_letter || flags2.is_number || flags2.is_undefined)) {
|
||||
while (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags2.as_uint()) {
|
||||
flags2 = _get_flags(++pos);
|
||||
}
|
||||
_add_token(pos);
|
||||
@@ -351,8 +350,7 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
|
||||
};
|
||||
|
||||
auto _get_flags = [&] (const size_t pos) -> codepoint_flags {
|
||||
static const codepoint_flags undef(codepoint_flags::UNDEFINED);
|
||||
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags(cpts[pos]) : undef;
|
||||
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags(cpts[pos]) : codepoint_flags{};
|
||||
};
|
||||
|
||||
size_t _prev_end = offset_ini;
|
||||
@@ -394,8 +392,8 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
|
||||
}
|
||||
}
|
||||
|
||||
// regex: [^\r\n\p{L}\p{N}]?\p{L}+ //####FIXME: the first \p{L} is correct?
|
||||
if (!(cpt == '\r' || cpt == '\n' || /*flags.is_letter |*/ flags.is_number)) {
|
||||
// regex: [^\r\n\p{L}\p{N}]?\p{L}+
|
||||
if (!(cpt == '\r' || cpt == '\n' || flags.is_number)) {
|
||||
if (flags.is_letter || _get_flags(pos+1).is_letter) { // one or more letters
|
||||
pos++;
|
||||
while (_get_flags(pos).is_letter) {
|
||||
@@ -421,9 +419,9 @@ static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string &
|
||||
|
||||
// regex: <space>?[^\s\p{L}\p{N}]+[\r\n]*
|
||||
auto flags2 = (cpt == ' ' ? _get_flags(pos+1) : flags);
|
||||
if (!(flags2.is_whitespace || flags2.is_letter || flags2.is_number || flags2.is_undefined)) {
|
||||
if (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags.as_uint()) {
|
||||
pos += (cpt == ' ');
|
||||
while (!(flags2.is_whitespace || flags2.is_letter || flags2.is_number || flags2.is_undefined)) {
|
||||
while (!(flags2.is_whitespace | flags2.is_letter | flags2.is_number) && flags2.as_uint()) {
|
||||
flags2 = _get_flags(++pos);
|
||||
}
|
||||
uint32_t cpt2 = _get_cpt(pos);
|
||||
|
||||
@@ -58,6 +58,10 @@ int main(void) {
|
||||
"{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}",
|
||||
//Phi-3-vision
|
||||
"{% for message in messages %}{{'<|' + message['role'] + '|>' + '\n' + message['content'] + '<|end|>\n' }}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{- '<|assistant|>\n' -}}{% endif %}",
|
||||
// ChatGLM3
|
||||
"{% for message in messages %}{% if loop.first %}[gMASK]sop<|{{ message['role'] }}|>\n {{ message['content'] }}{% else %}<|{{ message['role'] }}|>\n {{ message['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}",
|
||||
// ChatGLM4
|
||||
u8"[gMASK]<sop>{% for item in messages %}{% if item['tools'] is defined %}<|system|>\n你是一个名为 ChatGLM 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的,你的任务是针对用户的问题和要求提供适当的答复和支持。\n\n# 可用工具{% set tools = item['tools'] %}{% for tool in tools %}{% if tool['type'] == 'function' %}\n\n## {{ tool['function']['name'] }}\n\n{{ tool['function'] | tojson(indent=4) }}\n......{% endif %}{% endfor %}{% endif %}{% if item['content'] %}<|{{ item['role'] }}|>{{ item['metadata'] }}\n{{ item['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}",
|
||||
// MiniCPM-3B-OpenHermes-2.5-v2-GGUF
|
||||
u8"{% for message in messages %}{% if message['role'] == 'user' %}{{'<用户>' + message['content'].strip() + '<AI>'}}{% else %}{{message['content'].strip()}}{% endif %}{% endfor %}",
|
||||
// DeepSeek-V2
|
||||
@@ -98,6 +102,10 @@ int main(void) {
|
||||
"<|system|>\nYou are a helpful assistant<|end|>\n<|user|>\nHello<|end|>\n<|assistant|>\nHi there<|end|>\n<|user|>\nWho are you<|end|>\n<|assistant|>\n I am an assistant <|end|>\n<|user|>\nAnother question<|end|>\n<|assistant|>\n",
|
||||
//Phi-3-vision
|
||||
"<|system|>\nYou are a helpful assistant<|end|>\n<|user|>\nHello<|end|>\n<|assistant|>\nHi there<|end|>\n<|user|>\nWho are you<|end|>\n<|assistant|>\n I am an assistant <|end|>\n<|user|>\nAnother question<|end|>\n<|assistant|>\n",
|
||||
// ChatGLM3
|
||||
"[gMASK]sop<|system|>\n You are a helpful assistant<|user|>\n Hello<|assistant|>\n Hi there<|user|>\n Who are you<|assistant|>\n I am an assistant <|user|>\n Another question<|assistant|>",
|
||||
// ChatGLM4
|
||||
"[gMASK]<sop><|system|>\nYou are a helpful assistant<|user|>\nHello<|assistant|>\nHi there<|user|>\nWho are you<|assistant|>\n I am an assistant <|user|>\nAnother question<|assistant|>",
|
||||
// MiniCPM-3B-OpenHermes-2.5-v2-GGUF
|
||||
u8"You are a helpful assistant<用户>Hello<AI>Hi there<用户>Who are you<AI>I am an assistant<用户>Another question<AI>",
|
||||
// DeepSeek-V2
|
||||
|
||||
@@ -195,11 +195,11 @@ int main(int argc, char **argv) {
|
||||
const bool add_special = false;
|
||||
|
||||
for (const auto & test_kv : k_tests) {
|
||||
const std::vector<llama_token> res = llama_tokenize(ctx, test_kv.first, add_special);
|
||||
const std::vector<llama_token> res = llama_tokenize(ctx, test_kv.first, add_special, true);
|
||||
|
||||
printf("\n");
|
||||
printf("src: '%s'\n", test_kv.first.c_str());
|
||||
printf("res: '%s'\n", llama_detokenize_bpe(ctx, res).c_str());
|
||||
printf("res: '%s'\n", llama_detokenize(ctx, res).c_str());
|
||||
printf("tok: ");
|
||||
for (const auto & tok : res) {
|
||||
printf("%d ", tok);
|
||||
@@ -216,8 +216,8 @@ int main(int argc, char **argv) {
|
||||
if (!correct) {
|
||||
fprintf(stderr, "%s : failed test: '%s'\n", __func__, test_kv.first.c_str());
|
||||
fprintf(stderr, "%s : detokenized to: '%s' instead of '%s'\n", __func__,
|
||||
llama_detokenize_bpe(ctx, res).c_str(),
|
||||
llama_detokenize_bpe(ctx, test_kv.second).c_str());
|
||||
llama_detokenize(ctx, res).c_str(),
|
||||
llama_detokenize(ctx, test_kv.second).c_str());
|
||||
fprintf(stderr, "%s : expected tokens: ", __func__);
|
||||
for (const auto & t : test_kv.second) {
|
||||
fprintf(stderr, "%6d '%s', ", t, llama_token_to_piece(ctx, t).c_str());
|
||||
@@ -253,7 +253,7 @@ int main(int argc, char **argv) {
|
||||
{
|
||||
const auto t_start = ggml_time_us();
|
||||
|
||||
res = llama_tokenize(ctx, text, add_special);
|
||||
res = llama_tokenize(ctx, text, add_special, true);
|
||||
|
||||
const auto t_end = ggml_time_us();
|
||||
|
||||
@@ -272,7 +272,7 @@ int main(int argc, char **argv) {
|
||||
}
|
||||
|
||||
for (const auto & tok : res) {
|
||||
//ofs << tok << " '" << string_strip(llama_detokenize_bpe(ctx, std::vector<int>{tok})) << "'" << std::endl;
|
||||
//ofs << tok << " '" << string_strip(llama_detokenize(ctx, std::vector<int>{tok})) << "'" << std::endl;
|
||||
ofs << tok << "\n";
|
||||
}
|
||||
}
|
||||
|
||||
@@ -11,6 +11,7 @@
|
||||
#include <string>
|
||||
#include <thread>
|
||||
#include <vector>
|
||||
#include <atomic>
|
||||
|
||||
int main(int argc, char **argv) {
|
||||
if (argc < 2 || argc > 3) {
|
||||
@@ -63,7 +64,10 @@ int main(int argc, char **argv) {
|
||||
}
|
||||
}
|
||||
|
||||
GGML_ASSERT(llama_vocab_type(model) == LLAMA_VOCAB_TYPE_BPE);
|
||||
//GGML_ASSERT(llama_vocab_type(model) == LLAMA_VOCAB_TYPE_BPE);
|
||||
if (llama_vocab_type(model) != LLAMA_VOCAB_TYPE_BPE) {
|
||||
return 99;
|
||||
}
|
||||
|
||||
#ifdef _WIN32
|
||||
// We need this for unicode console support
|
||||
@@ -74,7 +78,7 @@ int main(int argc, char **argv) {
|
||||
const int n_vocab = llama_n_vocab(model);
|
||||
|
||||
for (int i = 0; i < n_vocab; ++i) {
|
||||
std::string str = llama_detokenize_bpe(ctx, std::vector<int>(1, i));
|
||||
std::string str = llama_detokenize(ctx, std::vector<int>(1, i));
|
||||
try {
|
||||
auto cps = unicode_cpts_from_utf8(str);
|
||||
std::vector<llama_token> tokens = llama_tokenize(ctx, str, false, true);
|
||||
@@ -90,7 +94,7 @@ int main(int argc, char **argv) {
|
||||
fprintf(stderr, "]\n");
|
||||
return 2;
|
||||
}
|
||||
std::string check = llama_detokenize_bpe(ctx, tokens);
|
||||
std::string check = llama_detokenize(ctx, tokens);
|
||||
if (check != str) {
|
||||
fprintf(stderr, "%s : error: token %d detokenizes to '%s'(%zu) but tokenization of this detokenizes to '%s'(%zu)\n",
|
||||
__func__, i, str.c_str(), str.length(), check.c_str(), check.length());
|
||||
@@ -108,26 +112,23 @@ int main(int argc, char **argv) {
|
||||
|
||||
std::vector<std::thread> threads(nthread);
|
||||
|
||||
std::atomic_int errcode = {};
|
||||
|
||||
for (int i = 0; i < nthread; ++i) {
|
||||
threads[i] = std::thread([i, nthread, ctx]() {
|
||||
for (uint32_t cp = i; cp < 0x0010ffff; cp += nthread) {
|
||||
if (!( // NOLINT
|
||||
(cp < 0x03 || cp > 0x05) && cp != 0x0b && cp != 0x11 &&
|
||||
(cp < 0x13 || cp > 0x17) && cp != 0x19 &&
|
||||
(cp < 0x1c || cp > 0x1e) &&
|
||||
(cp < 0xd800 || cp > 0xdfff) &&
|
||||
(cp < 0x00040000 || cp >= 0x000e0000)
|
||||
)) {
|
||||
threads[i] = std::thread([i, nthread, ctx, &errcode]() {
|
||||
for (uint32_t cp = i; !errcode && cp < 0x00110000; cp += nthread) {
|
||||
if ((0x0000D800 <= cp && cp <= 0x0000DFFF) || // surrogates \p{Cs}
|
||||
(0x00040000 <= cp && cp <= 0x000E0000)) { // undefined \p{Cn}
|
||||
continue;
|
||||
}
|
||||
|
||||
std::string str = unicode_cpt_to_utf8(cp);
|
||||
std::vector<llama_token> tokens = llama_tokenize(ctx, str, false);
|
||||
std::string check = llama_detokenize_bpe(ctx, tokens);
|
||||
std::string check = llama_detokenize(ctx, tokens);
|
||||
if (cp != 9601 && str != check) {
|
||||
fprintf(stderr, "error: codepoint %x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
|
||||
fprintf(stderr, "error: codepoint 0x%x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
|
||||
cp, check.c_str(), check.length(), str.c_str(), str.length());
|
||||
std::exit(3);
|
||||
errcode = 3;
|
||||
}
|
||||
}
|
||||
});
|
||||
@@ -136,6 +137,10 @@ int main(int argc, char **argv) {
|
||||
for (auto & t : threads) {
|
||||
t.join();
|
||||
}
|
||||
|
||||
if (errcode) {
|
||||
return errcode;
|
||||
}
|
||||
}
|
||||
|
||||
llama_free_model(model);
|
||||
|
||||
@@ -11,6 +11,7 @@
|
||||
#include <string>
|
||||
#include <thread>
|
||||
#include <vector>
|
||||
#include <atomic>
|
||||
|
||||
int main(int argc, char ** argv) {
|
||||
if (argc < 2) {
|
||||
@@ -51,7 +52,10 @@ int main(int argc, char ** argv) {
|
||||
}
|
||||
}
|
||||
|
||||
GGML_ASSERT(llama_vocab_type(model) == LLAMA_VOCAB_TYPE_SPM);
|
||||
//GGML_ASSERT(llama_vocab_type(model) == LLAMA_VOCAB_TYPE_SPM);
|
||||
if (llama_vocab_type(model) != LLAMA_VOCAB_TYPE_SPM) {
|
||||
return 99;
|
||||
}
|
||||
|
||||
#ifdef _WIN32
|
||||
// We need this for unicode console support
|
||||
@@ -62,9 +66,9 @@ int main(int argc, char ** argv) {
|
||||
const int n_vocab = llama_n_vocab(model);
|
||||
|
||||
for (int i = 0; i < n_vocab; ++i) {
|
||||
std::string str = llama_detokenize_spm(ctx, std::vector<int>(1, i));
|
||||
std::vector<llama_token> tokens = llama_tokenize(ctx, str, false);
|
||||
std::string check = llama_detokenize_spm(ctx, tokens);
|
||||
std::string str = llama_detokenize(ctx, std::vector<int>(1, i), true);
|
||||
std::vector<llama_token> tokens = llama_tokenize(ctx, str, false, true);
|
||||
std::string check = llama_detokenize(ctx, tokens);
|
||||
if (check != str) {
|
||||
fprintf(stderr, "%s : error: token %d detokenizes to '%s'(%zu) but tokenization of this detokenizes to '%s'(%zu)\n",
|
||||
__func__, i, str.c_str(), str.length(), check.c_str(), check.length());
|
||||
@@ -78,20 +82,23 @@ int main(int argc, char ** argv) {
|
||||
|
||||
std::vector<std::thread> threads(nthread);
|
||||
|
||||
std::atomic_int errcode = {};
|
||||
|
||||
for (int i = 0; i < nthread; ++i) {
|
||||
threads[i] = std::thread([i, nthread, ctx]() {
|
||||
for (uint32_t cp = i; cp < 0x0010ffff; cp += nthread) {
|
||||
if (cp >= 0xd800 && cp <= 0xdfff) {
|
||||
threads[i] = std::thread([i, nthread, ctx, &errcode]() {
|
||||
for (uint32_t cp = i; !errcode && cp < 0x00110000; cp += nthread) {
|
||||
if ((0x0000D800 <= cp && cp <= 0x0000DFFF) || // surrogates \p{Cs}
|
||||
(0x00040000 <= cp && cp <= 0x000E0000)) { // undefined \p{Cn}
|
||||
continue;
|
||||
}
|
||||
|
||||
std::string str = unicode_cpt_to_utf8(cp);
|
||||
std::vector<llama_token> tokens = llama_tokenize(ctx, str, false);
|
||||
std::string check = llama_detokenize_spm(ctx, tokens);
|
||||
std::vector<llama_token> tokens = llama_tokenize(ctx, str, false, true);
|
||||
std::string check = llama_detokenize(ctx, tokens);
|
||||
if (cp != 9601 && str != check) {
|
||||
fprintf(stderr, "error: codepoint %x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
|
||||
fprintf(stderr, "error: codepoint 0x%x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
|
||||
cp, check.c_str(), check.length(), str.c_str(), str.length());
|
||||
std::exit(3);
|
||||
errcode = 3;
|
||||
}
|
||||
}
|
||||
});
|
||||
@@ -100,6 +107,10 @@ int main(int argc, char ** argv) {
|
||||
for (auto & t : threads) {
|
||||
t.join();
|
||||
}
|
||||
|
||||
if(errcode) {
|
||||
return errcode;
|
||||
}
|
||||
}
|
||||
|
||||
llama_free_model(model);
|
||||
|
||||
+221
-91
@@ -13,7 +13,7 @@ import subprocess
|
||||
import random
|
||||
import unicodedata
|
||||
|
||||
from typing import Callable, Iterator
|
||||
from typing import Iterator
|
||||
|
||||
import cffi
|
||||
from transformers import AutoTokenizer
|
||||
@@ -24,17 +24,20 @@ logger = logging.getLogger("test-tokenizer-random")
|
||||
|
||||
class LibLlama:
|
||||
|
||||
DEFAULT_PATH_LLAMA_H = "./llama.h"
|
||||
DEFAULT_PATH_LIBLLAMA = "./build/libllama.so" # CMakeLists.txt: BUILD_SHARED_LIBS ON
|
||||
DEFAULT_PATH_LLAMA_H = "./include/llama.h"
|
||||
DEFAULT_PATH_INCLUDES = ["./ggml/include/", "./include/"]
|
||||
DEFAULT_PATH_LIBLLAMA = "./build/src/libllama.so" # CMakeLists.txt: BUILD_SHARED_LIBS ON
|
||||
|
||||
def __init__(self, path_llama_h: str = None, path_libllama: str = None):
|
||||
def __init__(self, path_llama_h: str = None, path_includes: list[str] = [], path_libllama: str = None):
|
||||
path_llama_h = path_llama_h or self.DEFAULT_PATH_LLAMA_H
|
||||
path_includes = path_includes or self.DEFAULT_PATH_INCLUDES
|
||||
path_libllama = path_libllama or self.DEFAULT_PATH_LIBLLAMA
|
||||
(self.ffi, self.lib) = self._load_libllama_cffi(path_llama_h, path_libllama)
|
||||
(self.ffi, self.lib) = self._load_libllama_cffi(path_llama_h, path_includes, path_libllama)
|
||||
self.lib.llama_backend_init()
|
||||
|
||||
def _load_libllama_cffi(self, path_llama_h: str, path_libllama: str):
|
||||
cmd = ["gcc", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)=", path_llama_h]
|
||||
def _load_libllama_cffi(self, path_llama_h: str, path_includes: list[str], path_libllama: str):
|
||||
cmd = ["gcc", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)="]
|
||||
cmd += ["-I" + path for path in path_includes] + [path_llama_h]
|
||||
res = subprocess.run(cmd, stdout=subprocess.PIPE)
|
||||
assert (res.returncode == 0)
|
||||
source = res.stdout.decode()
|
||||
@@ -79,6 +82,7 @@ class LibLlamaModel:
|
||||
raise RuntimeError("error: failed to create context for model '%s'" % path_model)
|
||||
n_tokens_max = self.lib.llama_n_ctx(self.ctx)
|
||||
self.token_ids = self.ffi.new("llama_token[]", n_tokens_max)
|
||||
self.text_buff = self.ffi.new("uint8_t[]", 1024)
|
||||
|
||||
def free(self):
|
||||
if self.ctx:
|
||||
@@ -89,14 +93,78 @@ class LibLlamaModel:
|
||||
self.model = None
|
||||
self.lib = None
|
||||
|
||||
def tokenize(self, text: str, n_tokens_max: int = 0, add_special: bool = False, parse_special: bool = False) -> list[int]:
|
||||
n_tokens_max = n_tokens_max if n_tokens_max > 0 else len(self.token_ids)
|
||||
def tokenize(self, text: str, add_special: bool = False, parse_special: bool = False) -> list[int]:
|
||||
text = text.encode("utf-8")
|
||||
num = self.lib.llama_tokenize(self.model, text, len(text), self.token_ids, n_tokens_max, add_special, parse_special)
|
||||
if num < 0:
|
||||
return []
|
||||
num = self.lib.llama_tokenize(self.model, text, len(text), self.token_ids, len(self.token_ids), add_special, parse_special)
|
||||
while num < 0 and len(self.token_ids) < (16 << 20):
|
||||
self.token_ids = self.ffi.new("llama_token[]", -2 * num)
|
||||
num = self.lib.llama_tokenize(self.model, text, len(text), self.token_ids, len(self.token_ids), add_special, parse_special)
|
||||
return list(self.token_ids[0:num])
|
||||
|
||||
def detokenize(self, ids: list[int], remove_special: bool = False, unparse_special: bool = False) -> str:
|
||||
if len(self.token_ids) < len(ids):
|
||||
self.token_ids = self.ffi.new("llama_token[]", 2 * len(ids))
|
||||
for i, id in enumerate(ids):
|
||||
self.token_ids[i] = id
|
||||
num = self.lib.llama_detokenize(self.model, self.token_ids, len(ids), self.text_buff, len(self.text_buff), remove_special, unparse_special)
|
||||
while num < 0 and len(self.text_buff) < (16 << 20):
|
||||
self.text_buff = self.ffi.new("uint8_t[]", -2 * num)
|
||||
num = self.lib.llama_detokenize(self.model, self.token_ids, len(ids), self.text_buff, len(self.text_buff), remove_special, unparse_special)
|
||||
return str(self.ffi.buffer(self.text_buff, num), encoding="utf-8", errors="replace") # replace errors with '\uFFFD'
|
||||
|
||||
|
||||
class Tokenizer:
|
||||
|
||||
def encode(self, text: str) -> list[int]:
|
||||
raise NotImplementedError
|
||||
|
||||
def decode(self, ids: list[int]) -> str:
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
class TokenizerGroundtruth (Tokenizer):
|
||||
|
||||
def __init__(self, dir_tokenizer: str):
|
||||
self.model = AutoTokenizer.from_pretrained(dir_tokenizer)
|
||||
# guess BOS and EOS
|
||||
ids = self.encode("a")
|
||||
assert 1 <= len(ids) <= 3
|
||||
add_bos_token = len(ids) > 1 and self.model.bos_token_id == ids[0]
|
||||
add_eos_token = len(ids) > 1 and self.model.eos_token_id == ids[-1]
|
||||
self.add_bos_token = getattr(self.model, "add_bos_token", add_bos_token)
|
||||
self.add_eos_token = getattr(self.model, "add_eos_token", add_eos_token)
|
||||
# build vocab
|
||||
tokens = list(self.model.get_vocab().values())
|
||||
self.vocab = self.model.batch_decode(tokens, skip_special_tokens=True)
|
||||
self.vocab = list(sorted(self.vocab))
|
||||
# tokens and lists
|
||||
self.special_tokens = list(self.model.all_special_tokens)
|
||||
self.added_tokens = list(self.model.added_tokens_encoder)
|
||||
self.bos_token = self.model.bos_token
|
||||
self.eos_token = self.model.eos_token
|
||||
|
||||
def encode(self, text: str) -> list[int]:
|
||||
return self.model.encode(text, add_special_tokens=True)
|
||||
|
||||
def decode(self, ids: list[int]) -> str:
|
||||
return self.model.decode(ids, skip_special_tokens=False)
|
||||
|
||||
|
||||
class TokenizerLlamaCpp (Tokenizer):
|
||||
|
||||
libllama: LibLlama = None
|
||||
|
||||
def __init__(self, vocab_file: str):
|
||||
if not self.libllama:
|
||||
self.libllama = LibLlama()
|
||||
self.model = LibLlamaModel(self.libllama, vocab_file, mparams=dict(vocab_only=True), cparams=dict(n_ctx=4096))
|
||||
|
||||
def encode(self, text: str) -> list[int]:
|
||||
return self.model.tokenize(text, add_special=True, parse_special=True)
|
||||
|
||||
def decode(self, ids: list[int]) -> str:
|
||||
return self.model.detokenize(ids, remove_special=False, unparse_special=True)
|
||||
|
||||
|
||||
def generator_custom_text() -> Iterator[str]:
|
||||
"""General tests"""
|
||||
@@ -165,19 +233,48 @@ def generator_custom_text_edge_cases() -> Iterator[str]:
|
||||
'a </s> b', # rstrip phi-3
|
||||
'a <mask> b', # lstrip jina-v2
|
||||
'\xa0aC', # deepseek
|
||||
'\u2029 \uA3E4', # deepseek-llm
|
||||
"a ?",
|
||||
'å', # mpt
|
||||
'\U000ac517', # utf-8 encode error, falcon
|
||||
'\U000522f4', # utf-8 encode error, starcoder
|
||||
"<s><s><unk><s>a<s>b<s>c<unk>d<unk></s>",
|
||||
"<s> <s> <unk><s>a<s>b<s>c<unk>d<unk></s>",
|
||||
]
|
||||
|
||||
|
||||
def generator_vocab_words(vocab: list[str]) -> Iterator[str]:
|
||||
def generator_vocab_words(tokenizer: TokenizerGroundtruth) -> Iterator[str]:
|
||||
"""Brute force check all vocab words"""
|
||||
yield from vocab
|
||||
yield from tokenizer.vocab
|
||||
|
||||
|
||||
def generator_added_lr_strip(tokenizer) -> Iterator[str]:
|
||||
WHITESPACES = ["", " ", " ", " "]
|
||||
special_tokens = list(tokenizer.all_special_tokens)
|
||||
added_tokens = list(tokenizer.added_tokens_encoder)
|
||||
all_tokens = list(sorted(set(special_tokens + added_tokens)))
|
||||
def generator_ascii_lr_strip() -> Iterator[str]:
|
||||
WHITESPACES = ["", " ", " "]
|
||||
CHARACTERS = list(chr(i) for i in range(1, 0x80)) + [""]
|
||||
for char1 in CHARACTERS:
|
||||
for char2 in CHARACTERS:
|
||||
for lstrip in WHITESPACES:
|
||||
for rstrip in WHITESPACES:
|
||||
yield lstrip + char1 + char2 + rstrip
|
||||
yield lstrip + char1 + rstrip + char2
|
||||
yield char1 + lstrip + char2 + rstrip
|
||||
|
||||
|
||||
def generator_apostrophe() -> Iterator[str]:
|
||||
WHITESPACES = ["", " ", " "]
|
||||
CHARACTERS = list(chr(i) for i in range(1, 0x80)) + [""]
|
||||
for char1 in CHARACTERS:
|
||||
for char2 in CHARACTERS:
|
||||
for lstrip in WHITESPACES:
|
||||
for rstrip in WHITESPACES:
|
||||
yield char1 + lstrip + "'" + rstrip + char2
|
||||
yield char1 + char2 + lstrip + "'" + rstrip + "z"
|
||||
yield "a" + lstrip + "'" + rstrip + char1 + char2
|
||||
|
||||
|
||||
def generator_added_lr_strip(tokenizer: TokenizerGroundtruth) -> Iterator[str]:
|
||||
WHITESPACES = ["", " ", " ", "\n", "\r\n", "\n\n", "\t", "\t\t"]
|
||||
all_tokens = list(sorted(set(tokenizer.special_tokens + tokenizer.added_tokens)))
|
||||
for token in all_tokens:
|
||||
for lstrip in WHITESPACES:
|
||||
for rstrip in WHITESPACES:
|
||||
@@ -187,11 +284,9 @@ def generator_added_lr_strip(tokenizer) -> Iterator[str]:
|
||||
yield "a" + lstrip + token + rstrip + "z"
|
||||
|
||||
|
||||
def generator_random_added_tokens(tokenizer, iterations=100) -> Iterator[str]:
|
||||
special_tokens = list(tokenizer.all_special_tokens)
|
||||
added_tokens = list(tokenizer.added_tokens_encoder)
|
||||
separations = [" ", "\n", "\t", "-", "!", "one", "1", "<s>", "</s>"]
|
||||
all_tokens = list(sorted(set(special_tokens + added_tokens + separations)))
|
||||
def generator_random_added_tokens(tokenizer: TokenizerGroundtruth, iterations=100) -> Iterator[str]:
|
||||
separations = [" ", "\n", "\t", "-", "!", "one", "1", "<s>", "</s>"]
|
||||
all_tokens = list(sorted(set(tokenizer.special_tokens + tokenizer.added_tokens + separations)))
|
||||
rand = random.Random()
|
||||
for m in range(iterations):
|
||||
rand.seed(m)
|
||||
@@ -242,13 +337,13 @@ def generator_unicodes() -> Iterator[str]:
|
||||
def _valid(cpt):
|
||||
if cpt >= 0x30000: # unassigned and supplementary
|
||||
return False
|
||||
if 0x00D800 <= cpt <= 0x00F8FF: # Surrogates
|
||||
return False
|
||||
if unicodedata.category(chr(cpt)) == "Cn":
|
||||
# if cpt == 0x2029: # deepseek-llm
|
||||
# return False
|
||||
if unicodedata.category(chr(cpt)) in ("Cn", "Cs", "Co"): # undefined, surrogates, private
|
||||
return False
|
||||
return True
|
||||
|
||||
characters = [chr(cpt) for cpt in range(1, MAX_CODEPOINTS) if _valid(cpt)]
|
||||
characters = [chr(cpt) for cpt in range(0, MAX_CODEPOINTS) if _valid(cpt)]
|
||||
|
||||
yield from characters
|
||||
|
||||
@@ -273,11 +368,11 @@ def generator_random_unicodes(iterations=100) -> Iterator[str]:
|
||||
yield "".join(text)
|
||||
|
||||
|
||||
def generator_random_vocab_chars(vocab: list[str], iterations=100) -> Iterator[str]:
|
||||
def generator_random_vocab_chars(tokenizer: TokenizerGroundtruth, iterations=100) -> Iterator[str]:
|
||||
"""Brute force random text with vocab characters"""
|
||||
|
||||
vocab_chars = set()
|
||||
for word in vocab:
|
||||
for word in tokenizer.vocab:
|
||||
vocab_chars.update(word)
|
||||
vocab_chars = list(sorted(vocab_chars))
|
||||
|
||||
@@ -288,10 +383,10 @@ def generator_random_vocab_chars(vocab: list[str], iterations=100) -> Iterator[s
|
||||
yield "".join(text)
|
||||
|
||||
|
||||
def generator_random_vocab_words(vocab: list[str], iterations=100) -> Iterator[str]:
|
||||
def generator_random_vocab_words(tokenizer: TokenizerGroundtruth, iterations=100) -> Iterator[str]:
|
||||
"""Brute force random text from vocab words"""
|
||||
|
||||
vocab = [w.strip() for w in vocab]
|
||||
vocab = [w.strip() for w in tokenizer.vocab]
|
||||
yield from vocab
|
||||
|
||||
rand = random.Random()
|
||||
@@ -307,7 +402,7 @@ def generator_random_vocab_words(vocab: list[str], iterations=100) -> Iterator[s
|
||||
yield "".join(text)
|
||||
|
||||
|
||||
def compare_tokenizers(func_tokenize1: Callable, func_tokenize2: Callable, generator: Iterator[str]):
|
||||
def compare_tokenizers(tokenizer1: TokenizerGroundtruth, tokenizer2: TokenizerLlamaCpp, generator: Iterator[str]):
|
||||
|
||||
def find_first_mismatch(ids1: list[int], ids2: list[int]):
|
||||
for i, (a, b) in enumerate(zip(ids1, ids2)):
|
||||
@@ -317,34 +412,67 @@ def compare_tokenizers(func_tokenize1: Callable, func_tokenize2: Callable, gener
|
||||
return -1
|
||||
return min(len(ids1), len(ids2))
|
||||
|
||||
t_tokenizer1 = 0
|
||||
t_tokenizer2 = 0
|
||||
def check_detokenizer(text: str, text1: str, text2: str) -> bool:
|
||||
if text1 == text2: # equal to TokenizerGroundtruth?
|
||||
return True
|
||||
# equal to source text?
|
||||
if tokenizer1.add_bos_token: # remove BOS
|
||||
if text2.startswith(tokenizer1.bos_token):
|
||||
text2 = text2[len(tokenizer1.bos_token):]
|
||||
if tokenizer1.add_eos_token: # remove EOS
|
||||
if text2.endswith(tokenizer1.eos_token):
|
||||
text2 = text2[:-len(tokenizer1.eos_token)]
|
||||
return text == text2
|
||||
|
||||
t_encode1 = 0
|
||||
t_encode2 = 0
|
||||
t_decode1 = 0
|
||||
t_decode2 = 0
|
||||
t_start = time.perf_counter()
|
||||
num_errors = 10
|
||||
encode_errors = 0
|
||||
decode_errors = 0
|
||||
MAX_ERRORS = 10
|
||||
|
||||
logger.info("%s: %s" % (generator.__name__, "ini"))
|
||||
for text in generator:
|
||||
# print(repr(text), text.encode())
|
||||
# print(repr(text), hex(ord(text[0])), text.encode())
|
||||
t0 = time.perf_counter()
|
||||
ids1 = func_tokenize1(text)
|
||||
ids1 = tokenizer1.encode(text)
|
||||
t1 = time.perf_counter()
|
||||
ids2 = func_tokenize2(text)
|
||||
ids2 = tokenizer2.encode(text)
|
||||
t2 = time.perf_counter()
|
||||
t_tokenizer1 += t1 - t0
|
||||
t_tokenizer2 += t2 - t1
|
||||
if ids1 != ids2:
|
||||
text1 = tokenizer1.decode(ids1)
|
||||
t3 = time.perf_counter()
|
||||
text2 = tokenizer2.decode(ids1)
|
||||
t4 = time.perf_counter()
|
||||
t_encode1 += t1 - t0
|
||||
t_encode2 += t2 - t1
|
||||
t_decode1 += t3 - t2
|
||||
t_decode2 += t4 - t3
|
||||
if encode_errors < MAX_ERRORS and ids1 != ids2:
|
||||
i = find_first_mismatch(ids1, ids2)
|
||||
ids1 = list(ids1)[max(0, i - 2) : i + 5 + 1]
|
||||
ids2 = list(ids2)[max(0, i - 2) : i + 5 + 1]
|
||||
logger.error(" TokenIDs: " + str(ids1))
|
||||
logger.error(" Expected: " + str(ids2))
|
||||
logger.error(" Expected: " + str(ids1))
|
||||
logger.error(" Result: " + str(ids2))
|
||||
encode_errors += 1
|
||||
logger.error(f" {encode_errors=}")
|
||||
if decode_errors < MAX_ERRORS and not check_detokenizer(text, text1, text2):
|
||||
i = find_first_mismatch(text1, text2)
|
||||
text1 = list(text1[max(0, i - 2) : i + 5 + 1])
|
||||
text2 = list(text2[max(0, i - 2) : i + 5 + 1])
|
||||
logger.error(" Expected: " + " ".join(hex(ord(x)) for x in text1))
|
||||
logger.error(" Result: " + " ".join(hex(ord(x)) for x in text2))
|
||||
decode_errors += 1
|
||||
logger.error(f" {decode_errors=}")
|
||||
if encode_errors >= MAX_ERRORS and decode_errors >= MAX_ERRORS:
|
||||
logger.error(f" EXIT: {encode_errors=} {decode_errors=}")
|
||||
# raise Exception()
|
||||
num_errors += 1
|
||||
if num_errors > 10:
|
||||
break
|
||||
break
|
||||
|
||||
t_total = time.perf_counter() - t_start
|
||||
logger.info("%s: end, tok1: %.3f tok2: %.3f total: %.3f" % (generator.__name__, t_tokenizer1, t_tokenizer2, t_total))
|
||||
logger.info(f"{generator.__name__}: end, {t_encode1=:.3f} {t_encode2=:.3f} {t_decode1=:.3f} {t_decode2=:.3f} {t_total=:.3f}")
|
||||
|
||||
|
||||
def main(argv: list[str] = None):
|
||||
@@ -357,74 +485,76 @@ def main(argv: list[str] = None):
|
||||
logging.basicConfig(level = logging.DEBUG if args.verbose else logging.INFO)
|
||||
logger.info(f"VOCABFILE: '{args.vocab_file}'")
|
||||
|
||||
model = LibLlamaModel(LibLlama(), args.vocab_file, mparams=dict(vocab_only=True), cparams=dict(n_ctx=4096))
|
||||
tokenizer = AutoTokenizer.from_pretrained(args.dir_tokenizer)
|
||||
tokenizer1 = TokenizerGroundtruth(args.dir_tokenizer)
|
||||
tokenizer2 = TokenizerLlamaCpp(args.vocab_file)
|
||||
|
||||
def func_tokenize1(text: str):
|
||||
return model.tokenize(text, add_special=True, parse_special=True)
|
||||
# compare_tokenizers(tokenizer1, tokenizer2, generator_custom_text())
|
||||
# compare_tokenizers(tokenizer1, tokenizer2, generator_custom_text_edge_cases())
|
||||
compare_tokenizers(tokenizer1, tokenizer2, generator_ascii_lr_strip())
|
||||
compare_tokenizers(tokenizer1, tokenizer2, generator_apostrophe())
|
||||
compare_tokenizers(tokenizer1, tokenizer2, generator_unicodes())
|
||||
compare_tokenizers(tokenizer1, tokenizer2, generator_vocab_words(tokenizer1))
|
||||
compare_tokenizers(tokenizer1, tokenizer2, generator_added_lr_strip(tokenizer1))
|
||||
# compare_tokenizers(tokenizer1, tokenizer2, generator_random_added_tokens(tokenizer1, 10_000))
|
||||
# compare_tokenizers(tokenizer1, tokenizer2, generator_random_chars(10_000))
|
||||
# compare_tokenizers(tokenizer1, tokenizer2, generator_random_unicodes(10_000))
|
||||
# compare_tokenizers(tokenizer1, tokenizer2, generator_random_vocab_chars(tokenizer1, 10_000))
|
||||
# compare_tokenizers(tokenizer1, tokenizer2, generator_random_vocab_words(tokenizer1, 5_000))
|
||||
|
||||
def func_tokenize2(text: str):
|
||||
return tokenizer.encode(text, add_special_tokens=True)
|
||||
|
||||
ids = func_tokenize2("a")
|
||||
assert 1 <= len(ids) <= 3
|
||||
add_bos_token = len(ids) > 1 and tokenizer.bos_token_id == ids[0]
|
||||
add_eos_token = len(ids) > 1 and tokenizer.eos_token_id == ids[-1]
|
||||
tokenizer.add_bos_token = getattr(tokenizer, "add_bos_token", add_bos_token)
|
||||
tokenizer.add_eos_token = getattr(tokenizer, "add_eos_token", add_eos_token)
|
||||
|
||||
vocab = list(sorted(tokenizer.batch_decode(list(tokenizer.get_vocab().values()), skip_special_tokens=True)))
|
||||
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_custom_text())
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_custom_text_edge_cases())
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_unicodes())
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_vocab_words(vocab))
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_added_lr_strip(tokenizer))
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_random_added_tokens(tokenizer, 10_000))
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_random_chars(10_000))
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_random_unicodes(10_000))
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_random_vocab_chars(vocab, 10_000))
|
||||
compare_tokenizers(func_tokenize1, func_tokenize2, generator_random_vocab_words(vocab, 5_000))
|
||||
|
||||
model.free()
|
||||
tokenizer2.model.free()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# main()
|
||||
|
||||
if True:
|
||||
logging.basicConfig(
|
||||
level = logging.DEBUG,
|
||||
format = "%(asctime)s.%(msecs)03d %(name)s %(levelname)s %(message)s",
|
||||
datefmt = "%Y-%m-%d %H:%M:%S",
|
||||
filename = logger.name + ".log",
|
||||
filemode = "a"
|
||||
)
|
||||
logging.basicConfig(
|
||||
level = logging.DEBUG,
|
||||
format = "%(asctime)s.%(msecs)03d %(name)s %(levelname)s %(message)s",
|
||||
datefmt = "%Y-%m-%d %H:%M:%S",
|
||||
filename = logger.name + ".log",
|
||||
filemode = "a"
|
||||
format = "%(levelname)s %(message)s",
|
||||
)
|
||||
|
||||
path_tokenizers = "./models/tokenizers/"
|
||||
path_vocab_format = "./models/ggml-vocab-%s.gguf"
|
||||
|
||||
# import os
|
||||
# tokenizers = os.listdir(path_tokenizers)
|
||||
tokenizers = [
|
||||
# "llama-spm", # SPM
|
||||
# "phi-3", # SPM
|
||||
# "bert-bge", # WPM
|
||||
# "jina-v2-en", # WPM
|
||||
"gpt-2", # BPE
|
||||
"llama-spm", # SPM
|
||||
"phi-3", # SPM
|
||||
"gemma", # SPM
|
||||
"gemma-2", # SPM
|
||||
"baichuan", # SPM
|
||||
"bert-bge", # WPM
|
||||
"jina-v2-en", # WPM
|
||||
"llama-bpe", # BPE
|
||||
"phi-2", # BPE
|
||||
"deepseek-llm", # BPE
|
||||
"deepseek-coder", # BPE
|
||||
"falcon", # BPE
|
||||
"mpt", # BPE
|
||||
"starcoder", # BPE
|
||||
"gpt-2", # BPE
|
||||
"stablelm2", # BPE
|
||||
"refact", # BPE
|
||||
"qwen2", # BPE
|
||||
"olmo", # BPE
|
||||
"jina-v2-es", # BPE
|
||||
"jina-v2-de", # BPE
|
||||
"jina-v2-code", # BPE
|
||||
"smaug-bpe", # BPE
|
||||
"phi-2", # BPE
|
||||
"deepseek-coder", # BPE
|
||||
"deepseek-llm", # BPE
|
||||
"poro-chat", # BPE
|
||||
"jina-v2-code", # BPE
|
||||
"viking", # BPE
|
||||
"jais", # BPE
|
||||
]
|
||||
|
||||
logger.info("=" * 50)
|
||||
for tokenizer in tokenizers:
|
||||
logger.info("=" * 50)
|
||||
logger.info("-" * 50)
|
||||
logger.info(f"TOKENIZER: '{tokenizer}'")
|
||||
vocab_file = path_vocab_format % tokenizer
|
||||
dir_tokenizer = path_tokenizers + "/" + tokenizer
|
||||
|
||||
Reference in New Issue
Block a user