Compare commits

...

5 Commits

Author SHA1 Message Date
Georgi Gerganov 6fdddb4987 metal : support virtual devices (#18919)
* metal : support virtual devices

* cont : manage buffer type context memory

* metal : add events

* cont : implement cpy_tensor_async
2026-02-02 14:29:44 +02:00
Daniel Bevenius 6156ae5111 model-conversion : add debug option to conversion script (#19265)
This commit adds a debug option to the model conversion script to enable
using the Python debugger (pdb) during model conversion.

The motivation for this is that I've found myself adding this a few
times now and it would be quicker to have this flag as an option and a
makefile target/recipe for it.
2026-02-02 11:29:57 +01:00
Johannes Gäßler 59377a6c87 ggml-backend: fix async set/get fallback sync (#19179) 2026-02-02 10:00:05 +01:00
Georgi Gerganov 1239267cc4 authors : update (#19263)
[no ci]
2026-02-02 08:51:25 +02:00
Christian Kastner 7a4ca3cbd9 docs : Minor cleanups (#19252)
* Update old URLs to github.com/ggml-org/

* Bump copyrights
2026-02-02 08:38:55 +02:00
47 changed files with 1316 additions and 507 deletions
+749 -336
View File
File diff suppressed because it is too large Load Diff
+1 -1
View File
@@ -1,6 +1,6 @@
MIT License
Copyright (c) 2023-2024 The ggml authors
Copyright (c) 2023-2026 The ggml authors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
+1 -1
View File
@@ -9,7 +9,7 @@ Download [MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6) PyTorch m
### Build llama.cpp
Readme modification time: 20250206
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
Clone llama.cpp:
```bash
+2 -2
View File
@@ -8,11 +8,11 @@ Download [MiniCPM-o-4](https://huggingface.co/openbmb/MiniCPM-o-4) PyTorch model
### Build llama.cpp
Readme modification time: 20250206
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
Clone llama.cpp:
```bash
git clone https://github.com/ggerganov/llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
```
+1 -1
View File
@@ -8,7 +8,7 @@ Download [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-
### Build llama.cpp
Readme modification time: 20250206
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
Clone llama.cpp:
```bash
+1 -1
View File
@@ -8,7 +8,7 @@ Download [MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) PyTorch m
### Build llama.cpp
Readme modification time: 20250206
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
Clone llama.cpp:
```bash
+2 -2
View File
@@ -8,11 +8,11 @@ Download [MiniCPM-V-4](https://huggingface.co/openbmb/MiniCPM-V-4) PyTorch model
### Build llama.cpp
Readme modification time: 20250731
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
Clone llama.cpp:
```bash
git clone https://github.com/ggerganov/llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
```
+2 -2
View File
@@ -8,11 +8,11 @@ Download [MiniCPM-V-4_5](https://huggingface.co/openbmb/MiniCPM-V-4_5) PyTorch m
### Build llama.cpp
Readme modification time: 20250826
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
Clone llama.cpp:
```bash
git clone https://github.com/ggerganov/llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
```
+1 -1
View File
@@ -1,7 +1,7 @@
# Migration notice for binary filenames
> [!IMPORTANT]
[2024 Jun 12] Binaries have been renamed w/ a `llama-` prefix. `main` is now `llama-cli`, `server` is `llama-server`, etc (https://github.com/ggerganov/llama.cpp/pull/7809)
[2024 Jun 12] Binaries have been renamed w/ a `llama-` prefix. `main` is now `llama-cli`, `server` is `llama-server`, etc (https://github.com/ggml-org/llama.cpp/pull/7809)
This migration was important, but it is a breaking change that may not always be immediately obvious to users.
@@ -28,7 +28,7 @@ int main(int argc, char** argv) {
fprintf(stdout, "\n");
fprintf(stdout, "WARNING: The binary '%s' is deprecated.\n", filename.c_str());
fprintf(stdout, " Please use '%s' instead.\n", replacement_filename.c_str());
fprintf(stdout, " See https://github.com/ggerganov/llama.cpp/tree/master/examples/deprecation-warning/README.md for more information.\n");
fprintf(stdout, " See https://github.com/ggml-org/llama.cpp/tree/master/examples/deprecation-warning/README.md for more information.\n");
fprintf(stdout, "\n");
return EXIT_FAILURE;
+1 -1
View File
@@ -402,7 +402,7 @@ class SchemaConverter:
Transforms a regular expression pattern into a GBNF rule.
Input: https://json-schema.org/understanding-json-schema/reference/regular_expressions
Output: https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md
Output: https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md
Unsupported features: negative/positive lookaheads, greedy/non-greedy modifiers.
+4 -1
View File
@@ -33,11 +33,14 @@ DEVICE ?= auto
causal-convert-model-bf16: OUTTYPE=bf16
causal-convert-model-bf16: causal-convert-model
causal-convert-model-debug: DEBUG=--debug
causal-convert-model-debug: causal-convert-model
causal-convert-model:
$(call validate_model_path,causal-convert-model)
@MODEL_NAME="$(MODEL_NAME)" OUTTYPE="$(OUTTYPE)" MODEL_PATH="$(MODEL_PATH)" \
METADATA_OVERRIDE="$(METADATA_OVERRIDE)" \
./scripts/causal/convert-model.sh
./scripts/causal/convert-model.sh $(DEBUG)
causal-convert-mm-model-bf16: OUTTYPE=bf16
causal-convert-mm-model-bf16: MM_OUTTYPE=f16
@@ -4,12 +4,17 @@ set -e
# Parse command line arguments
MMPROJ=""
DEBUG=""
while [[ $# -gt 0 ]]; do
case $1 in
--mmproj)
MMPROJ="--mmproj"
shift
;;
--debug)
DEBUG="1"
shift
;;
*)
shift
;;
@@ -28,7 +33,12 @@ echo "Data type: ${TYPE}"
echo "Converted model path:: ${CONVERTED_MODEL}"
echo "Metadata override: ${METADATA_OVERRIDE}"
CMD_ARGS=("python" "../../convert_hf_to_gguf.py" "--verbose")
if [[ -n "$DEBUG" ]]; then
CMD_ARGS=("python" "-m" "pdb")
else
CMD_ARGS=("python")
fi
CMD_ARGS+=("../../convert_hf_to_gguf.py" "--verbose")
CMD_ARGS+=("${MODEL_PATH}")
CMD_ARGS+=("--outfile" "${CONVERTED_MODEL}")
CMD_ARGS+=("--outtype" "${TYPE}")
+1 -1
View File
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023-2024 The ggml authors
* Copyright (c) 2023-2026 The ggml authors
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
+1 -1
View File
@@ -6,7 +6,7 @@
// This documentation is still a work in progress.
// If you wish some specific topics to be covered, feel free to drop a comment:
//
// https://github.com/ggerganov/whisper.cpp/issues/40
// https://github.com/ggml-org/whisper.cpp/issues/40
//
// ## Overview
//
+2
View File
@@ -258,6 +258,7 @@ void ggml_backend_tensor_set_async(ggml_backend_t backend, struct ggml_tensor *
GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
if (backend->iface.set_tensor_async == NULL) {
ggml_backend_synchronize(backend);
ggml_backend_tensor_set(tensor, data, offset, size);
} else {
backend->iface.set_tensor_async(backend, tensor, data, offset, size);
@@ -271,6 +272,7 @@ void ggml_backend_tensor_get_async(ggml_backend_t backend, const struct ggml_ten
GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor read out of bounds");
if (backend->iface.get_tensor_async == NULL) {
ggml_backend_synchronize(backend);
ggml_backend_tensor_get(tensor, data, offset, size);
} else {
backend->iface.get_tensor_async(backend, tensor, data, offset, size);
+1 -1
View File
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023-2024 The ggml authors
* Copyright (c) 2023-2026 The ggml authors
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
+1 -1
View File
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023-2024 The ggml authors
* Copyright (c) 2023-2026 The ggml authors
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
+1 -1
View File
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023-2024 The ggml authors
* Copyright (c) 2023-2026 The ggml authors
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
+1 -1
View File
@@ -1,5 +1,5 @@
/**
* Copyright (c) 2023-2024 The ggml authors
* Copyright (c) 2023-2026 The ggml authors
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
+1 -1
View File
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023-2024 The ggml authors
* Copyright (c) 2023-2026 The ggml authors
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
+1 -1
View File
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023-2024 The ggml authors
* Copyright (c) 2023-2026 The ggml authors
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
+1 -1
View File
@@ -71,7 +71,7 @@ else()
# disabling fast math is needed in order to pass tests/test-backend-ops
# note: adding -fno-inline fixes the tests when using MTL_SHADER_VALIDATION=1
# note: unfortunately, we have to call it default.metallib instead of ggml.metallib
# ref: https://github.com/ggerganov/whisper.cpp/issues/1720
# ref: https://github.com/ggml-org/whisper.cpp/issues/1720
# note: adding -g causes segmentation fault during compile
#set(XC_FLAGS -fno-fast-math -fno-inline -g)
set(XC_FLAGS -fno-fast-math -fno-inline)
+8
View File
@@ -15,14 +15,22 @@ typedef struct ggml_metal * ggml_metal_t;
ggml_metal_t ggml_metal_init(ggml_metal_device_t dev);
void ggml_metal_free(ggml_metal_t ctx);
const char * ggml_metal_get_name(ggml_metal_t ctx);
void ggml_metal_synchronize(ggml_metal_t ctx);
void ggml_metal_set_tensor_async(ggml_metal_t ctx, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
void ggml_metal_get_tensor_async(ggml_metal_t ctx, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size);
bool ggml_metal_cpy_tensor_async(ggml_metal_t ctx_src, ggml_metal_t ctx_dst, const struct ggml_tensor * src, struct ggml_tensor * dst);
enum ggml_status ggml_metal_graph_compute (ggml_metal_t ctx, struct ggml_cgraph * gf);
void ggml_metal_graph_optimize(ggml_metal_t ctx, struct ggml_cgraph * gf);
void ggml_metal_event_record(ggml_metal_t ctx, ggml_metal_event_t ev);
void ggml_metal_event_wait (ggml_metal_t ctx, ggml_metal_event_t ev);
ggml_metal_event_t ggml_metal_get_ev_cpy(ggml_metal_t ctx);
void ggml_metal_set_n_cb (ggml_metal_t ctx, int n_cb);
void ggml_metal_set_abort_callback (ggml_metal_t ctx, ggml_abort_callback abort_callback, void * user_data);
bool ggml_metal_supports_family (ggml_metal_t ctx, int family);
+99 -6
View File
@@ -24,9 +24,13 @@ struct ggml_metal_command_buffer {
};
struct ggml_metal {
char name[128];
ggml_metal_device_t dev;
ggml_metal_library_t lib;
ggml_metal_event_t ev_cpy; // for async copies
dispatch_queue_t d_queue;
// additional, inference-time compiled pipelines
@@ -117,7 +121,11 @@ ggml_metal_t ggml_metal_init(ggml_metal_device_t dev) {
}
}
//const struct ggml_metal_device_props * props_dev = ggml_metal_device_get_props(dev);
res->ev_cpy = ggml_metal_device_event_init(dev);
const struct ggml_metal_device_props * props_dev = ggml_metal_device_get_props(dev);
snprintf(res->name, sizeof(res->name), "%s", props_dev->name);
res->d_queue = dispatch_queue_create("ggml-metal", DISPATCH_QUEUE_CONCURRENT);
@@ -206,9 +214,15 @@ void ggml_metal_free(ggml_metal_t ctx) {
dispatch_release(ctx->d_queue);
ggml_metal_device_event_free(ctx->dev, ctx->ev_cpy);
free(ctx);
}
const char * ggml_metal_get_name(ggml_metal_t ctx) {
return ctx->name;
}
void ggml_metal_synchronize(ggml_metal_t ctx) {
// wait for any backend operations to finish
if (ctx->cmd_buf_last) {
@@ -273,8 +287,8 @@ void ggml_metal_set_tensor_async(ggml_metal_t ctx, struct ggml_tensor * tensor,
// wrap the source data into a Metal buffer
id<MTLDevice> device = ggml_metal_device_get_obj(ctx->dev);
id<MTLBuffer> buf_src = [device newBufferWithBytes:data
length:size
options:MTLResourceStorageModeShared];
length:size
options:MTLResourceStorageModeShared];
GGML_ASSERT(buf_src);
@@ -316,9 +330,9 @@ void ggml_metal_get_tensor_async(ggml_metal_t ctx, const struct ggml_tensor * te
@autoreleasepool {
id<MTLDevice> device = ggml_metal_device_get_obj(ctx->dev);
id<MTLBuffer> buf_dst = [device newBufferWithBytesNoCopy:data
length:size
options:MTLResourceStorageModeShared
deallocator:nil];
length:size
options:MTLResourceStorageModeShared
deallocator:nil];
GGML_ASSERT(buf_dst);
@@ -356,6 +370,49 @@ void ggml_metal_get_tensor_async(ggml_metal_t ctx, const struct ggml_tensor * te
}
}
bool ggml_metal_cpy_tensor_async(ggml_metal_t ctx_src, ggml_metal_t ctx_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
@autoreleasepool {
struct ggml_metal_buffer_id bid_src = ggml_metal_get_buffer_id(src);
struct ggml_metal_buffer_id bid_dst = ggml_metal_get_buffer_id(dst);
if (bid_src.metal == nil || bid_dst.metal == nil) {
return false;
}
// queue the copy operation into the Metal context
// this will be queued at the end, after any currently ongoing GPU operations
id<MTLCommandQueue> queue = ggml_metal_device_get_queue(ctx_src->dev);
id<MTLCommandBuffer> cmd_buf = [queue commandBuffer];
id<MTLBlitCommandEncoder> encoder = [cmd_buf blitCommandEncoder];
[encoder copyFromBuffer:bid_src.metal
sourceOffset:bid_src.offs
toBuffer:bid_dst.metal
destinationOffset:bid_dst.offs
size:ggml_nbytes(src)];
[encoder endEncoding];
ggml_metal_event_t ev_cpy = ggml_metal_get_ev_cpy(ctx_src);
ggml_metal_event_record(ctx_src, ev_cpy);
[cmd_buf commit];
// do not wait here for completion
//[cmd_buf waitUntilCompleted];
// instead, remember a reference to the command buffer and wait for it later if needed
[ctx_src->cmd_bufs_ext addObject:cmd_buf];
ctx_src->cmd_buf_last = cmd_buf;
[cmd_buf retain];
ggml_metal_event_wait(ctx_dst, ev_cpy);
return true;
}
}
enum ggml_status ggml_metal_graph_compute(ggml_metal_t ctx, struct ggml_cgraph * gf) {
// number of nodes encoded by the main thread (empirically determined)
const int n_main = 64;
@@ -530,6 +587,42 @@ void ggml_metal_graph_optimize(ggml_metal_t ctx, struct ggml_cgraph * gf) {
//printf("%s: graph optimize took %.3f ms\n", __func__, (ggml_time_us() - t_start) / 1000.0);
}
void ggml_metal_event_record(ggml_metal_t ctx, ggml_metal_event_t ev) {
@autoreleasepool {
id<MTLCommandQueue> queue = ggml_metal_device_get_queue(ctx->dev);
id<MTLCommandBuffer> cmd_buf = [queue commandBuffer];
ggml_metal_event_encode_signal(ev, cmd_buf);
[cmd_buf commit];
[ctx->cmd_bufs_ext addObject:cmd_buf];
ctx->cmd_buf_last = cmd_buf;
[cmd_buf retain];
}
}
void ggml_metal_event_wait(ggml_metal_t ctx, ggml_metal_event_t ev) {
@autoreleasepool {
id<MTLCommandQueue> queue = ggml_metal_device_get_queue(ctx->dev);
id<MTLCommandBuffer> cmd_buf = [queue commandBuffer];
ggml_metal_event_encode_wait(ev, cmd_buf);
[cmd_buf commit];
[ctx->cmd_bufs_ext addObject:cmd_buf];
ctx->cmd_buf_last = cmd_buf;
[cmd_buf retain];
}
}
ggml_metal_event_t ggml_metal_get_ev_cpy(ggml_metal_t ctx) {
return ctx->ev_cpy;
}
void ggml_metal_set_n_cb(ggml_metal_t ctx, int n_cb) {
if (ctx->n_cb != n_cb) {
ctx->n_cb = MIN(n_cb, GGML_METAL_MAX_COMMAND_BUFFERS);
+5 -3
View File
@@ -17,10 +17,12 @@ struct ggml_metal_device_deleter {
typedef std::unique_ptr<ggml_metal_device, ggml_metal_device_deleter> ggml_metal_device_ptr;
ggml_metal_device_t ggml_metal_device_get(void) {
static ggml_metal_device_ptr ctx { ggml_metal_device_init() };
ggml_metal_device_t ggml_metal_device_get(int device) {
static std::vector<ggml_metal_device_ptr> devs;
return ctx.get();
devs.emplace_back(ggml_metal_device_init(device));
return devs.back().get();
}
struct ggml_metal_pipelines {
+13 -3
View File
@@ -205,7 +205,9 @@ void ggml_metal_rsets_free(ggml_metal_rsets_t rsets);
//
struct ggml_metal_device_props {
int device;
char name[128];
char desc[128];
size_t max_buffer_size;
size_t max_working_set_size;
@@ -224,11 +226,15 @@ struct ggml_metal_device_props {
int op_offload_min_batch_size;
};
ggml_metal_device_t ggml_metal_device_init(void);
typedef struct ggml_metal_event * ggml_metal_event_t;
void ggml_metal_event_encode_signal(ggml_metal_event_t ev, ggml_metal_cmd_buf_t cmd_buf);
void ggml_metal_event_encode_wait (ggml_metal_event_t ev, ggml_metal_cmd_buf_t cmd_buf);
ggml_metal_device_t ggml_metal_device_init(int device);
void ggml_metal_device_free(ggml_metal_device_t dev);
// return a singleton that is automatically destroyed when the program exits
ggml_metal_device_t ggml_metal_device_get(void);
ggml_metal_device_t ggml_metal_device_get(int device);
void * ggml_metal_device_get_obj (ggml_metal_device_t dev); // id<MTLDevice>
void * ggml_metal_device_get_queue(ggml_metal_device_t dev); // id<MTLCommandQueue>
@@ -240,6 +246,10 @@ void ggml_metal_device_rsets_rm (ggml_metal_device_t dev, ggml_metal_rset_t rset
void ggml_metal_device_rsets_keep_alive(ggml_metal_device_t dev);
ggml_metal_event_t ggml_metal_device_event_init(ggml_metal_device_t dev);
void ggml_metal_device_event_free(ggml_metal_device_t dev, ggml_metal_event_t ev);
void ggml_metal_device_event_synchronize(ggml_metal_device_t dev, ggml_metal_event_t ev);
void ggml_metal_device_get_memory(ggml_metal_device_t dev, size_t * free, size_t * total);
bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_tensor * op);
+64 -7
View File
@@ -24,9 +24,6 @@
static const NSInteger MTLGPUFamilyMetal3_GGML = 5001;
static const NSInteger MTLGPUFamilyMetal4_GGML = 5002;
// virtual address for GPU memory allocations
static atomic_uintptr_t g_addr_device = 0x000000400ULL;
#if !GGML_METAL_EMBED_LIBRARY
// Here to assist with NSBundle Path Hack
@interface GGMLMetalClass : NSObject
@@ -523,6 +520,9 @@ struct ggml_metal_device {
ggml_metal_library_t library;
struct ggml_metal_device_props props;
// virtual address for GPU memory allocations
atomic_uintptr_t addr_virt;
};
//
@@ -618,7 +618,7 @@ void ggml_metal_rsets_free(ggml_metal_rsets_t rsets) {
free(rsets);
}
ggml_metal_device_t ggml_metal_device_init(void) {
ggml_metal_device_t ggml_metal_device_init(int device) {
ggml_metal_device_t dev = calloc(1, sizeof(struct ggml_metal_device));
assert(dev != NULL);
@@ -632,6 +632,9 @@ ggml_metal_device_t ggml_metal_device_init(void) {
GGML_LOG_ERROR("%s: error: failed to create command queue\n", __func__);
}
dev->addr_virt = 0x000000400ULL;
dev->props.device = device;
dev->props.has_simdgroup_reduction = [dev->mtl_device supportsFamily:MTLGPUFamilyApple7];
dev->props.has_simdgroup_reduction |= [dev->mtl_device supportsFamily:MTLGPUFamilyMetal3_GGML];
@@ -792,7 +795,8 @@ ggml_metal_device_t ggml_metal_device_init(void) {
dev->props.max_working_set_size = dev->mtl_device.maxBufferLength;
}
strncpy(dev->props.name, [[dev->mtl_device name] UTF8String], sizeof(dev->props.name) - 1);
snprintf(dev->props.name, sizeof(dev->props.name), "%s%d", "MTL", device);
snprintf(dev->props.desc, sizeof(dev->props.desc), "%s", [[dev->mtl_device name] UTF8String]);
dev->library = ggml_metal_library_init(dev);
if (!dev->library) {
@@ -922,6 +926,59 @@ void ggml_metal_device_rsets_keep_alive(ggml_metal_device_t dev) {
atomic_store_explicit(&dev->rsets->d_loop, 2*dev->rsets->keep_alive_s, memory_order_relaxed);
}
struct ggml_metal_event {
void * obj; // id<MTLEvent>
atomic_int value;
};
void ggml_metal_event_encode_signal(ggml_metal_event_t ev, ggml_metal_cmd_buf_t cmd_buf_raw) {
id<MTLEvent> event = (id<MTLEvent>)ev->obj;
id<MTLCommandBuffer> cmd_buf = (id<MTLCommandBuffer>) cmd_buf_raw;
[cmd_buf encodeSignalEvent:event value:atomic_fetch_add_explicit(&ev->value, 1, memory_order_relaxed) + 1];
}
void ggml_metal_event_encode_wait(ggml_metal_event_t ev, ggml_metal_cmd_buf_t cmd_buf_raw) {
id<MTLEvent> event = (id<MTLEvent>)ev->obj;
id<MTLCommandBuffer> cmd_buf = (id<MTLCommandBuffer>) cmd_buf_raw;
[cmd_buf encodeWaitForEvent:event value:atomic_load_explicit(&ev->value, memory_order_relaxed)];
}
ggml_metal_event_t ggml_metal_device_event_init(ggml_metal_device_t dev) {
id<MTLEvent> event = [dev->mtl_device newEvent];
ggml_metal_event_t ev = calloc(1, sizeof(struct ggml_metal_event));
ev->obj = (__bridge void *)event;
ev->value = 0;
return ev;
}
void ggml_metal_device_event_free(ggml_metal_device_t dev, ggml_metal_event_t ev) {
id<MTLEvent> event = ev->obj;
[event release];
free(ev);
GGML_UNUSED(dev);
}
void ggml_metal_device_event_synchronize(ggml_metal_device_t dev, ggml_metal_event_t ev) {
@autoreleasepool {
id<MTLEvent> event = ev->obj;
id<MTLCommandBuffer> cmd_buf = [dev->mtl_queue commandBuffer];
[cmd_buf encodeWaitForEvent:event value:atomic_load_explicit(&ev->value, memory_order_relaxed)];
[cmd_buf commit];
[cmd_buf waitUntilCompleted];
}
}
void ggml_metal_device_get_memory(ggml_metal_device_t dev, size_t * free, size_t * total) {
if (@available(macOS 10.12, iOS 16.0, *)) {
*total = dev->mtl_device.recommendedMaxWorkingSetSize;
@@ -1344,8 +1401,8 @@ ggml_metal_buffer_t ggml_metal_buffer_init(ggml_metal_device_t dev, size_t size,
res->all_data = ggml_metal_host_malloc(size_aligned);
res->is_shared = true;
} else {
// use virtual address from g_addr_device counter
res->all_data = (void *) atomic_fetch_add_explicit(&g_addr_device, size_aligned, memory_order_relaxed);
// use virtual address
res->all_data = (void *) atomic_fetch_add_explicit(&dev->addr_virt, size_aligned, memory_order_relaxed);
res->is_shared = false;
}
res->all_size = size_aligned;
+318 -108
View File
@@ -7,11 +7,12 @@
#include "ggml-metal-context.h"
#include "ggml-metal-ops.h"
// globals
#define GGML_METAL_NAME "MTL"
#define GGML_METAL_MAX_DEVICES 16
// initialized in ggml_backend_metal_reg
static ggml_backend_reg g_ggml_metal_reg;
static ggml_backend_device g_ggml_metal_device;
// number of Metal devices
// note: can be overriden with GGML_METAL_DEVICES env to simulate virtual devices
static int g_devices = 1;
////////////////////////////////////////////////////////////////////////////////
// backend interface
@@ -165,10 +166,28 @@ static ggml_backend_buffer_i ggml_backend_metal_buffer_private_i = {
/* .reset = */ NULL,
};
static bool ggml_backend_buffer_is_metal(ggml_backend_buffer_t buffer) {
return buffer->iface.free_buffer == ggml_backend_metal_buffer_shared_free_buffer ||
buffer->iface.free_buffer == ggml_backend_metal_buffer_private_free_buffer;
}
//
// buffer types
//
struct ggml_backend_metal_buffer_type {
int device;
std::string name;
};
struct ggml_backend_metal_buffer_type_deleter {
void operator()(ggml_backend_metal_buffer_type * ctx) const {
delete ctx;
}
};
typedef std::unique_ptr<ggml_backend_metal_buffer_type, ggml_backend_metal_buffer_type_deleter> ggml_backend_metal_buffer_type_ptr;
// common method for allocating shread or private Metal buffers
static ggml_backend_buffer_t ggml_backend_metal_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size, bool shared) {
ggml_metal_device_t ctx_dev = (ggml_metal_device_t)buft->device->context;
@@ -218,9 +237,9 @@ static size_t ggml_backend_metal_buffer_type_get_alloc_size(ggml_backend_buffer_
// default (shared) buffer type
static const char * ggml_backend_metal_buffer_type_shared_get_name(ggml_backend_buffer_type_t buft) {
return "Metal";
ggml_backend_metal_buffer_type * ctx = (ggml_backend_metal_buffer_type *)buft->context;
GGML_UNUSED(buft);
return ctx->name.c_str();
}
static ggml_backend_buffer_t ggml_backend_metal_buffer_type_shared_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
@@ -249,29 +268,54 @@ static bool ggml_backend_metal_buffer_type_shared_is_host(ggml_backend_buffer_ty
GGML_UNUSED(buft);
}
static ggml_backend_buffer_type_t ggml_backend_metal_buffer_type_shared(void) {
static ggml_backend_buffer_type ggml_backend_buffer_type_metal = {
/* .iface = */ {
/* .get_name = */ ggml_backend_metal_buffer_type_shared_get_name,
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_shared_alloc_buffer,
/* .get_alignment = */ ggml_backend_metal_buffer_type_shared_get_alignment,
/* .get_max_size = */ ggml_backend_metal_buffer_type_shared_get_max_size,
/* .get_alloc_size = */ ggml_backend_metal_buffer_type_shared_get_alloc_size,
/* .is_host = */ ggml_backend_metal_buffer_type_shared_is_host,
},
/* .device = */ &g_ggml_metal_device,
/* .context = */ NULL,
};
static ggml_backend_buffer_type_t ggml_backend_metal_buffer_type_shared(int device) {
static std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
return &ggml_backend_buffer_type_metal;
static std::vector<ggml_backend_buffer_type> bufts;
static std::vector<ggml_backend_metal_buffer_type_ptr> ctxs;
static bool initialized = false;
if (!initialized) {
bufts.reserve(g_devices);
ctxs.reserve(g_devices);
for (int i = 0; i < g_devices; ++i) {
ggml_backend_metal_buffer_type * raw_ctx =
new ggml_backend_metal_buffer_type {
/* .device = */ i,
/* .name = */ GGML_METAL_NAME + std::to_string(i),
};
ctxs.emplace_back(raw_ctx);
ggml_backend_buffer_type buft = {
/* .iface = */ {
/* .get_name = */ ggml_backend_metal_buffer_type_shared_get_name,
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_shared_alloc_buffer,
/* .get_alignment = */ ggml_backend_metal_buffer_type_shared_get_alignment,
/* .get_max_size = */ ggml_backend_metal_buffer_type_shared_get_max_size,
/* .get_alloc_size = */ ggml_backend_metal_buffer_type_shared_get_alloc_size,
/* .is_host = */ ggml_backend_metal_buffer_type_shared_is_host,
},
/* .device = */ ggml_backend_reg_dev_get(ggml_backend_metal_reg(), i),
/* .context = */ raw_ctx,
};
bufts.emplace_back(buft);
}
initialized = true;
}
return &bufts[device];
}
// default (private) buffer type
static const char * ggml_backend_metal_buffer_type_private_get_name(ggml_backend_buffer_type_t buft) {
return "Metal_Private";
ggml_backend_metal_buffer_type * ctx = (ggml_backend_metal_buffer_type *)buft->context;
GGML_UNUSED(buft);
return ctx->name.c_str();
}
static ggml_backend_buffer_t ggml_backend_metal_buffer_type_private_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
@@ -300,29 +344,53 @@ static bool ggml_backend_metal_buffer_type_private_is_host(ggml_backend_buffer_t
GGML_UNUSED(buft);
}
static ggml_backend_buffer_type_t ggml_backend_metal_buffer_type_private(void) {
static ggml_backend_buffer_type ggml_backend_buffer_type_metal = {
/* .iface = */ {
/* .get_name = */ ggml_backend_metal_buffer_type_private_get_name,
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_private_alloc_buffer,
/* .get_alignment = */ ggml_backend_metal_buffer_type_private_get_alignment,
/* .get_max_size = */ ggml_backend_metal_buffer_type_private_get_max_size,
/* .get_alloc_size = */ ggml_backend_metal_buffer_type_private_get_alloc_size,
/* .is_host = */ ggml_backend_metal_buffer_type_private_is_host,
},
/* .device = */ &g_ggml_metal_device,
/* .context = */ NULL,
};
static ggml_backend_buffer_type_t ggml_backend_metal_buffer_type_private(int device) {
static std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
return &ggml_backend_buffer_type_metal;
static std::vector<ggml_backend_buffer_type> bufts;
static std::vector<ggml_backend_metal_buffer_type_ptr> ctxs;
static bool initialized = false;
if (!initialized) {
bufts.reserve(g_devices);
ctxs.reserve(g_devices);
for (int i = 0; i < g_devices; ++i) {
ggml_backend_metal_buffer_type * raw_ctx = new ggml_backend_metal_buffer_type{
/* .device = */ i,
/* .name = */ GGML_METAL_NAME + std::to_string(i) + "_Private"
};
ctxs.emplace_back(raw_ctx);
ggml_backend_buffer_type buft = {
/* .iface = */ {
/* .get_name = */ ggml_backend_metal_buffer_type_private_get_name,
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_private_alloc_buffer,
/* .get_alignment = */ ggml_backend_metal_buffer_type_private_get_alignment,
/* .get_max_size = */ ggml_backend_metal_buffer_type_private_get_max_size,
/* .get_alloc_size = */ ggml_backend_metal_buffer_type_private_get_alloc_size,
/* .is_host = */ ggml_backend_metal_buffer_type_private_is_host,
},
/* .device = */ ggml_backend_reg_dev_get(ggml_backend_metal_reg(), i),
/* .context = */ raw_ctx,
};
bufts.emplace_back(buft);
}
initialized = true;
}
return &bufts[device];
}
// mapped buffer type
static const char * ggml_backend_metal_buffer_type_mapped_get_name(ggml_backend_buffer_type_t buft) {
return "Metal_Mapped";
ggml_backend_metal_buffer_type * ctx = (ggml_backend_metal_buffer_type *)buft->context;
GGML_UNUSED(buft);
return ctx->name.c_str();
}
static ggml_backend_buffer_t ggml_backend_metal_buffer_type_mapped_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
@@ -352,31 +420,55 @@ static bool ggml_backend_metal_buffer_type_mapped_is_host(ggml_backend_buffer_ty
GGML_UNUSED(buft);
}
static ggml_backend_buffer_type_t ggml_backend_metal_buffer_type_mapped(void) {
// note: not obvious, but this buffer type still needs to implement .alloc_buffer:
// https://github.com/ggml-org/llama.cpp/pull/15832#discussion_r2333177099
static ggml_backend_buffer_type ggml_backend_buffer_type_mapped_metal = {
/* .iface = */ {
/* .get_name = */ ggml_backend_metal_buffer_type_mapped_get_name,
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_mapped_alloc_buffer,
/* .get_alignment = */ ggml_backend_metal_buffer_type_mapped_get_alignment,
/* .get_max_size = */ ggml_backend_metal_buffer_type_mapped_get_max_size,
/* .get_alloc_size = */ ggml_backend_metal_buffer_type_mapped_get_alloc_size,
/* .is_host = */ ggml_backend_metal_buffer_type_mapped_is_host,
},
/* .device = */ &g_ggml_metal_device,
/* .context = */ NULL,
};
static ggml_backend_buffer_type_t ggml_backend_metal_buffer_type_mapped(int device) {
static std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
return &ggml_backend_buffer_type_mapped_metal;
static std::vector<ggml_backend_buffer_type> bufts;
static std::vector<ggml_backend_metal_buffer_type_ptr> ctxs;
static bool initialized = false;
if (!initialized) {
bufts.reserve(g_devices);
ctxs.reserve(g_devices);
for (int i = 0; i < g_devices; ++i) {
ggml_backend_metal_buffer_type * raw_ctx = new ggml_backend_metal_buffer_type{
/* .device = */ i,
/* .name = */ GGML_METAL_NAME + std::to_string(i) + "_Mapped"
};
ctxs.emplace_back(raw_ctx);
// note: not obvious, but this buffer type still needs to implement .alloc_buffer:
// https://github.com/ggml-org/llama.cpp/pull/15832#discussion_r2333177099
ggml_backend_buffer_type buft = {
/* .iface = */ {
/* .get_name = */ ggml_backend_metal_buffer_type_mapped_get_name,
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_mapped_alloc_buffer,
/* .get_alignment = */ ggml_backend_metal_buffer_type_mapped_get_alignment,
/* .get_max_size = */ ggml_backend_metal_buffer_type_mapped_get_max_size,
/* .get_alloc_size = */ ggml_backend_metal_buffer_type_mapped_get_alloc_size,
/* .is_host = */ ggml_backend_metal_buffer_type_mapped_is_host,
},
/* .device = */ ggml_backend_reg_dev_get(ggml_backend_metal_reg(), i),
/* .context = */ raw_ctx,
};
bufts.emplace_back(buft);
}
initialized = true;
}
return &bufts[device];
}
// backend
static const char * ggml_backend_metal_name(ggml_backend_t backend) {
return "Metal";
ggml_metal_t ctx = (ggml_metal_t)backend->context;
GGML_UNUSED(backend);
return ggml_metal_get_name(ctx);
}
static void ggml_backend_metal_free(ggml_backend_t backend) {
@@ -409,12 +501,24 @@ static void ggml_backend_metal_get_tensor_async(ggml_backend_t backend, const gg
}
static bool ggml_backend_metal_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const ggml_tensor * src, ggml_tensor * dst) {
return false;
if (!ggml_backend_is_metal(backend_src) || !ggml_backend_is_metal(backend_dst)) {
return false;
}
GGML_UNUSED(backend_src);
GGML_UNUSED(backend_dst);
GGML_UNUSED(src);
GGML_UNUSED(dst);
if (!ggml_backend_buffer_is_metal(src->buffer) || !ggml_backend_buffer_is_metal(dst->buffer)) {
return false;
}
ggml_metal_t ctx_src = (ggml_metal_t)backend_src->context;
ggml_metal_t ctx_dst = (ggml_metal_t)backend_dst->context;
//ggml_backend_buffer_t buf_src = src->view_src ? src->view_src->buffer : src->buffer;
//ggml_backend_buffer_t buf_dst = dst->view_src ? dst->view_src->buffer : dst->buffer;
//ggml_metal_buffer_t buf_ctx_src = (ggml_metal_buffer_t)buf_src->context;
//ggml_metal_buffer_t buf_ctx_dst = (ggml_metal_buffer_t)buf_dst->context;
return ggml_metal_cpy_tensor_async(ctx_src, ctx_dst, src, dst);
}
static enum ggml_status ggml_backend_metal_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
@@ -423,6 +527,20 @@ static enum ggml_status ggml_backend_metal_graph_compute(ggml_backend_t backend,
return ggml_metal_graph_compute(ctx, cgraph);
}
static void ggml_backend_metal_event_record(ggml_backend_t backend, ggml_backend_event_t event) {
ggml_metal_t ctx = (ggml_metal_t)backend->context;
ggml_metal_event_t ev = (ggml_metal_event_t)event->context;
ggml_metal_event_record(ctx, ev);
}
static void ggml_backend_metal_event_wait(ggml_backend_t backend, ggml_backend_event_t event) {
ggml_metal_t ctx = (ggml_metal_t)backend->context;
ggml_metal_event_t ev = (ggml_metal_event_t)event->context;
ggml_metal_event_wait(ctx, ev);
}
static void ggml_backend_metal_graph_optimize(ggml_backend_t backend, ggml_cgraph * cgraph) {
ggml_metal_t ctx = (ggml_metal_t)backend->context;
@@ -435,7 +553,6 @@ static void ggml_backend_metal_set_n_cb(ggml_backend_t backend, int n_cb) {
ggml_metal_t ctx = (ggml_metal_t)backend->context;
ggml_metal_set_n_cb(ctx, n_cb);
}
static ggml_backend_i ggml_backend_metal_i = {
@@ -450,12 +567,8 @@ static ggml_backend_i ggml_backend_metal_i = {
/* .graph_plan_update = */ NULL,
/* .graph_plan_compute = */ NULL,
/* .graph_compute = */ ggml_backend_metal_graph_compute,
// the events API is needed only for multi-GPU setups, so likely no need to implement it for Metal
// in any case, these docs seem relevant if we ever decide to implement it:
// https://developer.apple.com/documentation/metal/mtlcommandbuffer#Synchronizing-Passes-with-Events
/* .event_record = */ NULL,
/* .event_wait = */ NULL,
/* .event_record = */ ggml_backend_metal_event_record,
/* .event_wait = */ ggml_backend_metal_event_wait,
/* .graph_optimize = */ ggml_backend_metal_graph_optimize,
};
@@ -519,15 +632,17 @@ void ggml_backend_metal_capture_next_compute(ggml_backend_t backend) {
// backend device
static const char * ggml_backend_metal_device_get_name(ggml_backend_dev_t dev) {
return "Metal";
ggml_metal_device_t ctx_dev = (ggml_metal_device_t)dev->context;
GGML_UNUSED(dev);
const ggml_metal_device_props * props_dev = ggml_metal_device_get_props(ctx_dev);
return props_dev->name;
}
static const char * ggml_backend_metal_device_get_description(ggml_backend_dev_t dev) {
ggml_metal_device_t ctx_dev = (ggml_metal_device_t)dev->context;
return ggml_metal_device_get_props(ctx_dev)->name;
return ggml_metal_device_get_props(ctx_dev)->desc;
}
static void ggml_backend_metal_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
@@ -550,14 +665,14 @@ static void ggml_backend_metal_device_get_props(ggml_backend_dev_t dev, ggml_bac
ggml_backend_metal_device_get_memory(dev, &props->memory_free, &props->memory_total);
props->caps = {
/* .async = */ true,
/* .host_buffer = */ false,
/* .buffer_from_host_ptr = */ true,
/* .events = */ false,
/* .async = */ true,
/* .host_buffer = */ false,
/* .buffer_from_host_ptr = */ true,
/* .events = */ true,
};
}
static ggml_backend_t ggml_backend_metal_device_init(ggml_backend_dev_t dev, const char * params) {
static ggml_backend_t ggml_backend_metal_device_init_backend(ggml_backend_dev_t dev, const char * params) {
ggml_metal_device_t ctx_dev = (ggml_metal_device_t)dev->context;
ggml_metal_t ctx = ggml_metal_init(ctx_dev);
@@ -587,7 +702,7 @@ static ggml_backend_buffer_type_t ggml_backend_metal_device_get_buffer_type(ggml
const ggml_metal_device_props * props_dev = ggml_metal_device_get_props(ctx_dev);
return props_dev->use_shared_buffers ? ggml_backend_metal_buffer_type_shared() : ggml_backend_metal_buffer_type_private();
return props_dev->use_shared_buffers ? ggml_backend_metal_buffer_type_shared(props_dev->device) : ggml_backend_metal_buffer_type_private(props_dev->device);
}
static ggml_backend_buffer_t ggml_backend_metal_device_buffer_mapped(ggml_backend_dev_t dev, void * ptr, size_t size, size_t max_tensor_size) {
@@ -595,7 +710,9 @@ static ggml_backend_buffer_t ggml_backend_metal_device_buffer_mapped(ggml_backen
ggml_metal_buffer_t res = ggml_metal_buffer_map(ctx_dev, ptr, size, max_tensor_size);
return ggml_backend_buffer_init(ggml_backend_metal_buffer_type_mapped(), ggml_backend_metal_buffer_shared_i, res, size);
const ggml_metal_device_props * props_dev = ggml_metal_device_get_props(ctx_dev);
return ggml_backend_buffer_init(ggml_backend_metal_buffer_type_mapped(props_dev->device), ggml_backend_metal_buffer_shared_i, res, size);
}
static bool ggml_backend_metal_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
@@ -606,9 +723,10 @@ static bool ggml_backend_metal_device_supports_op(ggml_backend_dev_t dev, const
static bool ggml_backend_metal_device_supports_buft(ggml_backend_dev_t dev, ggml_backend_buffer_type_t buft) {
return
buft->device == dev && (
buft->iface.get_name == ggml_backend_metal_buffer_type_shared_get_name ||
buft->iface.get_name == ggml_backend_metal_buffer_type_private_get_name ||
buft->iface.get_name == ggml_backend_metal_buffer_type_mapped_get_name;
buft->iface.get_name == ggml_backend_metal_buffer_type_mapped_get_name);
GGML_UNUSED(dev);
}
@@ -632,45 +750,97 @@ static bool ggml_backend_metal_device_offload_op(ggml_backend_dev_t dev, const g
get_op_batch_size(op) >= ggml_metal_device_get_props(ctx_dev)->op_offload_min_batch_size;
}
static ggml_backend_event_t ggml_backend_metal_device_event_new(ggml_backend_dev_t dev) {
ggml_metal_device_t ctx_dev = (ggml_metal_device_t)dev->context;
ggml_metal_event_t event = ggml_metal_device_event_init(ctx_dev);
GGML_ASSERT(event);
ggml_backend_event_t ev = new ggml_backend_event {
/* .device = */ dev,
/* .context = */ event,
};
return ev;
}
static void ggml_backend_metal_device_event_free(ggml_backend_dev_t dev, ggml_backend_event_t event) {
ggml_metal_device_t ctx_dev = (ggml_metal_device_t)dev->context;
ggml_metal_event_t ev = (ggml_metal_event_t)event->context;
ggml_metal_device_event_free(ctx_dev, ev);
delete event;
}
static void ggml_backend_metal_device_event_synchronize(ggml_backend_dev_t dev, ggml_backend_event_t event) {
ggml_metal_device_t ctx_dev = (ggml_metal_device_t)dev->context;
ggml_metal_event_t evt = (ggml_metal_event_t)event->context;
ggml_metal_device_event_synchronize(ctx_dev, evt);
}
static ggml_backend_device_i ggml_backend_metal_device_i = {
/* .get_name = */ ggml_backend_metal_device_get_name,
/* .get_description = */ ggml_backend_metal_device_get_description,
/* .get_memory = */ ggml_backend_metal_device_get_memory,
/* .get_type = */ ggml_backend_metal_device_get_type,
/* .get_props = */ ggml_backend_metal_device_get_props,
/* .init_backend = */ ggml_backend_metal_device_init,
/* .init_backend = */ ggml_backend_metal_device_init_backend,
/* .get_buffer_type = */ ggml_backend_metal_device_get_buffer_type,
/* .get_host_buffer_type = */ NULL,
/* .buffer_from_host_ptr = */ ggml_backend_metal_device_buffer_mapped,
/* .supports_op = */ ggml_backend_metal_device_supports_op,
/* .supports_buft = */ ggml_backend_metal_device_supports_buft,
/* .offload_op = */ ggml_backend_metal_device_offload_op,
/* .event_new = */ NULL,
/* .event_free = */ NULL,
/* .event_synchronize = */ NULL,
/* .event_new = */ ggml_backend_metal_device_event_new,
/* .event_free = */ ggml_backend_metal_device_event_free,
/* .event_synchronize = */ ggml_backend_metal_device_event_synchronize,
};
// backend registry
struct ggml_backend_metal_reg {
std::vector<ggml_backend_dev_t> devices;
};
typedef struct ggml_backend_metal_reg * ggml_backend_metal_reg_t;
static ggml_backend_metal_reg_t ggml_backend_metal_reg_init(void) {
ggml_backend_metal_reg_t ctx = new struct ggml_backend_metal_reg;
return ctx;
}
static void ggml_backend_metal_reg_free(ggml_backend_metal_reg_t ctx) {
delete ctx;
}
struct ggml_backend_metal_reg_deleter {
void operator()(ggml_backend_metal_reg_t ctx) {
ggml_backend_metal_reg_free(ctx);
}
};
typedef std::unique_ptr<struct ggml_backend_metal_reg, ggml_backend_metal_reg_deleter> ggml_backend_metal_reg_ptr;
static const char * ggml_backend_metal_reg_get_name(ggml_backend_reg_t reg) {
return "Metal";
return GGML_METAL_NAME;
GGML_UNUSED(reg);
}
static size_t ggml_backend_metal_reg_device_count(ggml_backend_reg_t reg) {
return 1;
GGML_UNUSED(reg);
ggml_backend_metal_reg_t ctx = (ggml_backend_metal_reg_t)reg->context;
return ctx->devices.size();
}
static ggml_backend_dev_t ggml_backend_metal_reg_device_get(ggml_backend_reg_t reg, size_t index) {
GGML_ASSERT(index == 0);
return &g_ggml_metal_device;
GGML_UNUSED(reg);
GGML_UNUSED(index);
ggml_backend_metal_reg_t ctx = (ggml_backend_metal_reg_t)reg->context;
GGML_ASSERT(index < ctx->devices.size());
return ctx->devices[index];
}
static ggml_backend_feature g_ggml_backend_metal_features[] = {
@@ -698,27 +868,67 @@ static void * ggml_backend_metal_get_proc_address(ggml_backend_reg_t reg, const
static ggml_backend_reg_i ggml_backend_metal_reg_i = {
/* .get_name = */ ggml_backend_metal_reg_get_name,
/* .device_count = */ ggml_backend_metal_reg_device_count,
/* .device_get = */ ggml_backend_metal_reg_device_get,
/* .get_device_count = */ ggml_backend_metal_reg_device_count,
/* .get_device = */ ggml_backend_metal_reg_device_get,
/* .get_proc_address = */ ggml_backend_metal_get_proc_address,
};
ggml_backend_reg_t ggml_backend_metal_reg(void) {
{
g_ggml_metal_reg = {
/* .api_version = */ GGML_BACKEND_API_VERSION,
/* .iface = */ ggml_backend_metal_reg_i,
/* .context = */ NULL,
};
static ggml_backend_dev_t ggml_backend_metal_device_init(ggml_backend_reg_t reg, int device) {
return new ggml_backend_device {
/* .iface = */ ggml_backend_metal_device_i,
/* .reg = */ reg,
/* .context = */ ggml_metal_device_get(device),
};
}
g_ggml_metal_device = {
/* .iface = */ ggml_backend_metal_device_i,
/* .reg = */ &g_ggml_metal_reg,
/* .context = */ ggml_metal_device_get(),
};
static void ggml_backend_metal_device_free(ggml_backend_dev_t dev) {
delete dev;
}
struct ggml_backend_device_deleter {
void operator()(ggml_backend_dev_t ctx) {
ggml_backend_metal_device_free(ctx);
}
};
typedef std::unique_ptr<ggml_backend_device, ggml_backend_device_deleter> ggml_backend_device_ptr;
ggml_backend_reg_t ggml_backend_metal_reg(void) {
static ggml_backend_reg reg;
static bool initialized = false;
{
static std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
const char * env = getenv("GGML_METAL_DEVICES");
if (env) {
g_devices = atoi(env);
}
static std::vector<ggml_backend_device_ptr> devs;
if (!initialized) {
static ggml_backend_metal_reg_ptr reg_ctx(ggml_backend_metal_reg_init());
for (int i = 0; i < g_devices; ++i) {
auto * dev = ggml_backend_metal_device_init(&reg, i);
devs.emplace_back(dev);
reg_ctx->devices.push_back(dev);
}
reg = {
/* .api_version = */ GGML_BACKEND_API_VERSION,
/* .iface = */ ggml_backend_metal_reg_i,
/* .context = */ reg_ctx.get(),
};
}
initialized = true;
}
return &g_ggml_metal_reg;
return &reg;
}
GGML_BACKEND_DL_IMPL(ggml_backend_metal_reg)
+1 -1
View File
@@ -3740,7 +3740,7 @@ static enum ggml_status ggml_backend_opencl_buffer_init_tensor(ggml_backend_buff
// Reuse extra of the parent tensor. The offset of this view tensor
// becomes `extra->offset + view_offs` and needs to be calculated when
// it is used. This changes is needed because of the change to
// ggml_alloc.c in https://github.com/ggerganov/llama.cpp/pull/7640.
// ggml_alloc.c in https://github.com/ggml-org/llama.cpp/pull/7640.
// `buffer` passed in here will always be `tensor->buffer`. It is OK
// to allocate extras from the same buffer context for ordinary
// intermediate tensors. But for views into kv cache tensors, doing so
+1 -1
View File
@@ -3390,7 +3390,7 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
// mmvq and mmq need the __dp4a instruction which is available for gen12+
// Workaround in https://github.com/ggerganov/llama.cpp/commit/95f84d5ce8b449a9b16009434aca800df504a02e
// Workaround in https://github.com/ggml-org/llama.cpp/commit/95f84d5ce8b449a9b16009434aca800df504a02e
use_mul_mat_q = use_mul_mat_q && (src0->type != GGML_TYPE_IQ2_XXS);
#ifdef SYCL_USE_XMX
use_mul_mat_q = use_mul_mat_q && (src1->ne[1] <= MMQ_MAX_BATCH_SIZE);
@@ -330,7 +330,7 @@ void string_to_spv_func(std::string name, std::string in_path, std::string out_p
std::vector<std::string> cmd = {GLSLC, "-fshader-stage=compute", target_env, in_path, "-o", out_path};
#endif
// disable spirv-opt for coopmat shaders for https://github.com/ggerganov/llama.cpp/issues/10734
// disable spirv-opt for coopmat shaders for https://github.com/ggml-org/llama.cpp/issues/10734
// disable spirv-opt for bf16 shaders for https://github.com/ggml-org/llama.cpp/issues/15344
// disable spirv-opt for rope shaders for https://github.com/ggml-org/llama.cpp/issues/16860
if (!coopmat && name.find("bf16") == std::string::npos && name.find("rope") == std::string::npos) {
+1 -1
View File
@@ -6562,7 +6562,7 @@ static void ggml_compute_backward(
case GGML_OP_DIAG_MASK_INF: {
if (src0_needs_grads) {
/* ggml_diag_mask_inf_impl() shouldn't be here */
/* ref: https://github.com/ggerganov/llama.cpp/pull/4203#discussion_r1412377992 */
/* ref: https://github.com/ggml-org/llama.cpp/pull/4203#discussion_r1412377992 */
const int n_past = ((const int32_t *) tensor->op_params)[0];
ggml_add_or_set(ctx, cgraph, isrc0, ggml_diag_mask_zero_impl(ctx, grad, n_past, false));
}
+1 -1
View File
@@ -233,7 +233,7 @@ int32_t llm_chat_apply_template(
llm_chat_template tmpl,
const std::vector<const llama_chat_message *> & chat,
std::string & dest, bool add_ass) {
// Taken from the research: https://github.com/ggerganov/llama.cpp/issues/5527
// Taken from the research: https://github.com/ggml-org/llama.cpp/issues/5527
std::stringstream ss;
if (tmpl == LLM_CHAT_TEMPLATE_CHATML) {
// chatml template
+1
View File
@@ -317,6 +317,7 @@ llama_context::llama_context(
auto dev_type = ggml_backend_dev_type(ggml_backend_get_device(backend.get()));
if (dev_type == GGML_BACKEND_DEVICE_TYPE_CPU) {
// ignore CPU backend
// TODO: should we ignore ACCEL types too?
continue;
}
auto * dev = ggml_backend_get_device(backend.get());
+1 -1
View File
@@ -195,7 +195,7 @@ struct llama_hparams {
uint32_t n_deepstack_layers = 0;
// needed by encoder-decoder models (e.g. T5, FLAN-T5)
// ref: https://github.com/ggerganov/llama.cpp/pull/8141
// ref: https://github.com/ggml-org/llama.cpp/pull/8141
llama_token dec_start_token_id = LLAMA_TOKEN_NULL;
uint32_t dec_n_layer = 0;
+4 -4
View File
@@ -90,7 +90,7 @@ static_assert(std::is_trivially_copyable<llm_symbol>::value, "llm_symbol is not
//
// SPM tokenizer
// original implementation:
// https://github.com/ggerganov/llama.cpp/commit/074bea2eb1f1349a0118239c4152914aecaa1be4
// https://github.com/ggml-org/llama.cpp/commit/074bea2eb1f1349a0118239c4152914aecaa1be4
//
struct llm_bigram_spm {
@@ -285,7 +285,7 @@ struct llm_tokenizer_bpe : llm_tokenizer {
// original regex from tokenizer.json
//"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
// adapted: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989
// adapted: https://github.com/ggml-org/llama.cpp/pull/6920#issuecomment-2080233989
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
};
break;
@@ -2390,7 +2390,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
// maintain a list of tokens that cause end-of-generation
// this is currently determined based on the token text, which is obviously not ideal
// ref: https://github.com/ggerganov/llama.cpp/issues/9606
// ref: https://github.com/ggml-org/llama.cpp/issues/9606
special_eog_ids.clear();
if (special_fim_pad_id != LLAMA_TOKEN_NULL && special_eog_ids.count(special_fim_pad_id) == 0) {
@@ -3079,7 +3079,7 @@ std::vector<llama_token> llama_vocab::impl::tokenize(
}
int32_t llama_vocab::impl::token_to_piece(llama_token token, char * buf, int32_t length, int32_t lstrip, bool special) const {
// ref: https://github.com/ggerganov/llama.cpp/pull/7587#discussion_r1620983843
// ref: https://github.com/ggml-org/llama.cpp/pull/7587#discussion_r1620983843
static const int attr_special = LLAMA_TOKEN_ATTR_UNKNOWN | LLAMA_TOKEN_ATTR_CONTROL;
const llama_token_attr attr = token_get_attr(token);
if (!special && (attr & attr_special)) {
+1 -1
View File
@@ -14,7 +14,7 @@ llm_build_deepseek2::llm_build_deepseek2(const llama_model & model, const llm_gr
const uint32_t kv_lora_rank = hparams.n_lora_kv;
// We have to pre-scale kq_scale and attn_factor to make the YaRN RoPE work correctly.
// See https://github.com/ggerganov/llama.cpp/discussions/7416 for detailed explanation.
// See https://github.com/ggml-org/llama.cpp/discussions/7416 for detailed explanation.
// And also: https://github.com/ggml-org/llama.cpp/pull/17945 [TAG_DEEPSEEK2_YARN_LOG_MUL_FIX]
// first cancel the adjustment from llama_hparams::yarn_attn_factor_adjust to get the original attn_factor
+1 -1
View File
@@ -1,4 +1,4 @@
// ref: https://github.com/ggerganov/llama.cpp/issues/4952#issuecomment-1892864763
// ref: https://github.com/ggml-org/llama.cpp/issues/4952#issuecomment-1892864763
#include <cstdio>
#include <string>
+1 -1
View File
@@ -290,7 +290,7 @@ static void power_iteration(
ggml_gallocr_free(allocr);
// TODO @ngxson : The output vector is randomly inverted
// Solution: https://github.com/ggerganov/llama.cpp/pull/8069#issuecomment-2185328171
// Solution: https://github.com/ggml-org/llama.cpp/pull/8069#issuecomment-2185328171
}
static void run_pca(
+1 -1
View File
@@ -190,7 +190,7 @@ struct lora_merge_ctx {
gguf_set_val_u32(ctx_out, "general.file_type", LLAMA_FTYPE_MOSTLY_F16);
// check if all lora adapters have the same tensors
// TODO: remove this when we can support merging subset of adapters. Ref: https://github.com/ggerganov/llama.cpp/pull/8607#discussion_r1686027777
// TODO: remove this when we can support merging subset of adapters. Ref: https://github.com/ggml-org/llama.cpp/pull/8607#discussion_r1686027777
static const char * err_no_subset_adapter = "Input adapters do not have the same list of tensors. This is not yet supported. Please merge the adapter one-by-one instead of merging all at once.";
if (adapters.size() > 1) {
for (size_t i = 1; i < adapters.size(); ++i) {
+1 -1
View File
@@ -29,7 +29,7 @@ In addition to the KL divergence the following statistics are calculated with `-
* Mean change in "correct" token probability. Positive values mean the model gets better at prediction, negative values mean it gets worse.
* Pearson correlation coefficient of the "correct" token probabilites between models.
* Percentiles of change in "correct" token probability. Positive values mean the model gets better at prediction, negative values mean it gets worse. Can be used to judge noise vs. quality loss from quantization. If the percentiles are symmetric then the quantization is essentially just adding noise. If the negative values are significantly larger than the positive values then this indicates that the model is actually becoming worse from the quantization.
* The root mean square of the change in token probabilities. If you were to assume that the quantization simply causes Gaussian noise on the token probabilities then this would be the standard deviation of said noise. The uncertainty on the value is calculated that the change in token probabilities follows a Gaussian distribution. Related discussion: https://github.com/ggerganov/llama.cpp/discussions/2875 .
* The root mean square of the change in token probabilities. If you were to assume that the quantization simply causes Gaussian noise on the token probabilities then this would be the standard deviation of said noise. The uncertainty on the value is calculated that the change in token probabilities follows a Gaussian distribution. Related discussion: https://github.com/ggml-org/llama.cpp/discussions/2875 .
* Same top p: Percentage of how often the token was assigned the highest probabilites by both models. The uncertainty is calculated from the Gaussian approximation of the binomial distribution.
## LLaMA 3 8b Scoreboard
+1 -1
View File
@@ -1096,7 +1096,7 @@ return html`
</section>
<footer>
<p><${ModelGenerationInfo} /></p>
<p>Powered By <a href="https://github.com/ggerganov/llama.cpp#readme" target="_blank">llama.cpp</a> and <a href="https://ggml.ai/" target="_blank">ggml.ai</a></p>
<p>Powered By <a href="https://github.com/ggml-org/llama.cpp#readme" target="_blank">llama.cpp</a> and <a href="https://ggml.ai/" target="_blank">ggml.ai</a></p>
</footer>
</div>
`;
+1 -1
View File
@@ -1281,7 +1281,7 @@
<footer>
<p><${ModelGenerationInfo} /></p>
<p>Powered by <a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a> and <a href="https://ggml.ai">ggml.ai</a>.</p>
<p>Powered by <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> and <a href="https://ggml.ai">ggml.ai</a>.</p>
</footer>
</div>
`;
@@ -1,5 +1,5 @@
/* Author: Yazan Agha-Schrader */
/* Inspiration from llama.cpp logo/banner https://github.com/ggerganov/llama.cpp#readme */
/* Inspiration from llama.cpp logo/banner https://github.com/ggml-org/llama.cpp#readme */
.theme-mangotango {
+1 -1
View File
@@ -1032,7 +1032,7 @@
<footer>
<p><${ModelGenerationInfo} /></p>
<p>Powered by <a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a> and <a href="https://ggml.ai">ggml.ai</a>.</p>
<p>Powered by <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> and <a href="https://ggml.ai">ggml.ai</a>.</p>
</footer>
</div>
`;
+1 -1
View File
@@ -1036,7 +1036,7 @@
<footer>
<p><${ModelGenerationInfo} /></p>
<p>Powered by <a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a> and <a href="https://ggml.ai">ggml.ai</a>.</p>
<p>Powered by <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> and <a href="https://ggml.ai">ggml.ai</a>.</p>
</footer>
</div>
`;