server : free draft/MTP resources on sleep to fix VRAM leak (#23461)

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-09 07:16:44 +02:00

The destroy() function in server_context_impl only cleaned up the main
model and context (via llama_init.reset()) but did not free the speculative
decoder (spec), draft context (ctx_dft), or draft model (model_dft).

For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated
resources (KV cache, compute buffers) that are not freed when entering
the sleeping state. On each sleep/resume cycle, new resources are
allocated without the old ones being freed, leading to a VRAM leak
that eventually crashes the server with out-of-memory errors.

Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy()
before resetting llama_init, ensuring proper cleanup order to avoid
use-after-free.

ref: https://github.com/ggml-org/llama.cpp/issues/23395

Assisted-by: llama.cpp:local pi

This commit is contained in:

Aman Gupta

2026-05-21 16:11:11 +08:00

committed by

GitHub

parent c9021714e8

commit 52fb93a2bd

1 changed files with 4 additions and 0 deletions

									
										tools/server/server-context.cpp
									
		+4
		
												View File
												
				@@ -701,6 +701,10 @@ private:

				    bool sleeping = false;

				    void destroy() {

				        spec.reset();

				        ctx_dft.reset();

				        model_dft.reset();

				        llama_init.reset();

				        ctx_tgt = nullptr;