Support Needed: `ValueError: No available memory for the cache blocks` with Mistral-Nemo-12B-Instruct on NVIDIA GeForce RTX 4090 (16GB) in Docker

Hello everyone,

I’m trying to load and run the Mistral-Nemo-12B-Instruct model on my NVIDIA GeForce RTX 4090 (16 GB GDDR6 GPU) in a Docker container on a Windows 11 Pro laptop. I’m using the latest release of Docker and running into memory allocation issues specific to GPU memory utilization.

Issue

When I attempt to initialize the model, I receive the following error:

ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.


I’ve monitored GPU utilization with `nvidia-smi`, and the GPU shows as follows: (16,022 MiB of 16,376 MiB) with only 3% GPU utilization.

Sat Nov 9 12:02:34 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 556.12 Driver Version: 556.12 CUDA Version: 12.5 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 … WDDM | 00000000:01:00.0 Off | N/A |
| N/A 51C P8 6W / 175W | 16022MiB / 16376MiB | 3% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+


Below are further details on the error trace from the model initialization attempt:

INFO 2024-11-10 16:38:01.642 model_runner.py:692] Loading model weights took 22.8445 GB
INFO 2024-11-10 16:38:11.634 distributed_gpu_executor.py:56] # GPU blocks: 0, # CPU blocks: 1638
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py", line 722, in <module>
[rank0]:     engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
[rank0]:   File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py", line 32, in from_engine_args
[rank0]:     engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
[rank0]:   File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
[rank0]:     self._run_workers("initialize_cache",
[rank0]:   File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 178, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 212, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]:   File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 367, in raise_if_cache_size_invalid
[rank0]:     raise ValueError("No available memory for the cache blocks. "
[rank0]: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

Steps I’ve Taken

- Set GPU Memory Utilization**: Attempted to set `gpu_memory_utilization` to `1.0` to maximize usage.
- Reduced CPU Blocks and Increased GPU Blocks**: Adjusted `NUM_GPU_BLOCKS` and `NUM_CPU_BLOCKS` parameters to prioritize GPU over CPU.
- Checked Chunked Prefill Setting**: Verified that chunked prefill was disabled, as recommended for large models.

Questions

1. Is there a specific configuration of `gpu_memory_utilization` and block settings that could better allocate memory for this large model on my 16 GB GPU?**
2. Would enabling or further adjusting any specific environment variables or model settings improve memory allocation for cache blocks on this setup?**
3. Are there known compatibility issues for this model with Docker on Windows 11, or are there additional memory management configurations recommended in this setup?**

Any guidance or suggestions to address this memory error would be greatly appreciated!

 Thank you for your help.

Unfortunately I think you’ll probably have trouble using this model with only 16GB of memory. With NIM we only support FP16 on 4090 GPUs at the moment, which means you’d need upwards of ~25GB of GPU memory just to load the model weights. I’d recommend checking out build.nvidia.com to see if there are any smaller models that might work for your use case

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.