Hello everyone,
I’m trying to load and run the Mistral-Nemo-12B-Instruct model on my NVIDIA GeForce RTX 4090 (16 GB GDDR6 GPU) in a Docker container on a Windows 11 Pro laptop. I’m using the latest release of Docker and running into memory allocation issues specific to GPU memory utilization.
Issue
When I attempt to initialize the model, I receive the following error:
ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization
when initializing the engine.
I’ve monitored GPU utilization with `nvidia-smi`, and the GPU shows as follows: (16,022 MiB of 16,376 MiB) with only 3% GPU utilization.
Sat Nov 9 12:02:34 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 556.12 Driver Version: 556.12 CUDA Version: 12.5 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 … WDDM | 00000000:01:00.0 Off | N/A |
| N/A 51C P8 6W / 175W | 16022MiB / 16376MiB | 3% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
Below are further details on the error trace from the model initialization attempt:
INFO 2024-11-10 16:38:01.642 model_runner.py:692] Loading model weights took 22.8445 GB
INFO 2024-11-10 16:38:11.634 distributed_gpu_executor.py:56] # GPU blocks: 0, # CPU blocks: 1638
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py", line 722, in <module>
[rank0]: engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
[rank0]: File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py", line 32, in from_engine_args
[rank0]: engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
[rank0]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
[rank0]: self._run_workers("initialize_cache",
[rank0]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 178, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 212, in initialize_cache
[rank0]: raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 367, in raise_if_cache_size_invalid
[rank0]: raise ValueError("No available memory for the cache blocks. "
[rank0]: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
Steps I’ve Taken
- Set GPU Memory Utilization**: Attempted to set `gpu_memory_utilization` to `1.0` to maximize usage.
- Reduced CPU Blocks and Increased GPU Blocks**: Adjusted `NUM_GPU_BLOCKS` and `NUM_CPU_BLOCKS` parameters to prioritize GPU over CPU.
- Checked Chunked Prefill Setting**: Verified that chunked prefill was disabled, as recommended for large models.
Questions
1. Is there a specific configuration of `gpu_memory_utilization` and block settings that could better allocate memory for this large model on my 16 GB GPU?**
2. Would enabling or further adjusting any specific environment variables or model settings improve memory allocation for cache blocks on this setup?**
3. Are there known compatibility issues for this model with Docker on Windows 11, or are there additional memory management configurations recommended in this setup?**
Any guidance or suggestions to address this memory error would be greatly appreciated!
Thank you for your help.