Problem with installation of Llama 3.1 8b NIM

We have a problem with the installation of LLama 3.1 8b NIM.
I installed in a g6.4xlarge Instance of AWS
I installed Docker, Cuda, NVIDIA, NVIDIA Container Toolkit without errors.
I generated the API Key. Then I login to nvcr.io with the user and the API Key without problem,
When I execute the comamand :
“export NGC_API_KEY=
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p “$LOCAL_NIM_CACHE”
docker run -it --rm
–gpus all
–shm-size=16GB
-e NGC_API_KEY
-v “$LOCAL_NIM_CACHE:/opt/nim/.cache”
-u $(id -u)
-p 8000:8000
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.2
I have these output:
“ERROR 09-02 19:10:39.451 worker_base.py:382] Error executing method initialize_cache. This might cause deadlock in distributed execution.
Traceback (most recent call last):
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py”, line 374, in execute_method
return executor(*args, **kwargs)
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py”, line 212, in initialize_cache
raise_if_cache_size_invalid(num_gpu_blocks,
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py”, line 372, in raise_if_cache_size_invalid
raise ValueError(
ValueError: The model’s max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (30544). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
[rank0]: Traceback (most recent call last):
[rank0]: File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File “/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py”, line 703, in
[rank0]: engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
[rank0]: File “/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py”, line 33, in from_engine_args
[rank0]: engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py”, line 466, in from_engine_args
[rank0]: engine = cls(
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py”, line 380, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py”, line 547, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py”, line 265, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py”, line 377, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py”, line 62, in initialize_cache
[rank0]: self._run_workers(“initialize_cache”,
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py”, line 349, in _run_workers
[rank0]: self.driver_worker.execute_method(method, *driver_args,
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py”, line 383, in execute_method
[rank0]: raise e
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py”, line 374, in execute_method
[rank0]: return executor(*args, **kwargs)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py”, line 212, in initialize_cache
[rank0]: raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py”, line 372, in raise_if_cache_size_invalid
[rank0]: raise ValueError(
[rank0]: ValueError: The model’s max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (30544). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.”

I dont know where can I change the max_model_len or the gpu_memory_utilization to solve the problem.
Thank for advance.

Hi @pablopagotto – at the moment the best way to set the max_model_len parameter is to override the container entry point and pass it in as a command line flag, along the lines of this command:

docker run -it --rm \
–gpus all \
–shm-size=16GB \
-e NGC_API_KEY \
-v “$LOCAL_NIM_CACHE:/opt/nim/.cache” \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.2 \
python3 -m vllm_nvext.entrypoints.openai.api_server --max-model-len 30544

Modifying the gpu_memory_utilization can be done with a similar flag, but in this situation I don’t think it would have much impact.