Problem with installation of Llama 3.1 8b NIM

pablopagotto · September 3, 2024, 6:28pm

We have a problem with the installation of LLama 3.1 8b NIM.
I installed in a g6.4xlarge Instance of AWS
I installed Docker, Cuda, NVIDIA, NVIDIA Container Toolkit without errors.
I generated the API Key. Then I login to nvcr.io with the user and the API Key without problem,
When I execute the comamand :
“export NGC_API_KEY=
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p “$LOCAL_NIM_CACHE”
docker run -it --rm
–gpus all
–shm-size=16GB
-e NGC_API_KEY
-v “$LOCAL_NIM_CACHE:/opt/nim/.cache”
-u $(id -u)
-p 8000:8000
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.2”
I have these output:
“ERROR 09-02 19:10:39.451 worker_base.py:382] Error executing method initialize_cache. This might cause deadlock in distributed execution.
Traceback (most recent call last):
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py”, line 374, in execute_method
return executor(*args, **kwargs)
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py”, line 212, in initialize_cache
raise_if_cache_size_invalid(num_gpu_blocks,
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py”, line 372, in raise_if_cache_size_invalid
raise ValueError(
ValueError: The model’s max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (30544). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
[rank0]: Traceback (most recent call last):
[rank0]: File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File “/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py”, line 703, in
[rank0]: engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
[rank0]: File “/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py”, line 33, in from_engine_args
[rank0]: engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py”, line 466, in from_engine_args
[rank0]: engine = cls(
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py”, line 380, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py”, line 547, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py”, line 265, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py”, line 377, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py”, line 62, in initialize_cache
[rank0]: self._run_workers(“initialize_cache”,
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py”, line 349, in _run_workers
[rank0]: self.driver_worker.execute_method(method, *driver_args,
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py”, line 383, in execute_method
[rank0]: raise e
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py”, line 374, in execute_method
[rank0]: return executor(*args, **kwargs)
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py”, line 212, in initialize_cache
[rank0]: raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]: File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py”, line 372, in raise_if_cache_size_invalid
[rank0]: raise ValueError(
[rank0]: ValueError: The model’s max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (30544). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.”

I dont know where can I change the max_model_len or the gpu_memory_utilization to solve the problem.
Thank for advance.

neal.vaidya · September 4, 2024, 4:02am

Hi @pablopagotto – at the moment the best way to set the max_model_len parameter is to override the container entry point and pass it in as a command line flag, along the lines of this command:

docker run -it --rm \
–gpus all \
–shm-size=16GB \
-e NGC_API_KEY \
-v “$LOCAL_NIM_CACHE:/opt/nim/.cache” \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.2 \
python3 -m vllm_nvext.entrypoints.openai.api_server --max-model-len 30544

Modifying the gpu_memory_utilization can be done with a similar flag, but in this situation I don’t think it would have much impact.

Topic		Replies	Views
Reusing a stored model (llama-3.1-8b-instruct) with a proper profile Models nim , llama-31-8b-instruct , llama	0	159	October 30, 2024
/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server Container: CUDA	0	275	July 9, 2024
NIM API key not Found Models nim , llama-31-8b-instruct , llama	4	618	September 21, 2024
NVIDIA NIM Container with CUDA out of Memory Problem Docker and NVIDIA Docker cuda , ubuntu , docker , nim , llama3-8b-instruct	2	544	September 20, 2024
Getting Started With NVIDIA NIM Tutorial Issues with NGC Registry Access/Accounts ubuntu , nim , llm , llama3-8b-instruct	7	1584	July 24, 2024
Issues while starting NIM container in A10 VM Models nim , llama3-8b-instruct	4	154	September 4, 2024
401 unauthorized access Visual AI Agent nim , llama-31-70b-instruct , llama	12	88	April 28, 2025
Unable to Run NIM on H100 GPU Due to Profile Compatibility Issue Despite Sufficient GPU Resources Models nim , llama-31-8b-instruct , llama	1	214	November 12, 2024
Model says there is a compatible profile but fails on data type Models nim , mistral-7b-instruct-v03	4	683	August 21, 2024
NIM nim/meta/llama3-8b-instruct - no API key is detected NGC GPU Cloud	2	806	July 23, 2024

Problem with installation of Llama 3.1 8b NIM

Related topics