Reusing a stored model (llama-3.1-8b-instruct) with a proper profile

First I run a NIM container with the llama-3.1-8b-instruct model and it loads and run successfully.

export CONTAINER_NAME=Llama3_1-8B-Instruct
Repository=nim/meta/llama-3.1-8b-instruct
Latest_Tag=1.2.2
export IMG_NAME="nvcr.io/${Repository}:${Latest_Tag}"
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

The container loads up and uses both of 2 available GPUs.
If I check it, I see that there is a profile for that and it is properly picked:

docker run -it --rm \
	--gpus all \
	-e NGC_API_KEY \
	-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
	$IMG_NAME list-model-profiles
...
MODEL PROFILES
- Compatible with system and runnable:
  - 6a3ba475d3215ca28f1a8c8886ab4a56b5626d1c98adbfe751025e8ff3d9886d (vllm-bf16-tp2)
  - 3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5 (vllm-bf16-tp1)
  - With LoRA support:
    - a95e5c7221dae587b4fc32448df265320ce79064a970297649d97a84eb9dc3ba (vllm-bf16-tp2-lora)
    - dfd9bee71abb7582246f7fb8c2aedd9119909b9639e1b4b0260ef6865545ede7 (vllm-bf16-tp1-lora)
...

Then I store the cached model locally:

export MODEL_STORE=/tmp/model_store
mkdir -p $MODEL_STORE
sudo chmod -R 777 $MODEL_STORE

docker run -it --rm \
	--gpus all \
	-e NGC_API_KEY \
	-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
	-v $MODEL_STORE:$MODEL_STORE \
	$IMG_NAME create-model-store \
	-p 6a3ba475d3215ca28f1a8c8886ab4a56b5626d1c98adbfe751025e8ff3d9886d \
	-m $MODEL_STORE

and it is stored as expected:

ls $MODEL_STORE
checksums.blake3        model-00001-of-00004.safetensors  model.safetensors.index.json  tokenizer.json
config.json             model-00002-of-00004.safetensors  NOTICE.txt                    tool_use_config.json
generation_config.json  model-00003-of-00004.safetensors  special_tokens_map.json
LICENSE.txt             model-00004-of-00004.safetensors  tokenizer_config.json

And now I try to run a NIM container using this locally stored model:

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME=$MODEL_STORE \
  -e NIM_SERVED_MODEL_NAME=llama-3.1-8b \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v $MODEL_STORE:$MODEL_STORE \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

And for that I am getting the error:

RuntimeError: The number of CUDA devices has changed since the first call to torch.cuda.device_count(). This is not allowed and may result in undefined behavior. Please check out https://github.com/vllm-project/vllm/issues/6056 to find the first call to torch.cuda.device_count() and defer it until the engine is up. Or you can set CUDA_VISIBLE_DEVICES to the GPUs you want to use.

As suggested, I run with CUDA_VISIBLE_DEVICES:

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME=$MODEL_STORE \
  -e NIM_SERVED_MODEL_NAME=llama-3.1-8b \
  -e CUDA_VISIBLE_DEVICES="0,1" \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v $MODEL_STORE:$MODEL_STORE \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

and get another error :

[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (42480). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Also I notice that in the output INFO 2024-10-30 13:25:38.353 launch.py:92] running command ['/opt/nim/llm/.venv/bin/python3' ... there is a part "tensor_parallel_size": 1 that is wrong as I expect it to be equal to 2.

As an attempt to fix it I use -e NIM_MODEL_PROFILE="vllm-bf16-tp2", but to no avail, it still passes "tensor_parallel_size": 1 and crashes with

[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (42480). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

I have also tried using NIM_TENSOR_PARALLEL_SIZE, but as it is already clarified in The intended usage of NIM_TENSOR_PARALLEL_SIZE, it should not work.

So, in the end, is there a way to start a NIM container from a locally stored model with a proper profile and an expected tensor_parallel_size value?