First I run a NIM container with the llama-3.1-8b-instruct model and it loads and run successfully.
export CONTAINER_NAME=Llama3_1-8B-Instruct
Repository=nim/meta/llama-3.1-8b-instruct
Latest_Tag=1.2.2
export IMG_NAME="nvcr.io/${Repository}:${Latest_Tag}"
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
The container loads up and uses both of 2 available GPUs.
If I check it, I see that there is a profile for that and it is properly picked:
docker run -it --rm \
--gpus all \
-e NGC_API_KEY \
-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
$IMG_NAME list-model-profiles
...
MODEL PROFILES
- Compatible with system and runnable:
- 6a3ba475d3215ca28f1a8c8886ab4a56b5626d1c98adbfe751025e8ff3d9886d (vllm-bf16-tp2)
- 3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5 (vllm-bf16-tp1)
- With LoRA support:
- a95e5c7221dae587b4fc32448df265320ce79064a970297649d97a84eb9dc3ba (vllm-bf16-tp2-lora)
- dfd9bee71abb7582246f7fb8c2aedd9119909b9639e1b4b0260ef6865545ede7 (vllm-bf16-tp1-lora)
...
Then I store the cached model locally:
export MODEL_STORE=/tmp/model_store
mkdir -p $MODEL_STORE
sudo chmod -R 777 $MODEL_STORE
docker run -it --rm \
--gpus all \
-e NGC_API_KEY \
-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
-v $MODEL_STORE:$MODEL_STORE \
$IMG_NAME create-model-store \
-p 6a3ba475d3215ca28f1a8c8886ab4a56b5626d1c98adbfe751025e8ff3d9886d \
-m $MODEL_STORE
and it is stored as expected:
ls $MODEL_STORE
checksums.blake3 model-00001-of-00004.safetensors model.safetensors.index.json tokenizer.json
config.json model-00002-of-00004.safetensors NOTICE.txt tool_use_config.json
generation_config.json model-00003-of-00004.safetensors special_tokens_map.json
LICENSE.txt model-00004-of-00004.safetensors tokenizer_config.json
And now I try to run a NIM container using this locally stored model:
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NIM_MODEL_NAME=$MODEL_STORE \
-e NIM_SERVED_MODEL_NAME=llama-3.1-8b \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-v $MODEL_STORE:$MODEL_STORE \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
And for that I am getting the error:
RuntimeError: The number of CUDA devices has changed since the first call to torch.cuda.device_count(). This is not allowed and may result in undefined behavior. Please check out https://github.com/vllm-project/vllm/issues/6056 to find the first call to torch.cuda.device_count() and defer it until the engine is up. Or you can set CUDA_VISIBLE_DEVICES to the GPUs you want to use.
As suggested, I run with CUDA_VISIBLE_DEVICES:
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NIM_MODEL_NAME=$MODEL_STORE \
-e NIM_SERVED_MODEL_NAME=llama-3.1-8b \
-e CUDA_VISIBLE_DEVICES="0,1" \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-v $MODEL_STORE:$MODEL_STORE \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
and get another error :
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (42480). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Also I notice that in the output INFO 2024-10-30 13:25:38.353 launch.py:92] running command ['/opt/nim/llm/.venv/bin/python3' ...
there is a part "tensor_parallel_size": 1
that is wrong as I expect it to be equal to 2.
As an attempt to fix it I use -e NIM_MODEL_PROFILE="vllm-bf16-tp2"
, but to no avail, it still passes "tensor_parallel_size": 1
and crashes with
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (42480). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
I have also tried using NIM_TENSOR_PARALLEL_SIZE
, but as it is already clarified in The intended usage of NIM_TENSOR_PARALLEL_SIZE, it should not work.
So, in the end, is there a way to start a NIM container from a locally stored model with a proper profile and an expected tensor_parallel_size value?