I’m using VIA Engine to host Vita-2.0 model on A100 VM. Please find below the VM specifications:
- RAM: 70GB
- NVIDIA Driver Version: 535.161.08
- VRAM: 40GB
I’m using nvcr.io/metropolis/via-dp/via-engine:2.0-dp image to deploy the model.
I ran this container using
export BACKEND_PORT=8000
export FRONTEND_PORT=9000
export NVIDIA_API_KEY=<My-NVIDIA-API-Key>
export NGC_API_KEY=<My-NGC-Key>
export MODEL_PATH=./via-vita-model/vita_2.0.1
export NGC_MODEL_CACHE=./via-vita-model/cache
export VLM_MODEL_TO_USE=vita-2.0
docker run -it --ipc=host --ulimit memlock=-1 \
--ulimit stack=67108864 --tmpfs /tmp:exec --name via-server-2 \
--gpus '"device=all"' \
-p $FRONTEND_PORT:$FRONTEND_PORT \
-p $BACKEND_PORT:$BACKEND_PORT \
-e BACKEND_PORT=$BACKEND_PORT \
-e FRONTEND_PORT=$FRONTEND_PORT \
-e NVIDIA_API_KEY=$NVIDIA_API_KEY \
-e NGC_API_KEY=$NGC_API_KEY \
-e VLM_BATCH_SIZE=1 \
-v $NGC_MODEL_CACHE:/root/.via/ngc_model_cache \
-v ./via-vita-model/vita_2.0.1:/root/via-vita-model/vita_2.0.1 \
-e MODEL_PATH=$MODEL_PATH \
-e VLM_MODEL_TO_USE=vita-2.0 \
-v via-hf-cache:/tmp/huggingface \
nvcr.io/metropolis/via-dp/via-engine:2.0-dp
With this command, I was getting NVML error:
[10/09/2024-05:12:34] [TRT-LLM] [I] Compute capability: (8, 0)
[10/09/2024-05:12:34] [TRT-LLM] [I] SM count: 108
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 427, in main
cluster_config = infer_cluster_config()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 538, in infer_cluster_config
cluster_info=infer_cluster_info(),
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 460, in infer_cluster_info
sm_clock = pynvml.nvmlDeviceGetMaxClockInfo(
File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 2182, in nvmlDeviceGetMaxClockInfo
_nvmlCheckReturn(ret)
File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 833, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR: Failed to build TRT engine
2024-10-09 05:12:36,731 ERROR Failed to load VIA pipeline - Failed to generate TRT-LLM engine
To fix this issue, I referred to this github issue and upgraded the tesnorrt-llm version to 0.12.0.dev2024070200 but I’m now getting below mentioned error:
Selecting FP16 mode
Converting Checkpoint ...
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070200
Traceback (most recent call last):
File "/opt/nvidia/via/via-engine/models/vita20/trt_helper/convert_checkpoint.py", line 14, in <module>
from tensorrt_llm.models.llama.weight import load_from_gptq_llama
ModuleNotFoundError: No module named 'tensorrt_llm.models.llama.weight'
ERROR: Failed to convert checkpoint
2024-10-09 05:16:23,658 ERROR Failed to load VIA pipeline - Failed to generate TRT-LLM engine
Killed
Note: I downloaded the vita-2.0 model from NGC
Suggest the possible solution to fix this issue.