Error running Nvidia VSS blueprint || pods kept restating and crashing multiple times and never completed

adaher · January 9, 2025, 12:21pm

I ran with some issues while deploying the helm chart , I build the Kubernetes pods, they kept restating and crashing multiple times and never completed

i have :
8 H100. 4 for llm, 2 for vila, 1 each for the other models
CUDA version 12.2
Nvidia Driver : 535.161.08

I checked the logs of vss-deployment-8f96df479-bzkdh ( the last one) and got :

Nemo-embedding-embedding-deploymend-654cdcb5c8-6znwj logs:

The NIM container is governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement) and the Product Specific Terms for AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products).
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement).
ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.
{“level”: “None”, “time”: “None”, “file_name”: “None”, “file_path”: “None”, “line_number”: “-1”, “message”: “You are using a deprecated pynvml package. Please install nvidia-ml-py instead. See https://pypi.org/project/pynvml for more information.”, “exc_info”: “None”, “stack_info”: “None”}
{“level”: “ERROR”, “time”: “None”, “file_name”: “None”, “file_path”: “None”, “line_number”: “-1”, “message”: “”, “exc_info”: “Traceback (most recent call last):\n File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File "/usr/lib/python3.10/runpy.py", line 86, in _run_code\n exec(code, run_globals)\n File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 99, in \n main()\n File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 42, in main\n inference_env = prepare_environment()\n File "/opt/nim/llm/nim_llm_sdk/entrypoints/args.py", line 195, in prepare_environment\n engine_args, extracted_name = inject_ngc_hub(engine_args)\n File "/opt/nim/llm/nim_llm_sdk/hub/ngc_injector.py", line 239, in inject_ngc_hub\n system = get_hardware_spec()\n File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 354, in get_hardware_spec\n gpus = GPUInspect()\n File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 172, in init\n GPUInspect._safe_exec(cuda.cuInit(0))\n File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 180, in _safe_exec\n raise RuntimeError(f"Unexpected error: {status.name}")\nRuntimeError: Unexpected error: CUDA_ERROR_SYSTEM_NOT_READY”, “stack_info”: “None”}

adaher · January 9, 2025, 12:24pm

yuweiw · January 10, 2025, 1:38am

From the log you attached, there may be a problem with the driver you installed. How do you install the NVIDIA Driver? Did you install it according to the following Guide install-the-nvidia-driver?

And did you run the following command successfully?

sudo microk8s enable nvidia

adaher · January 10, 2025, 11:43am

Yes Nvidia driver version 535.161.08

deloitte@computeinstance-e00n5a1hgrdnpd7z4h:~/Desktop/nvidia-graphics-drivers-535.orig-amd64$ sudo microk8s enable nvidia
Infer repository core for addon nvidia
Addon core/nvidia is already enabled

yuweiw · January 13, 2025, 3:12am

It looks like there’s no problem with the driver version. We’ll check this issue. And could you also attach your ubuntu version and system memory?

adaher · January 13, 2025, 7:59am

Hi ,
thank you for your help
Ubuntu 22.04 LTS
Ram is 1600GiB

yuweiw · January 13, 2025, 8:55am

Have you tried rebooting your device and redeploying the VSS?

adaher · January 13, 2025, 10:01am

yes the same issue , pods crashed and never completed

yuweiw · January 14, 2025, 2:15am

OK. Let’s narrow down the problem based on the error message CUDA_ERROR_SYSTEM_NOT_READY first.

Could you try to run the simple sample deviceQuery.cpp on your device and attache the result? Please note that use the tag branch of this repo that corresponds to your cuda version.

yuweiw · March 5, 2025, 5:41am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

system · March 19, 2025, 5:41am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error running NVIDIA VSS \|\| pods keep restarting and crashing multiple times Visual AI Agent ubuntu	10	199	April 13, 2025
Error deploying VSS blueprint Visual AI Agent nim , llama	3	151	March 10, 2025
Error deploying VSS Blueprint using kubernetes Visual AI Agent	5	138	April 30, 2025
Getting Error while running blueprint-vss demo Visual AI Agent	30	823	January 24, 2025
Error with Nvidia VSS blueprint - nemo-rerank-ranking-deployment Visual AI Agent nvbugs	15	315	February 27, 2025
Deployment of Nvidia VSS Blueprint - vss-vss-deployment POD is failing to initialize Visual AI Agent nim , llama-31-70b-instruct , llama , blueprints	1	149	February 14, 2025
VSS issue - vss-blueprint-0 keeps restarting Visual AI Agent nvbugs	4	175	February 13, 2025
VSS Deployment - "vss-blueprint-0" Pod Keeps Crashing NGC GPU Cloud nim , llama-31-70b-instruct , llama , blueprints	0	72	February 2, 2025
VSS Blueprint Helm Installation- Nemo embedding pod failure Visual AI Agent nim , llama	30	339	May 29, 2025
VSS Installation Visual AI Agent	14	384	February 14, 2025

Error running Nvidia VSS blueprint || pods kept restating and crashing multiple times and never completed

Related topics