Error running Nvidia VSS blueprint || pods kept restating and crashing multiple times and never completed

I ran with some issues while deploying the helm chart , I build the Kubernetes pods, they kept restating and crashing multiple times and never completed

i have :
8 H100. 4 for llm, 2 for vila, 1 each for the other models
CUDA version 12.2
Nvidia Driver : 535.161.08

I checked the logs of vss-deployment-8f96df479-bzkdh ( the last one) and got :

Nemo-embedding-embedding-deploymend-654cdcb5c8-6znwj logs:

The NIM container is governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement) and the Product Specific Terms for AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products).
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement).
ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.
{“level”: “None”, “time”: “None”, “file_name”: “None”, “file_path”: “None”, “line_number”: “-1”, “message”: “You are using a deprecated pynvml package. Please install nvidia-ml-py instead. See https://pypi.org/project/pynvml for more information.”, “exc_info”: “None”, “stack_info”: “None”}
{“level”: “ERROR”, “time”: “None”, “file_name”: “None”, “file_path”: “None”, “line_number”: “-1”, “message”: “”, “exc_info”: “Traceback (most recent call last):\n File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File "/usr/lib/python3.10/runpy.py", line 86, in _run_code\n exec(code, run_globals)\n File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 99, in \n main()\n File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 42, in main\n inference_env = prepare_environment()\n File "/opt/nim/llm/nim_llm_sdk/entrypoints/args.py", line 195, in prepare_environment\n engine_args, extracted_name = inject_ngc_hub(engine_args)\n File "/opt/nim/llm/nim_llm_sdk/hub/ngc_injector.py", line 239, in inject_ngc_hub\n system = get_hardware_spec()\n File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 354, in get_hardware_spec\n gpus = GPUInspect()\n File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 172, in init\n GPUInspect._safe_exec(cuda.cuInit(0))\n File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 180, in _safe_exec\n raise RuntimeError(f"Unexpected error: {status.name}")\nRuntimeError: Unexpected error: CUDA_ERROR_SYSTEM_NOT_READY”, “stack_info”: “None”}

From the log you attached, there may be a problem with the driver you installed. How do you install the NVIDIA Driver? Did you install it according to the following Guide install-the-nvidia-driver?

And did you run the following command successfully?

sudo microk8s enable nvidia

Yes Nvidia driver version 535.161.08

deloitte@computeinstance-e00n5a1hgrdnpd7z4h:~/Desktop/nvidia-graphics-drivers-535.orig-amd64$ sudo microk8s enable nvidia
Infer repository core for addon nvidia
Addon core/nvidia is already enabled

It looks like there’s no problem with the driver version. We’ll check this issue. And could you also attach your ubuntu version and system memory?

Hi ,
thank you for your help
Ubuntu 22.04 LTS
Ram is 1600GiB

Have you tried rebooting your device and redeploying the VSS?

yes the same issue , pods crashed and never completed

OK. Let’s narrow down the problem based on the error message CUDA_ERROR_SYSTEM_NOT_READY first.

Could you try to run the simple sample deviceQuery.cpp on your device and attache the result? Please note that use the tag branch of this repo that corresponds to your cuda version.