I ran with some issues while deploying the helm chart , I build the Kubernetes pods, they kept restating and crashing multiple times and never completed
i have :
8 H100. 4 for llm, 2 for vila, 1 each for the other models
CUDA version 12.2
Nvidia Driver : 535.161.08
I checked the logs of vss-deployment-8f96df479-bzkdh ( the last one) and got :
Nemo-embedding-embedding-deploymend-654cdcb5c8-6znwj logs:
The NIM container is governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement) and the Product Specific Terms for AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products).
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement).
ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.
{“level”: “None”, “time”: “None”, “file_name”: “None”, “file_path”: “None”, “line_number”: “-1”, “message”: “You are using a deprecated pynvml
package. Please install nvidia-ml-py
instead. See https://pypi.org/project/pynvml for more information.”, “exc_info”: “None”, “stack_info”: “None”}
{“level”: “ERROR”, “time”: “None”, “file_name”: “None”, “file_path”: “None”, “line_number”: “-1”, “message”: “”, “exc_info”: “Traceback (most recent call last):\n File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File "/usr/lib/python3.10/runpy.py", line 86, in _run_code\n exec(code, run_globals)\n File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 99, in \n main()\n File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 42, in main\n inference_env = prepare_environment()\n File "/opt/nim/llm/nim_llm_sdk/entrypoints/args.py", line 195, in prepare_environment\n engine_args, extracted_name = inject_ngc_hub(engine_args)\n File "/opt/nim/llm/nim_llm_sdk/hub/ngc_injector.py", line 239, in inject_ngc_hub\n system = get_hardware_spec()\n File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 354, in get_hardware_spec\n gpus = GPUInspect()\n File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 172, in init\n GPUInspect._safe_exec(cuda.cuInit(0))\n File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 180, in _safe_exec\n raise RuntimeError(f"Unexpected error: {status.name}")\nRuntimeError: Unexpected error: CUDA_ERROR_SYSTEM_NOT_READY”, “stack_info”: “None”}