Please provide the following information when creating a topic:
Hardware Platform (GPU model and numbers): Nvidia H100 x 8
System Memory: 2Tb
Ubuntu Version: 22.04
NVIDIA GPU Driver Version (valid for GPU only): 550.127.08
Issue Type( questions, new requirements, bugs): bugs
How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)
Requirement details (This is for new requirement. Including the logs for the pods, the description for the pods)
Hi, i have a problem in deploying vss on a dgx kubernetes cluster deployed using BCM. I followed the Quickstart guide and tried deploying the helm chart using attached overrides, i tried various configurations but without success. Also tried disabling and enabling guardrails but with no succes. The current situation is as follows:
please find attached vss-blueprint-0 logs and vss-deployment describe output.
In the past days i was actually able to deploy the chart but the web interface hanged constantly so i tried to completely uninstall the helm chart and redeploy it, but with no success. I correctly created the secrets with docker key and api key. I really don’t understand what i am missing, could you please suggest how to fix the problem?
There is no obvious error message in the log. But in therory, if you are using 8xH100 and only depoy the VSS on your device, you don’t need to use file overrides.yaml file. The only thing that might cause problems is your driver and cuda version.
Could you attach the result of running the nvidia-smi command?
You can try to use the NVIDIA driver 535.161.08.
No other significant errors in logs. I have to understand what’s the impact in downgrading the driver version for other workloads on cluster and then try your suggestion to reinstall driver 535.161.08
You can wait until the deployment is successful. The first time deployment may take a little longer, you can wait 40~50 minutes to see that if it’s successful.
Now when i try to access the web ui via port forward the page never loads, and in the port forward command i see the following errors:
Forwarding from 127.0.0.1:9000 -> 9000
Forwarding from [::1]:9000 -> 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
E0218 14:29:32.339239 436 portforward.go:381] error copying from remote stream to local connection: readfrom tcp4 127.0.0.1:9000->127.0.0.1:48282: write tcp4 127.0.0.1:9000->127.0.0.1:48282: write: broken pipe
Handling connection for 9000
E0218 14:29:32.731004 436 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:9000->[::1]:55824: write tcp6 [::1]:9000->[::1]:55824: write: broken pipe
E0218 14:30:02.136458 436 portforward.go:370] error creating forwarding stream for port 9000 -> 9000: Timeout occurred
E0218 14:30:02.136496 436 portforward.go:370] error creating forwarding stream for port 9000 -> 9000: Timeout occurred
E0218 14:30:02.233257 436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
E0218 14:30:02.612103 436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
E0218 14:30:32.558864 436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
E0218 14:30:32.920601 436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
E0218 14:31:02.915904 436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
E0218 14:31:03.277598 436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
I am accessing the cluster in a remote datacenter and the network of the worker nodes is not directly accessible.
I don’t see errors in pods, tha page just stop loading and the sample icon images are not fully loaded
The test i’ve done has been deployed with no overrides.yaml option and the results are as shown above. It seems that the web ui just dies when first i try to load the page.