VSS Installation problem

Max.70 · February 16, 2025, 10:43am

Please provide the following information when creating a topic:

Hardware Platform (GPU model and numbers): Nvidia H100 x 8
System Memory: 2Tb
Ubuntu Version: 22.04
NVIDIA GPU Driver Version (valid for GPU only): 550.127.08
Issue Type( questions, new requirements, bugs): bugs
How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)
Requirement details (This is for new requirement. Including the logs for the pods, the description for the pods)

Hi, i have a problem in deploying vss on a dgx kubernetes cluster deployed using BCM. I followed the Quickstart guide and tried deploying the helm chart using attached overrides, i tried various configurations but without success. Also tried disabling and enabling guardrails but with no succes. The current situation is as follows:

 kubectl get pods -w
NAME                                                  READY   STATUS     RESTARTS       AGE
etcd-etcd-deployment-6ff4564cb8-l8jlj                 1/1     Running    0              53m
milvus-milvus-deployment-f64989c5d-4hdvr              1/1     Running    0              53m
minio-minio-deployment-6849c855b7-l7q8n               1/1     Running    0              53m
nemo-embedding-embedding-deployment-c8579989b-79v8m   1/1     Running    15             53m
nemo-rerank-ranking-deployment-5c9bcc97db-2q5s2       1/1     Running    3 (110s ago)   53m
neo4j-neo4j-deployment-5c9ff6fbc8-nkgbd               1/1     Running    0              53m
vss-blueprint-0                                       0/1     Running    1 (22m ago)    53m
vss-vss-deployment-f8bb8f778-jjwwj                    0/1     Init:2/3   0              53m

please find attached vss-blueprint-0 logs and vss-deployment describe output.

In the past days i was actually able to deploy the chart but the web interface hanged constantly so i tried to completely uninstall the helm chart and redeploy it, but with no success. I correctly created the secrets with docker key and api key. I really don’t understand what i am missing, could you please suggest how to fix the problem?

Thanks in advance
vss-blueprint-0.txt (9.8 KB)
vss-vss-deployment-f8bb8f778-jjwwj.txt (9.9 KB)
overrides.yaml.txt (1.5 KB)

yuweiw · February 17, 2025, 8:11am

There is no obvious error message in the log. But in therory, if you are using 8xH100 and only depoy the VSS on your device, you don’t need to use file overrides.yaml file. The only thing that might cause problems is your driver and cuda version.
Could you attach the result of running the nvidia-smi command?
You can try to use the NVIDIA driver 535.161.08.

Max.70 · February 17, 2025, 8:21am

Hi
Here’s nvidia-smi ouptut:

nvidia-smi
Mon Feb 17 09:19:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
| N/A   25C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
| N/A   26C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:52:00.0 Off |                    0 |
| N/A   29C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:61:00.0 Off |                    0 |
| N/A   28C    P0             67W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9D:00.0 Off |                    0 |
| N/A   26C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:C3:00.0 Off |                    0 |
| N/A   24C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:D1:00.0 Off |                    0 |
| N/A   29C    P0             72W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DF:00.0 Off |                    0 |
| N/A   29C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Thank you

yuweiw · February 17, 2025, 8:39am

Can you try the following steps?
1.reboot your device
2.export NGC_API_KEY=<your_ngc_api_key>
3.install the helm chart following our Guide

If the above method does not work, it is still recommended that you reinstall our driver to 535.161.08.

Max.70 · February 18, 2025, 10:00am

Restarted both nodes and redeployed the helm chart with no overrides but still facing the same problem:

 kubectl get pods -w
NAME                                                 READY   STATUS     RESTARTS   AGE
etcd-etcd-deployment-6ff4564cb8-kjkzw                1/1     Running    0          16m
milvus-milvus-deployment-f64989c5d-sdm6n             1/1     Running    0          16m
minio-minio-deployment-6849c855b7-qsbdf              1/1     Running    0          16m
nemo-embedding-embedding-deployment-d984ff59-qtx69   1/1     Running    0          16m
nemo-rerank-ranking-deployment-c95ccfcf-9fkj5        1/1     Running    0          16m
neo4j-neo4j-deployment-5c9ff6fbc8-jrpjq              1/1     Running    0          16m
vss-blueprint-0                                      0/1     Running    0          16m
vss-vss-deployment-6fc4ccfd94-w85rf                  0/1     Init:2/3   0          16m

No other significant errors in logs. I have to understand what’s the impact in downgrading the driver version for other workloads on cluster and then try your suggestion to reinstall driver 535.161.08

yuweiw · February 18, 2025, 10:10am

You can wait until the deployment is successful. The first time deployment may take a little longer, you can wait 40~50 minutes to see that if it’s successful.

Max.70 · February 18, 2025, 1:38pm

You were right after quite a while all the pods are now running:

kubectl get pods -w
NAME                                                 READY   STATUS    RESTARTS       AGE
etcd-etcd-deployment-6ff4564cb8-kjkzw                1/1     Running   0              3h35m
milvus-milvus-deployment-f64989c5d-sdm6n             1/1     Running   0              3h35m
minio-minio-deployment-6849c855b7-qsbdf              1/1     Running   0              3h35m
nemo-embedding-embedding-deployment-d984ff59-qtx69   1/1     Running   0              3h35m
nemo-rerank-ranking-deployment-c95ccfcf-9fkj5        1/1     Running   0              3h35m
neo4j-neo4j-deployment-5c9ff6fbc8-jrpjq              1/1     Running   0              3h35m
vss-blueprint-0                                      1/1     Running   0              3h35m
vss-vss-deployment-6fc4ccfd94-w85rf                  1/1     Running   2 (124m ago)   3h35m

Now when i try to access the web ui via port forward the page never loads, and in the port forward command i see the following errors:

Forwarding from 127.0.0.1:9000 -> 9000
Forwarding from [::1]:9000 -> 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
E0218 14:29:32.339239     436 portforward.go:381] error copying from remote stream to local connection: readfrom tcp4 127.0.0.1:9000->127.0.0.1:48282: write tcp4 127.0.0.1:9000->127.0.0.1:48282: write: broken pipe
Handling connection for 9000
E0218 14:29:32.731004     436 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:9000->[::1]:55824: write tcp6 [::1]:9000->[::1]:55824: write: broken pipe
E0218 14:30:02.136458     436 portforward.go:370] error creating forwarding stream for port 9000 -> 9000: Timeout occurred
E0218 14:30:02.136496     436 portforward.go:370] error creating forwarding stream for port 9000 -> 9000: Timeout occurred
E0218 14:30:02.233257     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
E0218 14:30:02.612103     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
E0218 14:30:32.558864     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
E0218 14:30:32.920601     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
E0218 14:31:02.915904     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
E0218 14:31:03.277598     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred

I am accessing the cluster in a remote datacenter and the network of the worker nodes is not directly accessible.
I don’t see errors in pods, tha page just stop loading and the sample icon images are not fully loaded

is there anything i can do to make the web ui working?

Thank you

yuweiw · February 19, 2025, 6:06am

OK. Could you try to not use the overrides.yaml to run the helm chart? Your GPU resources are sufficient. Or you can modify this profile with below.

nim-llm:           [0,1,2,3]
vss:               [4,5]
nemo-embedding:    [6]
nemo-rerank:       [7]

Max.70 · February 19, 2025, 2:41pm

The test i’ve done has been deployed with no overrides.yaml option and the results are as shown above. It seems that the web ui just dies when first i try to load the page.

yuweiw · February 20, 2025, 2:03am

OK. Could you load the web successfully in the internal networks without port forwarding?

What are your specific steps for using the port forwarding?

Max.70 · February 21, 2025, 12:46pm

Ok, i tried connecting directly to nodeport and now the webUI is correctly loading. I will use it that way. Thank you for you support

system · March 7, 2025, 12:46pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
VSS Installation Visual AI Agent	14	138	February 14, 2025
Error with Nvidia VSS blueprint - nemo-rerank-ranking-deployment Visual AI Agent nvbugs	15	104	February 27, 2025
Warning Unhealthy kubelet Startup probe failed: Get "v1/health/ready": dial tcp 10.1.124.81:8000: connect: connection refused Visual AI Agent nvbugs , nim , llama	31	150	April 14, 2025
VSS issue - vss-blueprint-0 keeps restarting Visual AI Agent nvbugs	4	68	February 13, 2025
Getting Error while running blueprint-vss demo Visual AI Agent	30	395	January 24, 2025
VSS issue - API Key Issue When Using OpenAI GPT-4o Instead of LLM-SVC in VSS Blueprint Visual AI Agent nvbugs , kubernetes , ngc , nim , llama-31-70b-instruct , nvidia-technologies , llama , blueprints	7	56	March 4, 2025
MountVolume.Setup failed for volume pvc - vss-ngc-model-cache-pvc Visual AI Agent nvbugs	7	57	May 6, 2025
Deployment of Nvidia VSS Blueprint - vss-vss-deployment POD is failing to initialize Visual AI Agent nim , llama-31-70b-instruct , llama , blueprints	1	58	February 14, 2025
Error running NVIDIA VSS \|\| pods keep restarting and crashing multiple times Visual AI Agent ubuntu	10	40	April 13, 2025
Error when installing nvidia driver - Tesla K40m on Linux RHEL Linux	28	2670	October 12, 2021

VSS Installation problem

Related topics