VSS Installation problem

Please provide the following information when creating a topic:

  • Hardware Platform (GPU model and numbers): Nvidia H100 x 8
  • System Memory: 2Tb
  • Ubuntu Version: 22.04
  • NVIDIA GPU Driver Version (valid for GPU only): 550.127.08
  • Issue Type( questions, new requirements, bugs): bugs
  • How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)
  • Requirement details (This is for new requirement. Including the logs for the pods, the description for the pods)

Hi, i have a problem in deploying vss on a dgx kubernetes cluster deployed using BCM. I followed the Quickstart guide and tried deploying the helm chart using attached overrides, i tried various configurations but without success. Also tried disabling and enabling guardrails but with no succes. The current situation is as follows:

 kubectl get pods -w
NAME                                                  READY   STATUS     RESTARTS       AGE
etcd-etcd-deployment-6ff4564cb8-l8jlj                 1/1     Running    0              53m
milvus-milvus-deployment-f64989c5d-4hdvr              1/1     Running    0              53m
minio-minio-deployment-6849c855b7-l7q8n               1/1     Running    0              53m
nemo-embedding-embedding-deployment-c8579989b-79v8m   1/1     Running    15             53m
nemo-rerank-ranking-deployment-5c9bcc97db-2q5s2       1/1     Running    3 (110s ago)   53m
neo4j-neo4j-deployment-5c9ff6fbc8-nkgbd               1/1     Running    0              53m
vss-blueprint-0                                       0/1     Running    1 (22m ago)    53m
vss-vss-deployment-f8bb8f778-jjwwj                    0/1     Init:2/3   0              53m

please find attached vss-blueprint-0 logs and vss-deployment describe output.

In the past days i was actually able to deploy the chart but the web interface hanged constantly so i tried to completely uninstall the helm chart and redeploy it, but with no success. I correctly created the secrets with docker key and api key. I really don’t understand what i am missing, could you please suggest how to fix the problem?

Thanks in advance
vss-blueprint-0.txt (9.8 KB)
vss-vss-deployment-f8bb8f778-jjwwj.txt (9.9 KB)
overrides.yaml.txt (1.5 KB)

There is no obvious error message in the log. But in therory, if you are using 8xH100 and only depoy the VSS on your device, you don’t need to use file overrides.yaml file. The only thing that might cause problems is your driver and cuda version.
Could you attach the result of running the nvidia-smi command?
You can try to use the NVIDIA driver 535.161.08.

Hi
Here’s nvidia-smi ouptut:

nvidia-smi
Mon Feb 17 09:19:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
| N/A   25C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
| N/A   26C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:52:00.0 Off |                    0 |
| N/A   29C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:61:00.0 Off |                    0 |
| N/A   28C    P0             67W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9D:00.0 Off |                    0 |
| N/A   26C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:C3:00.0 Off |                    0 |
| N/A   24C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:D1:00.0 Off |                    0 |
| N/A   29C    P0             72W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DF:00.0 Off |                    0 |
| N/A   29C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Thank you

Can you try the following steps?
1.reboot your device
2.export NGC_API_KEY=<your_ngc_api_key>
3.install the helm chart following our Guide

If the above method does not work, it is still recommended that you reinstall our driver to 535.161.08.

Restarted both nodes and redeployed the helm chart with no overrides but still facing the same problem:

 kubectl get pods -w
NAME                                                 READY   STATUS     RESTARTS   AGE
etcd-etcd-deployment-6ff4564cb8-kjkzw                1/1     Running    0          16m
milvus-milvus-deployment-f64989c5d-sdm6n             1/1     Running    0          16m
minio-minio-deployment-6849c855b7-qsbdf              1/1     Running    0          16m
nemo-embedding-embedding-deployment-d984ff59-qtx69   1/1     Running    0          16m
nemo-rerank-ranking-deployment-c95ccfcf-9fkj5        1/1     Running    0          16m
neo4j-neo4j-deployment-5c9ff6fbc8-jrpjq              1/1     Running    0          16m
vss-blueprint-0                                      0/1     Running    0          16m
vss-vss-deployment-6fc4ccfd94-w85rf                  0/1     Init:2/3   0          16m

No other significant errors in logs. I have to understand what’s the impact in downgrading the driver version for other workloads on cluster and then try your suggestion to reinstall driver 535.161.08

You can wait until the deployment is successful. The first time deployment may take a little longer, you can wait 40~50 minutes to see that if it’s successful.

You were right after quite a while all the pods are now running:

kubectl get pods -w
NAME                                                 READY   STATUS    RESTARTS       AGE
etcd-etcd-deployment-6ff4564cb8-kjkzw                1/1     Running   0              3h35m
milvus-milvus-deployment-f64989c5d-sdm6n             1/1     Running   0              3h35m
minio-minio-deployment-6849c855b7-qsbdf              1/1     Running   0              3h35m
nemo-embedding-embedding-deployment-d984ff59-qtx69   1/1     Running   0              3h35m
nemo-rerank-ranking-deployment-c95ccfcf-9fkj5        1/1     Running   0              3h35m
neo4j-neo4j-deployment-5c9ff6fbc8-jrpjq              1/1     Running   0              3h35m
vss-blueprint-0                                      1/1     Running   0              3h35m
vss-vss-deployment-6fc4ccfd94-w85rf                  1/1     Running   2 (124m ago)   3h35m

Now when i try to access the web ui via port forward the page never loads, and in the port forward command i see the following errors:

Forwarding from 127.0.0.1:9000 -> 9000
Forwarding from [::1]:9000 -> 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
E0218 14:29:32.339239     436 portforward.go:381] error copying from remote stream to local connection: readfrom tcp4 127.0.0.1:9000->127.0.0.1:48282: write tcp4 127.0.0.1:9000->127.0.0.1:48282: write: broken pipe
Handling connection for 9000
E0218 14:29:32.731004     436 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:9000->[::1]:55824: write tcp6 [::1]:9000->[::1]:55824: write: broken pipe
E0218 14:30:02.136458     436 portforward.go:370] error creating forwarding stream for port 9000 -> 9000: Timeout occurred
E0218 14:30:02.136496     436 portforward.go:370] error creating forwarding stream for port 9000 -> 9000: Timeout occurred
E0218 14:30:02.233257     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
E0218 14:30:02.612103     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
E0218 14:30:32.558864     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
E0218 14:30:32.920601     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000
E0218 14:31:02.915904     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred
E0218 14:31:03.277598     436 portforward.go:347] error creating error stream for port 9000 -> 9000: Timeout occurred

I am accessing the cluster in a remote datacenter and the network of the worker nodes is not directly accessible.
I don’t see errors in pods, tha page just stop loading and the sample icon images are not fully loaded

is there anything i can do to make the web ui working?

Thank you

OK. Could you try to not use the overrides.yaml to run the helm chart? Your GPU resources are sufficient. Or you can modify this profile with below.

nim-llm:           [0,1,2,3]
vss:               [4,5]
nemo-embedding:    [6]
nemo-rerank:       [7]

The test i’ve done has been deployed with no overrides.yaml option and the results are as shown above. It seems that the web ui just dies when first i try to load the page.

OK. Could you load the web successfully in the internal networks without port forwarding?

What are your specific steps for using the port forwarding?

Ok, i tried connecting directly to nodeport and now the webUI is correctly loading. I will use it that way. Thank you for you support

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.