VSS Blueprint Helm Installation- Nemo embedding pod failure

Please provide the following information when creating a topic:

  • Hardware Platform: 8 x H100

  • System Memory: 2.1 TB

  • Ubuntu Version: Ubuntu 22.04.5 LTS

  • NVIDIA GPU Driver Version: 570.133.20

  • Issue Type: Nemo embedding pod kept failing liveness probe and restarting

  • How to reproduce the issue ?
    I am trying to deploy the VSS blueprint by following the steps from:

These are the step taken:
(Driver and Fabric manager are preinstalled on the server)

sudo snap install microk8s --classic

sudo microk8s enable nvidia

sudo microk8s enable hostpath-storage

sudo snap install kubectl --classic

export NGC_API_KEY=<YOUR_LEGACY_NGC_API_KEY>
sudo microk8s kubectl create secret docker-registry ngc-docker-reg-secret
–docker-server=nvcr.io
–docker-username=‘$oauthtoken’
–docker-password=$NGC_API_KEY

sudo microk8s kubectl create secret generic graph-db-creds-secret
–from-literal=username=neo4j --from-literal=password=password

sudo microk8s kubectl create secret generic ngc-api-key-secret
–from-literal=NGC_API_KEY=$NGC_API_KEY

sudo microk8s helm fetch
https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-vss-2.3.0.tgz
–username=‘$oauthtoken’ --password=$NGC_API_KEY

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz
–set global.ngcImagePullSecretName=ngc-docker-reg-secret

I face issues where my Nemo Embedding Pod couldn’t be created properly. Here are the Logs and status:

Pods Status:
Containers:
embedding-container:
Container ID: containerd://5bb1d1fd40690aa7803339bc11680389043aa4083223ed035b5f700333c08c12
Image: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0
Image ID: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2@sha256:b7ff8b85e9661699f6803ad85a48ed41a5c52212284ca4632a3bd240aee61859
Port: 8000/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Fri, 23 May 2025 10:54:56 +0800
Finished: Fri, 23 May 2025 10:56:55 +0800
Ready: False
Restart Count: 23
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1

Pod Events:
Events:
Type Reason Age From Message


Normal Killing 5m6s (x6 over 15m) kubelet Container embedding-container failed liveness probe, will be restarted
Normal Created 4m36s (x7 over 16m) kubelet Created container: embedding-container
Normal Pulled 4m36s (x6 over 14m) kubelet Container image “nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0” already present on machine
Normal Started 4m35s (x7 over 16m) kubelet Started container embedding-container
Warning Unhealthy 3m6s (x21 over 16m) kubelet Liveness probe failed: Get “http://10.1.129.211:8000/v1/health/ready”: dial tcp 10.1.129.211:8000: connect: connection refused
Warning BackOff 60s (x9 over 2m36s) kubelet Back-off restarting failed container embedding-container in pod nemo-embedding-embedding-deployment-59d77cdcc4-k42hm_default(54ee3d80-1114-4169-a9a5-b12db1c28bb4)

Pod Log:
Overriding NIM_LOG_LEVEL: replacing NIM_LOG_LEVEL=unset with NIM_LOG_LEVEL=INFO
Running automatic profile selection: NIM_MANIFEST_PROFILE is not set
Selected profile: 8918df0dca55add3cce5d64bd465a9b4970951d45fe1742daedab84d3092e379
“timestamp”: “2025-05-23 02:54:57,379”, “level”: “INFO”, “message”: “Matched profile_id in manifest from env NIM_MODEL_PROFILE to: 8918df0dca55add3cce5d64bd465a9b4970951d45fe1742daedab84d3092e379”
“timestamp”: “2025-05-23 02:54:57,379”, “level”: “INFO”, “message”: “Using the profile specified by the user: 8918df0dca55add3cce5d64bd465a9b4970951d45fe1742daedab84d3092e379”
“timestamp”: “2025-05-23 02:54:57,379”, “level”: “INFO”, “message”: “Downloading manifest profile: 8918df0dca55add3cce5d64bd465a9b4970951d45fe1742daedab84d3092e379”

2025-05-23T02:54:57.379831Z INFO nim_hub_ngc::api::tokio::builder: ngc configured with api_loc: api.ngc.nvidia.com auth_loc: authn.nvidia.com scheme: https

2025-05-23T02:54:59.180400Z INFO nim_hub_ngc::api::tokio: Downloaded filename: special_tokens_map.json to blob: “/opt/nim/.cache/ngc/hub/models–nim–nvidia–llama-3-2-nv-embedqa-1b-v2/blobs/1acefce3b827946744850199a9618076”

2025-05-23T02:54:59.181011Z INFO nim_hub_ngc::api::tokio::builder: ngc configured with api_loc: api.ngc.nvidia.com auth_loc: authn.nvidia.com scheme: https

2025-05-23T02:55:01.569852Z INFO nim_hub_ngc::api::tokio: Downloaded filename: tokenizer_config.json to blob: “/opt/nim/.cache/ngc/hub/models–nim–nvidia–llama-3-2-nv-embedqa-1b-v2/blobs/3ee146788ee5492ffdf691a066305e40”

2025-05-23T02:55:01.570446Z INFO nim_hub_ngc::api::tokio::builder: ngc configured with api_loc: api.ngc.nvidia.com auth_loc: authn.nvidia.com scheme: https

2025-05-23T02:55:02.916547Z INFO nim_hub_ngc::api::tokio: Downloaded filename: metadata.json to blob: “/opt/nim/.cache/ngc/hub/models–nim–nvidia–llama-3-2-nv-embedqa-1b-v2/blobs/316ea9bb2437bae9ce6034f970f33835”

2025-05-23T02:55:02.917095Z INFO nim_hub_ngc::api::tokio::builder: ngc configured with api_loc: api.ngc.nvidia.com auth_loc: authn.nvidia.com scheme: https

Could you please attach the logs of all the failed pods?

  • use the command below to check the failed pods
sudo watch microk8s kubectl get pod
  • attach the logs of all the failed pods

Hi yuweiw, only the nemo-embedding pods fails while other pods seem fine.

Here are the screenshots:
Pods status:

Pods Logs:

Pods Events:

The last two pods are also not in the “READY” state. Could you please attach the logs together?

vss-blueprint-0 Pods:

You may find the logs in vss_logs.txt:
vss_logs.txt (14.3 KB)

vss-vss-deployment-649f65b85d-6vmdk
This pod is not initialized so there are no events and logs. Here is some information you may find useful:


Could you stop all the VSS services and try to deploy the llama-3_2-nv-embedqa-1b-v2 separately to see if it will be successfully deployed?

I am not sure how to deploy the model correctly. Could you guide me through?

  1. Install docker-ce by following the official instructions.Once you have installed docker-ce, follow the post-installation steps to ensure that the docker can be run without sudo.
  2. Install nvidia-container-toolkit by following the install-guide.
  3. There are detailed steps on the page I attached before like below. You can follow these steps to deploy it independently.

Hi, I have deployed the model successfully with docker.

This can basically be ruled out the problem of this embedqa service. Most likely, it’s a problem with your environment.

Could you try to reinstall your dirver and Fabric manager to the 535.183.06 version and try that? Please ensure that versions of these 2 software are exactly the same.

Hi, I managed to fix the nemo pods issues by extending the initialDelaySecond to 500s. However, the vss-blueprint-0 and vss-vss-deployment-649f65b85d-hqmrg are still faces issues.

Unlike nemo embedding pods, the vss-blueprint-0 kept on failing startup probe.

While the logs do not have any abnormal message.
vss-blueprint_logs.txt (14.4 KB)

I have tried to extend the initialDelaySecond of the startup probe for vss-blueprint-0 but the pods configuration did not update despite I have replace the pods forcefully. I wanted to modify its deployment, but I don’t find it.

If your network connection is not very good, it is recommended that you wait for a longer period of time. The LLM&VLM model is quite large size and there might be interruptions in the download process due to network issues.

Could you try the fully-local-single-gpu-deployment first? The resource that needs to be downloaded is relatively small in this way.

Hi, I tried with single gpu deployment but it seems to have different issues where it fails to pull the image in vss-blueprint-0. The nemo embedding pods also have the same issues again even though I have update the initialDelaySecond.

Could you attach the error logs? You can open several debugging terminals and try running the following command to print logs in real time.

sudo microk8s kubectl logs -f <pod_name>

If there are no abnormalities in the log, we still recommend that you try following the #11 suggestions. Our development and debugging both use this cuda version.

Thanks for the suggestion. I understand that it’s likely environment related. However, due to internal constraints and change control policies in our company, I’d prefer to avoid reinstalling the driver and Fabric Manager unless absolutely necessary.

I am sticking back to default installation as I managed to fix the vss-blueprint-0 pod.

However, the vss-vss-deployment take incredibly long to download the nvidia/vila-1.5-40b model (more than 1hr, yet it hasn’t finish).
Logs:

I also noticed one failure when I checked again later.image

Based on your debugging, it seems that the issue might be related to your network environment.
You can try to use ngc:nvidia/tao/nvila-highres:nvila-lite-15b-highres-lita first.

Okay, I just need to update the model path value under “env” section in the deployment configuration file, right?

Hi, I have tried to use ngc:nvidia/tao/nvila-highres:nvila-lite-15b-highres-lita but it shows an error in the log:
ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine.

Attached log file here:
vss-vss-deployment-log.txt (5.4 KB)

Please modify the VLM_MODEL_TO_USE to below.

          - name: VLM_MODEL_TO_USE
            value: nvila

Hi, I have updated the VLM_MODEL_TO_USE. There seems to be some issues where it did not progress any further in the logs file and the pod is not ready too.

Log file:
vss-vss-deployment-log.txt (14.6 KB)