Please provide the following information when creating a topic:
-
Hardware Platform: 8 x H100
-
System Memory: 2.1 TB
-
Ubuntu Version: Ubuntu 22.04.5 LTS
-
NVIDIA GPU Driver Version: 570.133.20
-
Issue Type: Nemo embedding pod kept failing liveness probe and restarting
-
How to reproduce the issue ?
I am trying to deploy the VSS blueprint by following the steps from:
- (Setup the Prerequisites — Video Search and Summarization Agent)
- (Deploy Using Helm — Video Search and Summarization Agent)
These are the step taken:
(Driver and Fabric manager are preinstalled on the server)
sudo snap install microk8s --classic
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
sudo snap install kubectl --classic
export NGC_API_KEY=<YOUR_LEGACY_NGC_API_KEY>
sudo microk8s kubectl create secret docker-registry ngc-docker-reg-secret
–docker-server=nvcr.io
–docker-username=‘$oauthtoken’
–docker-password=$NGC_API_KEY
sudo microk8s kubectl create secret generic graph-db-creds-secret
–from-literal=username=neo4j --from-literal=password=password
sudo microk8s kubectl create secret generic ngc-api-key-secret
–from-literal=NGC_API_KEY=$NGC_API_KEY
sudo microk8s helm fetch
https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-vss-2.3.0.tgz
–username=‘$oauthtoken’ --password=$NGC_API_KEY
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz
–set global.ngcImagePullSecretName=ngc-docker-reg-secret
I face issues where my Nemo Embedding Pod couldn’t be created properly. Here are the Logs and status:
Pods Status:
Containers:
embedding-container:
Container ID: containerd://5bb1d1fd40690aa7803339bc11680389043aa4083223ed035b5f700333c08c12
Image: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0
Image ID: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2@sha256:b7ff8b85e9661699f6803ad85a48ed41a5c52212284ca4632a3bd240aee61859
Port: 8000/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Fri, 23 May 2025 10:54:56 +0800
Finished: Fri, 23 May 2025 10:56:55 +0800
Ready: False
Restart Count: 23
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Pod Events:
Events:
Type Reason Age From Message
Normal Killing 5m6s (x6 over 15m) kubelet Container embedding-container failed liveness probe, will be restarted
Normal Created 4m36s (x7 over 16m) kubelet Created container: embedding-container
Normal Pulled 4m36s (x6 over 14m) kubelet Container image “nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0” already present on machine
Normal Started 4m35s (x7 over 16m) kubelet Started container embedding-container
Warning Unhealthy 3m6s (x21 over 16m) kubelet Liveness probe failed: Get “http://10.1.129.211:8000/v1/health/ready”: dial tcp 10.1.129.211:8000: connect: connection refused
Warning BackOff 60s (x9 over 2m36s) kubelet Back-off restarting failed container embedding-container in pod nemo-embedding-embedding-deployment-59d77cdcc4-k42hm_default(54ee3d80-1114-4169-a9a5-b12db1c28bb4)
Pod Log:
Overriding NIM_LOG_LEVEL: replacing NIM_LOG_LEVEL=unset with NIM_LOG_LEVEL=INFO
Running automatic profile selection: NIM_MANIFEST_PROFILE is not set
Selected profile: 8918df0dca55add3cce5d64bd465a9b4970951d45fe1742daedab84d3092e379
“timestamp”: “2025-05-23 02:54:57,379”, “level”: “INFO”, “message”: “Matched profile_id in manifest from env NIM_MODEL_PROFILE to: 8918df0dca55add3cce5d64bd465a9b4970951d45fe1742daedab84d3092e379”
“timestamp”: “2025-05-23 02:54:57,379”, “level”: “INFO”, “message”: “Using the profile specified by the user: 8918df0dca55add3cce5d64bd465a9b4970951d45fe1742daedab84d3092e379”
“timestamp”: “2025-05-23 02:54:57,379”, “level”: “INFO”, “message”: “Downloading manifest profile: 8918df0dca55add3cce5d64bd465a9b4970951d45fe1742daedab84d3092e379”
2025-05-23T02:54:57.379831Z INFO nim_hub_ngc::api::tokio::builder: ngc configured with api_loc: api.ngc.nvidia.com auth_loc: authn.nvidia.com scheme: https
2025-05-23T02:54:59.180400Z INFO nim_hub_ngc::api::tokio: Downloaded filename: special_tokens_map.json to blob: “/opt/nim/.cache/ngc/hub/models–nim–nvidia–llama-3-2-nv-embedqa-1b-v2/blobs/1acefce3b827946744850199a9618076”
2025-05-23T02:54:59.181011Z INFO nim_hub_ngc::api::tokio::builder: ngc configured with api_loc: api.ngc.nvidia.com auth_loc: authn.nvidia.com scheme: https
2025-05-23T02:55:01.569852Z INFO nim_hub_ngc::api::tokio: Downloaded filename: tokenizer_config.json to blob: “/opt/nim/.cache/ngc/hub/models–nim–nvidia–llama-3-2-nv-embedqa-1b-v2/blobs/3ee146788ee5492ffdf691a066305e40”
2025-05-23T02:55:01.570446Z INFO nim_hub_ngc::api::tokio::builder: ngc configured with api_loc: api.ngc.nvidia.com auth_loc: authn.nvidia.com scheme: https
2025-05-23T02:55:02.916547Z INFO nim_hub_ngc::api::tokio: Downloaded filename: metadata.json to blob: “/opt/nim/.cache/ngc/hub/models–nim–nvidia–llama-3-2-nv-embedqa-1b-v2/blobs/316ea9bb2437bae9ce6034f970f33835”
2025-05-23T02:55:02.917095Z INFO nim_hub_ngc::api::tokio::builder: ngc configured with api_loc: api.ngc.nvidia.com auth_loc: authn.nvidia.com scheme: https













