VSS Deployment - "vss-blueprint-0" Pod Keeps Crashing

I have been trying to deploy Nvidia VSS model on a server Equipped with eight A100 GPUs, I successfully initiated the Kubernetes pods using the microk8s cluster. Subsequently, I downloaded and deployed the NGC project for the VSS model.

All pods exhibited a “Running” status and were labeled as ready. However, the pod named “vss-blueprint-0” perpetually crashes in a loop while in the “Running” state. It fails to prepare for the “vss-vss-deployment-” pod.

I have tried waiting for more than 30 minutes, but this pod kept restarting several times, without any progress.

Despite attempting to increase the failureThreshold, no modifications were observed. Additionally, I have provided the log of this pod for your reference:

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.3.0
Model: meta/llama-3.1-70b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

The NIM container is governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreem
ents/enterprise-software/nvidia-software-license-agreement) and the Product Specific Terms for AI Products (found at
 https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products).

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (https://www.nvidia
.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement).

ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.

{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "You are
 using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for
 more information.", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "Profile
 0462612f0f2de63b2d423bc3863030835c0fbdbc13b531868670cc416e030029 is not fully defined with checksums", "exc_info":
"None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "Profile
 09af04392c0375ae5493ca5e6ea0134890ac28f75efd244a57f414f86e97b133 is not fully defined with checksums", "exc_info":
"None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "Profile
 0b0193d56ec0bba1840ea429993c776f9168a1ca4699e81f4db48319dd7e5c3a is not fully defined with checksums", "exc_info":
"None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "Profile
 162948dba7374caeb8f7886f7c62a105fd198cfc2dd533aa1cdb34eaea872af0 is not fully defined with checksums", "exc_info":
"None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "Profile
 3195a1b385c57c3cae2113f63a37c6ad5aacfd17915922b6a3abf109aa210606 is not fully defined with checksums", "exc_info":
"None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "Profile
 338caffa8378682a78aba3921720011f23ead03e8827e484a5333317b97c7527 is not fully defined with checksums", "exc_info":
"None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "Profile
 395082aa40085d35f004dd3056d7583aea330417ed509b4315099a66cfc72bdd is not fully defined with checksums", "exc_info":
"None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "Profile
 4141ba6fa7f7f67f402f0878ae823817775c473a745d55096b9cfe836c968df7 is not fully defined with checksums", "exc_info":
"None", "stack_info": "None"}

And here is the status of the pods:

NAME                                                   READY   STATUS     RESTARTS       AGE
etcd-etcd-deployment-6c6c94c64c-qtdww                  1/1     Running    0              46m
milvus-milvus-deployment-77d974cdd9-dkd2l              1/1     Running    0              46m
minio-minio-deployment-6f85f8b94b-cjz7l                1/1     Running    0              46m
nemo-embedding-embedding-deployment-654cdcb5c8-sh7rc   1/1     Running    0              46m
nemo-rerank-ranking-deployment-6bf8c94b54-zqgsq        1/1     Running    0              46m
neo4j-neo4j-deployment-8bb9b8b69-z8dcg                 1/1     Running    0              46m
vss-blueprint-0                                        0/1     Running    2 (5m1s ago)   15m
vss-vss-deployment-8f96df479-slqrb                     0/1     Init:2/3   0              46m

Finally, the only event in the “describe” of this pod was getting connection refused error for the tcp call to the startupProbe endpoint.