Getting Error while running blueprint-VSS demo

I am currently working with a virtual machine configured with 8xA100 (40GB) GPUs and attempting to run the Blueprint VSS Engine. However, I am encountering several errors during the execution and I have attached the necessary documents and error logs below.

Could you please assist in troubleshooting and resolving the issues related to this configuration?

I would appreciate guidance on the necessary steps that I should take.

Thank you in advance for the help.

vvs-deployment.txt (2.1 KB)
k8_logs.txt (940 Bytes)
describe_vss_logs.txt (4.9 KB)
pods_log.txt (3.0 KB)
get_secrets.txt (527 Bytes)

inception

1 Like
vss-blueprint-0                                        0/1     Running    11 (4m ago)   88m
vss-vss-deployment-5f7959797c-996mq                    0/1     Init:2/3   0             36m

It looks like the above 2 pods can not be run properly because of the insufficient resources.
Since you are using A100(40G), could you try to modify the gpu limits of the resources to 4?

Also you can attach the log by run the command below.

sudo microk8s kubectl logs vss-vss-deployment-POD-NAME

Also what’s the RAM memory size of your system?

Thanks for the reply,

in that case I have increased my resources to:

16xA100(40GB)
1.3 TB RAM
96 Core CPU

and I have attached the .yaml and log file, and still I’m facing the same issue.

logs.txt (10.3 KB)
overrides.txt (1.5 KB)

This may take a long time to wait since you are using A100(40G). Have you added the --set vss.applicationSpecs.vss-deployment.containers.vss.startupProbe.failureThreshold=360 to the helm install command?

vss:
applicationSpecs:
vss-deployment:
containers:
vss:
startupProbe:
failureThreshold: 360

we have tried defining it in the overrides.yaml but still faced the same issue.

@yuweiw

Are there any updates on this issue, please?

We do not currently have an 8xA100(40G) or 16xA100(40G) device on our hand, so a successful deployment cannot be guaranteed. The minimum memory of a single GPU we can successfully deploy is 48GB(8xL40s).

As you attached the 1.3 TB RAM before, this may be the storage of your device instead of RAM. Could you run top and attach the results? And the VSS requires at least 256+ GB system memory.

Also you can try to modify the limits of the resources to 0, which means there is no limits.

  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

You can also consider the following two deployment methods.

  1. Try to use Remote LLM Endpoint. Steps for it are mentioned here: Link.
  2. Try to use 7b llama model instead of 70b llama model. Steps for it are mentioned here Link