I am currently working with a virtual machine configured with 8xA100 (40GB) GPUs and attempting to run the Blueprint VSS Engine. However, I am encountering several errors during the execution and I have attached the necessary documents and error logs below.
Could you please assist in troubleshooting and resolving the issues related to this configuration?
I would appreciate guidance on the necessary steps that I should take.
It looks like the above 2 pods can not be run properly because of the insufficient resources.
Since you are using A100(40G), could you try to modify the gpu limits of the resources to 4?
Also you can attach the log by run the command below.
This may take a long time to wait since you are using A100(40G). Have you added the --set vss.applicationSpecs.vss-deployment.containers.vss.startupProbe.failureThreshold=360 to the helm install command?
We do not currently have an 8xA100(40G) or 16xA100(40G) device on our hand, so a successful deployment cannot be guaranteed. The minimum memory of a single GPU we can successfully deploy is 48GB(8xL40s).
As you attached the 1.3 TB RAM before, this may be the storage of your device instead of RAM. Could you run top and attach the results? And the VSS requires at least 256+ GB system memory.
Also you can try to modify the limits of the resources to 0, which means there is no limits.
resources:
limits:
nvidia.com/gpu: 0 # no limit
You can also consider the following two deployment methods.
Try to use Remote LLM Endpoint. Steps for it are mentioned here: Link.
Try to use 7b llama model instead of 70b llama model. Steps for it are mentioned here Link