One click script deployment on AWS of VSS

To deploy the VSS blueprint on the AWS cloud using we are using the p4d.24xlarge (8 X A100) ubuntu instance in the us-east-2 (Ohio) region.

We were able to successfully run the OneClick script, and the node was created as expected. However, at the end of the script’s execution, the following error was displayed. In the config.yml file I have specified the availability zones also as us-east-2a and us-east-2b respectively.

After running the script, we executed the following commands to verify the deployment:

  1. We ran the following command to get the frontend URL:

ubuntu@ip-xxx-xx-xx-xx:~/dist$ ./envbuild.sh -f config.yml info

Output:

preparing artifacts

access_urls:
app:
backend: http://x.xx.xxx.xxx:30081/
frontend: http://x.xx.xxx.xxx:30082/
ssh_command:
app:
master: ssh -i /home/ubuntu/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@x.xx.xxx.xxx

  1. As per the instructions in the blog, we SSH’d into the instance and ran the following command to list the pods. However, only one pod is listed.

ubuntu@ip-10-0-0-28:~$ kubectl get pod
Output:

NAME READY STATUS RESTARTS AGE
dnsutils 1/1 Running 0 35m

  1. When attempting to access the frontend URL in the browser, we are unable to reach it.

Additionally, I have attached the logs generated during the execution of the script on the Ubuntu machine for your reference.

Kindly assist us in resolving these issues.
I am sharing the error message as well as the config.yml file below.


config.yml.txt (2.9 KB)

We will analyze this problem as soon as possible.

Can you check the GPU operator pods?
kubectl get po -A

Check the logs of any pod under nvidia-gpu-operator namespace that has failure or stuck.
kubectl logs -n nvidia-gpu-operator < pod name >

Yes, I tried to run the commands suggested by you, but unfortunately we were getting the same issue.

after checking the pods this is the issue that I was getting:

ubuntu@ip-xx-x-x-xx:~$ kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-6p9nh
Defaulted container “nvidia-operator-validator” out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container “nvidia-operator-validator” in pod “nvidia-operator-validator-6p9nh” is waiting to start: PodInitializing

It seems GPU Operator deployment is having issue.
Is nvidia-operator-validator* the only pod not in running state?

“nvidia-operator-validator-6p9nh” is waiting to start: PodInitializing
This means some of the init containers are not successful.

Could you check the logs of those init containers with the following command?
kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-6p9nh -c “< container name >”
e.g.
kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-6p9nh -c driver-validation

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.