Whatever Nvidia image I use, I get a timeout error during deployment.The instance does seem to be created and looks OK, but its still worrisome. Tried it multiple times and on multiple nvidia image types
Details:
nvidia-ngc-image has resource warnings
nvidia-ngc-image-1-software: {“ResourceType”:“runtimeconfig.v1beta1.waiter”,“ResourceErrorCode”:“504”,“ResourceErrorMessage”:“Timeout expired.”}
Overview - nvidia-ngc-image-1
nvidia-ngc-image nvidia-ngc-image.jinja
nvidia-ngc-image-vm-tmpl vm_instance.py
nvidia-ngc-image-1-vm vm instance
software-status software_status.py
nvidia-ngc-image-1-config config
nvidia-ngc-image-1-software config waiter . <---- this seems to be where the failure happens.
software-status-script software_status_script.py
I’m still seeing this issue using “PyTorch from NVIDIA”:
nvidia-gpu-cloud-pytorch-image has resource level errors
nvidia-gpu-cloud-pytorch-image-1-software: {“ResourceType”:“runtimeconfig.v1beta1.waiter”,“ResourceErrorCode”:“504”,“ResourceErrorMessage”:“Timeout expired.”}
I just deployed that image in us-west1-b with a v100. It did seem to take an unusually long time. After you get the timeout, go ahead and check the GCP console to see if a VM instance was actually created. We have seen timeout messages displayed even though an instance was successfully created. Obviously, you won’t want GPU instances running that you aren’t using.
I’ll report to GCP Cloud and suggest you do the same. You wouldn’t want an instance hanging around for days billing your account that you weren’t aware of.
the same error has appeared again: nvidia-gpu-cloud-pytorch-image-2-software: {“ResourceType”:“runtimeconfig.v1beta1.waiter”,“ResourceErrorCode”:“504”,“ResourceErrorMessage”:“Timeout expired.”} when I try to create the instance based on “NVIDIA GPU-Optimized Image for PyTorch”