timeout when creating Nvidia images on GCP

I am creating an Nvidia images on GCP along the lines of https://docs.nvidia.com/ngc/ngc-gcp-setup-guide/introduction-to-using-ngc-gcp.html#introduction-to-using-ngc-gcp
Using a VM instance with 2 GPUs (V100)

Whatever Nvidia image I use, I get a timeout error during deployment.The instance does seem to be created and looks OK, but its still worrisome. Tried it multiple times and on multiple nvidia image types

nvidia-ngc-image has resource warnings
nvidia-ngc-image-1-software: {“ResourceType”:“runtimeconfig.v1beta1.waiter”,“ResourceErrorCode”:“504”,“ResourceErrorMessage”:“Timeout expired.”}

Overview - nvidia-ngc-image-1
nvidia-ngc-image nvidia-ngc-image.jinja
nvidia-ngc-image-vm-tmpl vm_instance.py
nvidia-ngc-image-1-vm vm instance
software-status software_status.py
nvidia-ngc-image-1-config config
nvidia-ngc-image-1-software config waiter . <---- this seems to be where the failure happens.
software-status-script software_status_script.py

A new image has been posted on the Google Cloud marketplace that should eliminate this issue. Please let us know if you see this concern again.

I’m still seeing this issue using “PyTorch from NVIDIA”:
nvidia-gpu-cloud-pytorch-image has resource level errors
nvidia-gpu-cloud-pytorch-image-1-software: {“ResourceType”:“runtimeconfig.v1beta1.waiter”,“ResourceErrorCode”:“504”,“ResourceErrorMessage”:“Timeout expired.”}

Hi, are you using the 19.11.3 image? Which zones and GPU types have you tried? I can try and recreate.

Greg Crider

Yes, its the 19.11.3 image here: https://console.cloud.google.com/marketplace/details/nvidia-ngc-public/nvidia-gpu-cloud-pytorch-image

I wasn’t aware zones mattered so I’m trying a central one now. The only GPU type available for the image appears to be the tesla v-100.

Same issue. Also same issue using us-west1-b which had a Tesla P100 available.

I just deployed that image in us-west1-b with a v100. It did seem to take an unusually long time. After you get the timeout, go ahead and check the GCP console to see if a VM instance was actually created. We have seen timeout messages displayed even though an instance was successfully created. Obviously, you won’t want GPU instances running that you aren’t using.

Ok, it looks like the instance was indeed created. The warnings aren’t marked as such, however, but rather as “resource errors”. Thank you!

I’ll report to GCP Cloud and suggest you do the same. You wouldn’t want an instance hanging around for days billing your account that you weren’t aware of.