timeout when creating Nvidia images on GCP

yaron4ubwc · June 27, 2019, 1:11pm

I am creating an Nvidia images on GCP along the lines of Using NGC with Google Cloud Platform Setup Guide :: NVIDIA GPU Cloud Documentation
Using a VM instance with 2 GPUs (V100)

Whatever Nvidia image I use, I get a timeout error during deployment.The instance does seem to be created and looks OK, but its still worrisome. Tried it multiple times and on multiple nvidia image types

Details:
nvidia-ngc-image has resource warnings
nvidia-ngc-image-1-software: {“ResourceType”:“runtimeconfig.v1beta1.waiter”,“ResourceErrorCode”:“504”,“ResourceErrorMessage”:“Timeout expired.”}

Overview - nvidia-ngc-image-1
nvidia-ngc-image nvidia-ngc-image.jinja
nvidia-ngc-image-vm-tmpl vm_instance.py
nvidia-ngc-image-1-vm vm instance
software-status software_status.py
nvidia-ngc-image-1-config config
nvidia-ngc-image-1-software config waiter . <---- this seems to be where the failure happens.
software-status-script software_status_script.py

gcrider · July 15, 2019, 8:13pm

A new image has been posted on the Google Cloud marketplace that should eliminate this issue. Please let us know if you see this concern again.

aaronbriel · February 6, 2020, 8:59pm

I’m still seeing this issue using “PyTorch from NVIDIA”:
nvidia-gpu-cloud-pytorch-image has resource level errors
nvidia-gpu-cloud-pytorch-image-1-software: {“ResourceType”:“runtimeconfig.v1beta1.waiter”,“ResourceErrorCode”:“504”,“ResourceErrorMessage”:“Timeout expired.”}

gcrider · February 6, 2020, 9:31pm

Hi, are you using the 19.11.3 image? Which zones and GPU types have you tried? I can try and recreate.

Greg Crider
NGC PM

aaronbriel · February 6, 2020, 9:42pm

Yes, its the 19.11.3 image here: Google Cloud console

I wasn’t aware zones mattered so I’m trying a central one now. The only GPU type available for the image appears to be the tesla v-100.

aaronbriel · February 6, 2020, 10:12pm

Same issue. Also same issue using us-west1-b which had a Tesla P100 available.

gcrider · February 6, 2020, 10:27pm

I just deployed that image in us-west1-b with a v100. It did seem to take an unusually long time. After you get the timeout, go ahead and check the GCP console to see if a VM instance was actually created. We have seen timeout messages displayed even though an instance was successfully created. Obviously, you won’t want GPU instances running that you aren’t using.

aaronbriel · February 6, 2020, 10:39pm

Ok, it looks like the instance was indeed created. The warnings aren’t marked as such, however, but rather as “resource errors”. Thank you!

gcrider · February 6, 2020, 10:45pm

I’ll report to GCP Cloud and suggest you do the same. You wouldn’t want an instance hanging around for days billing your account that you weren’t aware of.

morzhakovva · June 30, 2021, 12:20pm

the same error has appeared again: nvidia-gpu-cloud-pytorch-image-2-software: {“ResourceType”:“runtimeconfig.v1beta1.waiter”,“ResourceErrorCode”:“504”,“ResourceErrorMessage”:“Timeout expired.”} when I try to create the instance based on “NVIDIA GPU-Optimized Image for PyTorch”

Topic		Replies	Views
my first project fails to deploy to Google Cloud Google Cloud Platform (GCP)	2	3005	April 6, 2018
Error during nvidia-gpu-cloud-pytorch-image deployment Google Cloud Platform (GCP)	0	993	April 28, 2020
Help on deploying a custom NGC based pytorch docker image on GCP? Google Cloud Platform (GCP)	0	964	December 24, 2018
Cannot pull trsenorRT image from NGC. received unexpected HTTP status: 502 Bad Gateway Docker and NVIDIA Docker tensorrt , docker	9	1117	June 20, 2023
Getting Error while installing Nvidia Driver on GCP VM with image c0-deeplearning-common-gpu-v20240128-debian-11-py310 Google Cloud Platform (GCP) cuda , linux , cloud	1	578	May 8, 2024
Pull nvcr.io docker image keep received 502 Bad Gateway Docker and NVIDIA Docker nvbugs , docker	10	5743	October 28, 2021
Nvidia GPU Driver extension for Azure VM fails Microsoft Azure Image	0	1182	February 15, 2024
Cannot pull any image from nvcr.io, 2024/02/02 Docker and NVIDIA Docker networking	1	1021	February 2, 2024
NCCL error when training data in GCP GPU-Accelerated Libraries cuda , tensorflow , ubuntu , python	2	1511	August 23, 2024
Nvcr.io is throwing a 500 error NGC GPU Cloud	0	49	September 16, 2024

timeout when creating Nvidia images on GCP

Related topics