URGENT ESCALATION: Locked out of A100x8 (Network crashed, cannot stop/start)

I was directed here by support. My Azure A100x8 instance (try, ID: wdfmpritd) has a crashed network agent. Ports show as ‘Unhealthy’, SSH times out, and the CLI throws a not_found error.

Also Port 8888 is also returning a 502 Bad Gateway. Cloudflare shows ‘Host Error’. The machine is completely unreachable.

Because this is an Azure cluster, the dashboard states ‘This environment does not support stopping and starting’. I am completely locked out and have critical data on this volume.

Can an engineer please either:

  1. Force a backend hard-reboot of the node to restore the network.

  2. Detach the volume and mount it to a recovery instance.

A standard user cannot fix this from the UI. Please advise.

Hi, I feel your pain and urgency, but since these are GPUs in an azure instance, you will need to call on MSFT Azure support.
Nvidia has zero access to Azure instance in general, it - if at all - might be possible for Azure support only, to like rest your instance, right?

all the best
-Frank