A100 PCIe in Disabled* mode

I tried to using A100 multi-instance feature to run my workload.
After I ran my work successfully I quit MIG mode, The GPU shows it is in a “Disabled*” mode in MIG.
Info from nvidia-smi (Sry away from GPU now, I can’t upload my snapshot):

MIG M. : Disabled*
Volatile-GPU-Util : N/A

Anyone know what’s “Disabled*” mode mean?

I followed the following guide created my GPU instance and compute instance successfully and then ran my workload:

then I use the following command to quit MIG mode:
sudo nvidia-smi -i 0 -mig 0

After resetting my GPU, the “Disabled*” mode remains exist:
sudo nvidia-smi --gpu-reset

Hi @lichking ,

Is this on a DGX system or a PCIe based one? Usually we see that “Disabled*” because the GPU needs to be reset for MIG enable/disable to take place…and while resetting the GPU with nvidia-smi should do the right thing, there’s often something with a handle to the GPU device (persistence daemon, loaded driver, etc.) or an active NVLink connection that prevents it from actually resetting.

If you reboot the node, does it remain in “Disabled*”?

Thanks for your reply.
I solved this problem by rebooting my server.

After quitting the MIG mode:
sudo nvidia-smi -i 0 -mig 0
Then rebooting the node. After rebooting, the MIG mode changes to “Disabled” from “Disabled*” normally.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.