We had a problem with our 8x GPU A100 DGX server, recently one of the GPUs start having issues and we don’t know what the process will be to replace it, or if that’s even possible on this systems since they are SXM versions.
Assuming you have it, you should contact Nvidia Enterprise Support ESPCommunity. It’s best if you file a case and explain the issue, they will help you to debug and resolve it. I have not heard of one single GPU failing, but if the entire GPU tray fails, the procedure is rather straightforward. They send a replacement part and schedule a technician to perform the replacement. The faulty trey is sent back with a prepaid parcel.
Fingers crossed you’ll be able to resolve the issue asap.
Oh, it’s a Gigabyte server! This was posted in the DGX forum, so I (like @ilb I suspect) assumed it was an NVIDIA DGX server, which Enterprise Support can help with.
In the case of the Gigabyte server, you should contact Gigabyte directly - they should be able to coordinate replacing the GPU (or potentially the GPU tray) based on what they determine is necessary. If necessary, they can escalate to NVIDIA for assistance, but in general they are the ones to take care of hardware-level problems on their servers.