Quadro RTX 6000 GPU Cards Disappearing

We recently purchased a server with 4 Quadro RTX 6000 GPUs in them. We installed Centos 7.6 along with the drivers from the cuda repo.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro RTX 6000 On | 00000000:12:00.0 Off | Off |
| 35% 32C P8 21W / 260W | 0MiB / 24190MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 6000 On | 00000000:5C:00.0 Off | Off |
| 33% 34C P8 21W / 260W | 0MiB / 24190MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Quadro RTX 6000 On | 00000000:86:00.0 Off | Off |
| 33% 27C P8 4W / 260W | 0MiB / 24190MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Quadro RTX 6000 On | 00000000:AF:00.0 Off | Off |
| 33% 28C P8 14W / 260W | 0MiB / 24190MiB | 0% Default |
±------------------------------±---------------------±---------------------+

The problem we are having is that when a user runs an AMBER process on each GPU, the GPUs disappear from nvidia-smi and /dev after some time. “lspci | grep -i nvidia” shows nothing as well. The vendor management interface states “Unknown” for each PCIe slot where they should be showing Quadro RTX 6000. After speaking with the vendor of the server, we have updated the BIOS and the management firmware. They state we should update the firmware on the GPU cards. I assume they are meaning drivers, as I can’t find anything about updating firmware for these cards.

This behavior is random though as it doesn’t occur all the time. I ran the NVIDIA HPL program across all CPUs and GPUs using 100% of both and I had no problems.

Should we update to the driver version available through the NVIDIA site (430.50)? We have proper cooling and power for this node. There are no messages about hitting peaks of any temperatures or power in the management logs and when running the HPL benchmark, they were about 75% of the caution set point in the highest spots.

Thanks

Any messages in the logs (e.g. /var/log/messages) about “GPU has fallen off the bus”?

GPUs “disappearing” under heavy computational load is correlated with insufficient power supply. AMBER places a different load profile on the GPUs than HPL. Did you buy this machine with the Quadros already installed, or did you install all or some of them yourselves?

I have used Quadros for many years, but recall only rare instances of VBIOS upgrades being available through system integrators (HP, Dell, etc), never from NVIDIA itself. Maybe someone else has a good pointer in that regard.

What are the specs of the host system (CPU[s], memory, mass storage), and what are the ratings of the PSU[s] (power supply unit)? Taking a guess at your system configuration, my rule of thumb indicates you would need a 2000W PSU for rock-solid operation.

Updating to the latest driver available for your GPU is usually a good idea, but is also unlikely to resolve this particular issue.

njuffa,

Thanks for the reply. I don’t see any messages in our syslog server about “GPU has fallen off the bus” or anything related to bus. The only notification we get when they disappear is from our SLURM notification because it can’t access /dev/nvidia[0-3].

We actually have 5 of these systems all doing the same thing, so your assessment that it could the power is plausible as they all share that in common. These are pre-built systems through HP.

CPU - Dual Intel Gold 6226 (12 cores @ 2.7GHz)
Memory - 12 x 16GB @ 2933MHz
Storage - Single 1TB 7.2K HDD
PSU - Redundant 1600W

Looking at the past 24 hours of the power meter for one of the servers through the iLO, it shows a maximum of 1517W was reached. So it is very possible that it peaked just a bit higher and caused this issue. However, there are no power events logged in the iLO, which I would think there would be.

I will take this information to HP and see what they say.

Norbert’s rule of thumb for rock-solid operation: Total nominal wattage of system components should not exceed 60% of nominal wattage of PSU.

Both CPUs and GPUs produce short term (millisecond) spikes in power consumption as loads change, usually around 25% higher than what you can see with low resolution reporting, such as nvidia-smi. You want more headroom to account for TDP specs on CPU (!= maximum power), manufacturing tolerances in the PSU, aging of critical electronic components over lifetime of the system, etc, etc.

In my experience, any usage above 85% of nominal PSU wattage will (at the most inopportune times) cause spontaneous reboots. Below that threshold, insufficient power supply can cause local brown-outs, which cause electronic switching speeds to decrease, which cause electronic components to malfunction (e.g. due to violation of timing requirements at flip-flops etc)

4 x Quadro RTX 6000 = 1040W
2 x Intel Gold 6226 = 250W
192 GB DDR4 @0.4 W/GB = 77W
1x HDD = 6W
Motherboard components = 15W

total nominal power = 1388W

So my rule of thumb says you would want total nominal PSU wattage of 2300W. If you call my rule of thumb overly conservative (as some do :-), and go with, say, maximum load of 70% that puts you at 2000W nominal PSU wattage. 80 PLUS Titanium compliant PSUs recommended for this kind of high-end server, at minimum use 80 PLUS Platinum compliant ones.

I am not sure how to interpret “redundant 1600W” PSU. Does that supply 3200W, or 1600W from one PSU at any given time with the other PSU acting as an idling spare? If the former, it would be best to load both PSUs about equally. If the latter, you have a system with an insufficient power supply.

Discussing your setup with the system integrator (HP) seems like a good idea. They should also be able to advise you on the availability of new VBIOS versions for the Quadro GPUs.

Thanks for the insight. From what I have read about HP’s power profiles, they are in a balanced configuration, sharing the load equally. I believe by redundant that they are in still in a N+1 configuration so can’t go over the max 1600W. That is how I am interrupting it.

Thank you again.

Riley

It may very well be a NVIDIA problem, but when HPE sells Quadro products they own the support process as well, even if it is a NVIDIA issue.

If you have purchased a server from HPE configured by HPE with 4xQuadro RTX6000, that is supposed to work properly. If it doesn’t, first and foremost insist that a support ticket be opened by HPE and assigned to your case. That ticket will have an HPE supplied ticket number. They should supply you that ticket number.

Escalate the issue with HPE. Then if you are not satisfied with the support from HPE, send me a private message on these forums with your support ticket number, and I will escalate your issue directly with HPE.

Be advised that you will need to provide clear, explicit instructions for how to reproduce the failure. Also, you should be sure to be using a GPU driver that HPE recommends for their system. They should be able provide you with instructions and/or a download link to select the latest GPU driver recommended by HPE for the product they sold you. You should be using that driver. If it still fails, then escalate.

FWIW, I don’t. In my line of work we have to be very conservative and that makes perfect sense to me.