Quadro RTX 6000 GPU Cards Disappearing

rjepperson · October 11, 2019, 4:37pm

We recently purchased a server with 4 Quadro RTX 6000 GPUs in them. We installed Centos 7.6 along with the drivers from the cuda repo.

The problem we are having is that when a user runs an AMBER process on each GPU, the GPUs disappear from nvidia-smi and /dev after some time. “lspci | grep -i nvidia” shows nothing as well. The vendor management interface states “Unknown” for each PCIe slot where they should be showing Quadro RTX 6000. After speaking with the vendor of the server, we have updated the BIOS and the management firmware. They state we should update the firmware on the GPU cards. I assume they are meaning drivers, as I can’t find anything about updating firmware for these cards.

This behavior is random though as it doesn’t occur all the time. I ran the NVIDIA HPL program across all CPUs and GPUs using 100% of both and I had no problems.

Should we update to the driver version available through the NVIDIA site (430.50)? We have proper cooling and power for this node. There are no messages about hitting peaks of any temperatures or power in the management logs and when running the HPL benchmark, they were about 75% of the caution set point in the highest spots.

Thanks

njuffa · October 11, 2019, 11:33pm

Any messages in the logs (e.g. /var/log/messages) about “GPU has fallen off the bus”?

GPUs “disappearing” under heavy computational load is correlated with insufficient power supply. AMBER places a different load profile on the GPUs than HPL. Did you buy this machine with the Quadros already installed, or did you install all or some of them yourselves?

I have used Quadros for many years, but recall only rare instances of VBIOS upgrades being available through system integrators (HP, Dell, etc), never from NVIDIA itself. Maybe someone else has a good pointer in that regard.

What are the specs of the host system (CPU[s], memory, mass storage), and what are the ratings of the PSU[s] (power supply unit)? Taking a guess at your system configuration, my rule of thumb indicates you would need a 2000W PSU for rock-solid operation.

Updating to the latest driver available for your GPU is usually a good idea, but is also unlikely to resolve this particular issue.

rjepperson · October 12, 2019, 12:23am

njuffa,

Thanks for the reply. I don’t see any messages in our syslog server about “GPU has fallen off the bus” or anything related to bus. The only notification we get when they disappear is from our SLURM notification because it can’t access /dev/nvidia[0-3].

We actually have 5 of these systems all doing the same thing, so your assessment that it could the power is plausible as they all share that in common. These are pre-built systems through HP.

CPU - Dual Intel Gold 6226 (12 cores @ 2.7GHz)
Memory - 12 x 16GB @ 2933MHz
Storage - Single 1TB 7.2K HDD
PSU - Redundant 1600W

Looking at the past 24 hours of the power meter for one of the servers through the iLO, it shows a maximum of 1517W was reached. So it is very possible that it peaked just a bit higher and caused this issue. However, there are no power events logged in the iLO, which I would think there would be.

I will take this information to HP and see what they say.

njuffa · October 12, 2019, 12:43am

Norbert’s rule of thumb for rock-solid operation: Total nominal wattage of system components should not exceed 60% of nominal wattage of PSU.

Both CPUs and GPUs produce short term (millisecond) spikes in power consumption as loads change, usually around 25% higher than what you can see with low resolution reporting, such as nvidia-smi. You want more headroom to account for TDP specs on CPU (!= maximum power), manufacturing tolerances in the PSU, aging of critical electronic components over lifetime of the system, etc, etc.

In my experience, any usage above 85% of nominal PSU wattage will (at the most inopportune times) cause spontaneous reboots. Below that threshold, insufficient power supply can cause local brown-outs, which cause electronic switching speeds to decrease, which cause electronic components to malfunction (e.g. due to violation of timing requirements at flip-flops etc)

4 x Quadro RTX 6000 = 1040W
2 x Intel Gold 6226 = 250W
192 GB DDR4 @0.4 W/GB = 77W
1x HDD = 6W
Motherboard components = 15W

total nominal power = 1388W

So my rule of thumb says you would want total nominal PSU wattage of 2300W. If you call my rule of thumb overly conservative (as some do :-), and go with, say, maximum load of 70% that puts you at 2000W nominal PSU wattage. 80 PLUS Titanium compliant PSUs recommended for this kind of high-end server, at minimum use 80 PLUS Platinum compliant ones.

I am not sure how to interpret “redundant 1600W” PSU. Does that supply 3200W, or 1600W from one PSU at any given time with the other PSU acting as an idling spare? If the former, it would be best to load both PSUs about equally. If the latter, you have a system with an insufficient power supply.

Discussing your setup with the system integrator (HP) seems like a good idea. They should also be able to advise you on the availability of new VBIOS versions for the Quadro GPUs.

rjepperson · October 12, 2019, 1:02am

Thanks for the insight. From what I have read about HP’s power profiles, they are in a balanced configuration, sharing the load equally. I believe by redundant that they are in still in a N+1 configuration so can’t go over the max 1600W. That is how I am interrupting it.

Thank you again.

Riley

Robert_Crovella · October 12, 2019, 3:30pm

It may very well be a NVIDIA problem, but when HPE sells Quadro products they own the support process as well, even if it is a NVIDIA issue.

If you have purchased a server from HPE configured by HPE with 4xQuadro RTX6000, that is supposed to work properly. If it doesn’t, first and foremost insist that a support ticket be opened by HPE and assigned to your case. That ticket will have an HPE supplied ticket number. They should supply you that ticket number.

Escalate the issue with HPE. Then if you are not satisfied with the support from HPE, send me a private message on these forums with your support ticket number, and I will escalate your issue directly with HPE.

Be advised that you will need to provide clear, explicit instructions for how to reproduce the failure. Also, you should be sure to be using a GPU driver that HPE recommends for their system. They should be able provide you with instructions and/or a download link to select the latest GPU driver recommended by HPE for the product they sold you. You should be using that driver. If it still fails, then escalate.

ryork · October 15, 2019, 4:47pm

FWIW, I don’t. In my line of work we have to be very conservative and that makes perfect sense to me.

Topic		Replies	Views
GPU accelerated LAMMPS running for a while then stop with Cuda driver error 600 CUDA Setup and Installation	8	1750	October 1, 2020
Quadro RTX 6000 causes HPE server to power off - peaks way over power limit Linux	8	1559	April 1, 2019
"GPU has fallen off the bus" on GTX 1070 Linux	38	24095	April 5, 2021
Quadro RTX 8000 Multi-GPU Performance Issue CUDA Programming and Performance	13	1164	March 8, 2025
Quadro P5000 Mobile GPU has fallen off the bus Linux kb	3	612	November 2, 2022
Frequent catastrophic crashes on a multiple GPU machine CUDA Setup and Installation	8	4685	October 22, 2017
H100 GPU has fallen off the bus -- every day CUDA Setup and Installation	6	2262	May 29, 2024
GTX 1080 Ti falling off bus Linux	19	2329	September 3, 2018
Crash on RTX 6000 Ada on Ubuntu 24.04 "GPU has fallen off the bus" Linux llama	8	196	March 14, 2025
Xid 79, GPU has fallen off the bus. CUDA Programming and Performance	15	26055	August 13, 2023

Quadro RTX 6000 GPU Cards Disappearing

4 x Quadro RTX 6000 = 1040W 2 x Intel Gold 6226 = 250W 192 GB DDR4 @0.4 W/GB = 77W 1x HDD = 6W Motherboard components = 15W

Related topics

4 x Quadro RTX 6000 = 1040W
2 x Intel Gold 6226 = 250W
192 GB DDR4 @0.4 W/GB = 77W
1x HDD = 6W
Motherboard components = 15W