Quadro RTX 6000 causes HPE server to power off - peaks way over power limit

cw_hobbs · March 29, 2019, 8:07pm

I have an HPE DL580 Gen8 server with four Quadro RTX 6000 cards, and it will frequently power off with a hardware power fault.

Message is:
System Power Fault Detected (XR: 10 A2 MID: FF 0F F0 00 00…

This is apparently an emergency protection shutdown.

The system has 6000 watt power supply (four 1500 watt supplies), and the cards are fed with the standard HPE 8+6 cables from the power distribution panel. It has 3 TiB of RAM, and four 18C/36T Xeons. The system is about three years old, and has been fine with four GTX 1080 Ti cards and Maxwell Titans before that.

OS is Ubuntu Server LTS 18.04, and all Ubuntu and HPE patches are applied.

Looking at ‘nvidia-smi -q -d POWER’, I see samples like:

1.==============NVSMI LOG==============

2.

3.Timestamp                           : Tue Mar 26 15:45:12 2019

4.Driver Version                      : 418.56

5.CUDA Version                        : 10.1

6.

7.Attached GPUs                       : 4

8.GPU 00000000:41:00.0

9.    Power Readings

10.        Power Management            : Supported

11.        Power Draw                  : 252.35 W

12.        Power Limit                 : 260.00 W

13.        Default Power Limit         : 260.00 W

14.        Enforced Power Limit        : 260.00 W

15.        Min Power Limit             : 100.00 W

16.        Max Power Limit             : 260.00 W

17.    Power Samples

18.        Duration                    : 2.37 sec

19.        Number of Samples           : 119

[b]20.        Max                         : 392.15 W

[/b]21.        Min                         : 62.49 W

22.        Avg                         : 144.21 W

23.

24.GPU 00000000:81:00.0

25.    Power Readings

26.        Power Management            : Supported

27.        Power Draw                  : 270.69 W

[b]28.        Power Limit                 : 260.00 W
[/b]
29.        Default Power Limit         : 260.00 W

30.        Enforced Power Limit        : 260.00 W

31.        Min Power Limit             : 100.00 W

32.        Max Power Limit             : 260.00 W

33.    Power Samples

34.        Duration                    : 2.38 sec

35.        Number of Samples           : 119

[b]36.        Max                         : 362.39 W
[/b]
37.        Min                         : 60.93 W

38.        Avg                         : 132.90 W

Lowering the power limit makes it better, but it just reduces the probability of a shutdown - it doesn’t eliminate the problem.

1.==============NVSMI LOG==============

2.

3.Timestamp                           : Sat Mar 23 11:07:43 2019

4.Driver Version                      : 418.56

5.CUDA Version                        : 10.1

6.

7.Attached GPUs                       : 4

8.GPU 00000000:41:00.0

9.    Power Readings

10.        Power Management            : Supported

11.        Power Draw                  : 90.29 W

[b]12.        Power Limit                 : 150.00 W
[/b]
13.        Default Power Limit         : 260.00 W

14.        Enforced Power Limit        : 150.00 W

15.        Min Power Limit             : 100.00 W

16.        Max Power Limit             : 260.00 W

17.    Power Samples

18.        Duration                    : 2.38 sec

19.        Number of Samples           : 119

[b]20.        Max                         : 338.29 W
[/b]
21.        Min                         : 65.20 W

22.        Avg                         : 124.28 W

23.

24.GPU 00000000:81:00.0

25.    Power Readings

26.        Power Management            : Supported

27.        Power Draw                  : 95.68 W

[b]28.        Power Limit                 : 150.00 W
[/b]
29.        Default Power Limit         : 260.00 W

30.        Enforced Power Limit        : 150.00 W

31.        Min Power Limit             : 100.00 W

32.        Max Power Limit             : 260.00 W

33.    Power Samples

34.        Duration                    : 2.38 sec

35.        Number of Samples           : 119

[b]36.        Max                         : 375.68 W
[/b]
37.        Min                         : 62.28 W

38.        Avg                         : 124.25 W

Any ideas on how to tame the power consumption so that the Quadros don’t frighten the server?

generix · March 29, 2019, 8:45pm

The voltage regulator/power limiter will always exhibit some latency hardware-wise, so you won’t really get rid of those power spikes.
What kind of redundancy did you configure the psus for? Does setting them to N+1 resolve the issue?

cw_hobbs · March 29, 2019, 8:54pm

The system decided N+1 (although I’ve never been able to push it past just over 2000 watts, which is quite a bit less than N+2).

Power mode is set to “max performance”, which also sets the power supplies “always full”.

generix · March 29, 2019, 9:04pm

Should be enough power available,then. I’m a bit puzzled why this is now happening with the Quadros, 1080 Tis would also spike to 400W but I don’t really know of any options other than contacting HP or checking the PSUs.

generix · March 29, 2019, 9:12pm

If you’re using them for compute only, maybe check if setting application clocks using nvidia-smi is taming them a bit. Also, make sure the nvidia-persistenced is running.

cw_hobbs · March 29, 2019, 9:24pm

Yes, compute-only. I’ll look at the clock settings.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     On   | 00000000:41:00.0 Off |                  Off |
| 33%   56C    P2   119W / 200W |  23457MiB / 24190MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 6000     On   | 00000000:81:00.0 Off |                  Off |
| 38%   62C    P2    97W / 200W |  23457MiB / 24190MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I’ve done sudo nvidia-smi -i 0 -pm 1 on the cards, and persistent mode shows “on”. I don’t have any nvidia-persistenced daemons, however. The ‘-pl’ settings are sticky - do I also need to start the persistenced daemons?

We have an application script (tensorflow) that often causes the shutdown in the first couple of minutes of running. If the system survives the first five minutes, it seems to run the full 8 hours.

The application runs in a docker container with runtime=nvidia (nvidia-docker 2.0.3 / docker 18.09/3).

cw_hobbs · March 29, 2019, 9:27pm

Also, we have three of these DL580 Gen9 systems - same build options. I’ve moved the Quadros to different systems, and all systems run fine with the 1080 Ti’s, and power off with the Quadros. Pretty much rules out a hardware problem - but all are running the same OS and firmware versions.

generix · March 29, 2019, 9:53pm

Setting the persistence mode using nvidia-smi is depreciated in favour of the persistenced but should still have the same effect.

cw_hobbs · April 1, 2019, 7:52pm

Generix - THANK YOU!

I’m testing now with sudo nvidia-smi -i 0 --lock-gpu-clocks=300,1590 and it seems to be much more stable.

Before it was often running at 1920 MHz.

With 1590 the peak watts seldom goes about 260.

ps: also started the persistenced service

Topic		Replies	Views
RTX 6000 Pro Server power limit of 450W instead of expected 600W GPU - Hardware	4	1243	October 11, 2025
Quadro P6000 Power consumption Linux	0	545	July 21, 2023
Limitations for CPU which used with Quadro RTX 8000 General Discussion	5	4308	May 13, 2020
Quadro RTX 6000 GPU Cards Disappearing CUDA Setup and Installation	6	2321	October 15, 2019
A100-SXM-80GB A100 Enforced Power Limit different from server to server General Discussion	4	3323	July 23, 2021
565.77 driver default power limit is wrong Linux	0	393	December 8, 2024
RTX 2080 Ti always Power Cap and low utilization Linux	14	4910	November 6, 2019
The Default Power Limit of My 4060 Laptop Halves Its Performance Linux	12	9221	June 2, 2024
Dynamic boost for laptop stopped working Linux	2	1243	June 10, 2024
Unable to run an RTX 4060 laptop GPU at more than 80W on Linux Linux	3	1412	March 17, 2025

Quadro RTX 6000 causes HPE server to power off - peaks way over power limit

Related topics