Quadro RTX 6000 causes HPE server to power off - peaks way over power limit

I have an HPE DL580 Gen8 server with four Quadro RTX 6000 cards, and it will frequently power off with a hardware power fault.

Message is:
System Power Fault Detected (XR: 10 A2 MID: FF 0F F0 00 00…

This is apparently an emergency protection shutdown.

The system has 6000 watt power supply (four 1500 watt supplies), and the cards are fed with the standard HPE 8+6 cables from the power distribution panel. It has 3 TiB of RAM, and four 18C/36T Xeons. The system is about three years old, and has been fine with four GTX 1080 Ti cards and Maxwell Titans before that.

OS is Ubuntu Server LTS 18.04, and all Ubuntu and HPE patches are applied.

Looking at ‘nvidia-smi -q -d POWER’, I see samples like:

1.==============NVSMI LOG==============

2.

3.Timestamp                           : Tue Mar 26 15:45:12 2019

4.Driver Version                      : 418.56

5.CUDA Version                        : 10.1

6.

7.Attached GPUs                       : 4

8.GPU 00000000:41:00.0

9.    Power Readings

10.        Power Management            : Supported

11.        Power Draw                  : 252.35 W

12.        Power Limit                 : 260.00 W

13.        Default Power Limit         : 260.00 W

14.        Enforced Power Limit        : 260.00 W

15.        Min Power Limit             : 100.00 W

16.        Max Power Limit             : 260.00 W

17.    Power Samples

18.        Duration                    : 2.37 sec

19.        Number of Samples           : 119

[b]20.        Max                         : 392.15 W

[/b]21.        Min                         : 62.49 W

22.        Avg                         : 144.21 W

23.

24.GPU 00000000:81:00.0

25.    Power Readings

26.        Power Management            : Supported

27.        Power Draw                  : 270.69 W

[b]28.        Power Limit                 : 260.00 W
[/b]
29.        Default Power Limit         : 260.00 W

30.        Enforced Power Limit        : 260.00 W

31.        Min Power Limit             : 100.00 W

32.        Max Power Limit             : 260.00 W

33.    Power Samples

34.        Duration                    : 2.38 sec

35.        Number of Samples           : 119

[b]36.        Max                         : 362.39 W
[/b]
37.        Min                         : 60.93 W

38.        Avg                         : 132.90 W

Lowering the power limit makes it better, but it just reduces the probability of a shutdown - it doesn’t eliminate the problem.

1.==============NVSMI LOG==============

2.

3.Timestamp                           : Sat Mar 23 11:07:43 2019

4.Driver Version                      : 418.56

5.CUDA Version                        : 10.1

6.

7.Attached GPUs                       : 4

8.GPU 00000000:41:00.0

9.    Power Readings

10.        Power Management            : Supported

11.        Power Draw                  : 90.29 W

[b]12.        Power Limit                 : 150.00 W
[/b]
13.        Default Power Limit         : 260.00 W

14.        Enforced Power Limit        : 150.00 W

15.        Min Power Limit             : 100.00 W

16.        Max Power Limit             : 260.00 W

17.    Power Samples

18.        Duration                    : 2.38 sec

19.        Number of Samples           : 119

[b]20.        Max                         : 338.29 W
[/b]
21.        Min                         : 65.20 W

22.        Avg                         : 124.28 W

23.

24.GPU 00000000:81:00.0

25.    Power Readings

26.        Power Management            : Supported

27.        Power Draw                  : 95.68 W

[b]28.        Power Limit                 : 150.00 W
[/b]
29.        Default Power Limit         : 260.00 W

30.        Enforced Power Limit        : 150.00 W

31.        Min Power Limit             : 100.00 W

32.        Max Power Limit             : 260.00 W

33.    Power Samples

34.        Duration                    : 2.38 sec

35.        Number of Samples           : 119

[b]36.        Max                         : 375.68 W
[/b]
37.        Min                         : 62.28 W

38.        Avg                         : 124.25 W

Any ideas on how to tame the power consumption so that the Quadros don’t frighten the server?

The voltage regulator/power limiter will always exhibit some latency hardware-wise, so you won’t really get rid of those power spikes.
What kind of redundancy did you configure the psus for? Does setting them to N+1 resolve the issue?

The system decided N+1 (although I’ve never been able to push it past just over 2000 watts, which is quite a bit less than N+2).

Power mode is set to “max performance”, which also sets the power supplies “always full”.

Should be enough power available,then. I’m a bit puzzled why this is now happening with the Quadros, 1080 Tis would also spike to 400W but I don’t really know of any options other than contacting HP or checking the PSUs.

If you’re using them for compute only, maybe check if setting application clocks using nvidia-smi is taming them a bit. Also, make sure the nvidia-persistenced is running.

Yes, compute-only. I’ll look at the clock settings.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     On   | 00000000:41:00.0 Off |                  Off |
| 33%   56C    P2   119W / 200W |  23457MiB / 24190MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 6000     On   | 00000000:81:00.0 Off |                  Off |
| 38%   62C    P2    97W / 200W |  23457MiB / 24190MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I’ve done sudo nvidia-smi -i 0 -pm 1 on the cards, and persistent mode shows “on”. I don’t have any nvidia-persistenced daemons, however. The ‘-pl’ settings are sticky - do I also need to start the persistenced daemons?

We have an application script (tensorflow) that often causes the shutdown in the first couple of minutes of running. If the system survives the first five minutes, it seems to run the full 8 hours.

The application runs in a docker container with runtime=nvidia (nvidia-docker 2.0.3 / docker 18.09/3).

Also, we have three of these DL580 Gen9 systems - same build options. I’ve moved the Quadros to different systems, and all systems run fine with the 1080 Ti’s, and power off with the Quadros. Pretty much rules out a hardware problem - but all are running the same OS and firmware versions.

Setting the persistence mode using nvidia-smi is depreciated in favour of the persistenced but should still have the same effect.

Generix - THANK YOU!

I’m testing now with sudo nvidia-smi -i 0 --lock-gpu-clocks=300,1590 and it seems to be much more stable.

Before it was often running at 1920 MHz.

With 1590 the peak watts seldom goes about 260.

ps: also started the persistenced service