Terrible idle power on 418.43 (on Fedora 29 x86_64 when compared to 415.27)

Fedora 29

with 415.27 idle power without X was around 6-7W, with X was under 1W.
(the fact it wasn’t <1W without X is a separate bug I guess)

with 418.43 idle power appears to be ~21-22W

[root@sky ~]# nvidia-smi
Fri Mar 1 03:32:09 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:01:00.0 Off | N/A |
| 0% 48C P8 21W / 338W | 1MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

[root@sky ~]# nvidia-smi --format=csv,noheader --query-gpu=temperature.gpu,pstate,fan.speed,power.draw,power.limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video
49, P8, 0 %, 21.66 W, 338.00 W, 645 MHz, 645 MHz, 405 MHz, 585 MHz

The problem appears to be related to not dropping to 300 MHz.

Note: I upgraded: Fedora, NVidia driver, Cuda 10.0 → 10.1 and CudNN 7.4.2.24 → 7.5.0.56 all at the same time, but AFAICT this looks to be purely a driver issue

This also appears to result in a roughly 30-45 MHz lower top frequency.

This is on an eVGA 11G-P4-2382-KR (RTX 2080 Ti)

Have you left your PC idling for more than 40 seconds and checked power consumption after that?

Yes, it dropped from P2 under cuda workload to P3, P5 then P8 and 90+ seconds later it’s still P8 @ ~21W.
Starting X, stopping it, starting cuda workload, stopping it, waiting couple minutes… none of it helped eliminate this baseline 21W.

However, after rebooting behaviour seems sane again.

This appears to be caused by exceptions.

As soon as this triggered:

terminate called after throwing an instance of ‘lczero::Exception’
what(): CUDA error: an illegal memory access was encountered (…/…/src/neural/cuda/network_cudnn.cc:547)
Aborted (core dumped)

The card is now stuck in some sort of high energy idle state:

nvidia-smi

±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:01:00.0 Off | N/A |
|100% 31C P8 21W / 338W | 1MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

(rebooting fixes it)