Linux driver 410.79 on RTX 2080: NVRM: RmInitAdapter failed! (0x26:0xffff:1127)

I’m running a RHEL 7.6 system (kernel 3.10.0-957.10.1.el7.x86_64) that we just upgraded with an RTX 2080 Ti (it used to contain two GTX 780 Ti cards that ran stable for years). After running a bunch of OpenCL and CUDA tests, it fails with the following in dmesg:

...
[32438.624521] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32439.139541] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32439.139560] NVRM: rm_init_adapter failed for device bearing minor number 0
[32439.246182] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32443.757184] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32443.757202] NVRM: rm_init_adapter failed for device bearing minor number 0
[32443.864800] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32448.372512] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32448.372530] NVRM: rm_init_adapter failed for device bearing minor number 0
[32448.481516] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32452.993060] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32452.993083] NVRM: rm_init_adapter failed for device bearing minor number 0
[32453.128607] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32457.637375] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32457.637404] NVRM: rm_init_adapter failed for device bearing minor number 0
[32458.018091] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32462.547348] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32462.547384] NVRM: rm_init_adapter failed for device bearing minor number 0
[32462.660787] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
...

After which all GPU operations hang or fail (e.g. nvidia-smi seems to stall). A reboot fixes it.

Please use gpu-burn to check for a hw fault.

I have run gpu-burn for an hour without any problems (as well as several hours of the same OpenCL and CUDA code that caused the original crash). However, I suspect that I was actually using driver 410.70 during the crash (and that it rebooted into a newer version), so I’m hoping whatever went wrong has been fixed.