I’m running a RHEL 7.6 system (kernel 3.10.0-957.10.1.el7.x86_64) that we just upgraded with an RTX 2080 Ti (it used to contain two GTX 780 Ti cards that ran stable for years). After running a bunch of OpenCL and CUDA tests, it fails with the following in dmesg:
...
[32438.624521] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32439.139541] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32439.139560] NVRM: rm_init_adapter failed for device bearing minor number 0
[32439.246182] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32443.757184] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32443.757202] NVRM: rm_init_adapter failed for device bearing minor number 0
[32443.864800] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32448.372512] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32448.372530] NVRM: rm_init_adapter failed for device bearing minor number 0
[32448.481516] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32452.993060] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32452.993083] NVRM: rm_init_adapter failed for device bearing minor number 0
[32453.128607] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32457.637375] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32457.637404] NVRM: rm_init_adapter failed for device bearing minor number 0
[32458.018091] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
[32462.547348] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[32462.547384] NVRM: rm_init_adapter failed for device bearing minor number 0
[32462.660787] nvidia 0000:83:00.0: irq 145 for MSI/MSI-X
...
After which all GPU operations hang or fail (e.g. nvidia-smi seems to stall). A reboot fixes it.