GPU devices lost with 'NVRM: RmInitAdapter failed' When CPU or Network is busy

Hi Nvidia,

I have a GPU cluster with 4 V100 32GB GPU cards per node. Driver version 396.37 used to work very well for us. Recently we upgraded the cuda driver to 418.39 (we also tested 418.40, same problem), we started to hit problem.

When the CPUs are busy (e.g. 40 out of 96CPUs are 100% used), or when the network is busy copying data with one process only, the GPU devices are lost.

$nvidia-smi
No devices were found

$dmesg

[1523156.634328] NVRM: RmInitAdapter failed! (0x25:0x51:1084)
[1523156.640743] NVRM: rm_init_adapter failed for device bearing minor number 0
[1523156.649004] nvidia 0000:1e:00.0: irq 577 for MSI/MSI-X

Both drivers versions were tested.
nvidia-driver-418.39-4.el7.x86_64
nvidia-driver-418.40.04-4.el7.x86_64
nvidia-bug-report.log.gz (430 KB)

Hi Nvidia,

Can you update the status of this bug? This issue is very serious for us to effectively use latest drivers. Thanks.

Liwei

You might want to email it to linux-bugs[at]nvidia.com for more attention.
Looking at the logs though I see the driver continuously initializing/deinitializing in fast succession which points to that you’re missing the nvidia-persistenced. If you get additional load on the bus then this might run into timing problems possibly leading to the issue you’re observing. Please enable the nvidia-persistenced to start on boot and check if that resolves the issue.

Thanks for the answer. I enabled persist mode and it seems the issue went away. I’ll do more tests to make sure.

Getting this error in dmesg on ubuntu 18.04 with a single K80. Also seeing the No devices were found error. Checked with
sudo systemctl status nvidia-persistenced as per Setting up nvidia-persistenced - #11 by mikechen6688 but it’s still not working.

I do see
nvidia-persistenced[856]: device 0000:04:00.0 - registered
nvidia-persistenced[856]: device 0000:04:00.0 - failed to open.
showing when i run sudo systemctl status nvidia-persistenced

Anything else worth checking? Running nvidia server driver metapackage from nvidia-driver-460-server (proprietary)

I got this driver from:
sudo ubuntu-drivers autoinstall

Also tried the steps here to disable nouveau
https://towardsdatascience.com/deep-learning-gpu-installation-on-ubuntu-18-4-9b12230a1d31

I’m running in UEFI mode if that makes any difference and have enabled above 4G decoding in the BIOS. The card is properly cooled and the card shows up in hwinfo --gfxcard --short
graphics card:
Matrox G200 SE A (PCI)
nVidia GK210GL [Tesla K80]
nVidia GK210GL [Tesla K80]