I have a GPU cluster with 4 V100 32GB GPU cards per node. Driver version 396.37 used to work very well for us. Recently we upgraded the cuda driver to 418.39 (we also tested 418.40, same problem), we started to hit problem.
When the CPUs are busy (e.g. 40 out of 96CPUs are 100% used), or when the network is busy copying data with one process only, the GPU devices are lost.
No devices were found
[1523156.634328] NVRM: RmInitAdapter failed! (0x25:0x51:1084)
[1523156.640743] NVRM: rm_init_adapter failed for device bearing minor number 0
[1523156.649004] nvidia 0000:1e:00.0: irq 577 for MSI/MSI-X
Both drivers versions were tested.
nvidia-bug-report.log.gz (430 KB)