A new server just arrived and I proceeded to install Ubuntu 16.04 and CUDA+cuDNN as usual. After installing everything, one of the GPUs is missing from nvidia-smi
. All 4 appear on lspci
:
19:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev ff)
68:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
When I run nvidia-smi
, this message appears on dmesg
:
[ 8.672022] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c4000-0x000c7fff window]
[ 8.672138] caller os_map_kernel_space.part.7+0xd8/0x120 [nvidia] mapping multiple BARs
[ 11.860957] NVRM: RmInitAdapter failed! (0x26:0xffff:1125)
[ 11.860979] NVRM: rm_init_adapter failed for device bearing minor number 2
If I let the server running for some time, the nvidia-smi
then throws:
Unable to determine the device handle for GPU 0000:67:00.0: Unknown Error
I attached two bug reports: one before the nvidia-smi
“breaks” and another one after. Any help is welcome.
nvidia-bug-report-after.log.gz (2.06 MB)
nvidia-bug-report-before.log.gz (1.87 MB)