Hello, one of the GPUs isn’t detected and not showing up when I run nvidia-smi. Here is the log files:
nvidia-bug-report.log.gz (5.5 MB)
kern.log (4.2 MB)
Would the issue likely be software or hardware related?
Hello, one of the GPUs isn’t detected and not showing up when I run nvidia-smi. Here is the log files:
nvidia-bug-report.log.gz (5.5 MB)
kern.log (4.2 MB)
Would the issue likely be software or hardware related?
[ 1.173703] pci 0000:41:00.0: BAR 8: no space for [mem size 0x3f0000000 64bit pref]
[ 1.173705] pci 0000:41:00.0: BAR 8: failed to assign [mem size 0x3f0000000 64bit pref]
[ 1.173707] pci 0000:41:00.0: BAR 10: no space for [mem size 0x7e000000 64bit pref]
[ 1.173708] pci 0000:41:00.0: BAR 10: failed to assign [mem size 0x7e000000 64bit pref]
[ 1.173709] pci 0000:41:00.0: BAR 7: no space for [mem size 0x00fc0000]
[ 1.173710] pci 0000:41:00.0: BAR 7: failed to assign [mem size 0x00fc0000]
[ 7.987242] kernel: NVRM: GPU 0000:41:00.0: Failed to copy vbios to system memory.
[ 7.987334] kernel: NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x30:0xffff:948)
Please check if setting either kernel parameter
pci=realloc
or
pci=realloc=off
helps.
Both of these kernel parameters didn’t solve the issue and logs report the same error. Note that all GPUs were working fine until recently. Any other suggestions is highly appreciated.
Maybe something wrong with driver version?
Oct 16 07:03:57 ws-cluster-01 kernel: [586195.760861] NVRM: API mismatch: the client has the version 515.76, but
Oct 16 07:03:57 ws-cluster-01 kernel: [586195.760861] NVRM: this kernel module has the version 515.65.01. Please
Oct 16 07:03:57 ws-cluster-01 kernel: [586195.760861] NVRM: make sure that this kernel module and all NVIDIA driver
Oct 16 07:03:57 ws-cluster-01 kernel: [586195.760861] NVRM: components have the same version.
Have you updated your drivers recently?
I did but not sure if that was before we are having the issue. I did try purge and reinstall nvidia drivers as part of troubleshooting recently.
It is not a driver issue but an issue between kernel, gpu and mainboard (uefi firmware). Please check if the affected gpu works in another system, it might as well be broken and requests unfulfillable resources.