The compilation of the kernel module is successful, but when loading I get the error:
[53764.480742] VFIO - User Level meta-driver version: 0.3
[53764.535519] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[53764.536991] nvidia 0000:83:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[53764.537038] NVRM: The NVIDIA GPU 0000:83:00.0 (PCI ID: 10de:2208)
NVRM: installed in this system is not supported by the
NVRM: NVIDIA 470.94 driver release.
NVRM: Please see ‘Appendix A - Supported NVIDIA GPU Products’
NVRM: in this release’s README, available on the operating system
NVRM: specific graphics driver download page at www.nvidia.com.
[53764.537140] nvidia: probe of 0000:83:00.0 failed with error -1
[53764.537159] NVRM: The NVIDIA probe routine failed for 1 device(s).
[53764.537159] NVRM: None of the NVIDIA devices were initialized.
[53764.537298] nvidia-nvlink: Unregistered the Nvlink Core, major device number 235
Nouveau is blacklisted and not loaded. 470 should support this card.
Thank you for your help.
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
I’d say the message is misleading. The logs show that the gpu was working fine for quite some time until the driver reported the gpu failing while working. Then you tried up/downgrading the driver and now it doesn’t even recognize the type of gpu anmore. I guess it’s simply broken. Please remove it and check if it works in another system.
Here is the sequence of events. We has an older GPU in this machine (a Tesla series I believe). That card was working. I removed that card Friday evening and tried to install the driver last night and this morning. The card in the machine right now is brand new.
I should mention that this is a shared server machine where X is not going to be used. The intent is to use cuda via pytorch.
The logs are from January 24th, yesterday. you were installing driver 440 which is too old. Then you installed 470.94, worked. Then you installed 510.39, worked. Then you installed 465.19.01, worked. For some time, then it broke. Then you rebooted, still broken. Then you wildly installed all kinds of driver versions, none of which ever worked again. Maybe you just confused the gpu by installing/loading/unloading/loading/unloading/installing other driver/loading/unloading…so all it needs is a power off.
Or it’s just broken. Please check if it works in another system.
Thank you for your help. So a reboot and a clean run of NVIDIA-Linux-x86_64-470.94.run was successful.
Now the goal is to install cuda_11.3 as this is the version supported by pytorch. What was recommend by NVIDIA’s website is cuda_11.3.0_465.19.01_linux.run. Is that the source of the 465 driver that confused me and the GPU? Should I not install the 470 driver at all?
I understand that cuda_11.3.0 also tries to install a kernel driver. So I removed properly 470, installed cuda_11.3.0 with 465. The driver now loads properly. But pytorch still fails to see the GPU, and dmesg reports some errors: “RmInitAdapter failed!”, even after a reboot.
nvidia-bug-report.log.gz (300.6 KB)
It’s also falling off the bus. I repeat:
I guess it’s simply broken. Please remove it and check if it works in another system.