Nvidia-smi reports "no device" for a V100 GPU on IBM cloud

Hello,
I am trying to install NVidia drivers on an IBM Cloud VM with V100 GPU. I have tried with both the open drivers and the proprietary drivers, however I have been unable to get nvidia-smi to work. Here are the details of the system and what I have tried so far.

System: Ubuntu 22.04.3 LTS

# lspci
3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

I then install the cuda toolkit “CUDA Toolkit 12.3 Update 1 Downloads” by going to [https://developer.nvidia.com/cuda-downloads] and selecting OS etc. I used the network (.deb) version to install, which automatically adds the following to block loading the nouveau driver.

# cat /lib/modprobe.d/nvidia-graphics-drivers.conf
blacklist nouveau
blacklist lbm-nouveau
alias nouveau off
alias lbm-nouveau off

I installed the legacy driver version, since as per this page: [Chapter 4. Installing the NVIDIA Driver], V100 GPU is not listed for GSP. However, I get the following outputs:

# nvidia-smi
No devices were found

Tailing /var/log/kern.log shows:

Dec  8 13:22:49 gpubox kernel: [   10.688806] ACPI Warning: \_SB.PCI0.S38.S08._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210730/nsarguments-61)
Dec  8 13:22:51 gpubox kernel: [   12.632479] loop3: detected capacity change from 0 to 8
Dec  8 13:22:54 gpubox kernel: [   15.693600] NVRM: GPU 0000:04:01.0: RmInitAdapter failed! (0x11:0x45:2550)
Dec  8 13:22:54 gpubox kernel: [   15.725462] NVRM: GPU 0000:04:01.0: rm_init_adapter failed, device minor number 0
Dec  8 13:22:59 gpubox kernel: [   20.537105] NVRM: GPU 0000:04:01.0: RmInitAdapter failed! (0x11:0x45:2550)
Dec  8 13:22:59 gpubox kernel: [   20.537906] NVRM: GPU 0000:04:01.0: rm_init_adapter failed, device minor number 0
Dec  8 13:24:27 gpubox kernel: [  108.231444] NVRM: GPU 0000:04:01.0: RmInitAdapter failed! (0x11:0x45:2550)
Dec  8 13:24:27 gpubox kernel: [  108.232264] NVRM: GPU 0000:04:01.0: rm_init_adapter failed, device minor number 0
Dec  8 13:24:32 gpubox kernel: [  113.041459] NVRM: GPU 0000:04:01.0: RmInitAdapter failed! (0x11:0x45:2550)
Dec  8 13:24:32 gpubox kernel: [  113.042306] NVRM: GPU 0000:04:01.0: rm_init_adapter failed, device minor number 0

Anything else I can try to have the GPU driver recognize the GPU?

Thanks!

nvidia-bug-report.log.gz (1.0 MB)

Uploaded log file.

I did try out multiple things, including the “open” driver, so the log file is pretty big. The last reboot report start at line# 286808. HTH.

Please try a 470 driver, I have a suspicion that the rminit failure code 0x11,0x45 is the new nvidia code for “passthrough not allowed”.

Thanks @generix. I was checking the page below for the latest nvidia 470 driver, and V100 is not listed as supported. Any thoughts? You think it might still work?

Never mind, I will need to try the Data Center Driver from [Data Center Driver for Linux x64 | 470.223.02 | Linux 64-bit | NVIDIA]. Will try out and report back. Looks like the cuda toolkit version would be quite old (11.4) though.

Both drivers are actually identical.

Well, I tried the 470 driver - still got the same error message. I am thinking this might a problem on the cloud side - have contacted IBM. Let’s see.

What worked for me!
You can try:

edit /etc/default/grub and make sure:

GRUB_CMDLINE_LINUX=“pci=nomsi”

If it’s not there, run “sudo update-grub”.
sudo reboot

After the reboot, the output of “sudo cat /proc/cmdline” should show “pci=nomsi”
nvidia-smi