Hello,
I am trying to install NVidia drivers on an IBM Cloud VM with V100 GPU. I have tried with both the open drivers and the proprietary drivers, however I have been unable to get nvidia-smi to work. Here are the details of the system and what I have tried so far.
System: Ubuntu 22.04.3 LTS
# lspci
3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
I then install the cuda toolkit “CUDA Toolkit 12.3 Update 1 Downloads” by going to [https://developer.nvidia.com/cuda-downloads] and selecting OS etc. I used the network (.deb) version to install, which automatically adds the following to block loading the nouveau driver.
# cat /lib/modprobe.d/nvidia-graphics-drivers.conf
blacklist nouveau
blacklist lbm-nouveau
alias nouveau off
alias lbm-nouveau off
I installed the legacy driver version, since as per this page: [Chapter 4. Installing the NVIDIA Driver], V100 GPU is not listed for GSP. However, I get the following outputs:
# nvidia-smi
No devices were found
Tailing /var/log/kern.log shows:
Dec 8 13:22:49 gpubox kernel: [ 10.688806] ACPI Warning: \_SB.PCI0.S38.S08._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210730/nsarguments-61)
Dec 8 13:22:51 gpubox kernel: [ 12.632479] loop3: detected capacity change from 0 to 8
Dec 8 13:22:54 gpubox kernel: [ 15.693600] NVRM: GPU 0000:04:01.0: RmInitAdapter failed! (0x11:0x45:2550)
Dec 8 13:22:54 gpubox kernel: [ 15.725462] NVRM: GPU 0000:04:01.0: rm_init_adapter failed, device minor number 0
Dec 8 13:22:59 gpubox kernel: [ 20.537105] NVRM: GPU 0000:04:01.0: RmInitAdapter failed! (0x11:0x45:2550)
Dec 8 13:22:59 gpubox kernel: [ 20.537906] NVRM: GPU 0000:04:01.0: rm_init_adapter failed, device minor number 0
Dec 8 13:24:27 gpubox kernel: [ 108.231444] NVRM: GPU 0000:04:01.0: RmInitAdapter failed! (0x11:0x45:2550)
Dec 8 13:24:27 gpubox kernel: [ 108.232264] NVRM: GPU 0000:04:01.0: rm_init_adapter failed, device minor number 0
Dec 8 13:24:32 gpubox kernel: [ 113.041459] NVRM: GPU 0000:04:01.0: RmInitAdapter failed! (0x11:0x45:2550)
Dec 8 13:24:32 gpubox kernel: [ 113.042306] NVRM: GPU 0000:04:01.0: rm_init_adapter failed, device minor number 0
Anything else I can try to have the GPU driver recognize the GPU?
Thanks!