'nvidia-smi' does not discover the GPU on a GH200 Supermicro system running Ubuntu 22.04

Hi,

I have installed a Supermicro GH200 server running Ubuntu 22.04 Kernel 6.5.0-28-generic.
‘nvidia-smi’ does not discover the GPU even though the GPU is discovered as a PCIe device and Nvidia drivers are loaded.
~$ nvidia-smi
No devices were found
~$ lsmod | grep -i nvidia
nvidia_uvm 4698112 0
nvidia_drm 106496 0
nvidia_modeset 1630208 1 nvidia_drm
nvidia 8843264 2 nvidia_uvm,nvidia_modeset
video 69632 1 nvidia_modeset
drm_kms_helper 270336 4 ast,nvidia_drm
drm 790528 6 drm_kms_helper,ast,drm_shmem_helper,nvidia,nvidia_drm
ecc 45056 1 nvidia

When I was running the command ‘nvidia-smi’ command, the following errors were seen on desmg:
[ 150.692469] workqueue: drm_fb_helper_damage_work [drm_kms_helper] hogged CPU for >20000us 4 times, consider switching to WQ_UNBOUND
[ 156.721903] loop5: detected capacity change from 0 to 8
[ 197.086199] ACPI Warning: _SB.PCI9.RP00.GPU0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230331/nsarguments-61)
[ 199.819710] NVRM: Failing GPU memory onlining as the onlining zone is not movable. pa: 0x400000000000 size: 0x8000000
NVRM: The NVIDIA GPU 0009:01:00.0 installed in the system
NVRM: requires auto onlining mode online_movable enabled in
NVRM: /sys/devices/system/memory/auto_online_blocks
[ 200.683412] NVRM: Failing GPU memory onlining as the onlining zone is not movable. pa: 0x4005f0000000 size: 0x8000000
NVRM: The NVIDIA GPU 0009:01:00.0 installed in the system
NVRM: requires auto onlining mode online_movable enabled in
NVRM: /sys/devices/system/memory/auto_online_blocks
[ 201.548682] NVRM: Failing GPU memory onlining as the onlining zone is not movable. pa: 0x400be0000000 size: 0x8000000
NVRM: The NVIDIA GPU 0009:01:00.0 installed in the system
NVRM: requires auto onlining mode online_movable enabled in
NVRM: /sys/devices/system/memory/auto_online_blocks
[ 202.440019] NVRM: Failing GPU memory onlining as the onlining zone is not movable. pa: 0x4011d0000000 size: 0x8000000
NVRM: The NVIDIA GPU 0009:01:00.0 installed in the system
NVRM: requires auto onlining mode online_movable enabled in
NVRM: /sys/devices/system/memory/auto_online_blocks
[ 202.440048] NVRM: GPU memory zone movable auto onlining failed!

[ 263.352127] NVRM: nvAssertOkFailedNoLog: Assertion failed: Generic operating system error [NV_ERR_OPERATING_SYSTEM] (0x00000059) returned from kmemsysNumaAddMemory_HAL(pGpu, pKernelMemorySystem, 0, 0, numaOnlineSize, &numaNodeId) @ kern_mem_sys.c:971
[ 263.566059] NVRM: RmInitNvDevice: *** Cannot load state into the device
[ 263.566073] NVRM: RmInitAdapter: RmInitNvDevice failed, bailing out of RmInitAdapter
[ 263.769710] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d08f713b62b00 >= 3d08e8e95d7100
[ 263.769719] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[ 263.800075] NVRM: GPU 0009:01:00.0: RmInitAdapter failed! (0x25:0x40:1054)
[ 263.818772] NVRM: GPU 0009:01:00.0: rm_init_adapter failed, device minor number 0

Any solution to get past this issue?
Thanks

Which nvidia drivers did you install?

~$ dpkg -l | grep -i nvidia
ii cuda-nsight-compute-12-4 12.4.1-1 arm64 NVIDIA Nsight Compute
ii cuda-nvtx-12-4 12.4.127-1 arm64 NVIDIA Tools Extension
ii libnvidia-cfg1-550:arm64 550.67-0ubuntu1.22.04.2 arm64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-550 550.67-0ubuntu1.22.04.2 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-550:arm64 550.67-0ubuntu1.22.04.2 arm64 NVIDIA libcompute package
ii libnvidia-decode-550:arm64 550.67-0ubuntu1.22.04.2 arm64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-550:arm64 550.67-0ubuntu1.22.04.2 arm64 NVENC Video Encoding runtime library
ii libnvidia-extra-550:arm64 550.67-0ubuntu1.22.04.2 arm64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-550:arm64 550.67-0ubuntu1.22.04.2 arm64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-550:arm64 550.67-0ubuntu1.22.04.2 arm64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii nsight-compute-2024.1.1 2024.1.1.4-1 arm64 NVIDIA Nsight Compute
ii nvidia-compute-utils-550 550.67-0ubuntu1.22.04.2 arm64 NVIDIA compute utilities
ii nvidia-dkms-550-open 550.67-0ubuntu1.22.04.2 arm64 NVIDIA DKMS package (open kernel module)
ii nvidia-driver-550-open 550.67-0ubuntu1.22.04.2 arm64 NVIDIA driver (open kernel) metapackage
ii nvidia-firmware-550-550.67 550.67-0ubuntu1.22.04.2 arm64 Firmware files used by the kernel module
ii nvidia-kernel-common-550 550.67-0ubuntu1.22.04.2 arm64 Shared files used with the kernel module
ii nvidia-kernel-source-550-open 550.67-0ubuntu1.22.04.2 arm64 NVIDIA kernel source package
ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA’s Prime
ii nvidia-settings 510.47.03-0ubuntu1 arm64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-550 550.67-0ubuntu1.22.04.2 arm64 NVIDIA driver support binaries
ii screen-resolution-extra 0.18.2 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-550 550.67-0ubuntu1.22.04.2 arm64 NVIDIA binary Xorg driver

I see you did install the “-open” drivers as per documentation this is a requirement (see Appendix A.1 in
https://docs.nvidia.com/grace-ubuntu-install-guide.pdf).

You may try see if downgrading the version helps, as there may still be some incompatibilities with the most recent driver verison. FWW. I have version 535.161.07.

I downgraded to 535-open and still see the same issue.
~$ dpkg -l | grep -i nvidia
ii cuda-nsight-compute-12-4 12.4.1-1 arm64 NVIDIA Nsight Compute
ii cuda-nvtx-12-4 12.4.127-1 arm64 NVIDIA Tools Extension
ii libnvidia-cfg1-535:arm64 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-535 535.171.04-0ubuntu0.22.04.1 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-535:arm64 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA libcompute package
rc libnvidia-compute-550:arm64 550.67-0ubuntu1.22.04.2 arm64 NVIDIA libcompute package
ii libnvidia-decode-535:arm64 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-535:arm64 535.171.04-0ubuntu0.22.04.1 arm64 NVENC Video Encoding runtime library
ii libnvidia-extra-535:arm64 535.171.04-0ubuntu0.22.04.1 arm64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-535:arm64 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-535:arm64 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii nsight-compute-2024.1.1 2024.1.1.4-1 arm64 NVIDIA Nsight Compute
ii nvidia-compute-utils-535 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA compute utilities
ii nvidia-dkms-535-open 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA DKMS package (open kernel module)
ii nvidia-driver-535-open 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA driver (open kernel) metapackage
ii nvidia-firmware-535-535.171.04 535.171.04-0ubuntu0.22.04.1 arm64 Firmware files used by the kernel module
ii nvidia-kernel-common-535 535.171.04-0ubuntu0.22.04.1 arm64 Shared files used with the kernel module
ii nvidia-kernel-source-535-open 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA kernel source package
ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA’s Prime
ii nvidia-settings 510.47.03-0ubuntu1 arm64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-535 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA driver support binaries
ii screen-resolution-extra 0.18.2 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-535 535.171.04-0ubuntu0.22.04.1 arm64 NVIDIA binary Xorg driver
~$ nvidia-smi
No devices were found

dmesg errors when running the command ‘nvidia-smi’:
[ 394.805953] ACPI Warning: _SB.PCI9.RP00.GPU0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230331/nsarguments-61)
[ 397.879200] NVRM: Failing GPU memory onlining as the onlining zone is not movable. pa: 0x400000000000 size: 0x8000000
NVRM: The NVIDIA GPU 0009:01:00.0 installed in the system
NVRM: requires auto onlining mode online_movable enabled in
NVRM: /sys/devices/system/memory/auto_online_blocks
[ 398.740001] NVRM: Failing GPU memory onlining as the onlining zone is not movable. pa: 0x4005f0000000 size: 0x8000000
NVRM: The NVIDIA GPU 0009:01:00.0 installed in the system
NVRM: requires auto onlining mode online_movable enabled in
NVRM: /sys/devices/system/memory/auto_online_blocks
[ 399.582525] NVRM: Failing GPU memory onlining as the onlining zone is not movable. pa: 0x400be0000000 size: 0x8000000
NVRM: The NVIDIA GPU 0009:01:00.0 installed in the system
NVRM: requires auto onlining mode online_movable enabled in
NVRM: /sys/devices/system/memory/auto_online_blocks
[ 400.428091] NVRM: Failing GPU memory onlining as the onlining zone is not movable. pa: 0x4011d0000000 size: 0x8000000
NVRM: The NVIDIA GPU 0009:01:00.0 installed in the system
NVRM: requires auto onlining mode online_movable enabled in
NVRM: /sys/devices/system/memory/auto_online_blocks
[ 400.428115] NVRM: GPU memory zone movable auto onlining failed!
[ 463.914492] NVRM: nvAssertOkFailedNoLog: Assertion failed: Generic operating system error [NV_ERR_OPERATING_SYSTEM] (0x00000059) returned from kmemsysNumaAddMemory_HAL(pGpu, pKernelMemorySystem, 0, 0, numaOnlineSize, &numaNodeId) @ kern_mem_sys.c:918
[ 464.266957] NVRM: nvAssertFailedNoLog: Assertion failed: pmaTotalMemorySize >= numaTotalSize @ mem_mgr.c:2987
[ 464.267099] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ mem_mgr.c:705
[ 464.267509] NVRM: RmInitNvDevice: *** Cannot initialize the device
[ 464.267617] NVRM: RmInitAdapter: RmInitNvDevice failed, bailing out of RmInitAdapter
[ 464.297777] NVOC: __nvoc_objDelete: Child class PrereqTracker not freed from parent class OBJGPU.NVRM: iovaspaceDestruct_IMPL: 1 left-over mappings in IOVAS 0x90100
[ 464.297821] NVRM: GPU 0009:01:00.0: RmInitAdapter failed! (0x24:0x40:908)
[ 464.299341] NVRM: GPU 0009:01:00.0: rm_init_adapter failed, device minor number 0
[ 464.885305] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885355] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885380] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020000000000
[ 464.885396] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0500
[ 464.885408] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 464.885422] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885434] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885446] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020800000000
[ 464.885456] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0000
[ 464.885467] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 464.885478] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885488] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885499] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020000000000
[ 464.885509] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0500
[ 464.885519] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 464.885529] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885539] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885549] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020000000000
[ 464.885559] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0500
[ 464.885568] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 464.885579] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885588] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885598] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020000000000
[ 464.885609] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0500
[ 464.885618] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 464.885628] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885639] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885648] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020000000000
[ 464.885657] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0000
[ 464.885667] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 464.885678] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885687] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885697] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020000000000
[ 464.885706] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0520
[ 464.885716] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 464.885727] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885736] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885745] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020000000000
[ 464.885754] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0000
[ 464.885764] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 464.885775] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885785] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885795] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020000000000
[ 464.885804] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0520
[ 464.885814] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 464.885825] arm-smmu-v3 arm-smmu-v3.8.auto: event 0x10 received:
[ 464.885834] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0001010000000010
[ 464.885845] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000020000000000
[ 464.885855] arm-smmu-v3 arm-smmu-v3.8.auto: 0x00000000ffed0520
[ 464.885865] arm-smmu-v3 arm-smmu-v3.8.auto: 0x0000000000000000
[ 465.177898] NVRM: _kgspBootGspRm: unexpected WPR2 already up, cannot proceed with booting GSP
[ 465.177995] NVRM: _kgspBootGspRm: (the GPU is likely in a bad state and may need to be reset)
[ 465.178078] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[ 465.207607] NVRM: GPU 0009:01:00.0: RmInitAdapter failed! (0x62:0x40:1671)
[ 465.210183] NVRM: GPU 0009:01:00.0: rm_init_adapter failed, device minor number 0
[ 469.888292] arm_smmu_evtq_thread: 454292 callbacks suppressed

@kalaivanan1 have you managed to resolve your issue? FWW. My kernel is 6.2.0-1015-nvidia-64k, as I’ve noticed you’re on 6.5.0-28-generic. Have you installed the updates and the Nvidia optimised kernel that are mentioned in step 17 on pg. 26 of the Install guide? I would try that too, and if this doesn’t resolve the issues contact the vendor for Support.

@ilb Not yet. I have reached out to Nvidia support. I am working on kernel 6.5.0-1015-nvidia-64k now. Still the GPU discovery issue exists which is being actively worked on.