Nvidia-smi -L can't see one of the A30 GPUs

We have a Dell PowerEdge R7525 servers with two A30 GPUs running Debian Linux 12. It used to work fine until we move it to a new site and upgraded to the latest Debian 12.4.

Now nvidia-smi only shows 1 GPU though lspci shows both:

$ sudo lspci | grep -i nvidia
21:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
81:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
$ sudo nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-b61d4129-a6c8-5826-13ac-dec0a24a6ff4)

I went to Dell idrac and checked PCIe sections, it shows both GPUs as healthy.

I tried downgrading to older Debian 12 kernels, also tried latest nvidia gpu driver (x86_64-535.129.03) for A30 Linux, none helped. I am currently running driver 525.105.17 since that’s the version which used to work on this server.

Here is from dmesg:

[Mon Jan 15 21:23:19 2024] VFIO - User Level meta-driver version: 0.3
[Mon Jan 15 21:23:19 2024] nvidia: loading out-of-tree module taints kernel.
[Mon Jan 15 21:23:19 2024] nvidia: module license 'NVIDIA' taints kernel.
[Mon Jan 15 21:23:19 2024] Disabling lock debugging due to kernel taint
[Mon Jan 15 21:23:19 2024] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Mon Jan 15 21:23:19 2024] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[Mon Jan 15 21:23:19 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.105.17  Tue Mar 28 18:02:59 UTC 2023
[Mon Jan 15 21:23:19 2024] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[Mon Jan 15 21:23:19 2024] nvidia-uvm: Loaded the UVM driver, major device number 236.
[Mon Jan 15 21:23:19 2024] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.105.17  Tue Mar 28 22:18:37 UTC 2023
[Mon Jan 15 21:23:20 2024] [drm] [nvidia-drm] [GPU ID 0x00002100] Loading driver
[Mon Jan 15 21:23:20 2024] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:21:00.0 on minor 1
[Mon Jan 15 21:23:20 2024] [drm] [nvidia-drm] [GPU ID 0x00008100] Loading driver
[Mon Jan 15 21:23:20 2024] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:81:00.0 on minor 2
[Mon Jan 15 21:23:20 2024] [drm] [nvidia-drm] [GPU ID 0x00008100] Unloading driver
[Mon Jan 15 21:23:20 2024] [drm] [nvidia-drm] [GPU ID 0x00002100] Unloading driver
[Mon Jan 15 21:23:20 2024] nvidia-modeset: Unloading
[Mon Jan 15 21:23:20 2024] nvidia-uvm: Unloaded the UVM driver.
[Mon Jan 15 21:23:20 2024] nvidia-nvlink: Unregistered Nvlink Core, major device number 238
[Mon Jan 15 21:23:24 2024] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[Mon Jan 15 21:23:24 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.105.17  Tue Mar 28 18:02:59 UTC 2023
[Mon Jan 15 21:23:24 2024] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.105.17  Tue Mar 28 22:18:37 UTC 2023
[Mon Jan 15 21:23:24 2024] [drm] [nvidia-drm] [GPU ID 0x00002100] Loading driver
[Mon Jan 15 21:23:24 2024] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:21:00.0 on minor 1
[Mon Jan 15 21:23:24 2024] [drm] [nvidia-drm] [GPU ID 0x00008100] Loading driver
[Mon Jan 15 21:23:24 2024] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:81:00.0 on minor 2
[Mon Jan 15 21:23:43 2024] nvidia 0000:21:00.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin
[Mon Jan 15 21:23:45 2024] nvidia 0000:81:00.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin
[Mon Jan 15 21:23:46 2024] NVRM: GPU 0000:81:00.0: RmInitAdapter failed! (0x62:0xffff:2351)
[Mon Jan 15 21:23:46 2024] NVRM: GPU 0000:81:00.0: rm_init_adapter failed, device minor number 1
[Mon Jan 15 21:23:46 2024] nvidia 0000:81:00.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin
[Mon Jan 15 21:23:46 2024] NVRM: GPU 0000:81:00.0: RmInitAdapter failed! (0x62:0xffff:2351)
[Mon Jan 15 21:23:46 2024] NVRM: GPU 0000:81:00.0: rm_init_adapter failed, device minor number 1
[Mon Jan 15 21:27:37 2024] nvidia 0000:21:00.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin
[Mon Jan 15 21:27:39 2024] nvidia 0000:81:00.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin
[Mon Jan 15 21:27:39 2024] NVRM: GPU 0000:81:00.0: RmInitAdapter failed! (0x62:0xffff:2351)
[Mon Jan 15 21:27:39 2024] NVRM: GPU 0000:81:00.0: rm_init_adapter failed, device minor number 1
[Mon Jan 15 21:27:39 2024] nvidia 0000:81:00.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin
[Mon Jan 15 21:27:39 2024] NVRM: GPU 0000:81:00.0: RmInitAdapter failed! (0x62:0xffff:2351)
[Mon Jan 15 21:27:39 2024] NVRM: GPU 0000:81:00.0: rm_init_adapter failed, device minor number 1

We opened the server and checked the GPU riser cards and cables, but didn’t see any issues. We also tried upgrading Dell server firmware (idrac, BIOS, etc) and that didn’t help either.

We are also running “nvidia-persistenced” but it fails with the same GPU:

Jan 15 22:03:14 bja systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon...
Jan 15 22:03:14 bja nvidia-persistenced[111453]: Started (111453)
Jan 15 22:03:15 bja nvidia-persistenced[111453]: device 0000:81:00.0 - failed to open.
Jan 15 22:03:15 bja systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon.

Any thoughts? Shall I request any parts/cables to be replaced?

koi@bja:/proc/driver/nvidia/gpus$ cat 0000\:21\:00.0/information           
Model: 		 NVIDIA A30
IRQ:   		 243
GPU UUID: 	 GPU-b61d4129-a6c8-5826-13ac-dec0a24a6ff4
Video BIOS: 	 92.00.66.00.04
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:21:00.0
Device Minor: 	 0
GPU Firmware: 	 525.105.17
GPU Excluded:	 No
koi@bja:/proc/driver/nvidia/gpus$ 
koi@bja:/proc/driver/nvidia/gpus$ cat 0000\:81\:00.0/information 
Model: 		 NVIDIA A30
IRQ:   		 244
GPU UUID: 	 GPU-79a5f55c-2a6c-bf44-47e3-d43bad5d8a7f
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:81:00.0
Device Minor: 	 1
GPU Firmware: 	 N/A
GPU Excluded:	 No

The bad GPU seems to show no “Video BIOS” or “GPU Firmware” versions.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

We were in a hurry to transit this server to the new location, so temporarily “borrowed” another A30 from another spare server, and got it working.

We will need to put this “bad” gpu to a server to bring online in order to run the report.