Issue of GPU has fallen off the bus

$ dmesg -T 
[Tue Apr  1 16:39:18 2025] NVRM: GPU at PCI:0000:b1:00: GPU-40f90754-6e68-502b-4a05-f1f7df8d092b
[Tue Apr  1 16:39:18 2025] NVRM: GPU Board Serial Number: 1324323012725
[Tue Apr  1 16:39:18 2025] NVRM: Xid (PCI:0000:b1:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Tue Apr  1 16:39:18 2025] NVRM: GPU 0000:b1:00.0: GPU has fallen off the bus.
[Tue Apr  1 16:39:18 2025] NVRM: GPU 0000:b1:00.0: GPU serial number is 1324323012725.
[Tue Apr  1 16:39:18 2025] NVRM: A GPU crash dump has been created. If possible, please run
                           NVRM: nvidia-bug-report.sh as root to collect this data before
                           NVRM: the NVIDIA kernel module is unloaded.
$ nvidia-debugdump --list
Found 2 NVIDIA devices
        Device ID:              0
        Device name:            NVIDIA RTX A6000
        GPU internal ID:        1324223019966

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error
$ lsmod | grep nvidia
nvidia_uvm           1323008  14
nvidia_drm             65536  0
nvidia_modeset       1298432  1 nvidia_drm
nvidia              56778752  908 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  4 ast,nvidia_drm
drm                   495616  7 drm_kms_helper,drm_vram_helper,ast,nvidia,nvidia_drm,ttm

I have reinstall the driver, now version is 535.230.02, and tried changing the graphics card slot, but the problem still occurs.
nvidia-bug-report.log.gz (742.4 KB)

This is most commonly (but not exclusively) caused by insufficient power from a PSU (see similar threads on this forum), so start by verifying this.

Despite replacing the server’s PSUs with two new 2000W units recently, the problem persisted. When executing llm task, one GPU consistently encountered a ‘fallen off the bus’ error, even under light load. I removed the problematic GPU via its serial number reported in dmesg. The remaining GPU has now sustained the same task under full load stably for over 30 minutes. Can we conclude that the removed GPU is likely faulty?

That would be my bet in this situation. As a final test maybe you can try using this removed GPU in isolation in another machine and see if the problem persists.