While training with tensorflow RTX8000 with NVLINK loses with error message. "GPU has fallen off the bus."

Hi all.

I need help here.

During a training the tensorflow ai model with NVIDIA QUADRO RTX 8000, most of the time loses GPU with the error msg “GPU has fallen off the bus.” After that, the process name “UVM GPU1 BH” CPU usage becomes 100% and can’t be killed by kill -9. Also the dmesg says “NVLINK: Failed to train link 0 to remote PCI:0000:24:00”

I have done some googling to resolve the issue but still no luck.
Any advice will be appreciated.

GPU Spec:
8 x RTX 8000 with 4 NVLINK bridge
Driver Version: 455.45.01

Thanks

nvidia-bug-report.log.gz (3.8 MB)

below are screen shot of actual dmesg.

Either overheating or insufficient power supply, I guess. Please monitor temperatures, try limiting gpu clocks.