I need help here.
During a training the tensorflow ai model with NVIDIA QUADRO RTX 8000, most of the time loses GPU with the error msg “GPU has fallen off the bus.” After that, the process name “UVM GPU1 BH” CPU usage becomes 100% and can’t be killed by kill -9. Also the dmesg says “NVLINK: Failed to train link 0 to remote PCI:0000:24:00”
I have done some googling to resolve the issue but still no luck.
Any advice will be appreciated.
8 x RTX 8000 with 4 NVLINK bridge
Driver Version: 455.45.01
nvidia-bug-report.log.gz (3.8 MB)
below are screen shot of actual dmesg.