RTX 5090 - ERROR: GPU:1: Error while waiting for GPU progress

Hi,

While running a distributed PyTorch training my computer went into black screen saying “ERROR: GPU:1: Error while waiting for GPU progress: 0x0000ca7d:0 2:0:4048:4040”. I’m running dual RTX 5090 and after this error, using GPU:1 always gives the same error, gpu0 is OK though.

Im’ on ubuntu 20.04, CUDA 12.8, using NVIDIA driver (open kernel) metapackage from nvidia-driver-570-open(proprietary)

Here’s the bug report log.
nvidia-bug-report.log.gz (945.2 KB)