Random freezes and CUDA errors

Hi guys, I have a code that I’ve been using for a while but recently I updated my Nvidia driver (I use Linux) to 440xx and I started experiencing random freezes. My training goes well but usually somewhere in the first epoch it hangs forever, some other times it runs 8 epochs than hangs, others it simply freezes the entire computer, others it gives random device assertion error or CUDA errors (different runs different errors)…

I’ve already tried going back to the older driver, updating PyTorch to 1.6, using all different combinations of PyTorch and Cuda (Torch 1.4 + Cuda 10.2 on conda, Torch 1.6 + Cuda 10.2 on conda, Torch 1.6 + Cuda 11 on ArchLinux’s Python) but the issue persists…

My dmesg is full of:

[  959.195648] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.195768] pcieport 0000:00:03.0: AER: can't find device of ID0018
[  959.195769] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.195898] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  959.195902] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00001040/00002000
[  959.195903] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.195908] pcieport 0000:00:03.0: AER:    [12] Timeout               
[  959.195915] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.195986] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  959.195987] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000040/00002000
[  959.195989] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.195992] pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
[  959.195996] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  959.195998] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000040/00002000
[  959.196000] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.196007] pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
[  959.196148] pcieport 0000:00:03.0: AER: can't find device of ID0018
[  959.196150] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.196291] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  959.196293] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000040/00002000
[  959.196294] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.196300] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0

And when it breaks I get a lot of EDAC sbridge: Seeking for: PCI ID 8086:6f6d

I also noticed that the training runs well for a while and then when the errors start to appear the time to finish the epoch suddenly increases a lot until eventually the program freezes.

This issue appeared after I’ve uploaded the NVIDIA drivers and it happens both on PyTorch and TensorFlow. To verify if that’s not hardware related I booted on Win10 and did some benchmarking with GFXBench and everything seems fine and the results are compatible with others RTX 2080ti.

Any help is much appreciated. Thanks!

Hi,

I might have a similar problem. I see similar messages in dmesg.

Can you check what the C stack trace is at the time it hangs? E.g. via gdb -p $PID -ex 'thread apply all bt' -ex="set confirm off" -ex quit. I see cuEventSynchronize in there.

See here for more details:

Maybe this is also related: