Crash on multi-GPU training with linux kernel v5 (Linux driver 418.56, Fedora 29)

The system hangs when one process controls more than one gpus under linux kernel v5 and linux driver 418.56.

The crash happens as soon as inference of a neural network starts, e.g. with PyTorch’s “DataParallel” module.

I suspect this happens during a call to cudaDeviceSychronize().

This can be reproduced reliably.

I have found references to issues with dual-gpu (SLI?) and linux kernel v5 on the web, I wanted to know if there are workarounds or updates on the issue.

[url]https://github.com/tensorflow/tensorflow/issues/26653[/url]
[url]https://devtalk.nvidia.com/default/topic/1048320/linux/arch-linux-not-booting-anymore-using-418-43-5-with-x-server-1-20-4-1/post/5320077/#5320077[/url]

Unfortunately I have no control over the operating system and cannot downgrade the linux kernel.

Best regards