The system hangs when one process controls more than one gpus under linux kernel v5 and linux driver 418.56.
The crash happens as soon as inference of a neural network starts, e.g. with PyTorch’s “DataParallel” module.
I suspect this happens during a call to cudaDeviceSychronize().
This can be reproduced reliably.
I have found references to issues with dual-gpu (SLI?) and linux kernel v5 on the web, I wanted to know if there are workarounds or updates on the issue.
[url]https://github.com/tensorflow/tensorflow/issues/26653[/url]
[url]https://devtalk.nvidia.com/default/topic/1048320/linux/arch-linux-not-booting-anymore-using-418-43-5-with-x-server-1-20-4-1/post/5320077/#5320077[/url]
Unfortunately I have no control over the operating system and cannot downgrade the linux kernel.
Best regards