We have some GPUs which randomly lock up and become non-responsive
these GPUs are running tensorflow jobs, and after working for some time, will stop responding. my process is running with docker ,my container can’t stop now .
The driver version is 418, cuda 10, Ubuntu 22.214.171.124,Gpu is 1080Ti’s.
I am not sure what can be done to help mitigate this problem.
if there is more information needed, i will be happy to provide them.