My deep learning process stuck D status, i want to know why?

We have some GPUs which randomly lock up and become non-responsive

these GPUs are running tensorflow jobs, and after working for some time, will stop responding. my process is running with docker ,my container can’t stop now .

The driver version is 418, cuda 10, Ubuntu 4.4.0.116,Gpu is 1080Ti’s.

I am not sure what can be done to help mitigate this problem.

if there is more information needed, i will be happy to provide them.

Is this caused by a driver bug ?