We have some GPUs which randomly lock up and become non-responsive
these GPUs are running tensorflow jobs, and after working for some time, will stop responding.
the HW sku’s are Titan XP and 1080Ti’s.
The driver version is 410.78, cuda 10, Debian Jessie 8, Linux Kernel 4.4.92
When the GPUs finally respond, we see
ERR! against fan speed and Current PowerWatt usage, other metrics report fine.
i am not sure what can be done to help mitigate this problem.
if there is more information needed, i will be happy to provide them.
nvidia-bug-report.log.gz (2.21 MB)