Some GPU no work during image training

Two 8x V100 systems have a problem that some GPUs do not work during image training.
I want to check if this problem is H/W defect.

In the nvidia-smi -query result, the information on the GPU that does not work is different from the information on the GPU that is running. Is it related to this?

Clocks Throttle Reasons Idle : **Active** Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active

Add attached file : nvidia-bug-report.

nvidia-bug-report_gpua-v100-wa909.log (7.5 MB)