I’m getting this error when I’m training Kaldi models using 4 ZOTAC RTX 2080 Ti on a ASUS WS X299 SAGE/10G mobo.
This happens in the middle of the training, at different moments, so this means all the GPUs start to compute and train the model, but suddenly one (or many) of them just fails and I get this error .
I have exhausted all software issues and am starting to consider hardware problems.
It’s important to mention that this server suffered an accidental shutdown in the middle of a training couple of weeks ago while it was pretty late in the training, without failures, so I am wondering if that damaged something in the GPUs.
Any advice would be kindly appreciated.
Curious fact: today one of the cards failed with the same error right after I have plugged in an external USB drive!
It is unlikely that the GPUs were damaged by a sudden unexpected shutdown. It is much more likely that the GPUs triggered the sudden unexpected shutdown.
Unexpected spurious failures are most frequently caused by insufficient power supply. What are the hardware specifications of this system (CPU, DRAM, mass storage)? What is the nominal wattage of the power supply (PSU)? Does the PSU claim compliance with an 80 PLUS level (e.g. Gold, Platinum)? Alternatively, make and model of the PSU?
Based on a guess at the system configuration a 1900W power supply would guarantee a rock solid system, a 1600W power supply may do in a pinch.
I see that Zotac offers multiple different versions of the GTX 2080Ti, including Extreme, Maxx, and ArcticStorm. Which one are you using?
Check whether all auxiliary power cables are connected properly, and there are no converters or Y-splitters used, nor any daisy chaining. Make sure each card properly seated in the PCIe slot and mechanically secured at the bracket.
Monitor GPU temperature under full load with nvida-smi to check whether there are any thermal issues.