Ubuntu 18.04 with 2 RTX 2080 Ti system frozen when training deep learning models using cuda

okay thanks! I don’t think that either, but the weird thing is that only happens when I run deep learning model training…

Hi tera, thanks for your information, I tried both of my rtx2080ti, similar issue, that’s why im so puzzled, do you think it’s cuda problem or the gpu problem?

Unfortunately I have no idea. It’s a production machine, so I can’t do many experiments on it. Also, all our other RTX cards are in a different location so I can’t easily swap them around.

What I’ve just done is run the same job on a different machine with an RTX 2080Ti and mostly the same software installed. If that crashes the machine overnight, some people will be very angry with me tomorrow morning…

The other machine ran the job just fine. So this seems to point in the direction of a hardware issue.

Okay, thanks for the info, planning to replace psu and motherboard… see how it goes…

My case was a hardware failure.

It turns out to be one of the ram chip is defective, casing that hard freezing. thanks for you guys’ help!

ram chip on GPU card, or system RAM?

What specifically was the hardware issue? And for cabling from GPU to PSU, did you use a cable with 2 heads on both sides?