Nvidia Tesla T4 crashing

So I recently purchased a new T4. However, when I am trying to run the training in Pytorch or TensorFlow, the GPU crashes abruptly before the training starts. I ran through dmesg and nvidia-bug-report.sh to get more details. Please have a look at the attached files. Could anyone suggest why the GPU is falling off from the bus?
dmesg.txt (105.7 KB) nvidia-bug-report.log.gz (890.8 KB)

The T4 doesn’t have its own fan, it relies on the server chassis to provide the necessary airflow. If you’re using it in a normal desktop/workstation, you’ll need an add-on fan for it. Otherwise you’ll get
XID 79 the gpu has fallen off the bus
like in your dmesg.

1 Like

Hi @generix ,
Thanks for your quick reply. Indeed it doesn’t have its own fan. But I was checking the temeprature and it was not overheating. Maybe I can run "watch nvidia-smi"for a continous monitor. Even the GPU utilization was less than 60%. But is it possible that it can also be due to graphics clock or anything else.

XID 79 can also be due to power problems, though unlikely since the T4 is bus-powered only, so only applicable if it’s overstretching your mainboard’s slot power.
Nevertheless, without proper cooling, there’s no sense in looking any further.

Indeed, it seems to be the cooling. I added a small fan next to it and it is not crashing anymore.

Thanks alot!!

The cooling is still insufficient, the T4 reached its default temperature target of 84°C at a power draw of 41W with 99% gpu usage, meaning it’s throttling clocks to stay alive.

Ok, I Will try to add a better fan for proper cooling effect. Thanks for your suggestion. I will get back with that soon.

I put a bigger fan now. However, the temperature still caps to 84 quickly. I’m very sure the heat dissipation is quite good but still, it saturates. Could there be any power issues related?

No, that’s unrelated to power. To get an impression, see this:
I you don’t have access to a 3D-printer, check on ebay.