T4 thermal integration

What do I need to know, when integrating a T4 card(s) into a host system, to ensure optimal, reliable thermal dissipation and performance?

What are the requirements of the “outside” system surrounding the T4 card, in terms of fans and airflow?
It seems like the T4 cannot be expected to work by itself in still air, without outside cooling assistance.

Looking at the integrated sensor temperature (as reported by nvidia-smi), how hot is too hot? In some cases it can easily be up over 70 degrees C, and too hot to physically handle if removed straight after power shutdown. Is it required/expected to remain at less than 50 degrees C?

We’re experiencing some serious issues with T4 overheating, in enterprise ML applications.

The fanless, passive architecture seems like a poor design choice, in my opinion.

The use case here is not silent, fanless home theatre PCs. These systems have to work reliably in industrial applications, with two or more T4 cards per system.

The fanless approach makes the T4 thermal performance much more dependent on the rest of the architecture inside the case, and airflow and/or conductive thermal coupling to the card.

The T4 product brief makes reference to the ’ System Design Guide for NVIDIA Enterprise GPU Products Design Guide (DG-07562-001)’.

Where is this document available?

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-product-brief.pdf

The product brief specifies the supported operating temperature of the card as 0 °C to 50 °C.

Has anybody actually got the card operating, with a workload, at a temperature that stays at less than 50 °C ?
In my practical experience, this seems very challenging, without spraying the card with liquid nitrogen or something.

“The T4 supports bi-directional airflow either from left to right, or from right to left. CFM
requirements are identical for both airflow directions.”

What is the actual CFM requirement?

This person has made an elegant 3D-printed blower fan adapter which channels forced air straight through the channel over the T4 heatsink and outside the back of the case.

I haven’t tried this yet but plan on trying it.

https://www.thingiverse.com/thing:2561569

This looks like it could work very well - but why wasn’t this a standard part of the T4 hardware?

It seems almost unprofessional, an improper use of the product, to attach my own 3D-printed fan duct onto the product which has been purchased for a professional application. But do I have a better choice?

I would also be interested in such information (thermal integration guidelines and documentation).

We are also trying to integrate such a GPU, and preliminary tests show that the thermal integration is critical to using it.

Hi,
We installed T4 in an old HP Proliant SL390s G7 server.
T4 is ideal since there are no additional power lines needed. However its passive airflow and small diamiter of the radiator is a huge a problem.
It was installed together with K20.
We used gpu-burn to check how it will behave under heavy load.
With no modifications and no special configuration in the server it went up to 84C and started to
limit clock going down to 450MHz which is rather below expectations.
Then, we changed setting in bios to enable extra cooling and highly modified air flow tunnels to redirect more air borrowing from K20’s tunnel.
It was a bit risky but worked out partially. K20 went 5C higher but T4 was 10C cooler which enabled it
to work at around 750MHz (not perfect but anyway better).

We have also other observations from a different but also old Dell server. The server was not able to boot correctly until we disabled temperature diagnostic of a PCI device. After that obviously T4 raised to ridiculous 100C and server went down in emergency. Then we found extra constant cooling in bios settings which enabled T4 to work correctly. Again not a perfect solution due to high power consumption.

Shall we buy T4s ? Yes. Indeed cooling is a problem but it is the only way to reuse our old machines with new software requiring CC>=50. We can install 2xT4 or 3xT4 in a single node instead of 1xK20 which is a deal. More devices like K20 or P100 cannot fit due to lack of power supply and space. Especially HP SL390s is weird inside and almost only original M2070 can fit.

I plan to publish more on our experiences but this must wait two weeks or so.

I would also be interested in those thermal specifications.
Currently i am integrating the T4 & V100, so if anyone has some information about the parameters that are useful for the thermal design.
Please let me know.

We have just Assembled a Desktop Workstation with Tesla T4. We are using thunderbolt Typ-c Display connection. But System not booting with T4 GPU and it is overheating during the startup trials. How ever our system can`t work without GPU too. Is there any suggestion ?
Thanks

We have successfully used a T4 in a small dedicated PCIe extension. The important lessons learned here were:

  1. the T4 has an excellent heat dissipation, under the condition that there is an important air flow going through the card.

  2. as a consequence of 1., if there is NO airflow, then even in idle the GPU will overheat, and permanently be damaged!

In our case, with a 57 CFM fan, which is approximately fully channeled through the GPU, I was able to run a test load (100%; 70W) continuously without thermal throttling. (25C ambient temperature)