RTX Titan heavy throttling

Hello,
I have an issue with one of two RTX Titan cards in a deep-learning workstation. If I run a deep-learning workload or a stress test on the machine, one card will top out at 78°C with a fan speed of about 60% , ~270W power consumption and GPU clocks around 1860 MHz, which seems to be OK. However, the other card under the same load will reach a steady state of 89°C with a fan speed of 100%, a power consumption of only 168W and GPU clocks below 800 MHz. It seems like the card is throttling hard, which is also indicated by nvidia-smi showing SW Thermal Slowdown being active under load and SW Power Cap and HW Thermal Slowdown going active every now and then. When starting the workload on cooled down cards the card with the issue will power up the fans periodically up to 124% fan speed with power consumption and GPU clocks decreasing more and more before reaching the steady state described above.

Thermal images indicate that the cooling fins of the affected card are much cooler than on the non-affected card, while the back of the affected card is hotter than the other one. It is also visible that at least on the back of the card the GPU is the hottest spot and that no memory chip seems to be overheating. This lets me suspect that there is some king of heat transfer issue between the GPU and the cooler on the affected card. I have found many reports on the net that the RTX Titan is susceptible to overheating issues and not suitable for multi-GPU setups. However, both cards have plenty of space and cooling around them and the affected card shows the exact same behavior when put as a single card in another case. Also, none of the throttling issue reports on the net mentioned the GPU clocks going down even remotely as much as I can observe in this case. Is there a chance that the issue is related to the thermal paste on the GPU not working as intended, preventing the heat from reaching the cooler? Has anybody here experienced the same behavior on an RTX Titan?

Best regards

Hi there @denis.fisseler and welcome to the NVIDIA developer forums.

How long has the Titan RTX been running DL workloads?

The card was released (and likely produced) 6 years ago. Even thermal paste does not last forever. So it is very likely that the thermal paste has dried out and gone brittle. If there was not 100% contact to begin with, this will effect heat transport significantly and could explain your issues.

Hello. Thank you for the answer.

The affected card has been running low and medium heavy DL workloads for several years. Unfortunately, I am not sure when exactly the issue manifested itself, whether it appeared slowly or suddenly, since the usual user does not examine the cards power consumption, or if there are performance differences between individual cards, let alone taking thermal images of the hardware.

So, the most effective action would be to disassemble the card and check if the thermal paste has dried out? According to my knowledge, this should only affect the GPU, since the memory chips and power regulators are coupled to the heat sink via thermal pads. Are there any special requirements a suitable replacement thermal paste has to meet?

Important note! Please only attempt the disassembly if the GPU does not have any warranty left. If it still for some reason has warranty, try that way first to get it repaired or replaced.

If you disassemble the GPU, you must replace the thermal paste. Regardless whether it seems dry or not. And any high quality thermal paste will do.

And yes, the thermal pads can be reused, but only if they are still undamaged after disassembly! If they are ripped or broken, please replace them with pads of exactly the same thickness.

Thanks!

Today, I had finally the time and tools do disassemble the card. I managed to separate the cooler from the board with only minor damage to some thermal pads. The GPU and some of the power coils were covered in completely hard and dry thermal paste. In addition to the paste being completely dry, there was a dark spot on the paste on the cooler at the edge of the place where the GPU had been. Cleaning up the thermal paste revealed that the coolers surface for the GPU chip area seemingly has some factory defects. In the center of the cooling surface are three deep dents and some smaller ones off-center. The most eye-catching detail is a large defect in the surface, where the dark spot of thermal paste was visible before. There, the silvery galvanized coating of the cooling surface is missing and the copper is shining through. Also this coating defect has something on there that looks like solder and is higher than the surrounding surface. To me, this looks like a potential reason for the GPU and cooler not having sufficient contact. I am not sure if this can be compensated by thermal paste. The GPU surface does not seem to be damaged in any way.

Looks a bit like someone uses liquid metal thermal paste before.
Judging by the scratch I would not recommend ever using this GPU again.
I am sorry.

I fully agree that the corroded spot looks like some metal metal or acid metal reaction took place. What is strange is that I can guarantee this GPU was never tampered with outside the factory - no thermal paste renewal, nothing. It went directly from being delivered by Nvidia into the server and was never removed from the machine until now. The blue Loctite on all screws was intact and the card did not even have fingerprints on it when I removed it from the server. So the only logical explanation for the defects on the cooler surface would be a severe quality control issue or I got sold a refurbished unit as new. What I do not understand though is, why the GPU should not be used ever again. After all, only the cooler surface seems to be affected and not the GPU surface.

Let me rephrase this. I personally would not use it again because I would be very worried about overheating and very likely stability issues. From simple visual inspection you cannot guarantee that the actual chip is unaffected. The package is usually not very thick to ensure proper heat distribution.

If you want to try, that is absolutely up to you but also completely at your own risk.