Controling fan speed of Titan and TitanX with TCC enabled

I was easily able to activate TCC on cards and it works great for my needs except one small thing.
There is no way to control fans anymore, and some builtin fan curve is bad… seen cards going at and over 85 degrees with fans going up from 20% only up to 40% speed and staying there and card throttling down a lot.

Also after TCC is activated cards are not detected by any software that I could usually use for fan control not even nvidia inspector.

So does anyone have any idea about this any way to control fans on TCC activated cards?
Even if it is setting fixed rate from command line?
Thanks

1 Like

What is the temperature inside the computer case? Is there adequate air flow? Is it possible air flow is obstructed by cabling or other plug-in cards? Are there dust accumulations (“dust bunnies”)?

I am surprised whenever I hear about such throttling issues because I have used numerous GPUs over the years (both consumer and professional), often high-end cards running at full speed for extended periods of time, but I have never run into such an issue.

I am also not aware of issues with “bad fan curves”, which does not mean they could not exist with some VBIOS versions. Are you running the original VBIOS installed on the card?

It is not problem with airflow.
When TCC is off and I have control of fans they are going up to 100% as I use very aggressive fan curve and all is fine. I use cards for GPU rendering.

Problem is as soon as I activate TCC mode I don’t have any control of fans on cards.
Not a single program actually sees cards at all to be able to control them.
Fans then goes at max of like 40% rpm which is far from enough to cool rendering GPUs, 4 of them at stack.

So problem is how to make fans on cards go to 100% of rpm speed when they are in TCC mode

A GPU in TCC mode is a 3D controller not a graphics card, which is presumably why the programs you normally use to control the GPU fan don’t work. I don’t know what to do about it.

Your reference to “4 GPUs in a stack” seems to imply that you have actively cooled GPUs that are placed too close together to ensure adequate airflow (and possibly that there is hot air from one GPU flowing to the next one in the stack), which is presumably why you have been forced to manually increase fan speed for adequate cooling to begin with. This does not sound like a properly engineered enclosure to me.

The only idea I have is to use a powerful fan to push cool air from outside the case into the (presumably very narrow) gaps between the GPUs. That is the kind of hacky “chickenwire & duct tape” approach I have used for cobbled-together, insufficiently cooled systems before.

4 GPUs in a stack meaning there are 4 of them installed in a case.
They are all nvidia reference fan design ie blowing hot air out.
There are no issue with temperature at all when fan increases properly with temperature rising.
After hours of rendering they don’t go much over 75 degrees.
So only issue is that once cards are in TCC mode fan doesn’t go full speed, ever. Ofc then temperature goes way beyond 80 degrees which is problem.
For example how does Tesla card are being cooled?
Don’t they do intensive calculations as well, even being used for rendering too. So I assume there should be some fan control :)

If none of the GPU overheats (75 degrees seems perfectly fine, even 80 would be OK), the only other reason I can think of that would cause them to down-clock is if they exceed the power limit, or if the power supply does not deliver enough power. nvidia-smi can show the max. power rating for the GPU as well as the current power draw.

What does not make any sense to me is the statement that the GPUs do not overheat at stock fan speeds, but nonetheless down-clock unless cooled more aggressively. Something is very wrong in that scenario, but I cannot remotely diagnose what it is. You also stated earlier that you saw GPUs go up to 85 degrees, how does that jibe with your information that “after hours” of operation they only reach 75 degrees? Power consumption will differ widely based on workload, are you quoting temperatures from two difference workloads by any chance?

As a sanity check, I would make sure that all the power connectors are plugged in, and make sure the power supply has sufficient output to drive four GPUs. With some power supplies you may need to take care how GPUs are matched to “rails”. My recommendation would be to use a power supply that is rated at 1.5x the combined peak power consumption of the GPUs plus the CPU. For example, if each of the four GPUs is specified for 235W max power, use a 1500W PSU.

How are Tesla cards cooled? It depends on whether you have an actively or passively cooled model. The actively cooled ones come with a fan, basically the same way as a consumer GPU. The passively cooled models have a heat fin assembly, and require that the fans in the server enclosure blow just the right amount of air over these fins. This usually means you need to buy such a system from an integrator that partners with NVIDIA so you can be sure the cooling is set up correctly.

Some adventurous souls have tried integrating passively cooled Teslas into their own systems, and often it doesn’t work right due to insufficient cooling. That does not mean it can’t be done, one just has to have the knowledge and experience to set this up correctly, and few people have that.

I think you misunderstood me.
There are 2 scenarios:

  1. GPUs in standard, NON TCC mode.
    All fan control works with programs such after burner for example. There is temperature curve and when cards are rendering fan goes up to 100% with heat rising and they keep cards cooled and up to 75 degrees or something.

  2. GPUs in TCC mode
    Then fan control is not working, programs such after burner can’t even detect cards so in this case fan curve is not working. Then even when rendering and card temperatures are rising over 85 degrees fans are at max of 40%.
    That is not enough to properly cool cards.

All my systems have high quality 1500W power supplies, all power cables connected both to GPUs and all additional one to MBOs as well.
So only issue is once I put card into TCC mode I cannot control fan speed and with max fan of 40% in that case is not enough to properly cool cards.

1 Like

is there any solution for this?

I think its quite possible that there aren’t any tools to adjust fan speeds for a TCC card.

I think there is also possibly a disagreement on what constitutes “properly cool cards”. Many modern GPUs (&) publish a set of temperatures in nvidia-smi that roughly indicate fan behavior. The GPU has a throttle (“GPU Slowdown Temp”) and a shutdown temp (“GPU Shutdown Temp”). There is a suggestion being made here that GPUs will reach over 85C with the fan not rising above 40%. That could be the case if the 85C number is below the throttle threshold and the shutdown threshold. If that is the case, then we simply have a disagreement on “properly cool cards”. The final arbiter of that is NVIDIA design engineers, not anyone else. You’re free to disagree, but the card is designed around NVIDIA’s definition of “properly cool cards”, not anyone else’s.

If the card goes into thermal throttle with the fans at 40%, and it were my card, I would file a bug.

(&) Some low-end GPUs seem to have their thermal control listed mostly as N/A in nvidia-smi.

I am not aware of any. Have you tried one of the third-party fan control programs? There is one that looked quite reasonable in this YouTube video by JayzTwoCents, where it seems to work with an NVIDIA GPU (RTX something) just fine.

This is not an endorsement or recommendation. I have not used the software demonstrated in the video and I seriously doubt that it can control the fans of a GPU running with the TCC driver, simply because such a GPU does not appear as a graphics card to Windows (which otherwise would grab control of it), and thus the APIs available for a GPU running with WDDM driver are not available when running with the TCC driver.

The OP didn’t state that he experienced thermal throttling.

I agree with Robert and Norbert’s comments, but I think the issue is more around clock speed scaling with temperature and attempts to mitigate it.

My watercooled GTX1080 running a task at 100% GPU, is clocked 1847MHz@52C. The same card running running the same task in factory aircooled trim, 1607MHz@82C and no thermal throttling. I did not note what percentage speed the fans were running but I don’t believe they were at 100%. In both cases, no overclocking or tweaking done, (Linux).

So, assuming the fans are not maxed out in stock trim, there is performance benefit to being able to increase them, albeit at risk of reducing the life of the fans.

Later: On reflection, quite possibly the Nvidia engineers have considered the fact that increasing the fans to 100% may not increase clocks that much and so choose the fan speed as a balance between minor clock speed gain/fan life. It’s only by switching to water that decent gains can be made.

Isn’t that implied by:

[Speculation:]The factors considered for default maximum fan speed on actively-cooled GPUs are likely fan life and noise level. The latter might even be subject to government regulations somewhere around the world, e.g. in office environments. NVIDIA is not going to tell us what their design considerations were when setting the “fan curves”.

By observation, under normal operation the fans spin up to about 66% to 70% of maximum. However, as dust accumulates on the fins of the GPU heat sink, I have seen fans move to higher speeds, and I have seen them reach 100% with some GPUs under these circumstances. Even in my home office (no pets in the household) I need to blow out the GPU heat sink fins once a year to keep temperatures and fan speeds low.

It is certainly true that many (but not all!) CUDA applications become memory bound as boost clocks move up, so that running at very high boost clocks mostly increases power consumption, with relatively minor actual performance gain. For this reason, some people advocate downclocking of GPUs so as to hit the performance/power sweet spot.

1 Like

My interpretation of that was “the clock speed reduced as the temperature rose”. I know it’s a fine point, but it’s illustrated by the figures I quoted for my 1080, when it was aircooled, it was at it’s target temperature, but it was not throttling, as defined by the nvidia-smi -q stanza “Clocks Throttle Reasons”.

I have just run a stock aircooled GTX1060 and here are the relevant sections:

Fan Speed                             : 57 %
Performance State                     : P2
Clocks Throttle Reasons
    Idle                              : Not Active
    Applications Clocks Setting       : Not Active
    SW Power Cap                      : Not Active
    HW Slowdown                       : Not Active
        HW Thermal Slowdown           : Not Active
        HW Power Brake Slowdown       : Not Active
    Sync Boost                        : Not Active
    SW Thermal Slowdown               : Not Active
    Display Clock Setting             : Not Active
Utilization
    Gpu                               : 100 %
    Memory                            : 14 %
    Encoder                           : 0 %
    Decoder                           : 0 %
Temperature
    GPU Current Temp                  : 82 C
    GPU Shutdown Temp                 : 102 C
    GPU Slowdown Temp                 : 99 C
    GPU Max Operating Temp            : N/A
    GPU Target Temperature            : 83 C
    Memory Current Temp               : N/A
    Memory Max Operating Temp         : N/A
Clocks
    Graphics                          : 1809 MHz
    SM                                : 1809 MHz
    Memory                            : 3802 MHz
    Video                             : 1620 MHz
Max Clocks
    Graphics                          : 1911 MHz
    SM                                : 1911 MHz
    Memory                            : 4004 MHz
    Video                             : 1708 MHz

To me, I would not describe it as “thermal throttled” as neither “HW Thermal Slowdown” nor " SW Thermal Slowdown" are active. The fan is running at 57%.

Yes, the clock speed has been ramped down as it warmed up, as I say, a fine point…

I thought this horse was dead, but let’s beat it up some more :-)

The OP simply stated “throttled”. They also stated “GPU at 85 deg C”. In my experience, when a GPU reaches 85 deg C, the “SW Thermal Slowdown” will be active. In other words, from the two pieces of information “throttled” + “85 deg C” I concluded “thermal throttling”. I stand by that conclusion. I grant that OP did not provide nvidia-smi output proving thermal throttling.

Conversely, you are showing nvidia-smi output where the GPU temperature is 82 deg C and thus below the thermal limit (83 deg C), so obviously no thermal throttling is taking place, i.e. SW Thermal Slowdown is Not Active.

I do not see a conflict in interpretation between the two scenarios. I have two air-cooled GPUs running side by side right now. One is at 79 deg C and not thermally throttled. The other is at 85 deg C and thermally throttled, with clock reduced to 1450 MHz, fan at 68%.

What OP seemed to be saying: If we had the means to force the fan to run at 100%, the temperature of the GPU would fall below the thermal limit, SW Thermal Slowdown would be Not Active, and GPU clock frequency would increase. This seems very plausible and desirable, but it is simply not possible to achieve in the absence of a means of playing with fan curves.

Now, thermal throttling with fan at 68% is one thing, but throttling with fan at only 40% as claimed by OP would have me agree with Robert_Crovella’s conclusion:

You win :-)

In other news: Feeling somewhat adventurous today I took the plunge and installed the (free, but closed-source) Fan Control program that was demonstrated in the video I linked above.

Good news: With this program I was able to control the fans on both of my GPUs, with one using the WDDM driver and the other the TCC driver. With fans cranked to 100% I achieved about 5 deg C reduction in GPU temperature under full compute load.

Bad news: The program seems to somehow interfere with GPU clock control. My WDDM-driven GPU locks to exactly 1200 MHz about ten seconds after the Fan Control program starts. Once I exit the program, the GPU clock immediately becomes dynamic again and the clock goes back up to ~1500 MHz. I tried this five times with 100% repro.

I cannot tolerate interference with the GPU clock, so I removed the program.