CUDA performance degradation as the GPU card heats up

Hi, I’m new to this forum, so please excuse me if I have addressed my performamce issue to the wrong audience.

I have tested the 3D-implementation of the X-, Y-, and Z-derivatives as presented in the Nvidia Tech blog, see https://developer.nvidia.com/blog/finite-difference-methods-cuda-cc-part-1/ and https://developer.nvidia.com/blog/finite-difference-methods-cuda-c-part-2/. I have implemented a simple explicit Runge-Kutta based PDE solver that solves a simple model equation (3D advection equation u_t + u_x + u_y + u_z = 0 with periodic boundary conditions on the unit cube) to test how big a grid I can handle on my NVIDIA RTX A3000 12GB GPU card (Dell laptop). During my experiments I noticed that it looked like the CUDA GPU performance went down as the temperature of the CPU card went up (as reported by nvidia-smi).

To test this systematically I created three grids:

512x512x512
256x256x256
128x128x128

for a total of 9330MiB / 12288MiB, 1266MiB / 12288MiB, 258MiB / 12288MiB worth of device memory. For each grid, I successively ran the solver for 10, 20, 50, 100, 200 and so forth time steps. The expected behavior is linear progression of the execution time (using the cudaEvent_t structure to measure time), since all time steps execute the same number of arithmetic operations. The tables below indicate a different behavior, however:

512x512x512 grid:

# of time steps Execution time (ms) RMS error Comment
10 4687.727539 1.643180e-06
20 11877.449219 3.286361e-06 Fans kick in
50 66119.992188 8.215902e-06 Fans kick in
100 295313.437500 1.643180e-05 Fans kick in

256x256x256 grid:

# of time steps Execution time (ms) RMS error Comment
10 396.707855 1.322414e-05
20 793.666565 2.644829e-05
50 2002.138306 6.612072e-05
100 4100.682617 1.322414e-04
200 9145.752930 2.644829e-04 Fans kick in
500 45611.710938 6.612071e-04 Fans kick in
1000 409868.281250 1.322414e-03 Fans kick in

128x128x128 grid:

# of time steps Execution time (ms) RMS error Comment
10 45.130592 1.070874e-04
20 93.768578 2.141748e-04
50 240.425858 5.354367e-04
100 480.789276 1.070873e-03
200 989.427795 2.141744e-03
500 2349.962158 5.354315e-03
1000 4891.455566 1.070832e-02 Fans kick in
2000 14188.320312 2.141417e-02 Fans kick in
5000 65190.347656 5.349237e-02 Fans kick in
10000 N/A N/A Crashed due to overheating

The larger the memory usage, the sooner the fans kick in and performance starts decreasing. The RMS error is only included to ensure that the numerically computed solution is “close” to the exact solution. In the 10,000 run the temperature reached 121C (nvidia-smi), which caused the CPU card to fail, eventually bringing down the entire laptop. The ambient temperature of the GPU card is approximately 45C.

The computations achieve very high GPU utilization (98%-100%), which is desirable for HPC workloads.

Question : Is this the expected behavior of laptop (workstation) based GPUs? Is there anything that can be done to avoid the GPU from being overheated, e.g., “underclocking” or something similar?

This should not happen. Modern CPUs and GPUs include thermal monitoring that can trigger a thermal shutdown when temperatures get to that level, to prevent permanent damage to the hardware. However, such temperatures should not be reached even under full and sustained computational load.

It is normal for CPUs and GPUs to heat up when under load, but in a well-designed air-cooled configuration neither of these should really get above 85 degrees Celsius in reasonable ambient conditions, e.g. 30 degrees Celsius maximum. If you are operating the computer outside in Death Valley (currently 55 degrees Celsius) or the hot interior of a car baking in the summer sun (that may reach similar temperatures), you are likely running the system outside specified design limits.

It is possible for fan-heatsink combinations of actively-cooled components to become clogged with dust, pet hair, etc, which greatly reduces their effectiveness. Usually this can be fixed by an end user with a can of compressed air; sometimes this may require partial disassembly of shrouds.

It is also possible for fans to become defective, and in fact fans are those components of a computer that tend to break earlier than other components, although generally it takes years to get to that point even with 24/7 usage. The system BIOS of your computer may offer a pre-boot hardware self-test that includes a check for the proper operation of fans. For the GPU, the output of nvidia-smi should indicate whether the fans are spinning up properly. By the time the GPU temperature reaches 85 degrees Celsius, the fan would typically spin at between 80% to 100% of maximum RPMs.

Lastly, it is possible that your system was misdesigned, e.g. with an oversized GPU with excessive wattage or obstructed internal airflow, with the thermal load exceeding the design limits of the thermal solution.

I assume by “ambient temperature of the GPU” you mean the temperature inside the computer case or enclosure. If so, 45 degrees Celsius seems normal to me in summer conditions, where room temperature (that is, the air outside the computer) may already be at 26 degrees Celsius.

CUDA performance degradation as the GPU card heats up

Some degradation is normal and expected. Modern GPUs (as well as modern CPUs) use dynamic clock control. One of the inputs to the control mechanisms is usually the processor temperature, and that is definitely the case for GPUs. All other parameters being equal, a GPU will be able to boost the clock (and thus processing speed) higher and for longer duration when it is operating at a lower temperature. Sustaining maximum clock boost is typically only possible when the GPU temperature stays below 60 degrees Celsius. To achieve this on a permanent basis typically requires water cooling, unless your are operating an air-cooled computer outside in the Alaskan or Siberian winter.

Since I already had it typed up, just adding my $0.02 to the commentary from njuffa.

It’s expected behavior that if the GPU temperature gets high enough, that the GPU will eventually start to reduce its clocks and possibly take other power-savings steps, in order to prevent the GPU from overheating. This isn’t unique or specific to laptop GPUs.

It’s also true that if a GPU gets hot enough, it will shut itself off (which would perhaps/probably have catastrophic system consequences, such as causing the system to fail/shut off.) Again, not unique or specific to laptop GPUs.

The GPU designers and system designers are aware of such considerations, and generally the goal would be to design a system that does not become compromised in this way (thermal shutdown) regardless of what workload you run. I wouldn’t be able to explain the behavior of your system at that particular datapoint (10,000 run failure, “121C”, “CPU card to fail” ), but the remainder of your observations seem plausible, probably “expected”.

I wouldn’t be able to explain that particular failure. if the laptop were relatively new/under warranty, I would at least consider the idea of contacting the system vendor for support. Certainly if the laptop is being run in a “hot” ambient environment (i.e. the temperature of the room that the laptop is in), that might contribute to the observations, although again, ideally, the GPU should be able to thermally protect itself.

Thank you very much for your very detailed reply. I’m using my Dell laptop at normal Scandinavian room temperature (~20C). When my laptop is idling, nvidia-smi typically reports 45C, which I consider normal (I have never seen temperatures as low as 30C). As long as the temperature of my GPU card stays below 90C, the performance degradation is negligible, but after that performance drops significantly. The behavior is consistent - the more memory I use, the quicker the GPU heats up, which is the expected behavior. I will air heat sinks and fans to see if that helps mitigate the problem. Ideally I’d like the temperature to stay below 90C, since that would allow me to run real workloads.

A little bit of background: I’m interested in leveraging the CUDA capabilities when the laptop isn’t used for anything else, akin to cycle scavenging. I realize that my GPU card is not a server card and that the cooling capabilities of a laptop are inferior to those of a server because of the form factor. But I don’t want to run things overnight if I risk damaging my laptop due to overheating.

Thank you for responding so quickly. My motherboard is less than a year old, so I would expect it to be OK. The temperature reading of 121 may be a “false” reading by nvidia-smi. I have only seen this high a temperature reading once. I noticed that when the temperature reaches 100C, then certain data, like timing data, sometimes is corrupt (typically reported as zero). As mentioned in my reply to njuffa, I will clean heat sinks and fans to see if that helps.

The 30 degrees Celsius I mentioned above referred to the room temperature. Since the audience here is global, I could not exclude extreme room temperatures that could occur in the summer in various countries of the northern hemisphere. In my experience the air temperature inside a computer typically fluctuates between between 35 and 50 degrees Celsius based on environmental factors.

This observation is rooted in two factors, I think. (1) larger data leads to longer processing times. I we plot power dissipation vs temperature, we observe hysteresis, so while the power may change as a step function, the resulting temperature change takes many seconds to minutes to take effect fully (2) The fast DRAMs used with GPUs and the associated memory controllers dissipate a significant portion of the overall GPU power; in fact you may observe from nvida-smi output that the hot spot is actually the memory temperature. Memory-intensive codes therefore are prime candidates for maximum GPU heating.

Both CPUs and GPUs have sophisticated thermal monitoring and thermal throttling and shutdown mechanisms which make that virtually impossible. It is true, however, that in general running semiconductors hot accelerates their physical aging. Unless you plan to run the hardware at full load 24/7 for more than five years, you are unlikely to observe the fatal end of this aging process.

I looked up the specifications of your GPU and I am puzzled how one would dissipate 70W from the GPU alone (and probably twice that for the entire system under full load) in a laptop form factor. I would suggest contacting Dell regarding the overheating issues.

Dusting the laptop improved the situation. The GPU temperature was pegged at 100C (room temp 22C). This time the 10,000 time step run did not crash, but the execution time (391s) was still 6.5 times that of the 5,000 run (56s). I’ll reach out to Dell HW support to see what can be done to fine tune the cooling/fans