Avoiding performance mode slowdown

mads.tulstrup · October 13, 2017, 8:22am

When executing my CUDA program, I have noticed that if run twice close after each other then the second execution is much faster (~100µs vs ~300µs). This correlates very well with the current performance mode of the CUDA device. When run close after each other the device stays in P2 mode, but after some delay it drops to P5 and then to P8, leading to the slower execution time. (see graph here: [url]https://imgur.com/jrUpXaa[/url])

My question is now, how do I best avoid this decreased performance? Is there a setting that I can use that will prevent the CUDA device fro going to P5 and P8? It must be possible to apply this either through the API or from the command line.

My system is running Windows Server 2012 R2 and the CUDA device is a TITAN X (Pascal) with driver version 371.90 running in TCC mode.

mads.tulstrup · October 16, 2017, 6:04am

I had a look at the throttle reasons and the ones that are in play are NONE, IDLE and UNKNOWN, and they correlate with the performance state, Streaming Multiprocessor clock and memory clock. See following graphs: https://imgur.com/a/gGi9G

It is a little difficult to see in the graphs, but they show that when the system is running at full speed, a throttle reason UNKNOWN is reported. When the SM clock drops from 1417 to 1240 it is not reporting any throttle reasons. When the SM clock then drops to 1012 and below it reports throttle reason IDLE.

My system is running Windows Server 2012 R2 and the CUDA device is a TITAN X (Pascal) with driver version 371.90 running in TCC mode. I am running default clock settings as can be seen here:

Attached GPUs                       : 1
GPU 0000:82:00.0
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : 1417 MHz
        Memory                      : 5005 MHz
    Default Applications Clocks
        Graphics                    : 1417 MHz
        Memory                      : 5005 MHz
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5005 MHz
        Video                       : 1708 MHz
    SM Clock Samples
        Duration                    : 330.48 sec
        Number of Samples           : 100
        Max                         : 1417 MHz
        Min                         : 139 MHz
        Avg                         : 598 MHz
    Memory Clock Samples
        Duration                    : 330.48 sec
        Number of Samples           : 100
        Max                         : 4513 MHz
        Min                         : 405 MHz
        Avg                         : 1493 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A

njuffa · October 16, 2017, 7:19am

Are you accounting for auto boost clock? This causes clocks to be increased dynamically above the base clock, while the power state as such remains unchanged.

From the moment the first CUDA kernels starts executing, and while CUDA kernels continue to execute, the power state should be P0. Only if there is inactivity should state drop to P2, and if the GPU remains inactive the power state will drop lower and lower, down to P12 I think.

If the app has a very short runtime, your observations may be skewed by the limited temporal resolution of the data reported by nvidia-smi.

mads.tulstrup · October 16, 2017, 7:36am

Not sure what you mean by “accounting for auto boost clock”.

...
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
...

I assumed that this was caused by using TCC mode, and that it meant that Auto Boost was not applicable in this mode.

The power state never goes to P0, but will drop to P5 and P8 after some period of inactivity (~20 seconds). My app is using CUDA for a limited part of a computation pipeline that will run when an outside event occurs. So I frequently see inactivity for more than those 20 seconds, causing the device to drop to P8, and thus the next call to have the penalty.

Robert_Crovella · October 16, 2017, 11:16am

So you have a period of time of ~20s in your application, and you are concerned about a 200us variation in that 20s period?

mads.tulstrup · October 16, 2017, 11:23am

Yes.

I can solve this by keeping my CUDA device ready by regularly having it do some non-trivial work, e.g. every 10 seconds, but it feels like the wrong way to approach this problem. I’d much rather just have a setting instructing the device to stay ready.

mads.tulstrup · October 16, 2017, 1:37pm

txbob, I may have misunderstood your comment. I don’t care about a variation of 200us over a period of 20s. I care about a 200us variation in a ~2ms pipeline which can occur less frequently than every 20s.

njuffa · October 16, 2017, 1:38pm

It seems like the GPU is used for only an extremely short duration, and based on that I think what may be happening is that the power state goes to P0 briefly, but that the next time nvidia-smi observes it (limited temporal granularity), it has already fallen back to P2. There may also be the opposite case: Since switching the power state was a certain minimum latency, the kernel may have finished executing before power state could be switched to P0, although I think that’s the less likely scenario, because the switching happens quite fast.

GPUs are designed as throughput devices, not low-latency devices, and the power management is based on that and the need to run as efficiently as possible. European regulators in particular are always breezing down the computer industry’s neck in that regard. It seems your use case requires the GPU to act as a low-latency device, because throughput-wise a “200us variation in that 20s period” should be a don’t care. As I said, to the best of my knowledge the basic operation of the state machine for the GPU power states is not user programmable.

Not sure why there is no auto boost clock on the Titan X Pascal. I am surprised and it is news to me.

mads.tulstrup · October 16, 2017, 3:35pm

njuffa, you are absolutely right. My focus is on low latency which is not the typical use case for a GPU, and I realize that I should have stated that in my opening question.

Is it not possible to disable this throttling? I have noticed things like “power management” and “powermizer” which sound promising but I have not found a way to modify them from API og nvidia-smi.

njuffa · October 16, 2017, 3:52pm

In the early days of GPU power management, when it was still very crude and caused much bigger performance artifacts than what you are observing, I requested some degree of user control but the philosophy then was to make the management fully autonomous. I would be surprised if the philosophy has changed since then.

In many PCs the GPU is the single biggest power consumer and the EU has in the past “thought out loud” about regulating PC power consumption on more than one occasion, now that they are done with vacuum cleaners [#]. I assume that this provides significant incentives for GPU manufacturers to make their consumer cards as efficient as possible to stay off the radar of regulators, while the efficiency needs of supercomputers do the same for professional GPUs in the Tesla line.

It seems your use case is simply not optimally aligned with the design of GPUs as high-throughput machines with the trade-off of sometimes higher latencies. I know that it is politically incorrect to say that here, but if that were my code, I would probably try to do this latency-sensitive computation on the host CPU using the highest clocked parts I can find (~4 GHz these days) using the best compilers and possibly hand-optimized code (SIMD intrinsics etc).

[#] [url]http://www.bbc.com/news/business-41119355[/url]

mads.tulstrup · October 18, 2017, 5:52am

Actually, the code that is run is extremely parallelizable, so it is ideal for running on the GPU. And my naive CUDA implementation outperforms my current best C implementation (as long as I don’t get that 200µs penalty).

How about overclocking the P8 and P5 states to resemble P2? My CUDA device is only rarely used, and very well cooled. Can you recommend any tools for doing this? I tried the Nvidia Inspector ([url]http://orbmu2k.de/tools/nvidia-inspector-tool[/url]), but it does not appear to work on a my card with its current setup.

njuffa · October 18, 2017, 6:17am

Again, that’s the kind of user control over the power-modes state machine that isn’t made available by NVIDIA. There may be people who have cracked the state table and figured out how to manipulate it with their own tools (along the lines of editable fan curves), but I have never seen mention of such a tool.
Google may be your best friend.

If a very fast GPU is just 200µs faster than the equivalent CPU solution, that does not sound like a very strong case for using a GPU to me. I don’t know how highly optimized the C solution is, but usually plain C cannot get you anywhere near full performance on a modern CPU unless you can auto-vectorize it (and dspite decades of research auto-vectorizers give up rather quickly in my experience).

Alternatively you could try turning your naive CUDA implementation into a sophisticated CUDA implementation and making up the 200 us that way.

cbuchner1 · October 18, 2017, 2:41pm

A persistent kernel might be the solution to latency issues. The kernel would permanently poll for new work and perform the computation once work is provided.

The drawback: permanently high power consumption

Topic		Replies	Views
Very fast ramp-down from high to low clock speeds leading to increased time repeatedly ramping up CUDA Programming and Performance	13	1178	January 15, 2021
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13410	July 9, 2008
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1755	July 19, 2022
How to get the cuda "first-call overhead" to happen only once for cuda called from dll? CUDA Programming and Performance	51	297	November 25, 2024
Performance state switches from P0 to P2 when starting program CUDA Programming and Performance cuda , python , linux	16	10184	October 3, 2024
CUDA very slow performance CUDA Programming and Performance	21	16726	March 6, 2020
Cuda + omp = big slowdown CUDA Programming and Performance	4	1310	August 20, 2013
Strange performance regression with a single GPU context on a multi GPU host CUDA Programming and Performance	11	953	April 7, 2021
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5719	October 2, 2013
Quadro RTX 8000 Multi-GPU Performance Issue CUDA Programming and Performance	13	1167	March 8, 2025

Avoiding performance mode slowdown

Related topics