When executing my CUDA program, I have noticed that if run twice close after each other then the second execution is much faster (~100µs vs ~300µs). This correlates very well with the current performance mode of the CUDA device. When run close after each other the device stays in P2 mode, but after some delay it drops to P5 and then to P8, leading to the slower execution time. (see graph here: [url]https://imgur.com/jrUpXaa[/url])
My question is now, how do I best avoid this decreased performance? Is there a setting that I can use that will prevent the CUDA device fro going to P5 and P8? It must be possible to apply this either through the API or from the command line.
My system is running Windows Server 2012 R2 and the CUDA device is a TITAN X (Pascal) with driver version 371.90 running in TCC mode.
I had a look at the throttle reasons and the ones that are in play are NONE, IDLE and UNKNOWN, and they correlate with the performance state, Streaming Multiprocessor clock and memory clock. See following graphs: https://imgur.com/a/gGi9G
It is a little difficult to see in the graphs, but they show that when the system is running at full speed, a throttle reason UNKNOWN is reported. When the SM clock drops from 1417 to 1240 it is not reporting any throttle reasons. When the SM clock then drops to 1012 and below it reports throttle reason IDLE.
My system is running Windows Server 2012 R2 and the CUDA device is a TITAN X (Pascal) with driver version 371.90 running in TCC mode. I am running default clock settings as can be seen here:
Attached GPUs : 1
GPU 0000:82:00.0
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : 1417 MHz
Memory : 5005 MHz
Default Applications Clocks
Graphics : 1417 MHz
Memory : 5005 MHz
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5005 MHz
Video : 1708 MHz
SM Clock Samples
Duration : 330.48 sec
Number of Samples : 100
Max : 1417 MHz
Min : 139 MHz
Avg : 598 MHz
Memory Clock Samples
Duration : 330.48 sec
Number of Samples : 100
Max : 4513 MHz
Min : 405 MHz
Avg : 1493 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Are you accounting for auto boost clock? This causes clocks to be increased dynamically above the base clock, while the power state as such remains unchanged.
From the moment the first CUDA kernels starts executing, and while CUDA kernels continue to execute, the power state should be P0. Only if there is inactivity should state drop to P2, and if the GPU remains inactive the power state will drop lower and lower, down to P12 I think.
If the app has a very short runtime, your observations may be skewed by the limited temporal resolution of the data reported by nvidia-smi.
Not sure what you mean by “accounting for auto boost clock”.
...
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
...
I assumed that this was caused by using TCC mode, and that it meant that Auto Boost was not applicable in this mode.
The power state never goes to P0, but will drop to P5 and P8 after some period of inactivity (~20 seconds). My app is using CUDA for a limited part of a computation pipeline that will run when an outside event occurs. So I frequently see inactivity for more than those 20 seconds, causing the device to drop to P8, and thus the next call to have the penalty.
I can solve this by keeping my CUDA device ready by regularly having it do some non-trivial work, e.g. every 10 seconds, but it feels like the wrong way to approach this problem. I’d much rather just have a setting instructing the device to stay ready.
txbob, I may have misunderstood your comment. I don’t care about a variation of 200us over a period of 20s. I care about a 200us variation in a ~2ms pipeline which can occur less frequently than every 20s.
It seems like the GPU is used for only an extremely short duration, and based on that I think what may be happening is that the power state goes to P0 briefly, but that the next time nvidia-smi observes it (limited temporal granularity), it has already fallen back to P2. There may also be the opposite case: Since switching the power state was a certain minimum latency, the kernel may have finished executing before power state could be switched to P0, although I think that’s the less likely scenario, because the switching happens quite fast.
GPUs are designed as throughput devices, not low-latency devices, and the power management is based on that and the need to run as efficiently as possible. European regulators in particular are always breezing down the computer industry’s neck in that regard. It seems your use case requires the GPU to act as a low-latency device, because throughput-wise a “200us variation in that 20s period” should be a don’t care. As I said, to the best of my knowledge the basic operation of the state machine for the GPU power states is not user programmable.
Not sure why there is no auto boost clock on the Titan X Pascal. I am surprised and it is news to me.
njuffa, you are absolutely right. My focus is on low latency which is not the typical use case for a GPU, and I realize that I should have stated that in my opening question.
Is it not possible to disable this throttling? I have noticed things like “power management” and “powermizer” which sound promising but I have not found a way to modify them from API og nvidia-smi.
In the early days of GPU power management, when it was still very crude and caused much bigger performance artifacts than what you are observing, I requested some degree of user control but the philosophy then was to make the management fully autonomous. I would be surprised if the philosophy has changed since then.
In many PCs the GPU is the single biggest power consumer and the EU has in the past “thought out loud” about regulating PC power consumption on more than one occasion, now that they are done with vacuum cleaners [#]. I assume that this provides significant incentives for GPU manufacturers to make their consumer cards as efficient as possible to stay off the radar of regulators, while the efficiency needs of supercomputers do the same for professional GPUs in the Tesla line.
It seems your use case is simply not optimally aligned with the design of GPUs as high-throughput machines with the trade-off of sometimes higher latencies. I know that it is politically incorrect to say that here, but if that were my code, I would probably try to do this latency-sensitive computation on the host CPU using the highest clocked parts I can find (~4 GHz these days) using the best compilers and possibly hand-optimized code (SIMD intrinsics etc).
Actually, the code that is run is extremely parallelizable, so it is ideal for running on the GPU. And my naive CUDA implementation outperforms my current best C implementation (as long as I don’t get that 200µs penalty).
How about overclocking the P8 and P5 states to resemble P2? My CUDA device is only rarely used, and very well cooled. Can you recommend any tools for doing this? I tried the Nvidia Inspector ([url]http://orbmu2k.de/tools/nvidia-inspector-tool[/url]), but it does not appear to work on a my card with its current setup.
Again, that’s the kind of user control over the power-modes state machine that isn’t made available by NVIDIA. There may be people who have cracked the state table and figured out how to manipulate it with their own tools (along the lines of editable fan curves), but I have never seen mention of such a tool.
Google may be your best friend.
If a very fast GPU is just 200µs faster than the equivalent CPU solution, that does not sound like a very strong case for using a GPU to me. I don’t know how highly optimized the C solution is, but usually plain C cannot get you anywhere near full performance on a modern CPU unless you can auto-vectorize it (and dspite decades of research auto-vectorizers give up rather quickly in my experience).
Alternatively you could try turning your naive CUDA implementation into a sophisticated CUDA implementation and making up the 200 us that way.
A persistent kernel might be the solution to latency issues. The kernel would permanently poll for new work and perform the computation once work is provided.