Erratic compute speeds

I am having a few issues currently with compute speed on GTX 1080s and Titan XPs slowing down significantly during computation. The variations in speed change from trial to trial, and are enormous (slowdowns of a factor of 60 or more for periods of time that vary in length). When I run the code through nvvp the kernel execution speeds seem reasonable, but large “white-spaces” open up between kernel launches. These gaps are not explainable in terms of waiting for synchronous events to complete or any other host code in-between. I am guessing it could be heat related, but am surprised at that because the box sits in a cold server room. Interestingly a slightly older version of the code compiled with CUDA 7.5 doesn’t seem to have this problem, but the development code compiled with either 9.2 and 10.1 does. Has anyone else noticed erratic compute performance in their code?


White spaces between kernel launches? Triple-check your host code. Any swap activity due to lack of system memory? Other applications / users hammering the CPU? Any intense floating-point computations running as part of your host code?

I’ll see if I can reproduce this again in the morning and post a screenshot of the nvvp timeline. Basically the code evolves a system of equations on a 2D grid using a numerical time integration procedure. About a 40-50 kernel launches per time step (4th order Runge-Kutta integrator). A few data reduction steps and small device to host memory copies each timestep. The main issue is that it runs just fine for a few hundred timesteps, then for no reason I get a huge delay (and I mean huge - I’ll post the nvvp timeline tomorrow) between kernel launches occurring once per timestep. No other apps running - freshly rebooted system. The delay appears between a pair of kernels that are half way though a host routine and as far as I can tell are not caused by host code. The strange thing is the problem is variable from run to run. Same model, same binary and ptx code, vastly different timing results. At its worst, it runs 60x slower than normal. nvsmi shows GPU utilisation and power consumption drop very low, so I doubt it is an overheating problem. The really weird thing is the nvvp timeline seems to get confused and show the API driver kernel launchs occurring AFTER the kernels haave already been run. It is as if there are a couple of clocks on the GPU and their times are skewed (driver API reports times from one clock and kernel start and finish times are from a different clock).

I’ve managed to reproduce the issue. Here is a screen shot from nvvp at the point where the execution suddenly slows by 60x. ptx files compiled with CUDA 10.1.105, running on Titan Xp in WDDM mode. This is the 3rd GPU of 4. Windows has a few processes using GPU 4, but this is the only process running on 3 as far as I can tell. Driver version 436.15. The last thing the driver api does before the large gaps is to schedule two kernels which are then run actually at the end of the large gap. There is only one stream, and the driver API is not waiting for anything else in the stream to finish.

I don’t generally bother trying to debug such issues in WDDM mode. It sounds to me like command batching is biting you. Good luck!

If you have a reproducible test case, and have source code that you’re willing to provide, you’re welcome to file a bug. The instructions are linked in a sticky post at the top of this forum. If you are compiling ptx to cubin, then I would also suggest testing with the latest (10.1.243) before filing a bug.

Hi Robert - thanks for your thoughts. I haven’t been able to reproduce this sudden 60x slowdown in TCC mode. I’ll keep testing.

nvidia-smi utilization is a measurement of what percentage of the last time interval (I believe 1 second) a kernel was actually running. Nothing more than that.

So I would look at the host code. Because nvidia-smi is telling you that the (GPU) idle gaps between your kernels on the timeline are getting longer. If the GPU were actually getting slower (e.g. due to clock throttling or the like) and the kernels were taking longer, this number would go in the other direction.

There are a variety of ways to limit profiler scope. One simple method is to put cudaProfilerStart() and cudaProfilerStop() in your code, and turn the profiling on and off. So if the first 4 minutes of your code require 100000 loops, then after 1000000 loops in your host code, turn profiling on. In this case you specify to the profiler to start with profiling disabled.

If need be, instrument your host code in-between the kernel launches. The NVTX extensions are one way to do this.

Hi Robert - sorry but I had just realised that the white space opening up in the TCC results was at a point where it is possible the CPU thread was not in my dll. I’ve edit my previous post to reflect this. sorry. Greg.

Hi Robert - I think the above problem has been biting me again. Here is a snapshot from nvcc which shows the kernels drastically slowing down, and then for no reason speeding up again. The amount of work they are doing remains constant throughout the entire model run, and the memory access patterns aren’t changing either. If this was an overheating issue I wouldn’t expect it to last for 0.5s and then go away again. This is a Quadro P520 running in WDDM mode. It would me surprised if I am the only one who this issue affects.

This doesn’t seem to be the same issue. In the graph you attached previously, there were larger gaps between the kernel executions, this time around it seems as if the kernels themselves are slowed down by a factor of about 4.

I have a Quadro P2200, and I have seen what happens during both thermal and power throttling, and for short-term throttling, I have never seen the throughput drop by a factor 4.

How is that guaranteed? For example, there may be data-driven branches in your code, or in math functions called by your code. In other words, the code may be alternating between fast and slow paths.

Yes, it is possible that branching is causing slightly different performance, but I wouldn’t expect the difference to be large… I’ll see if the erratic speed also occurs with different GPUs.

@njuffa I think you’re right in that it is a different issue than my first report, but it is still an example of erratic compute behaviour. When you see power throttling is it a gradual slow down or does it kick in suddenly?

I’ve run the same code using the same input conditions on a GTX Titan and it ran smoothly the whole way. The attached nvprof2.png shows a snapshot from the profiler. The kernels execute a whole lot faster than on the little Quadro P520, and so the overhead for the context synchronize calls is more evident. (I omit these for the production code but I’ve got them in there while I’m profiling). I’ve bracketed the three main kernels with QueryPerformanceCounter() calls on the host side to get microsecond timings for these three together. The attached compare.png is showing how the P520 is slowing down periodically, whereas the Titan runs like clockwork. Both GPUs were in WDDM mode.

I’m not sure anyone will have a fix for me, but I would like to know if NVIDIA are aware of this issue.

It might be that your GPU is either overheating or hitting a power cap condition. Both of these can be identified using nvidia-smi tool. There are many answered forum questions about nvidia-smi and it has command-line help available.

Other than that I don’t have any suggestions that haven’t already been stated in this thread.

For short-term events like the one shown in your graph (0.6 seconds), I observe a minor reduction in clock speed accompanied by a minor reduction in supply voltage. No sudden 4x jumps.

Because the P2200 power rating is at the limit of what a PCIe slot can supply (75W), and the fan is tiny, these kind of short-term power or thermal throttling events happen frequently under full load. A good tool for continuous monitoring of these conditions under Windows is TechPowerUp’s GPU-Z. Its graphical output will show a green bar when hitting the power cap, a magenta bar when hitting the thermal cap. You can certainly also monitor with nvidia-smi, as suggested by Robert Crovella.

Thanks Robert, njuffa. I’ve monitored the performance with GPU-Z on the Quadro P520 and sure enough it says it is hitting the thermal performance limit. Interesting that it hits this limit at 56 deg. Screenshot attached. At least I now know how to respond to any users of our software that notice the solver engine slowing down and speeding up intermittently. Thanks for all you help - it is really appreciated. Greg.

Well, I can see magenta bars in the perf-cap row, but it seems your screenshot shows the GPU while idling, so none of the numeric data displayed shows us much of interest. It would be more interesting if you had two screen shots taken while the app is running, one during a thermal perf-cap event, and one when thermal perf-cap is applied.

A thermal limit of 56 deg C seems abnormally low. Typically, GPUs are configured for a limit around 80 deg C. For example, on my Quadro P2200, thermal slowdown kicks in when the temperature exceeds 82 deg C. And you should see fan speed go to > 95% before that happens. Voltage, clock speed, and fan control is very fine-grained with Pascal-based cards.

What does nvidia-smi -q state about the configured temperature limits? With my Quadro this section looks like so:

       GPU Current Temp            : 82 C
       GPU Shutdown Temp           : 104 C
       GPU Slowdown Temp           : 101 C
       GPU Max Operating Temp      : N/A
       Memory Current Temp         : N/A
       Memory Max Operating Temp   : N/A

Check that the GPU’s fan isn’t clogged up with accumulated dust/lint (use a can of compressed air to blow it out, if necessary) and make sure the system enclosure is adequately ventilated. The low-end Quadros do not really blow much hot air out through the slot bracket. Instead much of it gets distributed inside the case. Without proper ventilation, you wind up with high intake temperatures for the GPU. If you are in the Southern Hemisphere, it’s summer there now, so your ambient temperature may already be on the high side.

BTW, I assume your GPU is a Quadro P620. I am not aware of a Quadro P520.

The gpuz snapshot was for several minutes of time encompassing two runs of my test model. Here is are the temperature limits from nvidia-smi -q:

        GPU Current Temp            : 39 C
        GPU Shutdown Temp           : 102 C
        GPU Slowdown Temp           : 97 C
        GPU Max Operating Temp      : 57 C
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

It seems the “GPU Max Operating Temp” is the limiter…

I agree, that seems to be culprit. Does the “Advanced” Tab of GPU-Z show the same limit? GPU-Z shows the available temperature limit range for my GPU as follows:

minimum 65 deg C
default 83 deg C
maximum 98 deg C

In my case the thermal limit is at the default setting of 83 deg C. I am looking at the configuration switches of nvidia-smi but cannot find a switch to set the limit, although it should be configurable.

Very weird. Is this a mobile Quadro in a laptop by any chance? [Speculation:] In that case the GPU cooling solution may use the laptop’s outer shell as a heat sink, and the low temperature limit is required to avoid burns when users actually place the system in their lap.

[Later:] It seems there is a Quadro P520 Mobile, so I assume this is what you have in your system: