CUDA performance get slower after sleep in host side


I’m developing a real-time image processing application using CUDA.
I’m having Quadro P2000, and I’m using CUDA 11.5 in Windows 10.

My application’s scenario is like this, the sensor acquires data at rate of 20 fps ~ 120 fps depending on circumstances, and my CUDA program do some image/signal processing and displays the result image at the acquisition rate.

After the CUDA application run, I need to wait my host thread until the new data comes in, so the flow of my whole application is look like this:

  1. Data acquisition
  2. Run CUDA application
  3. Push the result image to the display
  4. Wait until the next data acquired
    repeat 1 to 4

When I profile my CUDA application without ‘wait’ it usually takes 5 [ms], which is reasonable, but when I add any wait, doesn’t matter busy wait or sleep, the CUDA application’s running time increases to like up to 25 [ms].

What’s interesting is it maintains highest performance about first 10 seconds, and its performance gradually decreases and end up to 25 [ms]. Moreover, when it is on the highest performance, like 5[ms] computation time, the GPU utilization is 10%, but it reach to 35% when its performance is decreased.

I have found the similar topic here: Why kernel calculate speed got slower after waiting for a while?
And it says it may related to the CUDA’s lazy initialization, and I need to keep my GPU busy. However, this is not the case for my application, and my application need to wait until a new data acquired. I also tried to change the persistence mode, but my GPU doesn’t support this.

What can I do for make my graphic card busy while the application wait for a new data, or is there anything I can try to make my application’s performance steady?

Thank you.

Please note that CUDA is not suitable for hard real-time work, as none of the components of the software stack come with deadline guarantees. If GPU throughput is generally sufficient to keep up with the work at hand, a CUDA application may be soft real-time capable. This automatically implies that your use case can tolerate missed deadlines. Missed deadlines may manifest as dropped frames etc. Given that, your observations may not actually constitute a problem.

If the GPU needs to wait for new work to be sent over from the host, it is idling, and this may cause GPU power management to reduce the operating frequency. Am I correctly assuming that this issue only exists at the low end (20 fps) sensor rate, and doesn’t occur at the high end (120 fps) sensor rate? I am also assuming that you have confirmed that the lower kernel performance is in fact caused by GPU clocks being lowered in between data frames.

I am not aware of a way for programmers to programmatically influence power management parameters (e.g. adjusting time delay for power stage transitions [hysteresis]) while running, so your best bet to keep the GPU operating at the fastest performance state is to keep it busy, for example by sending dummy data frames for it to crunch, so it always operates near 120 fps. Can you re-configure the sensor to always send data at the 120 fps rate? If the sensor data flows through a host-side buffer, you could replay the most recent buffer contents until new sensor data replaces it.

Hi njuffa, thank you so much for detailed reply.

I understand that CUDA is not guarantee the hard deadline, soft real-time capability just works for me. In fact, it’s been working fine.

I didn’t check the GPU’s operating frequency, but the GPU power management sounds reasonable, and it explains the increasing of GPU utilization, too. And yes, the issue only exists at the low framerates. The sensor’s frame rate is depending on the depth we want to measure, so it is not possible to fix it to 120fps. And I think I figured out the root cause, and I will just leave it since the 25[ms] processing time is still enough for 20 fps scenario. Thank you so much!

Depending on whether your device is supported for this feature, nvmlDeviceSetApplicationsClocks here may be of use.

@rs277 makes a good point. For some reason I thought that the Quadro P2000 does not support application clocks. I have a Quadro P2000 in my Windows 10 system, running with the TCC driver, NVIDIA driver version 522.06. At least in this configuration, it does support setting application clocks, as I just confirmed by experiment.

Setting application clocks with nvidia-smi requires administrative privileges. I simply opened a Command Prompt with the Run as administrator option. Supported clock settings can be displayed with nvidia-smi -q -d SUPPORTED_CLOCKS -i<GPU number>. Available Graphics clocks are listed grouped by associated Memory clock setting.

In my machine, the Quadro P2000 is GPU number 0, so I tried:

> nvidia-smi -i 0 -ac 3504,1404
Application clocks set to "(MEM 3504, SM 1404)" for GPU 00000000:17:00.0
All done.

I used GPU-Z 2.50.0 to confirm that the Quadro P2000 is operating at the selected application clock of 1404 MHz. Note that power capping and thermal capping are still in effect when application clocks are used, and this may lower the clock below the selected value.

The way I understand it, applications clocks are only applied to the highest performance power state, and idling GPUs transition to lower performance power states courtesy of GPU power management and will be clocked down. In my case the Quadro P2000 drops to 139 MHz for the core and 203 MHz for the memory when idle. So use of application clocks may not help with OP’s scenario.

Maybe nvmlDeviceSetGpuLockedClocks() then? (Volta and above only).

If the OP is developing the app. for wider deployment and if the hardware can be specified and if the above works, (too many ifs…), this may help.

Yes, that seems to work. I also have a Turing-based Quadro RTX 4000 in the same system. As an experiment, I fixed its core clock at 1395 MHz:

> nvidia-smi -i 1 -lgc 1395
GPU clocks set to "(gpuClkMin 1395, gpuClkMax 1395)" for GPU 00000000:65:00.0
All done.

Now, when I idle the card, the GPU clock stays at 1395 MHz and the memory clock at 1625 MHz. Power draw is at a whopping 37W, so that is something to keep in mind. Normally, when idling, the GPU power management drops voltage and frequency to very low values to reduce power draw below 10W.

As with application clocks, the “locked clock” setting specified appears to be overriden by power capping and thermal capping, which makes perfect sense as these are designed to prevent damage to the GPU hardware.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.