Why kernel calculate speed got slower after waiting for a while?

Hi, i found a strange phenomenon that the calculate speed would be different between continuous execution and intermittent execution for same kernel. For example, it will cost 200ms for one execution if i execute the kernel for 100 times continuously. But if I execute kernel ten times, wait for 10 seconds and execute another ten times, the first kernel execution (11th) will cost 400ms.
I tried to use cudaEvent to record time and got same result. So the timing part is right.
I also tried to use nsight to see what happened, but I didn’t see any difference between the first execution and others in one group.
Is this some optimization strategy of CUDA? How can i avoid this phenomenon and keep the kernel execute speed steady?
The cuda version I used is 10.2.
Here is the test code
Wait 10 seconds

for (int i = 0; i < 10; i++) {
   for (int j = 0; j < 10; j++) {
            std::clock_t start, end;
            start = clock();
            std::cout << "number: " << i * 10 + j << std::endl;
            bilateralFilter_kernel<<<>>>;
            cudaDeviceSynchronize();
            end = clock();
            double elapsedTime = (double)(end - start) / CLOCKS_PER_SEC;
            std::cout << elapsedTime << " s cuda calc time" << std::endl;
        }
        std::cout << "wait 10 S" << std::endl;
        std::this_thread::sleep_for(std::chrono::milliseconds(10000));
}

Result:
number: 0
0.261 s cuda calc time
number: 1
0.22 s cuda calc time
number: 2
0.219 s cuda calc time
number: 3
0.22 s cuda calc time
number: 4
0.22 s cuda calc time
number: 5
0.22 s cuda calc time
number: 6
0.219 s cuda calc time
number: 7
0.223 s cuda calc time
number: 8
0.218 s cuda calc time
number: 9
0.218 s cuda calc time
wait 10 S
number: 10
0.431 s cuda calc time
number: 11
0.217 s cuda calc time
number: 12
0.219 s cuda calc time
number: 13
0.218 s cuda calc time
number: 14
0.219 s cuda calc time
number: 15
0.22 s cuda calc time
number: 16
0.22 s cuda calc time
number: 17
0.219 s cuda calc time
number: 18
0.218 s cuda calc time
number: 19
0.218 s cuda calc time
.
.
.
wait 10 S
number: 90
0.449 s cuda calc time
number: 91
0.22 s cuda calc time
number: 92
0.22 s cuda calc time
number: 93
0.22 s cuda calc time
number: 94
0.22 s cuda calc time
number: 95
0.219 s cuda calc time
number: 96
0.219 s cuda calc time
number: 97
0.221 s cuda calc time
number: 98
0.22 s cuda calc time
number: 99
0.22 s cuda calc time

Continuous 100 times

for (int i = 0; i < 10; i++) {
 for (int j = 0; j < 10; j++) {
            std::clock_t start, end;
            start = clock();
            std::cout << "number: " << i * 10 + j << std::endl;
            bilateral_kernel_float<<<>>>;
            getLastCudaError("Cuda failed after bilateral_kernel_float:");
            cudaDeviceSynchronize();
            end = clock();
            double elapsedTime = (double)(end - start) / CLOCKS_PER_SEC;
            std::cout << elapsedTime << " s cuda calc time" << std::endl;
        }
        // std::cout << "wait 10 S" << std::endl;
        // std::this_thread::sleep_for(std::chrono::milliseconds(10000));
}

Result:
number: 0
0.259 s cuda calc time
number: 1
0.219 s cuda calc time
number: 2
0.223 s cuda calc time
number: 3
0.222 s cuda calc time
number: 4
0.221 s cuda calc time
number: 5
0.22 s cuda calc time
number: 6
0.22 s cuda calc time
number: 7
0.22 s cuda calc time
number: 8
0.218 s cuda calc time
number: 9
0.22 s cuda calc time
number: 10
0.221 s cuda calc time
number: 11
0.22 s cuda calc time
number: 12
0.22 s cuda calc time
number: 13
0.219 s cuda calc time
number: 14
0.22 s cuda calc time
number: 15
0.22 s cuda calc time
number: 16
0.219 s cuda calc time
number: 17
0.22 s cuda calc time
number: 18
0.219 s cuda calc time
number: 19
0.222 s cuda calc time
number: 20
0.223 s cuda calc time
number: 21
0.222 s cuda calc time
number: 22
0.223 s cuda calc time
number: 23
0.223 s cuda calc time
number: 24
0.223 s cuda calc time
number: 25
0.222 s cuda calc time
.
.
.
number: 80
0.224 s cuda calc time
number: 81
0.225 s cuda calc time
number: 82
0.223 s cuda calc time
number: 83
0.224 s cuda calc time
number: 84
0.223 s cuda calc time
number: 85
0.224 s cuda calc time
number: 86
0.223 s cuda calc time
number: 87
0.224 s cuda calc time
number: 88
0.225 s cuda calc time
number: 89
0.225 s cuda calc time
number: 90
0.224 s cuda calc time
number: 91
0.226 s cuda calc time
number: 92
0.223 s cuda calc time
number: 93
0.225 s cuda calc time
number: 94
0.223 s cuda calc time
number: 95
0.225 s cuda calc time
number: 96
0.223 s cuda calc time
number: 97
0.224 s cuda calc time
number: 98
0.224 s cuda calc time
number: 99
0.224 s cuda calc time

There is some evidence that putting things to sleep may cause CUDA to incur additional overhead after a wake. My suggestion would be that if you desire the most consistent CUDA performance, don’t put the thread to sleep.

I wondered if persistence may be a factor?

https://docs.nvidia.com/deploy/driver-persistence/index.html

It’s worth a try to see if it makes any difference. However I would not expect any teardown due to non-persistence to occur as long as the CUDA context remains present. I wouldn’t expect that putting a thread to sleep has an impact on a CUDA context. However, I realize this statement almost sounds contradictory to my previous statements. So it’s worth a try to see if it makes any difference. I’m suggesting that even with a context established, there is anecdotal evidence that putting a thread to sleep and waking it up again seems to incur some (CUDA?) overhead on wake-up.

True, I had overlooked the context still being present.

While it may not apply here, some general words of advice regarding performance measurements:

Since by default dynamic clock control is being applied to GPUs, one should generally not expect the same kernel operating on the same data to execute in the same amount of time. Also, the performance of the same kernel operating on the same data can vary between different instances of the same GPU model.

GPU clock frequencies vary with power draw, device temperature, and power-management state, as well as other environmental parameters. Power state and clock adjustments occur on relatively small time steps, but are not instantaneous.

In my work I see performance differences of up to 20%. For what it is worth, similar clock variations apply to modern CPUs. This is just how modern processors work to squeeze the last bit of useful performance out of a comatose Moore’s Law. The alternative would be to operate these processors at fixed but noticeably lower clocks.

In some cases it may be possible to lock GPU clocks with nvidia-smi: look for “application clocks”.

Good suggestions and quite possibly relevant. Running kernels with essentially zero gap vs. a 10 second gap could certainly play into GPU performance states. The context isn’t going to get torn down, but its quite possible that certain aspects of GPU management may be affected. So I think the persistence mode is worth checking and the clock/power management is worth checking also.

It’s probably too crude to just lump that all into “CUDA initialization”.

Here is the kind of power state management I observe on my GPUs. The behavior may well differ based on GPU type, VBIOS, and driver version. The frequencies are approximate only: examples indicating the rough range.

As long as the GPU is driving the GUI only, it is cruising along at 300 MHz. When a CUDA context is initialized the GPU frequency climbs to 1000 MHz. that seems to be the “standby for compute” mode. Only when a kernel actually starts running does the GPU go into “full speed ahead mode”. The mode switch appears to happen within <= 0.5 seconds of kernel run time. And only in this mode can boost clocks be applied. The boost clock mechanism first attempts to reach nominal boost speed, around 1500 MHz. If there is still power and thermal head room, it tries to reach peak boost speed, around 1800+ MHz. Reaching peak boost typically requires boosting voltage above 1.0V, which drives up the power significantly. One issue with actively-cooled GPUs is that the fan control has significantly more lag than the voltage/frequency adjusting mechanism. So with heavily loading kernels, the GPU may initially get too warm at the nominal boost clock, and the clock will drop to 1400 or 1300 MHz temporarily, only reaching nominal boost speed after a couple of minutes when the fan system is fully spooled up.

If there is a GPU activity gap, with no further kernels submitted, the GPU goes back to the 1000 MHz “standby for compute” mode. I have not paid attention how fast this mode switch happens. But is is certainly possible that if an app sends relatively short running kernels in fairly large time intervals, the GPU clock never rises above 1000 MHz (best I can tell with limited time resolution for sensor queries). If have seen this with an app of mine.

Hi every one.
Thank you very much for replying.
Here is what I have tried according to your advice.

1.Persistence Mode & Application Clocks
I considered about persistence mode before because this phenonmeon looks like a engery saving optimization strategy very much, but I noticed that the doucument said that

On Windows the kernel mode driver is loaded at Windows startup and kept loaded until Windows shutdown. Consequently Windows users can mostly ignore the driver persistence implications described in this document.

So I didn’t try it before. This time I tried to set persistence mode on my computer, yet I find out that my GPU doesn’t support persistence mode.

Product Name                          : NVIDIA GeForce GTX 1650
    Product Brand                         : GeForce
    Product Architecture                  : Turing
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : N/A

I also tried to set application clock to keep GPU frequency steady. Unfortunately, the Application Clocks is not supported either.

 Clocks
        Graphics                          : 37 MHz
        SM                                : 37 MHz
        Memory                            : 95 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 4001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A

2.Compute Mode
During checking the GPU performance setting using nvidia-smi, I found the word “compute mode” seems relavant to this problem. So I tried to set it to EXCLUSIVE_PROCESS mode. However, it says that the “compute mode” is not supported to change under WDDM and my GPU doesn’t support TCC. Also there is no other tread using cuda during experiment, so I don’t expect the EXCLUSIVE_PROCESS will be useful.

3.CudaStreamQuery
During searching TCC/WDDM, I found that someone said that the WDDM driver may not sent the request to GPU immediately. Calling cudaStreamQuery(0) after launching kernels may has a implicit effect that forcing WDDM driver send request immediately. I tried to use cudaStreamQuery but noting different.

4. Timeout detection and recovery (TDR)
I read the document again and found that the windows driver may cause overhead.

Note: Driver reload events, e.g. due to TDR or new driver installation, will result in reset of non-persistent state.

According to the widows document, the TDR threshold is 2 seconds.

Timeout detection in WDDM

The GPU scheduler, which is part of the DirectX graphics kernel subsystem (Dxgkrnl.sys), detects that the GPU is taking more than the permitted amount of time to execute a particular task. The GPU scheduler then tries to preempt this particular task. The preempt operation has a “wait” timeout, which is the actual TDR timeout. The default timeout period in Windows Vista and later operating systems is 2 seconds. If the GPU cannot complete or preempt the current task within the TDR timeout period, the OS diagnoses that the GPU is frozen.

I also found that if I wait for 2S between every kernel execution, every kernel execution will cost 400ms. If I wait for 1S between every kernel execution, the time will be 200ms for every kernel. The TDR seems to be the main reason to blame. However, I found this speed changing problem on Jetson platform at first. Jetson series use Linux Ubuntu OS. So TDR may be not the main reason either.

5.Keep GPU Busy
I wrote a simple kernel just copy the data from input image to output image and plus 0.1 for every pixel. I use this kernel to keep GPU busy during the ten seconds waiting time. At first I launch the keepBusyKernel once per second. After 10S, I launched the test kernel(bilateral filter) again, the time of first execution was still 400ms, longer than normal level. I almost gave up, but suddenly I realized that the keepBusyKernel is too simple that it only cost 1ms. Maybe the keepBusyKernel didn’t make GPU busy enough. So I tried to launch the keepBusyKernel once per 500ms, and this time, the speed of GPU became steady.

So the key point is keep GPU busy!

I also tried to make keepBusyKernel handle different data from the test kernel and got same result. This is reasonable because when I found the speed changing problem on Jetson platform, after waiting for 10 seconds, about 8 different kernels I launched got 2 times slower and they are processing different data and GPU memory. Therefore, on the other hand, the keepBusyKernel doesn’t have to be same kernel or same data as test kernel. We just have to keep GPU busy.

Conclusion
Although I still didn’t found offitial explanation or recomended solution, I think we can use this tricky solution that we use a simple kernel to keep GPU busy so that we can get a better and steady performance. Because the simple kernel won’t cost much time, when a real target kernel is requested, we can quickly launch the target kernel.
Thank you every one, your replys are very helpful.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.