Peaks and slow performance with cudaDeviceSynchronize

Before I explain my problem: I’m using ubuntu version 20.04, my GPU is NVIDIA GeForce RTX 2080 SUPER and I’m using Cuda version 11.4.
Because the Nvidia developer forums don’t allow to attach more than one file, I merge all my attachments into one file (at the end of the post), and everytime I reference it, you can find it below
I’ve got a strange problem when using kernel function and measuring execution time precisely (in microseconds).
The code I’m running on the kernel is:

__global__
void test(uint8_t* data, const unsigned int num)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int sum = 0;
    for (int i =0; i < 100; i++)
    {
        for (int j = 0; j < 100; j++)
        {
            sum += i*j - num;
        }
    }
    data[index] = sum % 256; 
}

I’m running this kernel function in a loop 1000 times. Every time before I called the function I start a timer and stop it after the execution of the function.
The weird thing is that the performance usually has some peaks. Its performances are not stable.
The code I’m running in order to measure the execution kernel time is:

cudaProfilerStart();
    for(int i = 0; i < 1000; i++)
    {
        begin = std::chrono::high_resolution_clock::now();
        test<<<1, 256>>>(frameRawData, 92);
        cudaDeviceSynchronize();
        end = std::chrono::high_resolution_clock::now();
        timeCheckingFile << std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count() << "\n";
    }
    cudaProfilerStop();

As you can see, after the measure I save the result in a file. I made a python script, that makes a performance graph from this file, where the x-axis is the iteration number (0 - 1000), and the y-axis is the time measured (in microseconds).
(The first image - graph at the attachment at the end of the post)
This is the graph that my script made.
As you can see, most of the iterations are consistent, but some of them, reach peaks, up to 350 microseconds!
Sometimes, when I run the program, I can see no peaks, and sometimes I can see plenty of them.
I used the Nvidia visual profiler, and realize that the cudaDeviceSynchronize I’m using, in order to wait until the kernel function will end, takes most of my execution time.
The result of the visual profiler is:
(The second image - timeline made by Nvidia visual profile at the attachment at the end of the post)
As you can see, most of the GPU work is done in constant spaces between each other, but there are a couple of larger gaps (which represents the performance peaks).
When I zoom in on these gaps, I see the cause of the gaps in the cudaDeviceSynchronize function.
(The third image - timeline made by Nvidia visual profile at the attachment at the end of the post)
I thought this is maybe a thread priority problem, but I tested my program with perf and it seems my program isn’t waiting at all.
If it helps for some of you, this is the Nvidia profile result I got when running the script again:
(The last image - Nvidia profiler results at the attachment at the end of the post)
Does someone know what is the cause of this unstable performance and inconsistent peaks at my program?

Hi @user20205,
Thank you very much for the detailed report! I see, that you posted it in CUDA-GDB forum branch (which is dedicated to CUDA-GDB tool support), but the nature of the question makes it more suitable for CUDA Programming and Performance - NVIDIA Developer Forums branch.

I have moved your topic there.

So when you do these measurements, there is literally nothing else going on, i.e. the machine is entirely idle before and after the run of the test program? Are there any background processes doing I/O, using PCIe in the process? The working hypothesis would be resource contention. Ordinary Linux is not an RTOS, any kernel call, including those made by CUDA could cause indeterminate delay, especially if there is resource contention, e.g. via a lock.

According to the profile data at the bottom picture, the kernel seems to be extremely short running, with an average execution time of 2.4 microseconds. The speed at which kernels can be launched is finite. For many years, the speed of light was 5 microseconds per kernel launch. I couldn’t tell you what it is with your hardware without measuring it; it may well be somewhat faster than the traditional 200,000 kernel launches per second.

If kernel invocations are stuffed into the GPU command queue by host code faster than they can be consumed by the GPU, this would eventually overflow the launch queue, causing a stall. While this doesn’t quite fit the graph (one would expect launch latency to remain high once the queue depth has been exhausted), and the working hypothesis “queue full stall” is therefor weak, I think this does call for a quick experiment that employs longer-running kernels, e.g. 10 microseconds on average.

the presence of cudaDeviceSynchronize() in the work-issuance loop pretty much guarantees that the command queue pending depth will never be large.

However I think the suggestion of trying out longer work will tend to remove the variation, at least as a percentage of kernel duration.

By this I mean that I have observed variability in the launch overhead, which is most evident when timing very short kernels. I do suspect that this variability varies by platform (for example, I expect a display GPU to have significantly higher variability).

If you feel this is a problem you could file a bug. I personally don’t feel there is an issue with a properly constructed test case, running on a non-display GPU:

$ nvidia-smi
Mon Nov  1 09:45:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         On   | 00000000:04:00.0 Off |                    0 |
| N/A   34C    P8    30W / 235W |      0MiB /  5700MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   39C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20Xm         On   | 00000000:83:00.0 Off |                    0 |
| N/A   33C    P8    18W / 235W |      0MiB /  5700MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20Xm         On   | 00000000:84:00.0 Off |                    0 |
| N/A   32C    P8    19W / 235W |      0MiB /  5700MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[user2@dc10 misc]$ cat t1911.cu
#include <chrono>
#include <iostream>
#include <vector>
#include <cuda_profiler_api.h>

__global__
void test(uint8_t* data, const unsigned int num)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int sum = 0;
    for (int i =0; i < 100; i++)
    {
        for (int j = 0; j < 100; j++)
        {
            sum += i*j - num;
        }
    }
    data[index] = sum % 256;
}
size_t nTPB = 256;
size_t nBLK = 1;
size_t depth = 1000;
int main(){
    uint8_t *frameRawData;
    cudaMalloc(&frameRawData, nTPB*nBLK*sizeof(uint8_t));
    test<<<nBLK,nTPB>>>(frameRawData, 92);
    cudaDeviceSynchronize();
    std::vector<size_t > times;
    cudaProfilerStart();
    for(int i = 0; i < depth; i++)
    {
        auto begin = std::chrono::high_resolution_clock::now();
        test<<<nBLK,nTPB>>>(frameRawData, 92);
        cudaDeviceSynchronize();
        auto end = std::chrono::high_resolution_clock::now();
        times.push_back(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count());
    }
    cudaProfilerStop();
    size_t avg = 0;
    size_t max = 0;
    for (int i = 0; i < depth; i++){
      max = std::max(max, times[i]);
      avg += times[i];}
    avg /= depth;
    std::cout << "avg: " << avg << "us  max: " << max << "us" << std::endl;
}
$ nvcc -o t1911 t1911.cu -std=c++14 -lineinfo
$ ./t1911
avg: 9us  max: 23us
$ ./t1911
avg: 9us  max: 20us
$ ./t1911
avg: 10us  max: 20us
$

(CUDA 11.4)

In this case, if we simply declare that the launch overhead may be as much as ~25 microseconds, then the problem seems to “disappear”. I’m not aware of any guaranteed specifications on launch overhead’; it is demonstrable that launch overhead may vary depending on the exact launch pattern (e.g. configuration of kernel arguments).

At some point I stop worrying about the noise.

1 Like

I agree, brainfart on my part.

I would say the take-home message here is that neither the operating system nor the layers of the CUDA software stack have any hard real-time properties: there is no guaranteed upper bound on any operation, it’s best effort only. CUDA acceleration can be used for some tasks that require soft real-time operation, i.e. nothing bad happens when a deadline is missed.

Generally speaking, I would recommend that kernels be designed to run in the millisecond range rather than the microsecond range. This allows smooth scaling from the fastest to slowest supported platform (say a RTX 3090 vs a GTX 1060).

Hey @Robert_Crovella
Thank you very much for the helpful answer!
When I ran the command nvidia-smi, just like you did, I receive my answer.
Apparently, the xorg server refresh rate made the cudaDeviceSynchronize to wait (or the kernel work to wait until it start).
Anyway, when I turn off the xorg server (sudo init 3),
I successfully managed to get great times (about 50 microseconds maximum per kernel running).
So it will be important to say, for everyone who’s trying to achieve such speed performance, you should know that the xorg server, may cause your program to some performance peaks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.