Before I explain my problem: I’m using ubuntu version 20.04, my GPU is NVIDIA GeForce RTX 2080 SUPER and I’m using Cuda version 11.4.
Because the Nvidia developer forums don’t allow to attach more than one file, I merge all my attachments into one file (at the end of the post), and everytime I reference it, you can find it below
I’ve got a strange problem when using kernel function and measuring execution time precisely (in microseconds).
The code I’m running on the kernel is:
__global__
void test(uint8_t* data, const unsigned int num)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int sum = 0;
for (int i =0; i < 100; i++)
{
for (int j = 0; j < 100; j++)
{
sum += i*j - num;
}
}
data[index] = sum % 256;
}
I’m running this kernel function in a loop 1000 times. Every time before I called the function I start a timer and stop it after the execution of the function.
The weird thing is that the performance usually has some peaks. Its performances are not stable.
The code I’m running in order to measure the execution kernel time is:
cudaProfilerStart();
for(int i = 0; i < 1000; i++)
{
begin = std::chrono::high_resolution_clock::now();
test<<<1, 256>>>(frameRawData, 92);
cudaDeviceSynchronize();
end = std::chrono::high_resolution_clock::now();
timeCheckingFile << std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count() << "\n";
}
cudaProfilerStop();
As you can see, after the measure I save the result in a file. I made a python script, that makes a performance graph from this file, where the x-axis is the iteration number (0 - 1000), and the y-axis is the time measured (in microseconds).
(The first image - graph at the attachment at the end of the post)
This is the graph that my script made.
As you can see, most of the iterations are consistent, but some of them, reach peaks, up to 350 microseconds!
Sometimes, when I run the program, I can see no peaks, and sometimes I can see plenty of them.
I used the Nvidia visual profiler, and realize that the cudaDeviceSynchronize I’m using, in order to wait until the kernel function will end, takes most of my execution time.
The result of the visual profiler is:
(The second image - timeline made by Nvidia visual profile at the attachment at the end of the post)
As you can see, most of the GPU work is done in constant spaces between each other, but there are a couple of larger gaps (which represents the performance peaks).
When I zoom in on these gaps, I see the cause of the gaps in the cudaDeviceSynchronize function.
(The third image - timeline made by Nvidia visual profile at the attachment at the end of the post)
I thought this is maybe a thread priority problem, but I tested my program with perf and it seems my program isn’t waiting at all.
If it helps for some of you, this is the Nvidia profile result I got when running the script again:
(The last image - Nvidia profiler results at the attachment at the end of the post)
Does someone know what is the cause of this unstable performance and inconsistent peaks at my program?