Hi,
I have been doing performance benchmarking on RTX 2070 GPU. One thing that I notice is that the performance of the first launched kernels can somehow negatively affect the performance of subsequent ones. As an example:
// Benchmark 1
for (i = 0; i < 1000; ++i)
kernel_A<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i)
kernel_B<<<...>>>();
// Benchmark 2
for (i = 0; i < 1000; ++i)
kernel_B<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i)
kernel_B<<<...>>>();
// Benchmark 3
for (i = 0; i < 1000; ++i)
kernel_C<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i)
kernel_B<<<...>>>();
// Benchmark 4
for (i = 0; i < 1000; ++i)
kernel_B<<<...>>>();
The performance numbers reported by nvprof is roughly the same on Benchmark 2, 3, and 4, whereas the numbers on Benchmark 1 is roughly 10% worse than the others. I suspect the reason is because kernel_A
has low SM occupancy in its execution (in its final execution wave, it is only able to dispatch one thread block per SM), but I am not sure how this can affect the performance of kernel_B
, especially given that there is a cudaDeviceSynchronize
call in between.
Could anyone please give me some hints on how to mitigate this problem? Thanks.