Same kernel and data exhibits different performance

trf86 · October 4, 2021, 2:17pm

I have a custom kernel that exhibits different performance for different runs, although the data is identical in the two runs. I have run the kernel through Nsight Compute and the GPU Throughput charts are showing different results.

For the fast run, the kernel shows 8.11% Compute and 33.78% Memory (776 usec runtime). For the slow run, the kernel shows 6.8% Compute and 51.60% Memory (924 usec runtime).

This is the exact same kernel (built into a library and called externally). The data being passed to each kernel is identical, and the number of blocks and threads are identical. The achieved occupancy and achieved active warps are 13.70/13.69 and 8.77/8.76 respectively.

This kernel is called thousands of times per run, so the performance difference is having a decent impact on the overall performance of the code. Sadly, I cannot post the code here as it’s proprietary.

The data that is being passed to the kernel is allocated the same way. The only difference in the runs is how the data is populated. The “fast” run is populated by copying from existing device memory whereas the “slow” run is populated using a kernel. The “slow” run has a lower overall memory footprint (ie less total memory used). I have run cuda-memcheck on both runs and it reports no errors for either case.

Does anyone have any idea as to why the performance characteristics of the two kernels are so different? I can see differences in the Memory Throughput Breakdown but I have no idea why they exist. The kernels surrounding this kernel (ie called before and after) are the same as well and there’s a synchronize between the kernels.

I can post the screenshots of these tables if it will help. I am running on a single A100 (same GPU used in both runs) with CUDA 11.4. I have noticed the same behavior on a V100 with CUDA 11.2.

Robert_Crovella · October 4, 2021, 2:24pm

You could possibly investigate if there are L2 cache hit differences in the 2 cases. I don’t really have enough info here to say that the L2 cache is definitely a consideration/concern (it may not be) so it’s just a guess.

trf86 · October 4, 2021, 2:35pm

Hi Robert, thanks for the quick reply.

I have attached a screenshot of the two runs (slow on left, fast on right). Nsight Compute says the grid is small, however, the timings for both kernels are consistent. I have verified that with nvprof via Nsight Systems (the standard deviation of the kernel times are slightly higher for the fast run, but the average time is consistent with the timings provided here, the kernel is called 10’s to 100’s of thousands of times in a typical run).

~~The DRAM cycles active % is higher for the slow kernel (probably the result of more cache misses?). I am not sure what could be causing this difference, though.~~

EDIT: I have found the bug. One of my arrays was too large, leading to a lot of cache misses.

system · December 3, 2021, 2:36pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUDA kernel is 6x slower in model than in a separate benchmark CUDA Programming and Performance cuda , kernel	6	439	February 17, 2023
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	330	April 23, 2024
Kernel execution time increase 4x when using streams CUDA Programming and Performance	8	1694	August 13, 2015
Same kernel 3x slower on CUDA than on OpenCL Nsight Compute cuda	7	944	May 5, 2023
Kernel Performance Discrepancy in Nsight Compute and Systems Nsight Compute	2	159	December 2, 2024
Why the same kernel runs a different speed when invoke more than once? Nsight Compute	3	972	June 3, 2022
Reduction of kernel's execution time that does not make sense CUDA Programming and Performance	4	555	January 11, 2018
nsight-compute's profiling result is different from nvprof's Nsight Compute	5	612	October 12, 2021
CUDA kernel launched from Nsight Compute gives inconsistent results Nsight Compute	1	466	October 20, 2022
What could be possible reasons for affecting the kernel launch overhead for fast small kernels? CUDA Programming and Performance	5	29	October 22, 2024

Same kernel and data exhibits different performance

Related topics