Same kernel and data exhibits different performance

I have a custom kernel that exhibits different performance for different runs, although the data is identical in the two runs. I have run the kernel through Nsight Compute and the GPU Throughput charts are showing different results.

For the fast run, the kernel shows 8.11% Compute and 33.78% Memory (776 usec runtime). For the slow run, the kernel shows 6.8% Compute and 51.60% Memory (924 usec runtime).

This is the exact same kernel (built into a library and called externally). The data being passed to each kernel is identical, and the number of blocks and threads are identical. The achieved occupancy and achieved active warps are 13.70/13.69 and 8.77/8.76 respectively.

This kernel is called thousands of times per run, so the performance difference is having a decent impact on the overall performance of the code. Sadly, I cannot post the code here as it’s proprietary.

The data that is being passed to the kernel is allocated the same way. The only difference in the runs is how the data is populated. The “fast” run is populated by copying from existing device memory whereas the “slow” run is populated using a kernel. The “slow” run has a lower overall memory footprint (ie less total memory used). I have run cuda-memcheck on both runs and it reports no errors for either case.

Does anyone have any idea as to why the performance characteristics of the two kernels are so different? I can see differences in the Memory Throughput Breakdown but I have no idea why they exist. The kernels surrounding this kernel (ie called before and after) are the same as well and there’s a synchronize between the kernels.

I can post the screenshots of these tables if it will help. I am running on a single A100 (same GPU used in both runs) with CUDA 11.4. I have noticed the same behavior on a V100 with CUDA 11.2.

You could possibly investigate if there are L2 cache hit differences in the 2 cases. I don’t really have enough info here to say that the L2 cache is definitely a consideration/concern (it may not be) so it’s just a guess.

Hi Robert, thanks for the quick reply.

I have attached a screenshot of the two runs (slow on left, fast on right). Nsight Compute says the grid is small, however, the timings for both kernels are consistent. I have verified that with nvprof via Nsight Systems (the standard deviation of the kernel times are slightly higher for the fast run, but the average time is consistent with the timings provided here, the kernel is called 10’s to 100’s of thousands of times in a typical run).

The DRAM cycles active % is higher for the slow kernel (probably the result of more cache misses?). I am not sure what could be causing this difference, though.

EDIT: I have found the bug. One of my arrays was too large, leading to a lot of cache misses.