I have a custom kernel that exhibits different performance for different runs, although the data is identical in the two runs. I have run the kernel through Nsight Compute and the GPU Throughput charts are showing different results.
For the fast run, the kernel shows 8.11% Compute and 33.78% Memory (776 usec runtime). For the slow run, the kernel shows 6.8% Compute and 51.60% Memory (924 usec runtime).
This is the exact same kernel (built into a library and called externally). The data being passed to each kernel is identical, and the number of blocks and threads are identical. The achieved occupancy and achieved active warps are 13.70/13.69 and 8.77/8.76 respectively.
This kernel is called thousands of times per run, so the performance difference is having a decent impact on the overall performance of the code. Sadly, I cannot post the code here as it’s proprietary.
The data that is being passed to the kernel is allocated the same way. The only difference in the runs is how the data is populated. The “fast” run is populated by copying from existing device memory whereas the “slow” run is populated using a kernel. The “slow” run has a lower overall memory footprint (ie less total memory used). I have run cuda-memcheck on both runs and it reports no errors for either case.
Does anyone have any idea as to why the performance characteristics of the two kernels are so different? I can see differences in the Memory Throughput Breakdown but I have no idea why they exist. The kernels surrounding this kernel (ie called before and after) are the same as well and there’s a synchronize between the kernels.
I can post the screenshots of these tables if it will help. I am running on a single A100 (same GPU used in both runs) with CUDA 11.4. I have noticed the same behavior on a V100 with CUDA 11.2.