When running the kernel below on GV100, performance improves significantly when changing the block size from 64 threads to 128 threads (from 28us to 15us).
global void SimpleCopy(float* restrict dest, const float* restrict src)
{
int index = blockDim.x * blockIdx.x + threadIdx.x;
dest[index] = src[index];
}
Yes, I know that 64 seems like a small block size, however, when putting all the numbers together, I don’t know how to explain the speedup.
For the first configuration, the kernel is launched with 64 threads/block and 13440 blocks.
For the second configuration, the kernel is launched with 128 threads/block and 6720 blocks.
GV100 is C.C. 7.0, so max #blocks/SM is 32, max #warps/SM is 64 and max #threads/SM is 2048.
No shared memory used and register count is reported by Nsight to be 16, so none of these will be the limiting factor when computing occupancy.
Doing the math, we should have:
- For the first configuration, where grid size is 13440 blocks and block size is 64 threads
32 blocks/SM <=> 2048 threads/SM <=> 64 warps/SM - For the second configuration, where grid size is 6720 blocks and block size is 128 threads
16 blocks/SM <=> 2048 threads/SM <=> 64 warps/SM
So, the same amount of warps per SM. Moreover, my GV100 reports to have 80SMs and both grid sizes are a multiple of 80, so no tail effect.
But then why the performance difference? Where else does the block size play any role?
In Nsight Compute I see that Memory throughput, Mem busy and Mem pipes busy are almost double for the second configuration, but I don’t understand why. I would have expected the same values, since we have the same amount of warps per SM in both scenarios.