Kernel execution time variable execution time depending on grid

Hey, I’m testing the best grid options for a given kernel and based in the information the cuda occupancy prints I’ve selected the number of threads per block that maximize the number of active warps in the multiprocessor, in this case 6 active warps.

Here’s the results I get for my kernel using 69 registers, 84 bytes of smem and 52 bytes of cmem:
(threads per block,number of blocks) kernel execution time [ms]
(192,113) 1.580
(176,123) 1.625
(338,64) 1.565
(450,48) 1.620

The execution time is an average of 100 samples of kernel execution (the application was ran 100 times, not just the kernel). Is there any particular reason as to why the execution times differ? The stopwatch I’m using has a 0.5 microsecond precision so the difference isn’t due to precision. A rule of thumb is that optimum grids are application dependent, but as far as I can see the number of threads per block does not correlate to a minimum execution time for my specific kernel.

What’s more, another similar kernel but running a code of the same fashion, ie LDPC decoding on the same coding with a different coding rate, only adds anticorrelation between the number of threads per block and minimum execution time. The conclusions I get for one coding rate are not the same I get for the the other coding rate.

I guess what pretty much sums up my question is, why such a grid dependency and application dependency? Does it have to do with the fact that 192 and 64 threads per block are multiple of the warpsize? Will the non execution of 16 threads in the last warp executed in either 176 or 48 threads per block generate such an overhead?

Your blocksize should definitely be a multiple of the warpsize (or better, of 64 = twice the warpsize). Otherwise you waste unused cores (although there is no overhead associated with unused cores, as far as I’m aware of).

Other than that, the differences are probably caused by different memory access patterns. So, testing on different cards (with different bus widths) will likely add to the anticorrelation you’ve already seen.