I wouldn’t worry about the if statements. With all the memory accesses in your kernel, your performance is most certainly memory bound.
DenisR is correct in that your problem is memory coalescing. Given that all of your memory accesses are using global memory pointers and the way you access them, not all of them will be coalesced. However, at certain dimensions, some of the memory accesses will be coalesced leading to the performance spikes you see.
You can use the CUDA visual profiler (download from the forum sticky post) to count the number of uncoalesced memory accesses vs coalesced ones. If you are unaware, the difference in performance can be an order of magnitude. See the programming guide for all the gory details on how to coalesce.
Things to do to significantly boost the performance of your kernel.
- Your access pattern for reading “tab” is perfect for the the 2D texture cache.
- Make that the final write to “var” is coalesced. To do this, you will need to use cudaMallocPitch to allocate your 2D memory with some padding at the end of each row to ensure coalescing.
Edit: I forgot to add that if you want to see how optimally you are using the device, count the number of memory bytes you read/write. Then divide by the running time of the kernel to calculate an effective GiB/s bandwidth usage. Something near ~70GiB/s should be attainable in your case.