Assume a kernel uses 24 registers per thread and no shared memory. A Kepler GPU (GTX 680) with CC 3.0 has 8 Multiprocessors.
I am tuning the kernel block size for “good” occupancy (I understand this is different than execution time). After playing with the excel calculator, I found multiple options that can achieve 100% occupancy, such as with block size of 128, 256, or 512. All these 3 configurations can achieve maximum threads and warps per SM. The only difference is the achievable active thread blocks, which is 16, 8, 4 respectively.
Now I assume these 3 configurations all give me good hardware utilization, and I can pick one that is fast. In this case, should I use a large thread block 512 or a small block 128? What should I consider now? is it the hardware overhead for block switching? does the “large block generally hide better latency” still hold here? Because eventually all these configurations yield the same number of warps being scheduled.
After some benchmark (a simple windowing operator on a gray scale image), I found the following result:
Execution time: 128 (1.049 ms), 256 (1.063 ms), 512 (1.101 ms). All these are the median of 50 runs each.
Achieved occupancy: 128 (82.1%), 256 (79.1%), 512 (77.6%). From Nvprof.
The result indicates me to go for smaller blocks. Seems like more small blocks rather than less large blocks achieve better occupancy. But my initial questions are still largely unanswered. What am i missing here?