OK, I was assuming that the GPU was not taking advantage of the additional parallelism offered by the additional blocks based on the profiler outputting the same occupancy for the 64 and 128 block cases in my previous post. Your theory is based on the assumption that the GPU is able to take advantage of that parallelism. I agree that if that assumption holds, then half populated warps could give better performance. But I don’t believe that assumption holds here, unless the profiler is incorrect, or I am misinterpreting it’s output.
You made an argument earlier in the thread that all 128 blocks should be able to be active at the same time, but I don’t believe it’s correct. The formula for the register footprint of a block is more complicated than (registers per thread) * (threads per block), according to cell B34 of the Occupancy Calculator ( http://forums.nvidia.com/index.php?showtopic=31279 ). According to that formula, only 48 blocks total (regardless of whether we use 32 threads/warp or 16 threads/warp) of my kernel can be active at a time on my Geforce card, due to the register limit.
Also, as a test I tried increasing the number of threads to 16384. If I group those threads 32 to a block (512 blocks), the runtime is 719ms. If I group them 16 to a block (1024 blocks), the runtime is 266ms, so this effect persists even with a larger number of blocks. Occupancy for both is .125.