better performance from underpopulated warps

I have an application with 2048 threads. If I group the threads as 64 blocks (each block having 32 threads), kernel computation takes ~90ms. If I group the threads as 128 blocks (each block having 16 threads), kernel computation only takes ~30ms. Can anyone think of any reason why this would happen?

I am using an 8800GTX. Here is the register/memory usage I get when I compile with nvcc --ptxas-options -v in case it is relevant:

ptxas info : 0 bytes lmem, 24 bytes smem, 1242688 bytes cmem, 35 registers

Thanks for any ideas.

Is this kernel memory or compute bound? I could maybe imagine scenarios for memory bound kernels where half-full warps would be faster if it resulted in more blocks. Memory coalescing actually happens at the half-warp level, so a 16-thread block probably doesn’t waste any memory bandwidth. More blocks could potentially be hide some of the memory latency.

What is the occupancy shown by the profiler for these two cases?


I believe that the kernel is memory bound, especially since I am not yet using coalescing. That is an interesting point about half warps possibly allowing a greater degree of parallelism than full warps. However, occupancy is the same for both cases (both are .125) according to the profiler, so I don’t believe that’s it - the GPU is not able to take advantage of that greater degree of parallelism (the 64 block kernel already hits the register limit).

Your GPU has 16 multiprocessors each capable to execute 8 active blocks at the same time. Each multiprocessor has 8192 registers which are shared over it’s active blocks meaning in the worst case if you have maximal number of active blocks (8 per multiprocessor, 128 per GPU) you can have 8192/8=1024 registers per block.

If your block has 16 threads each thread still could use 64 registers.

Difference in time execuction could be due to many things like:

  1. Is code path same for all threads

  2. The way memory is accessed

  3. thread synchronization (if any)

  4. type of your data and it’s organization over banked memory structure

  5. number of registers (this is not the case in your example)

If memory accessing is involved (almost always) difference in execution time could be due to bank organization.

Easy example is to imagine each thread has it’s own properly aligned memory location. For 128 blocks case, accessing to 16 memory locations in 16 different banks from 16 threads of one halfpopulated warp implicates only one “stall” due to memory latency because each bank is accessed only once in comparison to 64 blocks case where accessing of 32 memory locations in 16 different banks from 32 threads (means each bank is accessed twice from the same warp) implicates two “stalls”, one for first bank access and one for second.

Isn’t this the same thing that seibert was suggesting, or am I missing something? Basically that with 16 threads/block you would be using double the amount of threads (say 2*n threads), but each one only requires half the memory access (but the same amount of arithmetic), and if we can interleave the 2n threads together we could conceivably get a faster runtime. But if we can’t interleave the two batches of n threads, and we need to do the first n, and then the second n, sequentially, then I don’t see how it could be faster. And in this case, since occupancy is the same with 64 blocks and 128 blocks, that means we don’t get to take advantage of the extra parallelism, and I don’t see how it would be faster. Please correct me if you were suggesting something else, thanks.

It is not faster and I don’t know how did you conclude that from my post.

I am telling that in this specific situation 128 halfpopulated warps are faster

than 64 fullpopulated.

But that stands only for this specific situation not in general because this GPU has 128 cores. So in this spec. situation 128 blocks reach the GPU limit of cores. For larger number of threads let say 4096 you can not make benefits from that if you still need each thread access memory, due to memory is banked at fixed 16 banks (you would need 256 cores or with 128 cores you would need warp of 64 threads and memory banked to 32 banks) So if you try to make 256 blocks with half populated warps with 16 threads those blocks can not execute at the same time since GPU has only 128 cores and will be executed as 2x128 blocks meaning each warp also has two access to memory in comparison to 128 blocks with fullpopulated warps with 32 threads (also two memory access).

So you should get the same results.

OK, I was assuming that the GPU was not taking advantage of the additional parallelism offered by the additional blocks based on the profiler outputting the same occupancy for the 64 and 128 block cases in my previous post. Your theory is based on the assumption that the GPU is able to take advantage of that parallelism. I agree that if that assumption holds, then half populated warps could give better performance. But I don’t believe that assumption holds here, unless the profiler is incorrect, or I am misinterpreting it’s output.

You made an argument earlier in the thread that all 128 blocks should be able to be active at the same time, but I don’t believe it’s correct. The formula for the register footprint of a block is more complicated than (registers per thread) * (threads per block), according to cell B34 of the Occupancy Calculator ( ). According to that formula, only 48 blocks total (regardless of whether we use 32 threads/warp or 16 threads/warp) of my kernel can be active at a time on my Geforce card, due to the register limit.

Also, as a test I tried increasing the number of threads to 16384. If I group those threads 32 to a block (512 blocks), the runtime is 719ms. If I group them 16 to a block (1024 blocks), the runtime is 266ms, so this effect persists even with a larger number of blocks. Occupancy for both is .125.