Overhead of launching thread blocks

Programming Guide mentions thread scheduling has zero scheduling overhead. How about overhead of “launching thread blocks”? I see cases that performance improves if I launch less but bigger blocks (no synch. within a thread block so both schemes basically do the same). And of course, I’m concerned about 100,000’s of blocks…


Can you give an example? I think you probably have a microbenchmark that is probably not very representative of real performance. It is normally good to have lots of blocks, so your implementation scales with newer GPU’s

You should also be aware that having bigger threadblocks can help with some kinds of in block data dependency issues. While you need 32 to fill a warp you need 64 to make sure you don’t loose performance in read after write dependencies.

It turned out that “maximum 8 active blocks” was the culprit limiting number of total active warps. I originally had 32 threads per block so 256 threads from 8 blocks were scheduled (still above the 192 recommended for avoiding read-after-write wait).
Going to a bigger block size let me to have more active threads and do a better job in hiding memory latency…