I am curious. Since the Geforce 8800 GS has 12 multiprocessors with 8 processors each, does that mean that if I have 8 threads and 12 blocks that I am getting the maximum amount of parallelism possible? Is there a benefit speedwise of specifying more threads? Is there a performance hit?



Its a bit trickier than that (function of memory etc.). On the number of blocks you are on the right track. It is most likely a multiple of 12 (for the gts series). The number of threads per block is most likely a multiple of 32. The calculator in the thread below should be helpful.

As was mentioned, you want a multiple of 32 threads per block because that is the number that are run at once (a “warp”) due to pipelining and other scheduling issues. If possible, you also want more than 12 blocks, because more than one block (from the same kernel) can be interleaved on a given multiprocessor, and this can hide some of the latency to global memory. The number of blocks you can run simultaneously will be limited by your shared memory requirements, and the number of threads per block will be limited by the number of registers each thread needs.

Yes, there is a benefit in using more threads per threadblock:

  • you need at least 16 threads to get coalescing when accessing global memory (coalescing significantly improves performance).

  • having more threads helps “hide” global memory access latency (which for g80 is between 400 and 600 clocks). Think of it this way: if you have multiple threads per processor issue memory access instructions (say, reads), you incur the latency once, after which a new read will complete with each clock. Same idea as pipelining.

  • having more threads helps hide read-after-write register conflicts. According to the programming guide, you need 192 threads to completely avoid performance hit due to these conflicts (if your code creates them).