Single workgroup on multiple multiprocessors


I am unsure from the programming and best practices guide, if I can run warps from the same workgroup (block in CUDA terms) on multiple multiprocessors, or whether each block can be handled only by single multiprocessor? For example, if I have GeForce GTX 580 with 16 SM and I issue a kernel with single block with 512 threads, does each multiprocessor handle one warp, or are all 16 warps handled by only single SM?



Perhaps you should read chapter 2.1 of the OpenCL Programming Guide. Quite obvious, “Best practices” is not exactly where you should expect a description of the hardware implementation.

A workgroup/block is executed on a single multiprocessor. A multiprocessor can run multiple work groups / blocks at the same time as long as resources permit (register, local/shared memory, thread count etc., the C for CUDA docs are more detailed in this respect). Warps are bundles of work items/threads of a workgroup/block, that are processed together.

And to answer your question: If your ND range specifies a single work group / block, it will be assigned to a single multiprocessor.

I have already read the Programming Guide, but I wasn’t completely sure from the statement “The threads of a thread block execute concurrently on one multiprocessor.” that the multiprocessors cannot share the block. Thank you for clearing things up.