threads per block / multi processor, contradiction ?


the docs say:


each block can consist of 512 threads max

each block is executed on one multiprocessor,

one mp can manage 768 threads.

is this a contradiction, or did I miss something ???

Greetings, Uwe

That sounds right to me…remember that the blocks are not executed in parallel, they are divided into warps of 32 threads, which are then executed on the MP. Each MP processes 8 threads at a time, but via the latency of instructions and pipelining, it is actually processing 32 threads (the warp) at once (each group of 8 is in one stage of the pipeline at any given time). So once the blocks are divided into warps, the warps are executed serially, which accounts for the apparent contradiction.

But would it not be enough if each MP could execute 512 threads as there are never more threads scheduled to a MP ?

You can have more than one block per MP.

Plus, remember that your kernel is launched in a grid of blocks, so (in general) you’ll have way more blocks than you have MP’s…so the internal CUDA scheduler just keeps the blocks in a “queue” of sorts, and keeps the MP’s busy with them until all the blocks have completed.

EDIT: Also, this is why most people typically limit their blocks to a maximum of 256 threads…so that you can run three blocks per MP – and also for the convenience factor of having 256 threads being equivalent to a 16x16 block of threads, which is nice for 2D work (e.g. blocked matrix multiplication).

Thanks, that is what I misunderstood.

Greetings, Uwe