Distribution of Threads to Multiprocessors

sbaebler · August 23, 2007, 11:39am

In the CUDA Documentation is written:

[i]2.2.2 Grid of Thread Blocks
There is a limited maximum number of threads that a block can contain. However, blocks of same dimensionality and size that execute the same kernel can be batched together into a grid of blocks, so that the total number of threads that can be launched in a single kernel invocation is much larger.

The maximum number of threads per block is 512
[/i]
I understand that many threads can be “launched”, but I wonder how many threads or kernels are actually executed together.

Is it possible that more than one thread block is executed at the same time? Is this only the case when there are less than 512 Threads per block?

Is there more detailed information how the threads are distributed over the multiprocessors?

Thanks

MisterAnderson42 · August 23, 2007, 12:54pm

I can’t point you to a specific section in the guide that says this in a concise manner (It has been a while since I read it, I’m sure it is in there somewhere). But here is the short answer to the question. It is possible, and most desirable to have more than one block executed at a time. Given the constraints of register and shared memory usage, the device will try to run as many blocks on each multiprocessor as it can. And there are 16 multiprocessors! (fewer on the low end cards).

pluk · August 24, 2007, 7:40am

hab · November 23, 2007, 5:43am

//but this directy conflicts with section 5.2 of the same guide where it states “64 threads per block is minimal and makes sense only if there are multiple concurrent blocks. 192 or 256 threads per block is better and usually allows for enough registers to compile.” 256x8=2048 much greater than a multiprocessor can handle. I would guess that block in this quote should be multiproccessor where it would make more sense. I would like this clarified.

Another question? Is it possible to have completely different sets of threads concurrently on different multiprocessors? I am still waiting for my Tesla before I can learn by experimentation?

javier1 · November 23, 2007, 11:08am

8 is a maximum, not mandatory. The objective is reach a 100% of multiprocessor occupancy. That is, 24 warps running on every multiprocessor. For 256 threads per block, there will be a maximum of 3 blocks running on every multiprocessor at a time. 3 blocks of 256 threads complete the 24 warps and maximizes concurrency.

At the same time, there are limitations with registers (8192 per multiprocessor) and shared memory (16KB per multiprocessors). For instance, if your kernels require 12 registers per thread, using 256 threads per block allow a maximum of 2 blocks per multiprocessor (3 blocks would require 9216 registers), that’s a 66% of occupancy.

Everything is clearly explained in chapters 2 and 3 and 5.2 of the Programming Guide. Have a look at it.

hab · November 23, 2007, 7:57pm

Not quite so clear, but starting to emerge. Section 3.2 “A blockis processed on only one multiprocessor” explains why the numbers made sense for a multiprocessor. For a while I had percieved a block as the number of threads for a processor not the Multiprocessor.

It appears as each multiprocessor could execute a different instruction stream. Is this true? If not why the boundry? Is it a boundry due to communication and sharing? How are results shared between multiproccessor with no synchronization?

seibert · November 23, 2007, 8:15pm

Every multiprocessor has an independent instruction decoder, so yes, different blocks can run different instructions (the so called “fat kernel”). Multiprocessors cannot communicate very easily, though. The best approach is to design your kernels so communication between multiprocessors is not needed. Compute capability 1.1 devices (like 8600 and 8800 GT) can perform atomic updates to global memory, which can be used for inter-block communication. There are tricks described in other posts which show how an atomic update can be hacked into the older devices like the 8800 GTX and GTS, but should be avoided unless you really need it.

wumpus · November 24, 2007, 10:45am

In principle, even every warp can execute a different instruction stream. Threads within a block can synchronize using __syncThreads(), but blocks cannot synchonize and communicate with each other.

Yordan_Zaykov · June 8, 2011, 4:44pm

But all threads in a block should finish (or sync) before another block is scheduled to the same SM, right? I guess this is the case due to the usage of shared memory. Hence, executing a bigger number of threads per block might sometimes lead to a decrease in performance, because of load balance issues - one thread in the block might keep on doing its job while all of the others have finished. Isn’t that true?

Topic		Replies	Views
number of threads and registers CUDA Programming and Performance	10	4952	March 14, 2008
threads per block / multi processor, contradiction ? CUDA Programming and Performance	5	1711	January 23, 2009
A question about the CUDA's thread parallelization CUDA Programming and Performance	12	63102	January 25, 2009
A question the parallelization CUDA Programming and Performance	5	2741	July 29, 2008
How to use blocks CUDA Programming and Performance	1	3598	November 26, 2007
How many concurrent threads? CUDA Programming and Performance	4	4873	June 7, 2008
Multiprocessors, Cores, Threads and Parallelism CUDA Programming and Performance	5	13877	November 8, 2010
finding the best number of threads per block CUDA Programming and Performance	3	7883	January 29, 2010
Max no. of threads in a multiprocessor. CUDA Programming and Performance	4	1721	September 29, 2009
Shared Memory and number of Blocks invoked CUDA Programming and Performance	4	5776	March 5, 2008

Distribution of Threads to Multiprocessors

Related topics