Distribution of Threads to Multiprocessors

In the CUDA Documentation is written:

[i]2.2.2 Grid of Thread Blocks
There is a limited maximum number of threads that a block can contain. However, blocks of same dimensionality and size that execute the same kernel can be batched together into a grid of blocks, so that the total number of threads that can be launched in a single kernel invocation is much larger.

The maximum number of threads per block is 512
[/i]
I understand that many threads can be “launched”, but I wonder how many threads or kernels are actually executed together.

Is it possible that more than one thread block is executed at the same time? Is this only the case when there are less than 512 Threads per block?

Is there more detailed information how the threads are distributed over the multiprocessors?

Thanks

I can’t point you to a specific section in the guide that says this in a concise manner (It has been a while since I read it, I’m sure it is in there somewhere). But here is the short answer to the question. It is possible, and most desirable to have more than one block executed at a time. Given the constraints of register and shared memory usage, the device will try to run as many blocks on each multiprocessor as it can. And there are 16 multiprocessors! (fewer on the low end cards).

//but this directy conflicts with section 5.2 of the same guide where it states “64 threads per block is minimal and makes sense only if there are multiple concurrent blocks. 192 or 256 threads per block is better and usually allows for enough registers to compile.” 256x8=2048 much greater than a multiprocessor can handle. I would guess that block in this quote should be multiproccessor where it would make more sense. I would like this clarified.

Another question? Is it possible to have completely different sets of threads concurrently on different multiprocessors? I am still waiting for my Tesla before I can learn by experimentation?

8 is a maximum, not mandatory. The objective is reach a 100% of multiprocessor occupancy. That is, 24 warps running on every multiprocessor. For 256 threads per block, there will be a maximum of 3 blocks running on every multiprocessor at a time. 3 blocks of 256 threads complete the 24 warps and maximizes concurrency.

At the same time, there are limitations with registers (8192 per multiprocessor) and shared memory (16KB per multiprocessors). For instance, if your kernels require 12 registers per thread, using 256 threads per block allow a maximum of 2 blocks per multiprocessor (3 blocks would require 9216 registers), that’s a 66% of occupancy.

Everything is clearly explained in chapters 2 and 3 and 5.2 of the Programming Guide. Have a look at it.

Not quite so clear, but starting to emerge. Section 3.2 “A blockis processed on only one multiprocessor” explains why the numbers made sense for a multiprocessor. For a while I had percieved a block as the number of threads for a processor not the Multiprocessor.

It appears as each multiprocessor could execute a different instruction stream. Is this true? If not why the boundry? Is it a boundry due to communication and sharing? How are results shared between multiproccessor with no synchronization?

Every multiprocessor has an independent instruction decoder, so yes, different blocks can run different instructions (the so called “fat kernel”). Multiprocessors cannot communicate very easily, though. The best approach is to design your kernels so communication between multiprocessors is not needed. Compute capability 1.1 devices (like 8600 and 8800 GT) can perform atomic updates to global memory, which can be used for inter-block communication. There are tricks described in other posts which show how an atomic update can be hacked into the older devices like the 8800 GTX and GTS, but should be avoided unless you really need it.

In principle, even every warp can execute a different instruction stream. Threads within a block can synchronize using __syncThreads(), but blocks cannot synchonize and communicate with each other.

But all threads in a block should finish (or sync) before another block is scheduled to the same SM, right? I guess this is the case due to the usage of shared memory. Hence, executing a bigger number of threads per block might sometimes lead to a decrease in performance, because of load balance issues - one thread in the block might keep on doing its job while all of the others have finished. Isn’t that true?