how are blocks scheduled for execution?

I know that a block has a maximum number of threads (e.g. 512), and that when one block is executed all threads run the kernel code at the same time.
But how are blocks being scheduled? If I define a grid with say 1024 blocks, how many of them are going to be executed at once? From what I’ve seen all blocks are executed eventually, as a grid.
I’d appreciate an explanation, or a link to an article that contains the relevant information.
I would also like to find out if CUDA API provides calls that return that kind of stats (for a given GPU).
Thanks a lot,


This link illustrates how Fermi scheduler works.

Thank you for sending that link, it’s an interesting read.
I also found this article:
There is a paragraph there which is relevant, in case anyone is interested:

In the first post of this series we mentioned that the grouping of threads into thread blocks mimics how thread processors are grouped on the GPU. This group of thread processors is called a streaming multiprocessor, denoted SM in the table above. The CUDA execution model issues thread blocks on multiprocessors, and once issued they do not migrate to other SMs. Multiple thread blocks can concurrently reside on a multiprocessor subject to available resources (on-chip registers and shared memory) and the limit shown in the last row of the table. The limits on threads and thread blocks in this table are associated with the compute capability and not just a particular device: all devices of the same compute capability have the same limits. There are other characteristics, however, such as the number of multiprocessors per device, that depend on the particular device and not the compute capability. All of these characteristics, whether defined by the particular device or its compute capability, can be obtained using the cudaDeviceProp type.


More details might be found in this paper, but I have not read it yet.