how are blocks scheduled for execution?

markcitizen · December 6, 2016, 12:15am

Hello,
I know that a block has a maximum number of threads (e.g. 512), and that when one block is executed all threads run the kernel code at the same time.
But how are blocks being scheduled? If I define a grid with say 1024 blocks, how many of them are going to be executed at once? From what I’ve seen all blocks are executed eventually, as a grid.
I’d appreciate an explanation, or a link to an article that contains the relevant information.
I would also like to find out if CUDA API provides calls that return that kind of stats (for a given GPU).
Thanks a lot,

M

LongY · December 7, 2016, 11:10pm

This link illustrates how Fermi scheduler works.
[url]https://users.ices.utexas.edu/~sreepai/fermi-tbs/[/url]

markcitizen · December 8, 2016, 11:17pm

Hello,
Thank you for sending that link, it’s an interesting read.
I also found this article:

There is a paragraph there which is relevant, in case anyone is interested:

In the first post of this series we mentioned that the grouping of threads into thread blocks mimics how thread processors are grouped on the GPU. This group of thread processors is called a streaming multiprocessor, denoted SM in the table above. The CUDA execution model issues thread blocks on multiprocessors, and once issued they do not migrate to other SMs. Multiple thread blocks can concurrently reside on a multiprocessor subject to available resources (on-chip registers and shared memory) and the limit shown in the last row of the table. The limits on threads and thread blocks in this table are associated with the compute capability and not just a particular device: all devices of the same compute capability have the same limits. There are other characteristics, however, such as the number of multiprocessors per device, that depend on the particular device and not the compute capability. All of these characteristics, whether defined by the particular device or its compute capability, can be obtained using the cudaDeviceProp type.
<<<
Thanks,

M

SPWorley · December 9, 2016, 10:05am

More details might be found in this paper, but I have not read it yet.

Topic		Replies	Views
understand the mapping of the block threads to SMs in GPU CUDA Programming and Performance	3	2708	August 2, 2018
Scheduling of thread blocks on Stream Processors CUDA Programming and Performance	9	11022	June 7, 2010
a simple question about the resident blocks per multiprocessor CUDA Programming and Performance	6	3829	August 23, 2017
Blocks execution Are they executed concurrently? CUDA Programming and Performance	4	1202	December 14, 2011
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	3004	March 7, 2017
finding the best number of threads per block CUDA Programming and Performance	3	7851	January 29, 2010
Max threads/blocks CUDA Programming and Performance	10	88	September 6, 2024
Request clarification on CUDA runtime scheduling CUDA Programming and Performance	1	1748	September 5, 2008
Execution Of Thread-Blocks CUDA Programming and Performance	4	5282	June 18, 2007
performance cost of too many blocks? CUDA Programming and Performance	12	2811	December 4, 2018

how are blocks scheduled for execution?

Related topics