How blocks will be distributed among SPs ?

Romant · October 13, 2008, 5:57pm

Say, I have 30 blocks with 32 threads in each block, registry and shared memory requirements of each thread are minimal so 8 blocks may be put into one SM.

Considering GTX 280 - will these 30 blocks be distributed among 30 SMs or only four SMs will be involved (1st - 8 blocks, 2nd - 8 blocks, 3rd - 8 blocks, 4th - 6 blocks) ?

Is it controllable from the programmer’s side ?

tmurray · October 13, 2008, 6:02pm

It’s not controllable from the programmer’s side, and to be honest I don’t know how it schedules.

Romant · October 13, 2008, 7:54pm

This question has practical importance … when specifying the parameters of the run with input data that is not big enough to fill all SMs with any combination of block/grid size, which option to choose ? Many blocks with small number of threads or not so many blocks with maximal possible number of threads in each block ?

My observations on my tasks give the second approach (max threads/less blocks) more scores, but it is very task-dependent I believe.

paulius · October 13, 2008, 8:30pm

The total number of threads is what impacts memory latency hiding. To hide register dependencies, you need 192 or more threads per multiprocessor, as suggested by the programming guide. So, it really depends on how the time of your kernel is balanced between the memory and arithmetic operations.

While threadblock scheduling and assignment to multiprocessors is not defined (i.e. any order and assignment is correct), you can spread threadblocks across the multiprocessors with a simple “hack”:

a multiprocessor has 16KB of smem available;
you can control how many threadblocks run per multiprocessor by forcing occupancy with smem requests (easily done with the third argument of a kernel launch). For example, if you request >8KB, then only one threadblock will be assigned per multiprocessor.

Paulius

Romant · October 13, 2008, 8:35pm

The total number of threads is what impacts memory latency hiding. To hide register dependencies, you need 192 or more threads per multiprocessor, as suggested by the programming guide. So, it really depends on how the time of your kernel is balanced between the memory and arithmetic operations.

While threadblock scheduling and assignment to multiprocessors is not defined (i.e. any order and assignment is correct), you can spread threadblocks across the multiprocessors with a simple “hack”:

a multiprocessor has 16KB of smem available;

you can control how many threadblocks run per multiprocessor by forcing occupancy with smem requests (easily done with the third argument of a kernel launch). For example, if you request >8KB, then only one threadblock will be assigned per multiprocessor.

Paulius

[snapback]451608[/snapback]

Brilliant, I did not think about such a trick :-) Thank you.

Topic		Replies	Views
Scheduling blocks to SMs at runtime CUDA Programming and Performance	7	2836	October 27, 2008
How do the thread blocks resides in the multiprocessors? CUDA Programming and Performance	4	2049	April 16, 2012
Scheduling of thread blocks on Stream Processors CUDA Programming and Performance	9	11070	June 7, 2010
block numbers related to the number of SMs blocks in multiple SMs CUDA Programming and Performance	1	1422	December 1, 2009
Ensuring blocks per SM CUDA Programming and Performance	4	1113	February 20, 2012
What will be happen in the situation CUDA Programming and Performance	9	6268	December 23, 2008
What resources are needed for a block to run? CUDA Programming and Performance	9	3177	May 21, 2009
Request clarification on CUDA runtime scheduling CUDA Programming and Performance	1	1761	September 5, 2008
Registers per SM GTX 460 CUDA Programming and Performance	7	1940	April 17, 2011
Which entity will execute one block? A single Cuda core or a SM? CUDA Programming and Performance	13	17152	December 7, 2010

How blocks will be distributed among SPs ?

Related topics