Scheduling blocks to SMs at runtime

skyblues · October 26, 2008, 3:08am

In order to fully utilize all the SMs available, how many blocks must be there ?
Assumption: there are three active blocks assigned per SM…

If I have six blocks for execution, only two SMs are used right ?

Thanks

Sarnath · October 26, 2008, 3:36am

I have the same exact question as skyblues… With growing number of SMs with GTX280, some existing kernels would not work as expected if the above condition does NOT hold true!

Can some1 throw some light here?

pramodsub · October 26, 2008, 5:28am

The assumption is quite entirely true. The number of blocks assigned to an SM depends on the number registers each thread uses, the amount of shared memory used by a block, and the number of threads in the block. For instance, each SM may have 8192 registers and each of your threads may require 20 registers. So, for a 16x16 block size, you’d require a total of 20*256 = 5120 registers. This means only one block can be scheduled per SM.

That said, for your scenario, if the number of blocks is less than the number of SMs, I’d expect that they are probably assigned to different SMs. Since assigning many blocks to an SM only helps increase throughput, it makes more sense to put the blocks on different SMs.

Cheers,

Pramod

Romant · October 26, 2008, 8:40am

In fact, scheduling is not defined. However, it is possible to force each SM to handle one (or two or any number) blocks at a time by specifying the number of shared memory bytes required during the kernel call.

For example, if you specify (8K + 1 byte) - then the only block will be active on each SM.

alex_dubinsky · October 27, 2008, 2:25am

You can’t know if the scheduler will put three blocks onto one SM or spread them out. Obviously to spread them out would be smarter, but we don’t know (and never found out experimentaly). As Romant said, you can force them to spread out by using too many resources for more than one block to fit in a SM.

Sarnath · October 27, 2008, 4:43am

The main point is the scheduler should schedule blocks such that latencies are not exposed.

say, I run 64 threads per block… Then, I wont get maximum performance if 3 blocks are not active per multi-processor. This configuration would be my preferred configuration compared to 3 blocks running in separate SM (having poor performance)

I think this is not a big deal for NVIDIA to disclose this trivial information… I hope their scheduler is atleast somewhat intelligent to do this work.

This becomse a big problem when you design kernels that you would want to work with single-precision, double-precision hardware + hardware with variable SMs (especially with the GTX280 series… where there are 30SMs…)

Eri_Rubin · October 27, 2008, 3:44pm

launching a kernel with 6 blocks on a gtx280 is a huge waste of resources. you should have at least enough blocks to fill all the sms. any ways calculating how many blocks will run concurrently on a sm is easy with the occupancy calculator. You just need to add the gt200 spec to it for a gtx280.

alex_dubinsky · October 27, 2008, 4:11pm

That’s silly. You won’t get better performance bunching up the blocks on a few SMs. Sure, an SM with 3 blocks will run 2.5x more “efficiently”, but 3 SMs each running one block have 3x more resources. In the end it’ll be about the same, but spreading them out should be a bit faster.

Topic		Replies	Views
How blocks will be distributed among SPs ? CUDA Programming and Performance	4	1599	October 13, 2008
More blocks than SMs may not make sense CUDA Programming and Performance	13	2821	November 11, 2010
Relation between SM and block CUDA Programming and Performance	1	5624	March 18, 2010
Ensuring blocks per SM CUDA Programming and Performance	4	1143	February 20, 2012
block numbers related to the number of SMs blocks in multiple SMs CUDA Programming and Performance	1	1441	December 1, 2009
What will be happen in the situation CUDA Programming and Performance	9	6304	December 23, 2008
Assign blocks to SMs CUDA Programming and Performance	5	1702	February 4, 2019
What resources are needed for a block to run? CUDA Programming and Performance	9	3225	May 21, 2009
Question about the number of SMs using in the program. CUDA Programming and Performance	3	845	April 9, 2018
How do the thread blocks resides in the multiprocessors? CUDA Programming and Performance	4	2076	April 16, 2012

Scheduling blocks to SMs at runtime

Related topics