Scheduling blocks to SMs at runtime

In order to fully utilize all the SMs available, how many blocks must be there ?
Assumption: there are three active blocks assigned per SM…

If I have six blocks for execution, only two SMs are used right ?


I have the same exact question as skyblues… With growing number of SMs with GTX280, some existing kernels would not work as expected if the above condition does NOT hold true!

Can some1 throw some light here?

The assumption is quite entirely true. The number of blocks assigned to an SM depends on the number registers each thread uses, the amount of shared memory used by a block, and the number of threads in the block. For instance, each SM may have 8192 registers and each of your threads may require 20 registers. So, for a 16x16 block size, you’d require a total of 20*256 = 5120 registers. This means only one block can be scheduled per SM.

That said, for your scenario, if the number of blocks is less than the number of SMs, I’d expect that they are probably assigned to different SMs. Since assigning many blocks to an SM only helps increase throughput, it makes more sense to put the blocks on different SMs.



In fact, scheduling is not defined. However, it is possible to force each SM to handle one (or two or any number) blocks at a time by specifying the number of shared memory bytes required during the kernel call.

For example, if you specify (8K + 1 byte) - then the only block will be active on each SM.

You can’t know if the scheduler will put three blocks onto one SM or spread them out. Obviously to spread them out would be smarter, but we don’t know (and never found out experimentaly). As Romant said, you can force them to spread out by using too many resources for more than one block to fit in a SM.

The main point is the scheduler should schedule blocks such that latencies are not exposed.

say, I run 64 threads per block… Then, I wont get maximum performance if 3 blocks are not active per multi-processor. This configuration would be my preferred configuration compared to 3 blocks running in separate SM (having poor performance)

I think this is not a big deal for NVIDIA to disclose this trivial information… I hope their scheduler is atleast somewhat intelligent to do this work.

This becomse a big problem when you design kernels that you would want to work with single-precision, double-precision hardware + hardware with variable SMs (especially with the GTX280 series… where there are 30SMs…)

launching a kernel with 6 blocks on a gtx280 is a huge waste of resources. you should have at least enough blocks to fill all the sms. any ways calculating how many blocks will run concurrently on a sm is easy with the occupancy calculator. You just need to add the gt200 spec to it for a gtx280.

That’s silly. You won’t get better performance bunching up the blocks on a few SMs. Sure, an SM with 3 blocks will run 2.5x more “efficiently”, but 3 SMs each running one block have 3x more resources. In the end it’ll be about the same, but spreading them out should be a bit faster.