More blocks than SMs may not make sense

NeedWisdom · November 9, 2010, 8:24pm

Hello Forum,

My application launches a kernel that basically runs threads in parallel. They also are not cooperating threads and do not call syncthreads(). They basically run independently until the program ends.

It seems specifying more thread blocks then there are SMs won’t help in this case since there are no opportunities for the threads to wait on a synch. From debugging and reading, it appears that only one block can exist on one SM at time.

Is this understanding correct?

Thanks for your help.

NeedWisdom · November 9, 2010, 8:24pm

Hello Forum,

My application launches a kernel that basically runs threads in parallel. They also are not cooperating threads and do not call syncthreads(). They basically run independently until the program ends.

It seems specifying more thread blocks then there are SMs won’t help in this case since there are no opportunities for the threads to wait on a synch. From debugging and reading, it appears that only one block can exist on one SM at time.

Is this understanding correct?

Thanks for your help.

seibert · November 9, 2010, 11:54pm

No, more than one block can run on a single SM at the same time. Whether this happens depends on how many registers and how much shared memory each block requires. If the SM has sufficient resources to run more than one block, it will.

seibert · November 9, 2010, 11:54pm

No, more than one block can run on a single SM at the same time. Whether this happens depends on how many registers and how much shared memory each block requires. If the SM has sufficient resources to run more than one block, it will.

SPWorley · November 10, 2010, 2:09am

You usually want more blocks in general for efficiency robustness and scaling.

First, if your blocks take a variable amount of time each, then you’ll have idle SMs since your kernel runtime will be defined by the very slowest block in your kernel.
If you have many more blocks than SMs, then on Fermi your kernel runtime will be defined by the AVERAGE runtime of all blocks (which is ideal.)

Second, more blocks allow better scaling across various hardware. Perhaps a new GPU has more SMs or can run more blocks per SM… but you hardwired the block count to be lower than that number because you were assuming some other GPU as a reference. So you lose horsepower by leaving part of the GPU idle.

Finally, don’t be scared of high block counts. The overhead of launching a new block is quite small. While I have not timed it, it’s likely on the order of tens of clock cycles, not millions.

Last, a very very large (almost hypocritical) caveat: despite all I just posted, I actually don’t follow the above advice of “use lots of blocks” because of GT200 block scheduling inefficiencies… I dynamically schedule my own work inside each block (using atomic queues) instead of letting the GPU do dynamic block assignments. This isn’t necessary on Fermi.

SPWorley · November 10, 2010, 2:09am

You usually want more blocks in general for efficiency robustness and scaling.

First, if your blocks take a variable amount of time each, then you’ll have idle SMs since your kernel runtime will be defined by the very slowest block in your kernel.
If you have many more blocks than SMs, then on Fermi your kernel runtime will be defined by the AVERAGE runtime of all blocks (which is ideal.)

Second, more blocks allow better scaling across various hardware. Perhaps a new GPU has more SMs or can run more blocks per SM… but you hardwired the block count to be lower than that number because you were assuming some other GPU as a reference. So you lose horsepower by leaving part of the GPU idle.

Finally, don’t be scared of high block counts. The overhead of launching a new block is quite small. While I have not timed it, it’s likely on the order of tens of clock cycles, not millions.

Last, a very very large (almost hypocritical) caveat: despite all I just posted, I actually don’t follow the above advice of “use lots of blocks” because of GT200 block scheduling inefficiencies… I dynamically schedule my own work inside each block (using atomic queues) instead of letting the GPU do dynamic block assignments. This isn’t necessary on Fermi.

mdew · November 10, 2010, 10:23am

According to http://synergy.cs.vt.edu/pubs/papers/feng-iscas2010-gpusync.pdf:

In a multi-core environment, many types of synchronization approaches

[1], [3], [5] have been proposed. However, none of them

can be directly used on GPUs. This is because multiple GPU

thread blocks can be scheduled to execute on a single streaming

multiprocessor (SM) simultaneously, and CUDA blocks do not yield

their execution. This means that once a thread block is spawned by

the CUDA thread scheduler, other blocks cannot start their execution

until execution of the spawned block is completed. Thus, deadlocks

could occur, and they cannot be resolved in the same way as in a

multi-core environment, where a process can yield its execution to

other processes. One way to address this problem is to assign only

one block per SM, which can be implemented by allocating all the

shared memory of an SM for a single thread block [17].

I have a question:

If one SM can handle one block at the time, what is the difference if our block use all the shared mem or just 1/4 for example…?

Imagine that we have 2 SMs and 4 blocks which use all the shared mem each.

Now the second situation - we have also 2 SMs and 4 blocks that are using for example 1/10 shared mem each.

What will be the difference between those two situations?

If one SM can handle 1 block at time, then there shouldnt be any difference…?

mdew · November 10, 2010, 10:23am

According to http://synergy.cs.vt.edu/pubs/papers/feng-iscas2010-gpusync.pdf:

In a multi-core environment, many types of synchronization approaches

[1], [3], [5] have been proposed. However, none of them

can be directly used on GPUs. This is because multiple GPU

thread blocks can be scheduled to execute on a single streaming

multiprocessor (SM) simultaneously, and CUDA blocks do not yield

their execution. This means that once a thread block is spawned by

the CUDA thread scheduler, other blocks cannot start their execution

until execution of the spawned block is completed. Thus, deadlocks

could occur, and they cannot be resolved in the same way as in a

multi-core environment, where a process can yield its execution to

other processes. One way to address this problem is to assign only

one block per SM, which can be implemented by allocating all the

shared memory of an SM for a single thread block [17].

I have a question:

If one SM can handle one block at the time, what is the difference if our block use all the shared mem or just 1/4 for example…?

Imagine that we have 2 SMs and 4 blocks which use all the shared mem each.

Now the second situation - we have also 2 SMs and 4 blocks that are using for example 1/10 shared mem each.

What will be the difference between those two situations?

If one SM can handle 1 block at time, then there shouldnt be any difference…?

NeedWisdom · November 10, 2010, 5:50pm

You usually want more blocks in general for efficiency robustness and scaling.

First, if your blocks take a variable amount of time each, then you’ll have idle SMs since your kernel runtime will be defined by the very slowest block in your kernel.

If you have many more blocks than SMs, then on Fermi your kernel runtime will be defined by the AVERAGE runtime of all blocks (which is ideal.)

Second, more blocks allow better scaling across various hardware. Perhaps a new GPU has more SMs or can run more blocks per SM… but you hardwired the block count to be lower than that number because you were assuming some other GPU as a reference. So you lose horsepower by leaving part of the GPU idle.

Finally, don’t be scared of high block counts. The overhead of launching a new block is quite small. While I have not timed it, it’s likely on the order of tens of clock cycles, not millions.

Last, a very very large (almost hypocritical) caveat: despite all I just posted, I actually don’t follow the above advice of “use lots of blocks” because of GT200 block scheduling inefficiencies… I dynamically schedule my own work inside each block (using atomic queues) instead of letting the GPU do dynamic block assignments. This isn’t necessary on Fermi.

Thanks for the comprehensive reply.

First, my blocks don’t vary. They all do the exact same thing just on different data input.
I compute the number of threads that might run on a SM. In my case it ends up being about 600 per SM doing math on registers and shared memory use.
On my GPU The max threads per block is 512 so I should be able to run 2 blocks with 300 threads on each SM simultaneously.
I multiply by the number of SM which is 30 for my GPU. So I have a grid size of 60 (30x2) and a block size of 300.
What appears to be happening is that only 30 blocks are active. I see this by attaching the NSIGHT debugger and I am only allowed to set a CUDA focus to blocks < 30 even though the dimension is reported to 60. I don’t know why other than these blocks are not “active”
So it appears that half the blocks don’t have an opportunity to be active because the SM are busy with the first set and never breath. My best time appears when I just limit the blocks = number of SMs and maximize the threads per block (512) and eat the other 88 threads I might have realized. Any other fiddling does nothing or makes it worse.

Since I can query the device for the number of SMs, that’s the number of blocks I launch ( I don’t hard wire it) and that’s how I figure in leveraging adapting to other GPUs.

NeedWisdom · November 10, 2010, 5:50pm

You usually want more blocks in general for efficiency robustness and scaling.

First, if your blocks take a variable amount of time each, then you’ll have idle SMs since your kernel runtime will be defined by the very slowest block in your kernel.

If you have many more blocks than SMs, then on Fermi your kernel runtime will be defined by the AVERAGE runtime of all blocks (which is ideal.)

Second, more blocks allow better scaling across various hardware. Perhaps a new GPU has more SMs or can run more blocks per SM… but you hardwired the block count to be lower than that number because you were assuming some other GPU as a reference. So you lose horsepower by leaving part of the GPU idle.

Finally, don’t be scared of high block counts. The overhead of launching a new block is quite small. While I have not timed it, it’s likely on the order of tens of clock cycles, not millions.

Last, a very very large (almost hypocritical) caveat: despite all I just posted, I actually don’t follow the above advice of “use lots of blocks” because of GT200 block scheduling inefficiencies… I dynamically schedule my own work inside each block (using atomic queues) instead of letting the GPU do dynamic block assignments. This isn’t necessary on Fermi.

Thanks for the comprehensive reply.

First, my blocks don’t vary. They all do the exact same thing just on different data input.
I compute the number of threads that might run on a SM. In my case it ends up being about 600 per SM doing math on registers and shared memory use.
On my GPU The max threads per block is 512 so I should be able to run 2 blocks with 300 threads on each SM simultaneously.
I multiply by the number of SM which is 30 for my GPU. So I have a grid size of 60 (30x2) and a block size of 300.
What appears to be happening is that only 30 blocks are active. I see this by attaching the NSIGHT debugger and I am only allowed to set a CUDA focus to blocks < 30 even though the dimension is reported to 60. I don’t know why other than these blocks are not “active”
So it appears that half the blocks don’t have an opportunity to be active because the SM are busy with the first set and never breath. My best time appears when I just limit the blocks = number of SMs and maximize the threads per block (512) and eat the other 88 threads I might have realized. Any other fiddling does nothing or makes it worse.

Since I can query the device for the number of SMs, that’s the number of blocks I launch ( I don’t hard wire it) and that’s how I figure in leveraging adapting to other GPUs.

tera · November 10, 2010, 6:58pm

I can’t comment on your particular situation, but be aware that attaching the debugger changes the execution configuration, and the debugger will not attach to all SMs.

tera · November 10, 2010, 6:58pm

I can’t comment on your particular situation, but be aware that attaching the debugger changes the execution configuration, and the debugger will not attach to all SMs.

mdew · November 11, 2010, 4:56pm

Okay I checked it out and:

no matter of compute capability, 1 SM can handle up to 8 blocks.
There are several restriction as:
maximum number of threads per multiprocessor
maximum number of registers per multiprocessor
maximum number of smem per multiprocessor (which is the same as smem per block)
and others…
If for example 4 blocks use 1/4 of those resource per multiprocessor they are calculated concurently, it means that
SM “switch” them during computation…

So I think it’s efficient to use 8*num_of_SM blocks if they can be run all at once…
I don’t know if it will be efficient to make more blocks…

Restrictions per multiprocessor are in Programming guide v3.0 in appendix G…

mdew · November 11, 2010, 4:56pm

Okay I checked it out and:

no matter of compute capability, 1 SM can handle up to 8 blocks.
There are several restriction as:
maximum number of threads per multiprocessor
maximum number of registers per multiprocessor
maximum number of smem per multiprocessor (which is the same as smem per block)
and others…
If for example 4 blocks use 1/4 of those resource per multiprocessor they are calculated concurently, it means that
SM “switch” them during computation…

So I think it’s efficient to use 8*num_of_SM blocks if they can be run all at once…
I don’t know if it will be efficient to make more blocks…

Restrictions per multiprocessor are in Programming guide v3.0 in appendix G…

Topic		Replies	Views
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	738	March 30, 2024
Can threads in a warp from different blocks? CUDA Programming and Performance	17	11828	March 26, 2010
Number of Threads vs Number of Blocks in GPU Kernel CUDA Programming and Performance	4	8514	July 16, 2017
What will be happen in the situation CUDA Programming and Performance	9	6241	December 23, 2008
confusion of basic concepts CUDA Programming and Performance	8	6305	May 18, 2011
performance cost of too many blocks? CUDA Programming and Performance	12	2790	December 4, 2018
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19584	July 5, 2011
Basic Cuda Confusion - help CUDA Programming and Performance	9	1900	February 11, 2013
Max 1 or 2 concurrent kernels per SM? CUDA Programming and Performance	19	11690	May 22, 2014
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17724	April 1, 2010

More blocks than SMs may not make sense

Related topics