More blocks than SMs may not make sense

Hello Forum,

My application launches a kernel that basically runs threads in parallel. They also are not cooperating threads and do not call syncthreads(). They basically run independently until the program ends.

It seems specifying more thread blocks then there are SMs won’t help in this case since there are no opportunities for the threads to wait on a synch. From debugging and reading, it appears that only one block can exist on one SM at time.

Is this understanding correct?

Thanks for your help.

Hello Forum,

My application launches a kernel that basically runs threads in parallel. They also are not cooperating threads and do not call syncthreads(). They basically run independently until the program ends.

It seems specifying more thread blocks then there are SMs won’t help in this case since there are no opportunities for the threads to wait on a synch. From debugging and reading, it appears that only one block can exist on one SM at time.

Is this understanding correct?

Thanks for your help.

No, more than one block can run on a single SM at the same time. Whether this happens depends on how many registers and how much shared memory each block requires. If the SM has sufficient resources to run more than one block, it will.

No, more than one block can run on a single SM at the same time. Whether this happens depends on how many registers and how much shared memory each block requires. If the SM has sufficient resources to run more than one block, it will.

You usually want more blocks in general for efficiency robustness and scaling.

First, if your blocks take a variable amount of time each, then you’ll have idle SMs since your kernel runtime will be defined by the very slowest block in your kernel.
If you have many more blocks than SMs, then on Fermi your kernel runtime will be defined by the AVERAGE runtime of all blocks (which is ideal.)

Second, more blocks allow better scaling across various hardware. Perhaps a new GPU has more SMs or can run more blocks per SM… but you hardwired the block count to be lower than that number because you were assuming some other GPU as a reference. So you lose horsepower by leaving part of the GPU idle.

Finally, don’t be scared of high block counts. The overhead of launching a new block is quite small. While I have not timed it, it’s likely on the order of tens of clock cycles, not millions.

Last, a very very large (almost hypocritical) caveat: despite all I just posted, I actually don’t follow the above advice of “use lots of blocks” because of GT200 block scheduling inefficiencies… I dynamically schedule my own work inside each block (using atomic queues) instead of letting the GPU do dynamic block assignments. This isn’t necessary on Fermi.

You usually want more blocks in general for efficiency robustness and scaling.

First, if your blocks take a variable amount of time each, then you’ll have idle SMs since your kernel runtime will be defined by the very slowest block in your kernel.
If you have many more blocks than SMs, then on Fermi your kernel runtime will be defined by the AVERAGE runtime of all blocks (which is ideal.)

Second, more blocks allow better scaling across various hardware. Perhaps a new GPU has more SMs or can run more blocks per SM… but you hardwired the block count to be lower than that number because you were assuming some other GPU as a reference. So you lose horsepower by leaving part of the GPU idle.

Finally, don’t be scared of high block counts. The overhead of launching a new block is quite small. While I have not timed it, it’s likely on the order of tens of clock cycles, not millions.

Last, a very very large (almost hypocritical) caveat: despite all I just posted, I actually don’t follow the above advice of “use lots of blocks” because of GT200 block scheduling inefficiencies… I dynamically schedule my own work inside each block (using atomic queues) instead of letting the GPU do dynamic block assignments. This isn’t necessary on Fermi.

According to http://synergy.cs.vt.edu/pubs/papers/feng-iscas2010-gpusync.pdf:

I have a question:

If one SM can handle one block at the time, what is the difference if our block use all the shared mem or just 1/4 for example…?

Imagine that we have 2 SMs and 4 blocks which use all the shared mem each.

Now the second situation - we have also 2 SMs and 4 blocks that are using for example 1/10 shared mem each.

What will be the difference between those two situations?

If one SM can handle 1 block at time, then there shouldnt be any difference…?

According to http://synergy.cs.vt.edu/pubs/papers/feng-iscas2010-gpusync.pdf:

I have a question:

If one SM can handle one block at the time, what is the difference if our block use all the shared mem or just 1/4 for example…?

Imagine that we have 2 SMs and 4 blocks which use all the shared mem each.

Now the second situation - we have also 2 SMs and 4 blocks that are using for example 1/10 shared mem each.

What will be the difference between those two situations?

If one SM can handle 1 block at time, then there shouldnt be any difference…?

Thanks for the comprehensive reply.

  • First, my blocks don’t vary. They all do the exact same thing just on different data input.

  • I compute the number of threads that might run on a SM. In my case it ends up being about 600 per SM doing math on registers and shared memory use.

  • On my GPU The max threads per block is 512 so I should be able to run 2 blocks with 300 threads on each SM simultaneously.

  • I multiply by the number of SM which is 30 for my GPU. So I have a grid size of 60 (30x2) and a block size of 300.

  • What appears to be happening is that only 30 blocks are active. I see this by attaching the NSIGHT debugger and I am only allowed to set a CUDA focus to blocks < 30 even though the dimension is reported to 60. I don’t know why other than these blocks are not “active”

  • So it appears that half the blocks don’t have an opportunity to be active because the SM are busy with the first set and never breath. My best time appears when I just limit the blocks = number of SMs and maximize the threads per block (512) and eat the other 88 threads I might have realized. Any other fiddling does nothing or makes it worse.

Since I can query the device for the number of SMs, that’s the number of blocks I launch ( I don’t hard wire it) and that’s how I figure in leveraging adapting to other GPUs.

Thanks for the comprehensive reply.

  • First, my blocks don’t vary. They all do the exact same thing just on different data input.

  • I compute the number of threads that might run on a SM. In my case it ends up being about 600 per SM doing math on registers and shared memory use.

  • On my GPU The max threads per block is 512 so I should be able to run 2 blocks with 300 threads on each SM simultaneously.

  • I multiply by the number of SM which is 30 for my GPU. So I have a grid size of 60 (30x2) and a block size of 300.

  • What appears to be happening is that only 30 blocks are active. I see this by attaching the NSIGHT debugger and I am only allowed to set a CUDA focus to blocks < 30 even though the dimension is reported to 60. I don’t know why other than these blocks are not “active”

  • So it appears that half the blocks don’t have an opportunity to be active because the SM are busy with the first set and never breath. My best time appears when I just limit the blocks = number of SMs and maximize the threads per block (512) and eat the other 88 threads I might have realized. Any other fiddling does nothing or makes it worse.

Since I can query the device for the number of SMs, that’s the number of blocks I launch ( I don’t hard wire it) and that’s how I figure in leveraging adapting to other GPUs.

I can’t comment on your particular situation, but be aware that attaching the debugger changes the execution configuration, and the debugger will not attach to all SMs.

I can’t comment on your particular situation, but be aware that attaching the debugger changes the execution configuration, and the debugger will not attach to all SMs.

Okay I checked it out and:

  1. no matter of compute capability, 1 SM can handle up to 8 blocks.
  2. There are several restriction as:
    maximum number of threads per multiprocessor
    maximum number of registers per multiprocessor
    maximum number of smem per multiprocessor (which is the same as smem per block)
    and others…
  3. If for example 4 blocks use 1/4 of those resource per multiprocessor they are calculated concurently, it means that
    SM “switch” them during computation…

So I think it’s efficient to use 8*num_of_SM blocks if they can be run all at once…
I don’t know if it will be efficient to make more blocks…

Restrictions per multiprocessor are in Programming guide v3.0 in appendix G…

Okay I checked it out and:

  1. no matter of compute capability, 1 SM can handle up to 8 blocks.
  2. There are several restriction as:
    maximum number of threads per multiprocessor
    maximum number of registers per multiprocessor
    maximum number of smem per multiprocessor (which is the same as smem per block)
    and others…
  3. If for example 4 blocks use 1/4 of those resource per multiprocessor they are calculated concurently, it means that
    SM “switch” them during computation…

So I think it’s efficient to use 8*num_of_SM blocks if they can be run all at once…
I don’t know if it will be efficient to make more blocks…

Restrictions per multiprocessor are in Programming guide v3.0 in appendix G…