Thread block num problem

If we don’t take the shared memory limited per thread block into account, will there be any different between 4 threadblk per SM and 128 threads per blk with 2 threadblk per SM and 256 threads per blk? I don’t understand why many tutorials tell me just to activate at least 4-8 thread blocks per SM.

There is a small difference for the time, when SMs are finishing one thread block and starting another (for a large grid size with many blocks to calculate). With more thread blocks per SM the switching to the next block is smoother; with only 2 or even only 1 thread block per SM, the old thread block has to be fully processed (all warps have to exit), before a new block can be started. There can be a short time (at the end of the processing of a block), where the number of still resident warps, and thus the occupancy really gets low. This effect is more extreme, if the warps of a block have vastly differing execution time, e.g. if only one warp saves back the result or does some other post-processing.

So SM actually start a thread block every time, instead of a single warp? Does that means at the end of device function, there is a invisible syncthreads()?

So, if SM can operate 4 threadblk per time, is it recommanded to at least allocate 8 thread blocks per SM?

The warps have to start with thread block granularity, as threads may interact with each other with __syncthreads().

(Perhaps in theory there are kernels, which use neither __syncthreads nor shared memory), but then it is the task of the programmer to reduce the block size and increase the number of blocks, as without cooperation between warps there is no reason to keep the number of threads per block high.)

No, there is no invisible __syncthreads() at the end of a device function. Warps and threads, which are finished, exit. But the not finished block still needs resources, especially shared memory and frees them only, when all threads have finished.

A SM can operate 4 warps (not blocks) at a time (4 SM Partitions). Those warps can be from the same or from a different block. On the other hand, if we assume that in the extreme case only one warp is still running from each block, it would make sense to have at least 8 thread blocks active at the same time.

In my experience the number of thread blocks is not critical in most of the times. You can perfectly work with 1 or 2 thread blocks per SM, if you need the number of threads or have limited shared memory.

It can improve performance for all kernels by a very small amount to increase the number of blocks per SM.

There could be some few kernels (see my comment regarding postprocessing by a single warp), where it has more of an effect.

Another advantage of smaller block sizes is that the kernel is more flexible for different CUDA architectures. E.g. there are some architectures with a limit of 1024 threads and some with 1536 threads. If you would have a kernel using 1024 threads per block, it also would only fit once into 1536, whereas using a block size of 512 would fit twice into 1024 and three times into 1536, improving occupancy.

Further advantages of smaller blocks could be

  • that your problem size better fits as multiple of a smaller block size
  • that CUDA is more flexible to distribute work to different SMs, if they are free
  • that the grid size (number of blocks) gets larger and your kernel is more independent of the number of SMs - the tail effect is less strong

Thanks very much! Can I think this way: shared memory occupied by a threadblock can only be freed when every warp in this threadblock have finished its calculation, but registers can be freed after the single warp finished and exit?

Hi half-0, I am sure of that shared memory only can be freed, when the last warp of a block finishes. For the registers I believe so (as you have written).

Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.