The choose of grid size and block size

Hi everyone, I have a question about the deployment of grid and block size. I already have a common knowledge about it. But when it comes to my programm, I am not sure, which solution is better.
My programm needs 4096 threads in all, as the limit of shared memory size, I use 256 blocks with 16 threads each. The 16 Threads need a 2KB shared memory. It is also lower than the limit. While I can also combine the two block’s work in one block, which means, then, 128 thread blocks with 32 threads are needed. I can even gather more thread blocks, like 16 blocks together(execute with 16 thread blocks with 256 threads each), this is still a little bit under the limitation of shared memory.
I have also read that, each 32 threads are gathered as warps and executed on multiprocessor. Is this also important, while choosing the block size?
Which type of dimension can be the best or it is not that important? Many thanks!

Hi Markus,
you should avoid to use block sizes, which are not divisible by 32. It is possible to use smaller block sizes, but will loose computational resources, as those are allocated for whole warps of 32 threads.
You should also consider that GPUs can (depending on architecture) put a maximum of 16 to 32 blocks onto a SM. Allowable Thread numbers are around 1024 to 2048 as a maximum per SM, or 32 to 64 warps. A good number to hide latencies is at least 8. You can freely choose yourself, whether to use smaller or larger blocks, as long as the thread number is divisible by 32 and you can put enough warps per SM.

What GPU are you using? E.g. the RTX 3070 has 40 SMs with 100 KiB shared memory each, which is 4000 KiB for the whole GPU. Blocks, which are executed serially, can reuse that memory. So if your application allows, use a large grid and block size.

Hi Curefab,
thanks a lot for your answer. I am currently using rtx 2080ti. So if I use 32 threads instead of 16 threads per block, but still using the same amount of threads, can it be worse? For example, I had 256 Blocks with 16 threads each, but if I use 32 threads each, 128 blocks are needed, but the task executed by 2*16 threads, which is given to 2 SMs, is now as a warp and given to one SM. Or it is different from what I think.

And my problem is that, the GPU only limits my programm at the shared memory,so that now I use more blocks(shared memory per block is limited). On other compute resources, my programm actually ultilize just a small part.

On Turing compute capability 7.5 (for the RTX 2080 Ti) the maximum shared memory per thread block is the same as the maximum shared memory per SM. It is 64 KiB for each of the 68 SMs on the RTX 2080 Ti.

You write 16 Threads need 2 KB shared memory => 128 Bytes per Thread. So each SM can have 512 active threads. It can be 16 blocks (the maximum on 7.5) of 32 threads; or it can be 1 block of 512 threads per SM. Or anything in between.

Each block is running on a single SM. But you are talking about 128 or 256 blocks, so the blocks will be distributed anyway. If you need them running on a single SM (e.g. for using shared memory) you have to put all those threads in a single large block. If you want them distributed, provide enough threads and blocks. If your algorithm allows, create at least ten thousands of threads (number of blocks times number of threads per block).

Thanks a lot for your reply. I have understood that. But still on a small point about the 10k threads in all. If I generate my programm with three Streams, like doing the same work but with 3 different input data sets. Each of them uses like 3000 threads. Then it is 9000 threads used in all?

Hi Markus,
yes, that is true. With streams, you can execute several kernels with different input data sets at the same time. They will also each allocate parts of the shared memory. But according to the calculation your shared memory is enough for 34816 active threads.

Just mentioning:

  • If you strive to get maximum performance, you should also consider that part of the time the streams are occupied with asynchronous memory copies of the input and output data sets between running their computation. Therefore the maximum occupancy will be reached slightly beyond 3 streams running kernels with 3000 threads each.

Another thing to optimize is, how the threads in each warp behave:

  • They are slowed down, if the program flow diverges or the memory access patterns are not ideal.

  • For global memory, you try to get a multiple of 4 sectors (32-byte aligned memory portions) being accessed over the whole warp, for shared memory, you try to avoid bank conflicts (each thread accesses a different bank at the same time).

If execution time is not critical, you do not have to fully optimize all those parameters.

I have understood. Thanks a lot again for your help!