Calculating the optimal grid and block size?


is there a general algorithm/rule how to calculate the optimal grid/block size for 2D image processing.

I’m doing image processing with a lot of pixel-wise algorithms.
For most of my applications it makes sense to create a grid/block from the given width and height.
What I assume is that there might be an optimal algorithm of how to spread the threads and blocks
using the information I get from the CUDA device info.

How do you solve such things.


Hello, i am calculating 2d images (phase holograms), with pixel-wise operations only.
I am arranging the data, the threadblocks, and the threads in 1D, i see no advantage of the 2D arrangement in my case.

When determining the number of threads per block, and thereby the number of blocks, you should be aware for the
Maximum number of resident blocks per multiprocessor
Maximum number of resident warps per multiprocessor: 48 in my case
Maximum number of resident threads per multiprocessor: 1536 in my case (=48*32)

I am calculating an image with 1024x768 pixels, and i am making threadblocks with 768 threads. So i will have 1024 threadblocks, each containing 768 threads.
2 threadblock will be executed concurrently in each multiprocessor, with 1536 threads overall. So i am having full occupancy.
If the number of threads per block is higher than 768, for example 1024,
only one block will be executed concurrently on each multiprocessor, so there will be an occupancy: 1024/1536.

Maximizung the threads per block can pay out if you are loading constant global parameters to the shared memory of each block. More threads/blocks will result in fewer
threadblocks, thereby fewer load from global to shared memory.