The division of work CUDA imposes into blocks is logical because it reflects the hardware (some amount of execution threads within a single execution unit, all in the same “block”).
However, as I’m looking at implementation of image processing algorithms, it’s not entirely clear why I should have 2D grids of blocks, each being a 2D grid of threads. Why won’t 1D do? After all, the kernel call usually just sees the image as a linear 1D array of pixels anyway and has to compute its global index by multiplying the usual row * column + offset in column.
One guess I have is for spatial locality. We usually compute stuff for a pixel based on the pixels around it, so the 2D grid of threads makes sure that all adjacent pixels run within the same execution unit, thus can share local memory etc. Is this correct? Anything else I am missing? Maybe ease of programming somehow (although that’s hard to believe since the code is computing a 1D offset anyway)
Thanks in advance