What's the best way to calculate the number of blocks for any input size

Hi,

I am doing some image manipulation and trying to optimise the occupancy for my kernel. I by default use a 16x16x1 dim for the threads per block and then for the total number of blocks per grid I map each thread to a pixel and calculate the number of blocks that would be needed to cover the entire image. E.g.:

dim3 threadsPerBlock(16,16,1);
dim3 blocksPerThread((image_width/threadsPerBlock.x) + 1, (image_height/threadsPerBlock.y) + 1, 1);

Is this enough to get a good through put for any image size or am I missing something when it comes to occupancy?

Thanks,

Your approach is sensible and should not be performance-disadavantaged in any dramatic way (more than 2x).

If concerned about performance, I personally would never use (16,16,1) for a threadblock shape. I prefer to have the first number be 32, or some positive whole number multiple of 32.

Beyond that, CUDA has occupancy calculator APIs that can answer this question.

Coupled with that, a grid-stride loop can be used to decouple grid sizing from problem size, once you have worked out threadblock shape choice.

Hi, thanks for the reply. Can I ask why specifically the 32 (or positive multiple) for the first parameter. I know that you want more that 32 total threads so that the GPU can do latency hiding. But it was my understanding that as long as the total threads (16x16=256) is more than 32 and also a multiple it should be fine. Is there anything special about the first parameter being 32?

Thanks for the resources for the occupancy APIs and grid-stride loops. I hadn’t seen anything for grid-stride loops before

This is to some degree a personal preference of mine. Probably not terribly important on modern GPUs, in most cases. It comes down to the organization of work in a 2D setting, when the structure of the block is overlaid on the data set (or the “work”).

Let’s consider a simple example, a 32x32 data set. To drive the point home, let’s consider a data set consisting of byte quantities. The 16x16 block structure would be like this:

b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4

We see that effectively this data set is broken into 4 quadrants, for a 16x16 block, where each block is responsible for one quadrant (this assumes canonical index generation, i.e. int idx = threadIdx.x+blockDim.x*blockIdx.x and int idy = threadIdx.y+blockDim.y*blockIdx.y).

Now suppose that the first step in my computation is for each thread to load its respective value from global memory. For block b1, the first warp will load the first half of the first row of data, and the first half of the second row of data. For our 1-byte element case, that is 16 bytes starting at address offset 0 and 16 bytes starting at address offset 32. Is that, by definition, a fully or perfectly coalesced load? It is not.

Is that important? Maybe, but probably not in many/most cases. If we posit 32-bit elements instead of 8-bit elements, and we also posit a “modern” GPU like Pascal or newer, that has its L1 cache broken into 4 32-byte sectors per cacheline, then the load inefficiency probably goes away. It is still not a “perfectly coalesced” load in terms of what I consider to be canonical CUDA teaching (all threads in the warp are loading adjacent elements, or elements that fall within the same L1 cacheline), but the advancement in GPU architecture I think makes the point moot; it still achieves 100% efficiency using the definition of bytes requested/bytes retrieved.

Another benefit for me personally is that I usually like to be able to study a piece of code statically (meaning, inspect it with my eyes) and quickly come to the conclusion as to whether or not it will “nicely coalesce” (I have a fairly simple method to do this, which I teach when I am teaching CUDA). In my opinion, for me personally, that is a bit easier to do if there is not this breakpoint in the middle of each warp.

even in the 1-byte per element case, we could observe that the load of the first warp in b1 would load some elements needed by the first warp in b2. The GPU caches will tend to provide some benefit in this case, depending on locality of block execution. Certainly the L2 cache can provide some benefit, even if b1 and b2 happen to be on different SMs, as long as they execute close enough to each other temporally. If b1 and b2 happen to be on the same SM, then they can benefit from the L1 also, potentially.

It’s not a first-level consideration, just something that appeals to the thought processes I put in my head in Fermi/Kepler days. Which is why I said:

Your choice of 16x16 block shape is a commonly used choice, and should give good results in many/most situations.

I appreciate the insight and the amount of detail associated with it. l am still very new to Cuda and I personally picked 16x16 as a number because it looks nice and 256 is positive multiple of 32. But I do see why 32 is more suitable for a generic front where you might be working with data types that don’t saturate a cache line when running with 16 data points per row.