What's the best way to calculate the number of blocks for any input size

louis.child · September 14, 2023, 9:15am

Hi,

I am doing some image manipulation and trying to optimise the occupancy for my kernel. I by default use a 16x16x1 dim for the threads per block and then for the total number of blocks per grid I map each thread to a pixel and calculate the number of blocks that would be needed to cover the entire image. E.g.:

dim3 threadsPerBlock(16,16,1);
dim3 blocksPerThread((image_width/threadsPerBlock.x) + 1, (image_height/threadsPerBlock.y) + 1, 1);

Is this enough to get a good through put for any image size or am I missing something when it comes to occupancy?

Thanks,

Robert_Crovella · September 14, 2023, 3:14pm

Your approach is sensible and should not be performance-disadavantaged in any dramatic way (more than 2x).

If concerned about performance, I personally would never use (16,16,1) for a threadblock shape. I prefer to have the first number be 32, or some positive whole number multiple of 32.

Beyond that, CUDA has occupancy calculator APIs that can answer this question.

Coupled with that, a grid-stride loop can be used to decouple grid sizing from problem size, once you have worked out threadblock shape choice.

louis.child · September 19, 2023, 9:14am

Hi, thanks for the reply. Can I ask why specifically the 32 (or positive multiple) for the first parameter. I know that you want more that 32 total threads so that the GPU can do latency hiding. But it was my understanding that as long as the total threads (16x16=256) is more than 32 and also a multiple it should be fine. Is there anything special about the first parameter being 32?

Thanks for the resources for the occupancy APIs and grid-stride loops. I hadn’t seen anything for grid-stride loops before

Robert_Crovella · September 19, 2023, 2:20pm

This is to some degree a personal preference of mine. Probably not terribly important on modern GPUs, in most cases. It comes down to the organization of work in a 2D setting, when the structure of the block is overlaid on the data set (or the “work”).

Let’s consider a simple example, a 32x32 data set. To drive the point home, let’s consider a data set consisting of byte quantities. The 16x16 block structure would be like this:

b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2 b2
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4
b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b3 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4 b4

We see that effectively this data set is broken into 4 quadrants, for a 16x16 block, where each block is responsible for one quadrant (this assumes canonical index generation, i.e. int idx = threadIdx.x+blockDim.x*blockIdx.x and int idy = threadIdx.y+blockDim.y*blockIdx.y).

Now suppose that the first step in my computation is for each thread to load its respective value from global memory. For block b1, the first warp will load the first half of the first row of data, and the first half of the second row of data. For our 1-byte element case, that is 16 bytes starting at address offset 0 and 16 bytes starting at address offset 32. Is that, by definition, a fully or perfectly coalesced load? It is not.

Is that important? Maybe, but probably not in many/most cases. If we posit 32-bit elements instead of 8-bit elements, and we also posit a “modern” GPU like Pascal or newer, that has its L1 cache broken into 4 32-byte sectors per cacheline, then the load inefficiency probably goes away. It is still not a “perfectly coalesced” load in terms of what I consider to be canonical CUDA teaching (all threads in the warp are loading adjacent elements, or elements that fall within the same L1 cacheline), but the advancement in GPU architecture I think makes the point moot; it still achieves 100% efficiency using the definition of bytes requested/bytes retrieved.

Another benefit for me personally is that I usually like to be able to study a piece of code statically (meaning, inspect it with my eyes) and quickly come to the conclusion as to whether or not it will “nicely coalesce” (I have a fairly simple method to do this, which I teach when I am teaching CUDA). In my opinion, for me personally, that is a bit easier to do if there is not this breakpoint in the middle of each warp.

even in the 1-byte per element case, we could observe that the load of the first warp in b1 would load some elements needed by the first warp in b2. The GPU caches will tend to provide some benefit in this case, depending on locality of block execution. Certainly the L2 cache can provide some benefit, even if b1 and b2 happen to be on different SMs, as long as they execute close enough to each other temporally. If b1 and b2 happen to be on the same SM, then they can benefit from the L1 also, potentially.

It’s not a first-level consideration, just something that appeals to the thought processes I put in my head in Fermi/Kepler days. Which is why I said:

Your choice of 16x16 block shape is a commonly used choice, and should give good results in many/most situations.

louis.child · September 19, 2023, 2:26pm

I appreciate the insight and the amount of detail associated with it. l am still very new to Cuda and I personally picked 16x16 as a number because it looks nice and 256 is positive multiple of 32. But I do see why 32 is more suitable for a generic front where you might be working with data types that don’t saturate a cache line when running with 16 data points per row.

Topic		Replies	Views
Occupancy Query Performance not as expected CUDA Programming and Performance	11	4490	February 3, 2009
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2630	March 5, 2013
understanding the trade-off between block size and occupancy CUDA Programming and Performance	1	14170	March 29, 2010
How to determine the Block Size CUDA Programming and Performance	1	5937	September 4, 2009
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5935	July 25, 2007
GPU profiling 33% occupancy faster then 50-66% CUDA Programming and Performance	2	3348	March 13, 2007
Newbie help on thread blocks CUDA Programming and Performance	22	10698	December 24, 2008
Occupancy/ Optimazation How to use Occupancy Calculator, improve performance CUDA Programming and Performance	12	16866	December 7, 2011
Thread Block Size what difference does it make? CUDA Programming and Performance	6	5449	June 3, 2008
3 x 16 thread block runs faster than 16 x 3 why is that? CUDA Programming and Performance	8	10025	April 3, 2007

What's the best way to calculate the number of blocks for any input size

Related topics