Number of Threads vs Number of Blocks in GPU Kernel

I have tried various forums and searched everywhere but haven’t got all my answers clarified.

How to decide how many number of blocks vs threads in a block.

Now, I did read that limiting factors are number of registers, number of blocks per SM, etc

So, in order to avoid confusion let me create a simple example.

global
void add(int *a, int *b,int *c,int n)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}

So objective is to calculate for a 1024 bit vector so, last case is c[1023] = a[1023] + b[1023]

Now, let me define 2 scenarios:-

1. Case1: When there are 1024 threads in a block and hence there is only one block

2. Case2: When there are 32 blocks and 32 threads

Now, as we can see from the example that there is no register limitation or any kind of limitation of memory for which the blocks can be constrained.

Let’s make some assumptions (gtx 980)

1. Total SM = 16
2. Max Warps per SM = 64
3. Max threads blocks per SM = 32
4. Max thread block size = 1024
5. SPs per SM = 128
6. Assuming the registers and shared memory are quite higher so that there is no limitation

So my question is as follows:-

Question: Which one is going to better utilize the GPU Case1 or Case2 and be more time efficient. Can we estimate this without using the profiler i.e. mathematically or by some logic?

i.e. In case of case1 a single block will be scheduled to one out of 16 SM, which will be fully loaded.
Now, since there are 128 SP, we can’t run them all together atomically, the warps will run concurrently. But not all together in one go. (only 128 can run together in on go i.e. 4 warps)

In case 2, let’s assume the 32 blocks are spread across 16SM, assuming 2 blocks to each SM then will it be not better than case1 since more atomic level parallelism.

I am new to CUDA and hence have more questions to follow, but all the questions are mainly related to this fundamental qs.

Thanks
Any help is highly appreciated!

case 2 is better

however for real world codes that make efficient use of the GPU, you will generally need 10,000 threads or more, and you won’t be wondering about whether you should put all the threads in one block or not.

For cases where you have a small number of threads overall, you might want to stop and rethink your life, and analyze how you got to this point. If the only thing you wanted to do was a 1024-element vector add, you should immediately stop, forget CUDA and GPUs, and write some decent quality CPU code to get it done. If you really want to do it on the GPU, your first step is to look for algorithms that may increase the exposed parallelism, and/or move more of your workflow to the GPU.

There are some situations where this does naturally occur, for example as part of a larger overall workflow on the GPU where one particular operation does not lend itself to a large number of threads. In these situations where you have a small number of threads, the first strategy is usually to provide at least one warp per GPU SM. After that evenly distribute warps across the SMs (increase block size) until you reach saturation (2048 threads * # of SMs, as a first order approximation). After that, you have a thread complement now that can hope to saturate the GPU, and remaining optimization strategies may diverge, depending on what your code is doing, and whether it is memory or compute bound.

Hello txbob,

Thank you very much for the clarification.

1. Of course I would not use it for 1024 threads only.
2. Now, since I know that case2 will be better. Let’s talk with at least 10,000 threads as you suggsted.

The followup question is as follows:-

Consider the case where I have 3 kernels

2. Subtract Function
3. Multiplication function

The kernels are as follows:-

global
void add(int *a, int *b, int *c, int n)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}

2. Subtract
global
void sub(int *a, int *b, int *c, int n)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] - b[index];
}

3. Multiplication
global
void mul(int *a, int *b, int *c, int n)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] * b[index];
}

Now, let me call the kernels using different streams, so that I can utilize the HyperQ for the high concurrency of kernels

I would introduce 2 scenarios for 2 questions

Scenario 1:
Execution flow as per the number

2. sub<<<32,256,stream2>>>(d_a,d_b,d_c,8192)
3. mul<<<1,1024,stream3>>>(d_a,d_b,d_c,1024)

Here I have kept the kernels and memory in such a way that it does not create a limitation, so what we are giving is not further limited by the device.

Scenario 2:

2. sub<<<16,512,stream2>>>(d_a,d_b,d_c,8192)
3. mul<<<1,1024,stream3>>>(d_a,d_b,d_c,1024)

Now the questions are as follows:-

Q1. Now, since the kernels are being called concurrently, which scenario is better without profiling. I mean can we tell by any mathematical means.Keeping more blocks provides better chance of
kernel mixing or keeping more threads in a block does any benefit.

Can we comment something based on IPC where we take care of individual units of a SM i.e. SP, LD/ST & SFU. This way we might be able to take care of individual units also. This will give the max utilization of GPU SM. So, keeping more threads per block or more blocks per thread shall help.
And if we want to understand the IPC level for the code, how should be we do it as there is no more DECUDA, it’s obsolete now.

Q2. In every profiling app they say it’s per SM. Does it mean all the SM’s are loaded same way. Why I ask this is because for example, we could fill all the SMs evenly with these 2 kernels i.e. add & subtract, but there will be this multiplication case which needs to be in one of the SMs. Now this might cause unevenness.
Hence, profiler telling blocks per SM ( as each SM is same) is incomplete definition.

Sorry for keeping it long, but I couldn’t express my doubts otherwise.
Thanks

I would doubt there would be much difference between your two scenarios. Scenario 1 might be favored on very large GPUs, but I doubt it would matter much on a GTX980. A Volta V100 on the other hand has 80 SMs (*) so even your case of 32 blocks per kernel would not put at least 1 block on each SM in a V100. IPC is something you can measure. If you want to analyze it, you’d have to look at the SASS. And doing all that in the concurrent kernels scenario would be quite complex. Furthermore, CUDA doesn’t provide any guarantees about the scheduling order of blocks or the scheduling order of concurrent kernels (the blocks from concurrent kernels).

The SMs are not necessarily all loaded the same way, but they should have similar loading for a single kernel. Again, CUDA provides no guarantees for the scheduling order of blocks, or concurrent kernels. The profiler can certainly give unexpected results when your work is very small, eg. a kernel of a single block.

The profiler telling you stuff based on a per SM statement is certainly an incomplete statement. The profiler is not guaranteed to do exhaustive profiling. Some metrics are composed of spot or sample-based measurements (e.g. the measurements on a single SM) and then extrapolated as necessary. It does not guarantee precise measurement.

(*) I’m used to talking about 10K-20K threads being enough to keep a GPU busy. A GTX980 with 16 SMs has an instantaneous capacity of 32K threads. A V100 has an instantaneous capacity of about 160K threads, so I will need to gradually adjust my frames of reference.

Thanks txbob.

I found a wonderful article, this is the exact thing I wanted. It explains the scheduling algorithm of a Fermi Architecture. As a result based on such a thing, we can plan on the number of blocks vs threads.

https://users.ices.utexas.edu/~sreepai/fermi-tbs/
javascript:void();