I have tried various forums and searched everywhere but haven’t got all my answers clarified.
How to decide how many number of blocks vs threads in a block.
Now, I did read that limiting factors are number of registers, number of blocks per SM, etc
So, in order to avoid confusion let me create a simple example.
global
void add(int *a, int *b,int *c,int n)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}
So objective is to calculate for a 1024 bit vector so, last case is c[1023] = a[1023] + b[1023]
Now, let me define 2 scenarios:

Case1: When there are 1024 threads in a block and hence there is only one block
add<<<1,1024>>>(d_a,d_b,d_c,1024) 
Case2: When there are 32 blocks and 32 threads
add<<<32,32>>>(d_a,d_b,d_c,1024)
Now, as we can see from the example that there is no register limitation or any kind of limitation of memory for which the blocks can be constrained.
Let’s make some assumptions (gtx 980)
 Total SM = 16
 Max Warps per SM = 64
 Max threads blocks per SM = 32
 Max thread block size = 1024
 SPs per SM = 128
 Assuming the registers and shared memory are quite higher so that there is no limitation
So my question is as follows:
Question: Which one is going to better utilize the GPU Case1 or Case2 and be more time efficient. Can we estimate this without using the profiler i.e. mathematically or by some logic?
i.e. In case of case1 a single block will be scheduled to one out of 16 SM, which will be fully loaded.
Now, since there are 128 SP, we can’t run them all together atomically, the warps will run concurrently. But not all together in one go. (only 128 can run together in on go i.e. 4 warps)
In case 2, let’s assume the 32 blocks are spread across 16SM, assuming 2 blocks to each SM then will it be not better than case1 since more atomic level parallelism.
I am new to CUDA and hence have more questions to follow, but all the questions are mainly related to this fundamental qs.
Thanks
Any help is highly appreciated!