# Doubt regarding grid size and block size

Hi!

I’ve build an app that takes, as input, the number of threads per block. This app launches a kernel in which every thread does some computations. Basically, in the original Matlab code, it’s a function which has a for loop from 1 to 1013 and each iteration do some computations. So when I converted it to CUDA, I conceived the grid size and the block size in this way:

numOfThreads = (for instance, 256)

The problem is that I use only 4 SMs of 16 SMs total (I’m using a Fermi card). Considering that I’m using 256 threads per block, how can I take advantage of all SMs in this case (if possible)? Have I misconceived the whole problem when I converted it to CUDA?

Some pseudo-code:

``````Nblock = 1014

% MATLAB

for tt:1:Nblock-1

...

do something

....

endfor

// CUDA

__global__ void mykernel(){

int tt = blockIdx.x * block_size + threadIdx.x;

if(tt<Nblock-1){

...

do something

...

}

}
``````

Hi!

I’ve build an app that takes, as input, the number of threads per block. This app launches a kernel in which every thread does some computations. Basically, in the original Matlab code, it’s a function which has a for loop from 1 to 1013 and each iteration do some computations. So when I converted it to CUDA, I conceived the grid size and the block size in this way:

numOfThreads = (for instance, 256)

The problem is that I use only 4 SMs of 16 SMs total (I’m using a Fermi card). Considering that I’m using 256 threads per block, how can I take advantage of all SMs in this case (if possible)? Have I misconceived the whole problem when I converted it to CUDA?

Some pseudo-code:

``````Nblock = 1014

% MATLAB

for tt:1:Nblock-1

...

do something

....

endfor

// CUDA

__global__ void mykernel(){

int tt = blockIdx.x * block_size + threadIdx.x;

if(tt<Nblock-1){

...

do something

...

}

}
``````

You can’t, because blocks are never split between SMs. With 128 threads per block and 1013 total threads you could however use 8 blocks, and with 64 threads per blocks you could use 16 SMs.
I do not, however, know of any current Fermi-class card that offers 16 SMs, so (if all threads take the same time to finish) you are probably as well off with either 64 or 128 threads per block. Only concurrently launching multiple independent kernels would make optimal use of all SMs.

You can’t, because blocks are never split between SMs. With 128 threads per block and 1013 total threads you could however use 8 blocks, and with 64 threads per blocks you could use 16 SMs.
I do not, however, know of any current Fermi-class card that offers 16 SMs, so (if all threads take the same time to finish) you are probably as well off with either 64 or 128 threads per block. Only concurrently launching multiple independent kernels would make optimal use of all SMs.

This strategy to use really depends heavily on the "/* do something… */ part.
If the operation is trivial, then the GPU will be wildly inefficient since data transfer times will dominate the compute… you’re better off keeping it in Matlab.

But if "/* do something… */ is meaty, you can start to break that down. It depends on what that operation is. For example, if the work is to invert a 32x32 matrix, then you might change your strategy… not giving each THREAD some work but instead give each BLOCK (or warp) work that the threads can cooperate on. A warp is a nice size to work on a 32x32 matrix, for example.

Alternatively, you can increase the work for the GPU to do simultaneously especially if you have staged data to process… perhaps you do this 1013 work elements, but you also have some more work to do. Then it’s fine to use only a few SMs to do the work since your other SMs can be busy working on an independent kernel (one of the great features of Fermi!).

This strategy to use really depends heavily on the "/* do something… */ part.
If the operation is trivial, then the GPU will be wildly inefficient since data transfer times will dominate the compute… you’re better off keeping it in Matlab.

But if "/* do something… */ is meaty, you can start to break that down. It depends on what that operation is. For example, if the work is to invert a 32x32 matrix, then you might change your strategy… not giving each THREAD some work but instead give each BLOCK (or warp) work that the threads can cooperate on. A warp is a nice size to work on a 32x32 matrix, for example.

Alternatively, you can increase the work for the GPU to do simultaneously especially if you have staged data to process… perhaps you do this 1013 work elements, but you also have some more work to do. Then it’s fine to use only a few SMs to do the work since your other SMs can be busy working on an independent kernel (one of the great features of Fermi!).

Thanks a lot! I’ll find some solutions!

Thanks a lot! I’ll find some solutions!