Grid dimension's decision How to take decision for organization of a grid .

Hi,

I want to start CUDA but i have a problem that is :

  • what is the optimum way to declare a grid , because i go through some code and got in some code the declaration was

    dim3 GridDim(4,2,0);

    and some other code it was

    dim3 GridDim(8,0,0);

then my question is :

- how  can one decide that which one is benificial either to organize a grid in one , two or three dimensions.

Please help me.

Thanks :

Kundan

Depends on what you want to solve and HOW you want to solve.

Hello,
I’m writing a CUDA kernel that will receive a 2 dimensional array of pixels with dimensions Width and Height.

I’d like to be able to index the array using “array[idx.x][idx.y]”. How can I specify block_size and n_blocks so that I can call my CUDA kernel.

I’m am just learning CUDA now so I am only familiar with calling my kernel with

my_kernel <<< n_blocks, block_size >>> (my_parameters)

I appreciate any help you can provide.

Thank you.

David

Hello David,

The grid size and block size are hardware dependant i.e it is dependant on the device used. But better use gridDim would be the total number of multiprocessor in the device and for better use the blockDim should be 128,256,or 512… but 256 is working fine for any type of device…

This is wrong information.

A minimum of 192 threads (better 256) are required to be active per Multi processor. Multiply this with the number of multi-processors in the hardware. You would need that much threads minimum to cover your hardware latencies.

You can still have a blockDim.x as 64 and have a bigger gridDim.x – if your CUDA occupancy allows a minium of 3 active blocks (3*64=192).

Usually more number of blocks (10,000 or 20,000) gives higher performance – of course blocks should have something meaningful and sizeable to do.

Read the manual for more details…

Hi Sarnath,

I think what i am make him understand is:

Efficiency does not come from using blocks of maximal size. Not exactly. If you use small blocks, it will run just fine on a larger device (running several blocks per multiprocessor). No reason for the runtime to automate this.

However for certain algorithms the more shared mem and the more registers you have per block, the better (a good example is matrix-multiply). Also, gmem accesses may be slightly faster if there is one block per multiprocessor (and the accesses are perfectly optimized).

For your type of algorithm, I don’t think having a maximally large block or using a hundred registers has any advantage. A good configuration is: use blocks with 128 threads, 32 registers per thread, and up to 4KB shared memory per block. Launch at least a few thousand of these. This will let your code scale nicely across all devices, including future ones, and is a good balance of occupancy and resource usage on current ones.

Thanks

Kundan

You had mentioned:

"

"

This is wrong. Grid size and block size are chosen by programmers. It all depends on how the parallel programmer chooses to write his kernel and how he decomposes the data among blocks. (the hardware factor has a role… read on…)

Also note that – your statement allows 128 threads per block and multi-processor number of blocks - In this configuration, you will suffer from register latencies and probably also by global memory latencies. Thats why I had to correct it.

The hardware factor does have a role to play. I am not ruling that out.

One needs to work out the “CUDA occupancy” of the kernel and make sure that the following bottom line is met:

  1. At run time, each Multi-processor should have atleast 192 ACTIVE threads (or 256 best) active to hide latencies.

One just needs to work on this bottomline to adjust his block and grid size to get the optimal performance for any given hardware.

This does NOT impose any condition on your block size. People have reported best performance even with 64 threads per block. “Active” threads is what matters…

Active threads refer to the number of threads that the multi-processor simulatneously execute at any given point of time.

Active threads = Active blocks * blockDim

Active Blocks is determined by the program’s resource usage (use cuda occupancy to figure out)