The grid size and block size are hardware dependant i.e it is dependant on the device used. But better use gridDim would be the total number of multiprocessor in the device and for better use the blockDim should be 128,256,or 512… but 256 is working fine for any type of device…
A minimum of 192 threads (better 256) are required to be active per Multi processor. Multiply this with the number of multi-processors in the hardware. You would need that much threads minimum to cover your hardware latencies.
You can still have a blockDim.x as 64 and have a bigger gridDim.x – if your CUDA occupancy allows a minium of 3 active blocks (3*64=192).
Usually more number of blocks (10,000 or 20,000) gives higher performance – of course blocks should have something meaningful and sizeable to do.
Efficiency does not come from using blocks of maximal size. Not exactly. If you use small blocks, it will run just fine on a larger device (running several blocks per multiprocessor). No reason for the runtime to automate this.
However for certain algorithms the more shared mem and the more registers you have per block, the better (a good example is matrix-multiply). Also, gmem accesses may be slightly faster if there is one block per multiprocessor (and the accesses are perfectly optimized).
For your type of algorithm, I don’t think having a maximally large block or using a hundred registers has any advantage. A good configuration is: use blocks with 128 threads, 32 registers per thread, and up to 4KB shared memory per block. Launch at least a few thousand of these. This will let your code scale nicely across all devices, including future ones, and is a good balance of occupancy and resource usage on current ones.
This is wrong. Grid size and block size are chosen by programmers. It all depends on how the parallel programmer chooses to write his kernel and how he decomposes the data among blocks. (the hardware factor has a role… read on…)
Also note that – your statement allows 128 threads per block and multi-processor number of blocks - In this configuration, you will suffer from register latencies and probably also by global memory latencies. Thats why I had to correct it.
The hardware factor does have a role to play. I am not ruling that out.
One needs to work out the “CUDA occupancy” of the kernel and make sure that the following bottom line is met:
At run time, each Multi-processor should have atleast 192 ACTIVE threads (or 256 best) active to hide latencies.
One just needs to work on this bottomline to adjust his block and grid size to get the optimal performance for any given hardware.
This does NOT impose any condition on your block size. People have reported best performance even with 64 threads per block. “Active” threads is what matters…
Active threads refer to the number of threads that the multi-processor simulatneously execute at any given point of time.
Active threads = Active blocks * blockDim
Active Blocks is determined by the program’s resource usage (use cuda occupancy to figure out)