blockdim value added

I’m a newbie to cuda and I’m trying to figure out the value added to specifying block dimensions compared to just using the value 1. Let’s just say this code below runs on a given gpu. It works fine no complaints. If I add block dimensions will it run faster? If so how do I know I have optimized and used the fastest possible block dimensions for the given application?

dim3 mygriddim(HEIGHT,WIDTH);

hook<<<mygriddim,1>>>(myarray);

in hook:

int idx = (blockIdx.y*gridDim.x)+blockIdx.x;

You need to read up on warps. CUDA runs threads in groups of 32, even if you specify just 1, (31 of them would be just wasting time in this case). On a fermi compute 2.0 GPU it means that you are wasting 31 out of 32 cores (compute 2.1 is a bit more complex to explain, but you would be doing even worse). 64 is the really minimum number if you know exactly what you are doing, a general rule of thumb is around 256 threads for compute 1.x and 384 for compute 2.x.

Play around with the size of the block and see for your self.