optimization questions

1.) Does an application with more threads necessarily run faster? For example let’s say I have a blockdim of 16x16 = 256 and a blockdim of 16x32 = 512. Will the later blockdim run faster since there are twice as many threads?

2.) Should I worry about the multiplication and addition the GPU has to do to calculate each and every idx value? I look at computations like this for example…

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

I’m thinking it involves two multiplication operations and two addition operations. Should I put work into trying to ensure that a minimum amount of computations are required to generate an index that I use for a given algorithm?

1.) not necessarily. In some cases it might be faster to have more smaller blocks running per SM.

2.) Each of these computations compiles to a single machine instruction (IMAD), so there is little point trying to optimize them.


I have code which runs optimal with 128 threads. It depends on the problem. I recommend to put in the code the dimension of the block in a such a way that you can modify them by just changing one or two values, this way you can measure the time to execute different sizes of the blocks.