1.) Does an application with more threads necessarily run faster? For example let’s say I have a blockdim of 16x16 = 256 and a blockdim of 16x32 = 512. Will the later blockdim run faster since there are twice as many threads?

2.) Should I worry about the multiplication and addition the GPU has to do to calculate each and every idx value? I look at computations like this for example…

int x = blockIdx.x * blockDim.x + threadIdx.x;

int y = blockIdx.y * blockDim.y + threadIdx.y;

I’m thinking it involves two multiplication operations and two addition operations. Should I put work into trying to ensure that a minimum amount of computations are required to generate an index that I use for a given algorithm?