choosing the best grid/block dimensions

Hi guys! I know that it’s a difficult question because it depends specifically on how is implemented the kernel, but I wanted to know, which line of reasoning should I follow?
For example I’ve tested a simple kernel like this one:

__global__ void copy(float* A,int A_size,float* B,int B_size){                  // dim B != dim A

                int tid=threadIdx.x;
                int bid=blockIdx.x;
                int bd=blockDim.x;
                int gd=gridDim.x;

                if(B_size<A_size){
                        while(bid<B_size){
                                while(tid<B_size){
                                        A[tid+bid*A_size]=B[tid+bid*B_size];
                                        tid+=bd;
                                }
                                tid=threadIdx.x;
                                bid+=gd;
                        }
                }
                else{
                        while(bid<A_size){
                                while(tid<A_size){
                                        A[tid+bid*A_size]=B[tid+bid*B_size];
                                        tid+=bd;
                                }
                                tid=threadIdx.x;
                                bid+=gd;
                        }
                }

}

it simply copies a matrix (linearized) in another one of different dimension. This kernel is used a lot of times, on the order of 10^6, with matrices of size 100 (10^4 elements) and if I launch it with 313 blocks and 1024 threads I get

instead if I launch it with 313 blocks and 32 threads it is much faster


sotty it didn’t load the images, in the first case it took about 35 s, in the second one about 20 s

  1. size=min(Asize,Bsize) will reduce your code :)
  2. there is a profiler in cuda, so look at its output

Thanks! sorry for the late but i didn’t received the notification