choosing the best grid/block dimensions

JakeTheDog · January 6, 2016, 1:13pm

Hi guys! I know that it’s a difficult question because it depends specifically on how is implemented the kernel, but I wanted to know, which line of reasoning should I follow?
For example I’ve tested a simple kernel like this one:

__global__ void copy(float* A,int A_size,float* B,int B_size){                  // dim B != dim A

                int tid=threadIdx.x;
                int bid=blockIdx.x;
                int bd=blockDim.x;
                int gd=gridDim.x;

                if(B_size<A_size){
                        while(bid<B_size){
                                while(tid<B_size){
                                        A[tid+bid*A_size]=B[tid+bid*B_size];
                                        tid+=bd;
                                }
                                tid=threadIdx.x;
                                bid+=gd;
                        }
                }
                else{
                        while(bid<A_size){
                                while(tid<A_size){
                                        A[tid+bid*A_size]=B[tid+bid*B_size];
                                        tid+=bd;
                                }
                                tid=threadIdx.x;
                                bid+=gd;
                        }
                }

}

it simply copies a matrix (linearized) in another one of different dimension. This kernel is used a lot of times, on the order of 10^6, with matrices of size 100 (10^4 elements) and if I launch it with 313 blocks and 1024 threads I get

instead if I launch it with 313 blocks and 32 threads it is much faster

JakeTheDog · January 6, 2016, 1:15pm

…
sotty it didn’t load the images, in the first case it took about 35 s, in the second one about 20 s

BulatZiganshin · January 12, 2016, 8:49pm

size=min(Asize,Bsize) will reduce your code :)
there is a profiler in cuda, so look at its output

JakeTheDog · January 30, 2016, 12:18pm

Thanks! sorry for the late but i didn’t received the notification

Topic		Replies	Views
help to improve my kernel... CUDA Programming and Performance	2	3844	June 4, 2009
How to best choose <<grids, threads>>? for best performance, how do i best choose gr CUDA Programming and Performance	5	3182	June 14, 2011
Strange performance relationship to grid dimension? CUDA Programming and Performance	1	963	November 26, 2009
Newbie help on thread blocks CUDA Programming and Performance	22	10595	December 24, 2008
About grid size and performance CUDA Programming and Performance	10	2405	June 25, 2010
help with some cuda programming CUDA Programming and Performance	9	1817	August 31, 2009
Efficiently loading data in the shared memory CUDA Programming and Performance	0	335	February 15, 2021
Slow performance with long arrays CUDA Programming and Performance	3	953	February 15, 2013
Device Memory Bandwidth CUDA Programming and Performance	8	1846	March 29, 2015
kernel sample CUDA Programming and Performance	0	2312	July 13, 2007

choosing the best grid/block dimensions

Related topics