i have attempted to optimise my algorithm by changing my ‘base transformation’ from 1d thread granularity to 2d thread granularity (i am working on a matrix).
doing this has changed the timing of my kernel from 3ms to 16ms (on a 2000x1000 matrix). I cant understand why. My co-worker has told me that it is due to the thread overhead, but i have tried to inform him that thread overhead is negligable (am i correct?).
My original function went like this (3ms version):
for(eq=0;eq<neq;eq++) if(eq!=pivot && tid<(neq+ncons+1) ) ((float*)((char*)matrix+eq*pitch))[tid]=((float*)((char*)matrix+eq*pitch))[tid]-(rowFactorColumn[eq]/pivotFactor)*pr[tid];
where tid is my thread index and ‘matrix’,‘rowFactorColumn’ and ‘pr’ are in global memory.
The 2d equivilant that i wrote (works correctly) takes 16ms. I dont know why:
int idx = bx*bdx + x; int idy = by*bdy + y; if(idx<(neq+ncons+1) && idy<neq && idy!=pivot) ((float*)((char*)matrix+idy*pitch))[idx]=((float*)((char*)matrix+idy*pitch))[idx]-(rowFactorColumn[idy]/pivotFactor)*pr[idx];
Both versions do their operations correctly, the only difference is the 2d version takes 13ms extra.
can anyone tell me why?