i have attempted to optimise my algorithm by changing my ‘base transformation’ from 1d thread granularity to 2d thread granularity (i am working on a matrix).

doing this has changed the timing of my kernel from 3ms to 16ms (on a 2000x1000 matrix). I cant understand why. My co-worker has told me that it is due to the thread overhead, but i have tried to inform him that thread overhead is negligable (am i correct?).

My original function went like this (3ms version):

```
for(eq=0;eq<neq;eq++)
if(eq!=pivot && tid<(neq+ncons+1) )
((float*)((char*)matrix+eq*pitch))[tid]=((float*)((char*)matrix+eq*pitch))[tid]-(rowFactorColumn[eq]/pivotFactor)*pr[tid];
```

where tid is my thread index and ‘matrix’,‘rowFactorColumn’ and ‘pr’ are in global memory.

The 2d equivilant that i wrote (works correctly) takes 16ms. I dont know why:

```
int idx = bx*bdx + x;
int idy = by*bdy + y;
if(idx<(neq+ncons+1) && idy<neq && idy!=pivot) ((float*)((char*)matrix+idy*pitch))[idx]=((float*)((char*)matrix+idy*pitch))[idx]-(rowFactorColumn[idy]/pivotFactor)*pr[idx];
```

Both versions do their operations correctly, the only difference is the 2d version takes 13ms extra.

can anyone tell me why?