require help in looping in gauss elimination implementaion

i am implementing gauss elimination method on cuda for that i require to execute 3 loops:
for(k=0;k<n;k++)
{
for(i=k+1;i<n;i++)
{
pivot=coeff[i][k]/coeff[k][k];
for(j=k;j<n+1;j++)
coeff[i][j]=coeff[i][j]-pivot*coeff[k][j];
}
}
what i am able to do is that first element of all rows i make zero after that it becomes hard to start loop again in inside matrix …as when i use a variable in kernel then every thread increases the value of that variable and variable becomes of no use…
require serious help…