what is the proper way of transforming c call

cpu version

for (i=1;i<=imax;i++)
for (j=1;j<=jmax;j++)
for (k=1;k<=kmax;k++)
{
m = ins+jnc+k ;
temp=(w[m]-w[m-ns])/dltx+(w[m+ns]-w[m])/dltx;
w[m] = w[m]-delt*temp;
}

should i use shared memory? or copy first w to w_old and work on w_old to get new w?
i tried both and didn’t work…

any help or comments are welcome… thanks in advance.