what is the proper way of transforming c call

cpu version

for (i=1;i<=imax;i++)
for (j=1;j<=jmax;j++)
for (k=1;k<=kmax;k++)
m = ins+jnc+k ;
w[m] = w[m]-delt*temp;

should i use shared memory? or copy first w to w_old and work on w_old to get new w?
i tried both and didn’t work…

any help or comments are welcome… thanks in advance.