Hi Guys, I have the issue about the data swap according to the given squence (`id`

). The code is followed:

```
attributes(global) subroutine d_APPLYLINEARID_PTC(id,x,y,u,v,w,N)
implicit none
integer(kind=4),value :: N
integer(kind=4),dimension (N,2) :: id
real(kind=4),dimension (N,2) :: x,y,u,v,w
integer(kind=4) :: i,j,r_id
real(kind=4) :: r_x,r_y,r_u,r_v,r_w
i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
j = blockIdx%y
if(i <= N .and. j <= 2)then
r_id = id(i,j)
r_x = x(r_id,j); r_y = y(r_id,j)
r_u = u(r_id,j); r_v = v(r_id,j); r_w = w(r_id,j)
end if
call threadfence_system()
if(i <= r_nptl .and. j <= 2)then
x(i,j) = r_x; y(i,j) = r_y
u(i,j) = r_u; v(i,j) = r_v; w(i,j) = r_w
end if
return
end subroutine d_APPLYLINEARID_PTC
```

(Where thread size is (128, 1, 1) and block size is (ceiling(N/128), 2, 1), N is larger than millions). When I verify the results using this kernel, I find that the swap process is not completely finish (most of the result is correct, still have certain data is not correct). I do use the `threadfence_system`

to wait all the data is loaded to the register, but it is still not correct. What the crucial point that I missed?