Failure Results of Data Swap in CUDA Fortran

Hi Guys, I have the issue about the data swap according to the given squence (id). The code is followed:

attributes(global) subroutine d_APPLYLINEARID_PTC(id,x,y,u,v,w,N)
          implicit none
          integer(kind=4),value :: N
          integer(kind=4),dimension (N,2) :: id
          real(kind=4),dimension (N,2) :: x,y,u,v,w
          integer(kind=4) :: i,j,r_id
          real(kind=4) :: r_x,r_y,r_u,r_v,r_w
          i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
          j = blockIdx%y
          if(i <= N .and. j <= 2)then
               r_id = id(i,j)
               r_x = x(r_id,j); r_y = y(r_id,j)
               r_u = u(r_id,j); r_v = v(r_id,j); r_w = w(r_id,j)
         end if
         call threadfence_system()
         if(i <= r_nptl .and. j <= 2)then
              x(i,j) = r_x; y(i,j) = r_y
              u(i,j) = r_u; v(i,j) = r_v; w(i,j) = r_w
        end if
end subroutine d_APPLYLINEARID_PTC

(Where thread size is (128, 1, 1) and block size is (ceiling(N/128), 2, 1), N is larger than millions). When I verify the results using this kernel, I find that the swap process is not completely finish (most of the result is correct, still have certain data is not correct). I do use the threadfence_system to wait all the data is loaded to the register, but it is still not correct. What the crucial point that I missed?