# Understanding the data access

I have recently started learning about parallel computing. I created a sample case which solves heat conduction equation in 2D to get familiar with CUDA Fortran way of doing things. I also created a similar solver using same set of equations, initial and boundary conditions that runs on CPU.
I am observing different behaviors if I launch kernels as 1D or 2D.
For example, if I call my subroutine as

do i=1,num_steps
end do

where the subroutine is

attributes(global) subroutine solve_heat_conduction_equation(n,posd,phyd)
implicit none
integer,value :: n
real,device,intent(in) :: posd(:,:)
real,device,intent(inout) :: phyd(:,:)
integer :: i, j
real :: dx,dy,rpp, dTi,wij,dwij,Fij(dim), rij(dim),tmp, k_avg
real :: eta

eta = 0.01 * h_s * h_s
i = (blockIdx%x - 1) * blockDim%x + threadIdx%x
! j = (blockIdx%y - 1) * blockDim%y + threadIdx%y

if(i .le. n) then
dTi=0
Fij=0
do j=1,n
if (j/=i) then
if (int(posd(dim+1,i))==2) then
phyd(1,i) = T_ref
else
dx=posd(1,i)-posd(1,j)
dy=posd(2,i)-posd(2,j)
rij(1)=dx ; rij(2)=dy
rpp=sqrt(dx**2+dy**2)
if (rpp .le. 2 * h_s) then
dwij=dw(dx,dy,h_s)
Fij(1)=dwij*dx/rpp ; Fij(2)=dwij*dy/rpp
tmp = (phyd(1,i) - phyd(1,j)) / phyd(2,i) / phyd(2,j) * particle_mass
k_avg=4*phyd(4,i)*phyd(4,j)/(phyd(4,i)+phyd(4,j))
dTi = dTi + (k_avg / phyd(3,i) * tmp * dot_product(rij, Fij) / (rpp**2 + eta))
end if
end if
end if
end do
phyd(1,i) = phyd(1,i) + dt * dTi
end if
end subroutine solve_heat_conduction_equation

The solver produces the right temperatures, whereas if I launch a 2D kernel and modify the subroutine

if(i .le. n .and. j .le. n .and. j/=i) then
dTi=0
Fij=0
if (int(posd(dim+1,i))==2) then
phyd(1,i) = T_ref
else
dx=posd(1,i)-posd(1,j)
dy=posd(2,i)-posd(2,j)
rij(1)=dx ; rij(2)=dy
rpp=sqrt(dx**2+dy**2)
if (rpp .le. 2 * h_s) then
dwij=dw(dx,dy,h_s)
Fij(1)=dwij*dx/rpp ; Fij(2)=dwij*dy/rpp
tmp = (phyd(1,i) - phyd(1,j)) / phyd(2,i) / phyd(2,j) * particle_mass
k_avg=4*phyd(4,i)*phyd(4,j)/(phyd(4,i)+phyd(4,j))
dTi = dTi + (k_avg / phyd(3,i) * tmp * dot_product(rij, Fij) / (rpp**2 + eta))
end if

end if
phyd(1,i) = phyd(1,i) + dt * dTi
end if

My CUDA solver outputs temperatures, but the results are different than that of CPU ( CPU results already validated against theory).
The idea is that for each index i we should be looping over all j indices. I am having trouble visualizing or understanding how different threads access different data.