Understanding the data access

I have recently started learning about parallel computing. I created a sample case which solves heat conduction equation in 2D to get familiar with CUDA Fortran way of doing things. I also created a similar solver using same set of equations, initial and boundary conditions that runs on CPU.
I am observing different behaviors if I launch kernels as 1D or 2D.
For example, if I call my subroutine as

 threads=dim3(block_size,1,1)
 blocks = dim3(ceiling(real(numPar)/real(threads%x)),1,1)
do i=1,num_steps
call solve_heat_conduction_equation<<<blocks, threads>>>(numPar,posType_d,pPhys_d)
    ierr=cudaDeviceSynchronize()
end do

where the subroutine is

  attributes(global) subroutine solve_heat_conduction_equation(n,posd,phyd)
	implicit none
	integer,value :: n
	real,device,intent(in) :: posd(:,:)
	real,device,intent(inout) :: phyd(:,:)    
    integer :: i, j 
    real :: dx,dy,rpp, dTi,wij,dwij,Fij(dim), rij(dim),tmp, k_avg
    real :: eta
	
	eta = 0.01 * h_s * h_s
    i = (blockIdx%x - 1) * blockDim%x + threadIdx%x
	! j = (blockIdx%y - 1) * blockDim%y + threadIdx%y
	
    if(i .le. n) then
	    dTi=0
		Fij=0
		do j=1,n
			if (j/=i) then
				if (int(posd(dim+1,i))==2) then
					phyd(1,i) = T_ref
				else
					dx=posd(1,i)-posd(1,j)
					dy=posd(2,i)-posd(2,j)
					rij(1)=dx ; rij(2)=dy
					rpp=sqrt(dx**2+dy**2)
						if (rpp .le. 2 * h_s) then
							dwij=dw(dx,dy,h_s)
							Fij(1)=dwij*dx/rpp ; Fij(2)=dwij*dy/rpp
							tmp = (phyd(1,i) - phyd(1,j)) / phyd(2,i) / phyd(2,j) * particle_mass
							k_avg=4*phyd(4,i)*phyd(4,j)/(phyd(4,i)+phyd(4,j))
							dTi = dTi + (k_avg / phyd(3,i) * tmp * dot_product(rij, Fij) / (rpp**2 + eta))							
						end if
				end if			
			end if	
		end do
		phyd(1,i) = phyd(1,i) + dt * dTi	
      end if		  
  end subroutine solve_heat_conduction_equation

The solver produces the right temperatures, whereas if I launch a 2D kernel and modify the subroutine

 threads=dim3(block_size,block_size,1)
 blocks = dim3(ceiling(real(numPar)/real(threads%x)),ceiling(real(numPar)/real(threads%x)),1)
    if(i .le. n .and. j .le. n .and. j/=i) then
	    dTi=0
		Fij=0
		if (int(posd(dim+1,i))==2) then
			phyd(1,i) = T_ref
		else
			dx=posd(1,i)-posd(1,j)
			dy=posd(2,i)-posd(2,j)
			rij(1)=dx ; rij(2)=dy
			rpp=sqrt(dx**2+dy**2)
				if (rpp .le. 2 * h_s) then
					dwij=dw(dx,dy,h_s)
					Fij(1)=dwij*dx/rpp ; Fij(2)=dwij*dy/rpp
					tmp = (phyd(1,i) - phyd(1,j)) / phyd(2,i) / phyd(2,j) * particle_mass
					k_avg=4*phyd(4,i)*phyd(4,j)/(phyd(4,i)+phyd(4,j))
					dTi = dTi + (k_avg / phyd(3,i) * tmp * dot_product(rij, Fij) / (rpp**2 + eta))							
				end if
			
		end if	
			phyd(1,i) = phyd(1,i) + dt * dTi	
      end if

My CUDA solver outputs temperatures, but the results are different than that of CPU ( CPU results already validated against theory).
The idea is that for each index i we should be looping over all j indices. I am having trouble visualizing or understanding how different threads access different data.