Different results obtained in varying the block size

I am getting different results when I am varying the blocksize in the kernel launch parameter. When I am using lower blocksize like 64 or 128, the results are much closer to the CPU result (No CUDA), but when I am using higher blocksize such as 512, 1024, the results are much different (diverge) from the CPU result.

I am doing sum and max reductions in the kernel and have removed race condition (checked by cudamemcheck tool racecheck). Here is one such kernel,

     attributes(global) subroutine foo()
     implicit none
     integer :: i, tid, stride
     real (dp) :: tmp
     real (dp), shared :: sdata(blockDim%x)

     i  = (blockIdx%x-1)*blockDim%x + threadIdx%x
     tid= threadIdx%x

     if (i<=nn_d) then
     
         Xn_d(i) = Xo_d(i+nxy_d) + alpha_d*Po_d(i+nxy_d) + omega_d*So_d(i+nxy_d)
         Rn_d(i) = So_d(i+nxy_d) - omega_d*ASo_d(i)
     end if

     sdata(tid) = 0.
     if (i<=nn_d) then
         tmp = Rn_d(i)*Rso_d(i)
         sdata(tid) = tmp
     end if

     stride = blockDim%x/2
     do while (stride>=1)
        call syncthreads()
        if (tid<=stride)  sdata(tid) = sdata(tid) + sdata(tid + stride)
        
        stride = stride/2
     end do
     call syncthreads()

     if (tid==1)    tmp3_d(blockIdx%x) = sdata(1)

     sdata(tid) = 0.
     if (i<=nn_d) then
         tmp = Ro_d(i)*Rso_d(i)
         sdata(tid) = tmp
     end if

     stride = blockDim%x/2
     do while (stride>=1)
        call syncthreads()
        if (tid<=stride)  sdata(tid) = sdata(tid) + sdata(tid + stride)
        
        stride = stride/2
     end do
     call syncthreads()

     if (tid==1)    tmp3_d(gridDim%x + blockIdx%x) = sdata(1)

     sdata(tid) = 0.
     if (i<=nn_d) then
         tmp = abs(Xn_d(i)-Xo_d(i+nxy_d))
         sdata(tid) = tmp
     end if

     stride = blockDim%x/2
     do while (stride>=1)
        call syncthreads()
        if (tid<=stride)  sdata(tid) = max(sdata(tid), sdata(tid + stride))
        
        stride = stride/2
     end do

     if (tid==1)    tmp3_d(2*gridDim%x + blockIdx%x) = sdata(1)
     end subroutine foo

The CPU code where the kernel foo being is called,

      ...
      alpha_d = alpha
      ...
      omega_d=omega
      call foo<<<blocks, threads, threads*8>>>()
      tmp3= tmp3_d

      betaN = 0.
      do j=1,blocks
         betaN = betaN + tmp3(j)
      end do

      betaD = 0.
      do j=blocks+1,blocks*2
         betaD = betaD + tmp3(j)
      end do

      TErr = 0.
      do j=blocks*2+1, blocks*3
         TErr = max(TErr, tmp3(j))
      end do
      ...

The code is in fortran. Any advice would be helpful. Thank you.