Thanks for the response. Here’s some background on what I meant by unrealistic values - I have 3 code versions:

- The first one is the legacy code, runs entirely on CPU and generates outputs over a 10-day simulation that I use to benchmark against the results of the other two code versions.
- The second code version is one ported to run on the GPU using global device variables and only kernel loop directives; this code generates results identical to the CPU-only code over the 10-day simulation, with a 15x reduction in runtime which is great. However, I am greedy and want more speed-up, and hence the next code version.
- The third code version is my attempt to make kernel subroutines of the loops previously ported using kernel loop directives. For the initial attempt, I only used the global device variables and replicated the operations in only one of the kernel loop directives. However, this version runs fine for about 2.5 days, with results identical to the other two code versions, then generates divergent values that eventually leads to NaNs.

Here’s the kernel loop directive from code version 2 that I put in a kernel subroutine in code version 3:

```
!$cuf kernel do(2) <<< *, * >>>
DO J=1,JM
DO I=1,IM
UA_d(I,J,nC_h)=UA_d(I,J,nC_h)+.5*NU*(UA_d(I,J,nB_h)-2.*UA_d(I,J,nC_h)+UA_d(I,J,nF_h))
VA_d(I,J,nC_h)=VA_d(I,J,nC_h)+.5*NU*(VA_d(I,J,nB_h)-2.*VA_d(I,J,nC_h)+VA_d(I,J,nF_h))
EL_d(I,J,nC_h)=EL_d(I,J,nC_h)+.5*NU*(EL_d(I,J,nB_h)-2.*EL_d(I,J,nC_h)+EL_d(I,J,nF_h))
DUMWAD_d(I,J)=DUM_d(I,J)
DVMWAD_d(I,J)=DVM_d(I,J)
WFSM_d(I,J)=INT(FSM_d(I,J))
D_d(I,J,nF_h)=H_d(I,J)+EL_d(I,J,nF_h)
ENDDO
ENDDO
```

And below is my kernel call and kernel code from code version 3 (module GPUVars contains all the device variables operated upon in the kernel subroutine:

CALL KRNL_ECOM3D_HYD2D<<< dim3(2,65,1),dim3(32,4,1) >>>(nC_h, nB_h, nF_h, IM, JM)

```
MODULE KERNELS_2D_HYDRO
USE CUDAFOR
USE GPUVars
CONTAINS
ATTRIBUTES(GLOBAL) SUBROUTINE KRNL_ECOM3D_HYD2D(nC,nB,nF,IM,JM)
integer, value :: nC, nB, nF, IM, JM
integer :: t_i, t_j
t_i = threadIdx%x + (blockIdx%x - 1) * blockDim%x
t_j = threadIdx%y + (blockIdx%y - 1) * blockDim%y
! calculate
if (t_i.LE.IM .AND. t_j.LE.JM) then
UA_d(t_i,t_j,nC)=UA_d(t_i,t_j,nC)+.5*NU*(UA_d(t_i,t_j,nB)-2.*UA_d(t_i,t_j,nC)+UA_d(t_i,t_j,nF))
VA_d(t_i,t_j,nC)=VA_d(t_i,t_j,nC)+.5*NU*(VA_d(t_i,t_j,nB)-2.*VA_d(t_i,t_j,nC)+VA_d(t_i,t_j,nF))
EL_d(t_i,t_j,nC)=EL_d(t_i,t_j,nC)+.5*NU*(EL_d(t_i,t_j,nB)-2.*EL_d(t_i,t_j,nC)+EL_d(t_i,t_j,nF))
DUMWAD_d(t_i,t_j)=DUM_d(t_i,t_j)
DVMWAD_d(t_i,t_j)=DVM_d(t_i,t_j)
WFSM_d(t_i,t_j)=INT(FSM_d(t_i,t_j))
D_d(t_i,t_j,nF)=H_d(t_i,t_j)+EL_d(t_i,t_j,nF)
endif
END SUBROUTINE KRNL_ECOM3D_HYD2D
END MODULE KERNELS_2D_HYDRO
```

I will look into the compute-sanitizer tool; I was not aware of this one. In the meantime, if you see something above that catches your eye, please let me know. I appreciate your help!