Code computes correctly with kernel loop directives but calculates NaNs with kernel subroutines

Hi,

I am having a strange problem. I’m in the midst of porting over a legacy Fortran numerical model code to run on a GPU. I’ve successfully ported it over to the GPU using kernel loop directives and using global device variables on the subset of key computationally expensive loops. I would like to take the code blocks within the kernel loop directives, and create kernel subroutines of those so I can further optimize memory management on the device by using shared variables instead of global device variables. I’ve converted one such loop to a kernel subroutine, and the model works fine for a while but then starts computing unrealistic values. Adding syncThreads, cudaThreadSynchronize, or cudaDeviceSynchronize does not help. I am at a loss for how to go about debugging this and would appreciate any suggestions.

Thanks!

What kind of “unrealistic values”? Are they numerically divergent from expected values, and gradually and increasingly so as the simulation continues? Or are they completely outside the realm of possible values, possibly NaNs, and more or less pop up randomly and suddenly?

The first scenario could be due to a numerical issue. This could be a plain implementation bug, e.g. a mix-up in variable names, or a latent numerical stability issue simply exposed through the porting process. Carefully check the data types of all variables and literal constants.

In the second scenario there might be out-of-bounds access, maybe due to incorrectly computed array indexes as a consequence of thread-to-data mapping.

If your code does not trigger any reports from compute-sanitizer, a standard method of debugging is to work backwards from any piece of data clearly identified as incorrect. The data is the result of a computation and some input data. Look at the details of that.

Personally, I like to instrument by inserting simple printf() and collecting the output in a log. Then I scrutinize the log for anything that looks wrong or suspicious, and add more instrumentation related to the suspicious items while removing instrumentation for data that seems clearly OK. I usually manage to track down the root cause within a few hours. In the worst case (working on a large, completely unfamiliar code base) it has taken me two days.

The key to successful debugging is to work thoroughly and methodically, following incorrect data back step by step to the origin. Avoid guess work: peppering code with synchronization primitives without prior analysis whether these are actually needed falls under this heading. It may make sense to temporarily remove portions of the code (comment out calls, comment out code blocks) to simplify the search, but this can also lead to masking of the root cause, so I usually try to minimize the amount of instrumentation while working on the original unmodified code.

Thanks for the response. Here’s some background on what I meant by unrealistic values - I have 3 code versions:

  1. The first one is the legacy code, runs entirely on CPU and generates outputs over a 10-day simulation that I use to benchmark against the results of the other two code versions.
  2. The second code version is one ported to run on the GPU using global device variables and only kernel loop directives; this code generates results identical to the CPU-only code over the 10-day simulation, with a 15x reduction in runtime which is great. However, I am greedy and want more speed-up, and hence the next code version.
  3. The third code version is my attempt to make kernel subroutines of the loops previously ported using kernel loop directives. For the initial attempt, I only used the global device variables and replicated the operations in only one of the kernel loop directives. However, this version runs fine for about 2.5 days, with results identical to the other two code versions, then generates divergent values that eventually leads to NaNs.

Here’s the kernel loop directive from code version 2 that I put in a kernel subroutine in code version 3:

!$cuf kernel do(2) <<< *, * >>>
    DO J=1,JM 
         DO I=1,IM 
            UA_d(I,J,nC_h)=UA_d(I,J,nC_h)+.5*NU*(UA_d(I,J,nB_h)-2.*UA_d(I,J,nC_h)+UA_d(I,J,nF_h)) 
            VA_d(I,J,nC_h)=VA_d(I,J,nC_h)+.5*NU*(VA_d(I,J,nB_h)-2.*VA_d(I,J,nC_h)+VA_d(I,J,nF_h)) 
            EL_d(I,J,nC_h)=EL_d(I,J,nC_h)+.5*NU*(EL_d(I,J,nB_h)-2.*EL_d(I,J,nC_h)+EL_d(I,J,nF_h)) 
            DUMWAD_d(I,J)=DUM_d(I,J)
            DVMWAD_d(I,J)=DVM_d(I,J)
            WFSM_d(I,J)=INT(FSM_d(I,J))
            D_d(I,J,nF_h)=H_d(I,J)+EL_d(I,J,nF_h)
         ENDDO
      ENDDO

And below is my kernel call and kernel code from code version 3 (module GPUVars contains all the device variables operated upon in the kernel subroutine:

CALL KRNL_ECOM3D_HYD2D<<< dim3(2,65,1),dim3(32,4,1) >>>(nC_h, nB_h, nF_h, IM, JM)

  MODULE KERNELS_2D_HYDRO
  USE CUDAFOR
  USE GPUVars      
  CONTAINS

  ATTRIBUTES(GLOBAL) SUBROUTINE KRNL_ECOM3D_HYD2D(nC,nB,nF,IM,JM)
  integer, value :: nC, nB, nF, IM, JM
  integer :: t_i, t_j

  t_i = threadIdx%x + (blockIdx%x - 1) * blockDim%x
  t_j = threadIdx%y + (blockIdx%y - 1) * blockDim%y
  
  ! calculate
  if (t_i.LE.IM .AND. t_j.LE.JM) then
     UA_d(t_i,t_j,nC)=UA_d(t_i,t_j,nC)+.5*NU*(UA_d(t_i,t_j,nB)-2.*UA_d(t_i,t_j,nC)+UA_d(t_i,t_j,nF)) 
     VA_d(t_i,t_j,nC)=VA_d(t_i,t_j,nC)+.5*NU*(VA_d(t_i,t_j,nB)-2.*VA_d(t_i,t_j,nC)+VA_d(t_i,t_j,nF)) 
     EL_d(t_i,t_j,nC)=EL_d(t_i,t_j,nC)+.5*NU*(EL_d(t_i,t_j,nB)-2.*EL_d(t_i,t_j,nC)+EL_d(t_i,t_j,nF)) 
     DUMWAD_d(t_i,t_j)=DUM_d(t_i,t_j)
     DVMWAD_d(t_i,t_j)=DVM_d(t_i,t_j)
     WFSM_d(t_i,t_j)=INT(FSM_d(t_i,t_j))
     D_d(t_i,t_j,nF)=H_d(t_i,t_j)+EL_d(t_i,t_j,nF)
  endif

  END SUBROUTINE KRNL_ECOM3D_HYD2D     
  END MODULE KERNELS_2D_HYDRO

I will look into the compute-sanitizer tool; I was not aware of this one. In the meantime, if you see something above that catches your eye, please let me know. I appreciate your help!

I notice a pattern that suggests to me that you may be using _d and _h to indicate host and device copies of variables. Is that correct? Or does it mean something else?

Also, while you are free to post here, by convention I would normally suggest that people asking questions about CUDA code that requires nvc+, nvc++, or nvfortran, to post in the forum section for those tools.

Yes, that is correct - _d and _h indicate device and host variables, respectively.

I will add future, Fortran-specific queries, to those forums. Thanks!

It seems odd to me that you would be passing host variables to a device kernel:

but I acknowledge there is not enough information here to definitively say that is a problem

What is strange is that the kernel works fine and computes correctly for 2.5 days (43,000 timesteps) before going haywire.

Can you store the itermediate results at each iteration or regular intervals to exactly point to, where it breaks down?
Is the breaking down reproducible at that step with the exact same result numbers?
Does it lead to a specific arithmetic operation?
Does your PC and GPU use ECC memory?

What’s the values for IM and JM? Do they change during the run?

In the launch configuration, you have it set for 2x65 blocks and 32x4 threads so the max IM would be 64 and max JM 260. If these are larger, then you’d be missing computation on some elements.

If this is the case, consider adding two do loops. Set the outer starting index set to “t_i” and then stride by blockDim%x, then “t_j” and blockDim%y for the inner loop.

Something like:

! calculate
    do i = t_j, LM, blockDim%x
      do j=t_j, JM, blockDim%y
          UA_d(i,j,nC)=UA_d(i,j,nC)+.5*NU*(UA_d(i,j,nB)-2.*UA_d(i,j,nC)+UA_d(i,j,nF)) 
          VA_d(i,j,nC)=VA_d(i,j,nC)+.5*NU*(VA_d(i,j,nB)-2.*VA_d(i,j,nC)+VA_d(i,j,nF)) 
...
      enddo
   enddo

It might be a good idea to do this anyway since it will future proof the code if you change sizes. It’s also the pattern the compiler uses when generating the CUF kernel.

Thanks for the response. IM and JM are constants - set to 33 and 260 in the run. So the grid/block configuration seems fine.

After playing around a bit, the problem code seems to be:
EL_d(t_i,t_j,nC)=EL_d(t_i,t_j,nC)+.5*NU*(EL_d(t_i,t_j,nB)-2.*EL_d(t_i,t_j,nC)+EL_d(t_i,t_j,nF))

If I pull this line out of the kernel subroutine, and execute it separately with a kernel loop directive, the calculation goes through successfully. I am at a loss to explain why this should be the case.

IM and JM are constants - set to 33 and 260 in the run

Ok

I’m not seeing anything obvious which means there’s something going on higher up in the code (like a missing synchronization) or there’s a compiler issue.

Are you able to share the full source or pull together a minimal reproducing example?

Yes, I can share the full source and inputs to run. Please let me know how I may transmit the files to you. And thank you for offering to look into this!

If it’s code that can be shared publicly, you can upload the code as part of a post (look for the icon with the computer with an up arrow above it).

Otherwise, click on my profile and select the “message” button, uploading the package using the upload button as part of the message.

If it’s too big, then I can create a shared Google drive which you can you can use to upload.