Cannot find the reason for this error: call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)

Hello,

I have been getting this error seemingly at random from one specific part of my code:

call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)

It is a CFD code and this error usually happens after thousands of iterations/timesteps and never occurs consistently at the same iteration/timestep. I don’t think there is anything wrong with the solution because I am able to resume the simulation if this error is given. I am not sure where it is coming from. I have tried searching this forum and others for an answer but none of the solutions have worked. I would really appreciate any help.

The code is rather large, so I would be happy to send it if need be.

Hi Natan,

This is a generic error meaning that a bad pointer is getting dereferenced, though there could be multiple causes. Given it’s non-deterministic and only occurs well into the execution of the program, my best guess is that the code may be writing out of bounds to some array. Whether the write is problematic will depend on what data is stored in that location. If its padding, then it wouldn’t have any effect, but if it’s some other variable, then the memory gets stomped on.

Other possibilities are stack or heap overflows.

What I’d suggest is to run your code through compute-sanitizer and see if picks up any errors.

-Mat

Thanks Mat. I am debugging with compute-sanitizer now. I tried all four tools. It seems like the only one pointing to issues is racecheck. I am consistently getting race conditions in kernels where I am using reduction operations (with or without specifying it in directive). For example:

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ! Compute inviscid timestep
    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    !$acc parallel loop gang vector collapse(4) default(present)
    do b = 1,nblocks
       do k = 1,nkmax
          do j = 1,njmax
             do i = 1,nimax
                if ((i .le. ni(b)) .and. (j .le. nj(b)) .and. (k .le. nk(b))) then

                   u_contra = prim_c(i,j,k,b,2)*idx_c(i,j,k,b) + &
                        &     prim_c(i,j,k,b,3)*idy_c(i,j,k,b) + &
                        &     prim_c(i,j,k,b,4)*idz_c(i,j,k,b)
                   v_contra = prim_c(i,j,k,b,2)*jdx_c(i,j,k,b) + &
                        &     prim_c(i,j,k,b,3)*jdy_c(i,j,k,b) + &
                        &     prim_c(i,j,k,b,4)*jdz_c(i,j,k,b)
                   w_contra = prim_c(i,j,k,b,2)*kdx_c(i,j,k,b) + &
                        &     prim_c(i,j,k,b,3)*kdy_c(i,j,k,b) + &
                        &     prim_c(i,j,k,b,4)*kdz_c(i,j,k,b)

                   sound_x = sqrt(gamma_ideal*prim_c(i,j,k,b,5)/prim_c(i,j,k,b,1)) * &
                        &    sqrt(idx_c(i,j,k,b)**2.0d0+idy_c(i,j,k,b)**2.0d0+idz_c(i,j,k,b)**2.0d0)
                   sound_y = sqrt(gamma_ideal*prim_c(i,j,k,b,5)/prim_c(i,j,k,b,1)) * &
                        &    sqrt(jdx_c(i,j,k,b)**2.0d0+jdy_c(i,j,k,b)**2.0d0+jdz_c(i,j,k,b)**2.0d0)
                   sound_z = sqrt(gamma_ideal*prim_c(i,j,k,b,5)/prim_c(i,j,k,b,1)) * &
                        &    sqrt(kdx_c(i,j,k,b)**2.0d0+kdy_c(i,j,k,b)**2.0d0+kdz_c(i,j,k,b)**2.0d0)

                   x_velocity = abs(u_contra)+sound_x
                   y_velocity = abs(v_contra)+sound_y
                   z_velocity = abs(w_contra)+sound_z

                   dt_i = min(dt_i,1.0d0/x_velocity,1.0d0/y_velocity,1.0d0/z_velocity)
                   
                end if
             end do
          end do
       end do
    end do

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ! Compute viscous timestep
    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    if (i_eqn .eq. 2) then

       if (nondim .eq. 1) then

          !$acc parallel loop gang vector collapse(4) reduction(min:dt_v) &
          !$acc& default(present)
          do b = 1,nblocks
             do k = 1,nkmax
                do j = 1,njmax
                   do i = 1,nimax
                      if ((i .le. ni(b)) .and. (j .le. nj(b)) .and. (k .le. nk(b))) then

                         norm1 = idx_c(i,j,k,b)**2.0d0+idy_c(i,j,k,b)**2.0d0+idz_c(i,j,k,b)**2.0d0
                         norm2 = jdx_c(i,j,k,b)**2.0d0+jdy_c(i,j,k,b)**2.0d0+jdz_c(i,j,k,b)**2.0d0
                         norm3 = kdx_c(i,j,k,b)**2.0d0+kdy_c(i,j,k,b)**2.0d0+kdz_c(i,j,k,b)**2.0d0
                         
                         dt_v = min(dt_v,0.25d0*(Re/Ma)*(prim_c(i,j,k,b,1)/dyn_visc_c(i,j,k,b))*min(1.0d0/norm1,1.0d0/norm2,1.0d0/norm3))
                         
                      end if
                   end do
                end do
             end do
          end do

       else if (nondim .eq. 2) then
          
          !$acc parallel loop gang vector collapse(4) reduction(min:dt_v) &
          !$acc& default(present)
          do b = 1,nblocks
             do k = 1,nkmax
                do j = 1,njmax
                   do i = 1,nimax
                      if ((i .le. ni(b)) .and. (j .le. nj(b)) .and. (k .le. nk(b))) then

                         norm1 = idx_c(i,j,k,b)**2.0d0+idy_c(i,j,k,b)**2.0d0+idz_c(i,j,k,b)**2.0d0
                         norm2 = jdx_c(i,j,k,b)**2.0d0+jdy_c(i,j,k,b)**2.0d0+jdz_c(i,j,k,b)**2.0d0
                         norm3 = kdx_c(i,j,k,b)**2.0d0+kdy_c(i,j,k,b)**2.0d0+kdz_c(i,j,k,b)**2.0d0
                         
                         dt_v = min(dt_v,0.25d0*Re*(prim_c(i,j,k,b,1)/dyn_visc_c(i,j,k,b))*min(1.0d0/norm1,1.0d0/norm2,1.0d0/norm3))
                         
                      end if
                   end do
                end do
             end do
          end do
       end if
       
    else if (nondim .eq. 3) then

       !$acc parallel loop gang vector collapse(4) reduction(min:dt_v) &
       !$acc& default(present)
          do b = 1,nblocks
             do k = 1,nkmax
                do j = 1,njmax
                   do i = 1,nimax
                      if ((i .le. ni(b)) .and. (j .le. nj(b)) .and. (k .le. nk(b))) then

                         norm1 = idx_c(i,j,k,b)**2.0d0+idy_c(i,j,k,b)**2.0d0+idz_c(i,j,k,b)**2.0d0
                         norm2 = jdx_c(i,j,k,b)**2.0d0+jdy_c(i,j,k,b)**2.0d0+jdz_c(i,j,k,b)**2.0d0
                         norm3 = kdx_c(i,j,k,b)**2.0d0+kdy_c(i,j,k,b)**2.0d0+kdz_c(i,j,k,b)**2.0d0
                         
                         dt_v = min(dt_v,0.25d0*Re*(prim_c(i,j,k,b,1)/dyn_visc_c(i,j,k,b))*min(1.0d0/norm1,1.0d0/norm2,1.0d0/norm3))
                         
                      end if
                   end do
                end do
             end do
          end do
       
    else
       
       dt_v = 1.0d20
       
    end if

Anyways, these don’t seem to be causing issues.

In another module of my code, I am getting a memcheck error. The issue is that the line that compute-sanitizer points to is not very helpful. It is pointing to the line that closes the subroutine, which I can only assume means that the memcheck error is occurring somewhere in the subroutine.

Is there a way I can change this behavior? I tried compiling with debug flags as well to try and catch this but it does not seem to help.

Appreciate any advice.

What debug flags are you using? Just -g?

You might try adding “-gpu=lineinfo” to see if that helps. It’s primarily there for the profilers so I’m not sure it will help here, but easy to try.

compute-sanitizer is best at helping identify approximately where potential issues occur, but not necessarily tell you why it’s happening. For that, I’d move to using cuda-gdb setting the break point in the kernel that compute-sanitizer identifies.

Of course since the problem doesn’t occur until well into the simulation, you may need to resort to “print” debugging to help narrow it down.

Hi Mat,

I think I found the source of the memcheck error.
The loop is structured like so:

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ! EWM main iteration
    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    !$acc parallel loop gang vector collapse(2) &
    !$acc& private(a_m_local,b_m_local,c_m_local,rhs_m_local,vel_ewm_c_local) &
    !$acc& private(a_t_local,b_t_local,c_t_local,rhs_t_local,t_ewm_c_local) &
    !$acc& default(present)
    do k = 1,nk(b1)
       do i = 1,ni(b1)

          ewm_iterations(i,k,b1) = 1

          !$acc loop seq
          do l = 1,10000000

Inside of this, a few different calculations are done. Everything seems to be fine except for a call to a different subroutine:

             call tdma(n_ewm, &
                  &    a_m_local      (1:n_ewm), &
                  &    b_m_local      (1:n_ewm), &
                  &    c_m_local      (1:n_ewm), &
                  &    rhs_m_local    (1:n_ewm), &
                  &    vel_ewm_c_local(1:n_ewm))

This call seems to trigger the memcheck error. This is what the subroutine looks like:

  !================================================================================
  ! subroutine tdma
  ! Solves a tri-diagonal matrix system
  ! n is the number of columns/rows
  ! a is the lower diagonal
  ! b is the main diagonal
  ! c is the upper diagonal
  ! d is the right hand side
  ! x is the solution vector
  subroutine tdma(n,a,b,c,d,x)
    !$acc routine
    implicit none
    integer,  intent(in)  :: n
    real(dp), intent(in)  :: a(n),c(n),b(n),d(n)
    real(dp), intent(out) :: x(n)

    integer  :: i
    real(dp) :: m,cp(n),dp(n)

    cp(1) = c(1)/b(1)
    dp(1) = d(1)/b(1)

    ! Forward elimination
    !$acc loop seq
    do i = 2,n
       m = b(i)-cp(i-1)*a(i)
       cp(i) = c(i)/m
       dp(i) = (d(i)-dp(i-1)*a(i))/m
    enddo

    ! Back substitution
    x(n) = dp(n)

    !$acc loop seq
    do i = n-1,1,-1
       x(i) = dp(i)-cp(i)*x(i+1)
    end do

  end subroutine tdma

Does anything seem off?

It seems I was able to get around this by inlining the operations done in the tdma subroutine straight into the loop it was being called from. I no longer get the memcheck errors. However, I do keep getting racecheck errors wherever reductions are being done. I am not sure if these are false positives.

For instance, I get a racecheck from this loop:

    u_tau_ewm        = 0.0d0
    rho_ewm_wall_avg = 0.0d0
    mu_ewm_wall_avg  = 0.0d0
    !$acc parallel loop gang vector collapse(2) &
    !$acc& reduction(+:u_tau_ewm,rho_ewm_wall_avg,mu_ewm_wall_avg) &
    !$acc& default(present)
    do k = 1,nk(b1)
       do i = 1,ni(b1)
          u_tau_ewm        = u_tau_ewm        + utau_ewm(i,k,b1)       /dble(ni(b1)*nk(b1))
          rho_ewm_wall_avg = rho_ewm_wall_avg + rho_ewm_wall_array(i,k)/dble(ni(b1)*nk(b1))
          mu_ewm_wall_avg  = mu_ewm_wall_avg  + mu_ewm_wall_array (i,k)/dble(ni(b1)*nk(b1))
       end do
    end do

You have a couple of automatic arrays:

real(dp) :: m,cp(n),dp(n)

Automatics are implicitly allocated which is problematic. This can cause a performance slowdown, but also errors due to heap overflows.

I don’t know if you are encountering a heap overflow, but inlining the routine would fix this since you can then hoist the temp arrays so they are declared on the host, and then make them private on the device. Private array allocation is done before the launch.

You can also try increasing the heap size by setting the environment variable “NV_ACC_CUDA_HEAPSIZE” to see if it fixes it. However even if it does, it’s better to avoid using automatics in device routines due to the performance issues.

As for the racecheck, I’m not sure either. The source looks fine to me. Maybe it’s picking up the partial reductions as a race condition?

Though, I’d replace “dble(ni(b1)*nk(b1))” with a scalar computed before the loop given it’s constant. Also, you can then take the reciprocal and change the divides to multiples. Divides have gotten better performance-wise, but still relatively slow.

1 Like