OpenACC Optimization Help Needed: Execution Time Increase

Hello everyone,

I’m seeking assistance with optimizing the performance of a Fortran subroutine using OpenACC directives. The subroutine in question is euler_LLF_x, which is called within a larger loop. Despite adding OpenACC directives, I’m experiencing longer execution times compared to the version without directives. My goal is to reduce the execution time.
Here’s a summary of what I’ve done:
Before the main loop that calls this subroutine, I’ve added:
!$acc data copyin(T, x, y, z, dx_i, adi, ha, v, W_i, cp, thd_part)
the subroutine is :

subroutine euler_LLF_x ( thd , adi , dx_i , x , y , z , T , W_i , cp , ha , v , fl )

   type (part_thd_type)                                           :: thd_part !< particle derived type
   type (thd_type) , intent (in)                                  :: thd  !< thermodynamic derived type
   type (adi_type) , intent (in)                                  :: adi  !< non-dimensional derived type
   real (dp) , allocatable , dimension (:) , intent (in)          :: dx_i !< inverted dx array
   real (dp) , allocatable , dimension (:) , intent (in)          :: x    !< x-coordinate array
   real (dp) , allocatable , dimension (:) , intent (in)          :: y    !< y-coordinate array
   real (dp) , allocatable , dimension (:) , intent (in)          :: z    !< z-coordinate array
   real (dp) , allocatable , dimension (:,:,:) , intent (in)      :: T    !< temperature
   real (dp) , allocatable , dimension (:,:,:) , intent (in)      :: W_i  !< inverted molar mass
   real (dp) , allocatable , dimension (:,:,:) , intent (in)      :: cp   !< heat capacity
   real (dp) , allocatable , dimension (:,:,:,:) , intent (in)    :: ha   !< especies enthalpy
   real (dp) , allocatable , dimension (:,:,:,:) , intent (in)    :: v    !< conserved variables array
   real (dp) , allocatable , dimension (:,:,:,:) , intent (inout) :: fl   !< x-component flux


   integer (ip) , parameter                  :: stencil_m1 = ng+ng

   integer (ip)                              :: ok

   integer (ip)                              :: st , i_s , i_c , i_l , i_r , j , k , l , m , mm , s1 , s2

   real (dp)                                 :: ux_c , vy_c , wz_c , T_c , ht_c , et_c , W_i_c , gam_c , cs_c , cp_c

   real (dp) , dimension (nrv)               :: psi_c , ha_c

   real (dp) , dimension (nrv+npv+nvv)       :: Y_c

   real (dp) , dimension (nv)                :: ghat

   real (dp)                                 :: drho , dpres , eigenmax , df , gl , gr
      ! Output execution time

   real (dp)                                 :: r , r1 , cs_c_i , cs_c_2 , b1 , b2 , q , xi , fl_W , fl_E

   real (dp)                                 :: wrk

   real (dp) , dimension (nv,nv)             :: er , el

   logical                                   :: shk , neg_pres

   integer (ip) , dimension (ndimmax)        :: neg_pres_coords

   real (dp) , dimension (stencil_m1,nv)     :: f_s , L_s

   real (dp) , dimension (stencil_m1)        :: gplus , gminus , wc , gc

   real (dp) , dimension (:,:) , allocatable :: fhat , Ya

   real (dp) , dimension (:) , allocatable   :: rho_i , gam , ux , vy , wz , cs , P , ht , et
   integer :: dummy

   
   allocate ( fhat  (sx-1:ex,nv)            , &
              vy    (sx-1:ex+1)             , &
              wz    (sx-1:ex+1)             , &
              ht    (sx-1:ex+1)             , &
              et    (sx-1:ex+1)             , &
              Ya    (sx-1:ex+1,nrv+npv+nvv) , &
              ux    (sx-ng:ex+ng)           , &
              cs    (sx-ng:ex+ng)           , &
              P     (sx-ng:ex+ng)           , &
              gam   (sx-ng:ex+ng)           , &
              rho_i (sx-ng:ex+ng)           , &
              stat = ok )
   if ( ok > 0 ) call abort_mpi ('error allocate euler_LLF_x')


   fl_W = 1.0_dp
   if ( bc (W) == noreflection ) fl_W = 0.0_dp
   fl_E = 1.0_dp
   if ( bc (E) == noreflection ) fl_E = 0.0_dp

   neg_pres = .false.
   ! Initialize neg_pres_coords
     neg_pres_coords = 0

  call part_thd_init(thd, thd_part)
  
  !$acc data  copyin(fl, neg_pres_coords, neg_pres, thd_part)
  !$acc parallel default(present) &
  !$acc create(fhat, P, Ya, ht, et, ux, vy, wz, rho_i) &
  !$acc private  (df, el, er, ghat, shk, gam, cs,i_s, ha_c,  L_s, f_s, neg_pres, neg_pres_coords)  
  !$acc cache(df, el, er, ghat, shk, gam, cs,i_s, ha_c,  L_s, f_s)
  !$acc loop gang collapse(2) 
  do k = sz , ez ! loop in the z-direction
     do j = sy , ey ! loop in the y-direction


        !$acc loop vector 
        do i_c = sx-ng , ex+ng

            rho_i (i_c) = 1.0_dp / v (i_c,j,k,1)
            ux (i_c)    = v (i_c,j,k,2) * rho_i (i_c)
            P (i_c)     = v (i_c,j,k,1) * T (i_c,j,k) * W_i (i_c,j,k)
            if ( P (i_c) <= 0.0_dp .and. .not. neg_pres ) then
              !acc atomic write
               neg_pres = .true.
              !acc atomic write
               neg_pres_coords (1) = i_c
              !acc atomic write
               neg_pres_coords (2) = j
              !acc atomic write
               neg_pres_coords (3) = k
            end if
            gam (i_c) = thd_part % GaM2 * W_i (i_c,j,k) / cp (i_c,j,k)
            gam (i_c) = 1.0_dp / ( 1.0_dp - gam (i_c) )
            cs (i_c)  = sqrt ( gam (i_c) * P (i_c) * rho_i (i_c) )
       
        end do
           

                 
        !$acc loop vector 
        do i_c = sx-1 , ex+1

            vy (i_c) = v (i_c,j,k,3) * rho_i (i_c)
            wz (i_c) = v (i_c,j,k,4) * rho_i (i_c)
            do l = 1 , nrv+npv+nvv
               Ya (i_c,l) = v (i_c,j,k,niv+l) * rho_i (i_c)
            end do
            ht (i_c) = 0.0_dp
            do l = 1 , nrv
               ht (i_c) = ht (i_c) + ha (i_c,j,k,l) * Ya (i_c,l)
            end do
            ht (i_c) = ht (i_c) + 0.5_dp * ( ux(i_c)*ux(i_c) + vy(i_c)*vy(i_c) + wz(i_c)*wz(i_c) )
            et (i_c) = v (i_c,j,k,5) * rho_i (i_c)
        end do


        do i_l = sx-1 , ex ! loop on the cell faces


            shk = .false.
            i_r = i_l + 1
           !$acc loop vector 
           do st = 1 , stencil_m1

               i_s = i_l + st - ng

               f_s (st,1) = v (i_s,j,k,2)
               f_s (st,2) = v (i_s,j,k,2) * ux (i_s) + P (i_s)
               f_s (st,3) = v (i_s,j,k,3) * ux (i_s)
               f_s (st,4) = v (i_s,j,k,4) * ux (i_s)
               f_s (st,5) = ux (i_s) * ( v (i_s,j,k,5) + P (i_s) )

               do l = niv+1 , nv
                  f_s (st,l) = v (i_s,j,k,l) * ux (i_s)
               end do


               L_s (st,1) = abs ( ux (i_s) - cs (i_s) )
               L_s (st,2) = abs ( ux (i_s) )
               L_s (st,3) = abs ( ux (i_s) + cs (i_s) )
               L_s (st,4) = L_s (st,2)
               L_s (st,5) = L_s (st,2)
               do l = niv+1 , nv
                  L_s (st,l) = L_s (st,2)
               end do
            ! density criteria
             drho = abs ( v (min(i_s+1,ex),j,k,1) - v (i_s,j,k,1) ) * &
                    min ( rho_i (i_s) , rho_i (min(i_s+1,ex)) )

            ! pressure criteria
             dpres = abs ( P (min(i_s+1,ex)) - P (i_s) ) / &
                     min ( P (i_s) , P (min(i_s+1,ex)) )

             if ( drho > max_rel_weight .and. dpres > max_rel_weight ) shk = .true.
           end do

           ! activate 2D WENO at the boundaries (not 3D because of periodicity)
           if ( i_l < 1+ng )   shk = .true.
           if ( i_l > ntx-ng ) shk = .true.
           if ( j   < 1+ng )   shk = .true.
           if ( j   > nty-ng ) shk = .true.

            ! Roe's average state (by definition: velocities, mass fractios and total enthalpy)
            r     = sqrt ( v (i_r,j,k,1) * rho_i (i_l) )
            r1    = 1.0_dp / ( r + 1.0_dp )
            ux_c  = ( r * ux (i_r)  + ux (i_l)  ) * r1
            vy_c  = ( r * vy (i_r)  + vy (i_l)  ) * r1
            wz_c  = ( r * wz (i_r)  + wz (i_l)  ) * r1
            do l = 1 , nrv+npv+nvv
               Y_c (l) = ( r * ( Ya (i_r,l) ) + Ya (i_l,l) ) * r1
            end do
            ht_c  = ( r * ht (i_r)  + ht (i_l)  ) * r1
            et_c  = ( r * et (i_r)  + et (i_l)  ) * r1 ! but also total energy (Dieterding)
           !average state (Roe approx. compatible with EOS)
                       ! print *, "Warning: W_i_c is zero for thd_part % wc_ =", thd_part % Wc_i (1) 
           call Wmix_i_scalar_2 (thd_part , Y_c , W_i_c )

           T_c = ( ht_c - et_c ) / W_i_c ! averaged temperature compatible with EOS and previous Roe's averages
           call cp_scalar_2 ( thd_part , T_c , Y_c , cp_c ) ! cp compatible with EOS and previous Roe's averages
           gam_c = thd_part % GaM2 * W_i_c / cp_c
           gam_c = 1.0_dp / ( 1.0_dp - gam_c ) ! gamma compatible with EOS
           cs_c  = sqrt ( gam_c * T_c * W_i_c ) ! sound speed compatible with EOS

           call ha_scalar_2 ( thd_part , T_c , ha_c )  ! species enthalpies compatible with EOS
           wrk = ( gam_c / ( gam_c - 1.0_dp ) ) * T_c
           do l = 1 , nrv
              psi_c (l) = - ( r * ha_c (l) + ha_c (l) ) * r1
              psi_c (l) = psi_c (l) + wrk * thd_part % Wc_i (l)
           end do

            ! auxiliary variables to fill the matrices
            cs_c_i = 1.0_dp / cs_c
            cs_c_2 = cs_c * cs_c
            q      = 0.5_dp * ( ux_c*ux_c + vy_c*vy_c + wz_c*wz_c )
            b2     = ( gam_c - 1.0_dp ) / cs_c_2
            b1     = b2 * q
            xi     = b1 - b2 * ht_c

            ! 1 : FIRST
            el(1,1)   =   0.5_dp * ( b1        + ux_c * cs_c_i )
            el(1,2)   = - 0.5_dp * ( b2 * ux_c +        cs_c_i )
            el(1,3)   = - 0.5_dp * ( b2 * vy_c                 )
            el(1,4)   = - 0.5_dp * ( b2 * wz_c                 )
            el(1,5)   =   0.5_dp * b2

            el(2,1)   =   1.0_dp - b1
            el(2,2)   =   b2 * ux_c
            el(2,3)   =   b2 * vy_c
            el(2,4)   =   b2 * wz_c
            el(2,5)   = - b2

            el(3,1)   =   0.5_dp * ( b1        - ux_c * cs_c_i )
            el(3,2)   = - 0.5_dp * ( b2 * ux_c -        cs_c_i )
            el(3,3)   = - 0.5_dp * ( b2 * vy_c                 )
            el(3,4)   = - 0.5_dp * ( b2 * wz_c                 )
            el(3,5)   =   0.5_dp * b2

            el(4,1)   = - vy_c
            el(4,2)   =   0.0_dp
            el(4,3)   =   1.0_dp
            el(4,4)   =   0.0_dp
            el(4,5)   =   0.0_dp

            el(5,1)   = - wz_c
            el(5,2)   =   0.0_dp
            el(5,3)   =   0.0_dp
            el(5,4)   =   1.0_dp
            el(5,5)   =   0.0_dp

            ! 2 : SECOND
            wrk = 0.5_dp * b2
            do l = niv+1 , nv-npv-nvv
               el ( 1 , l ) = wrk * psi_c ( l-niv )
            end do
            el ( 2   , niv+1:nv-npv-nvv ) = - el ( 1 , niv+1:nv-npv-nvv ) - el ( 1 , niv+1:nv-npv-nvv )
            el ( 3   , niv+1:nv-npv-nvv ) =   el ( 1 , niv+1:nv-npv-nvv )
            el ( 4:5 , niv+1:nv-npv-nvv ) =   0.0_dp
            if ( npv > 0 ) then
               el ( 1:5 , nv-npv-nvv+1:nv ) = 0.0_dp
            end if

            ! 3 : THIRD
            el(6,1)   =   ( 1 + xi ) * q
            el(6,2)   = - ( 1 + xi ) * ux_c
            el(6,3)   = - ( 1 + xi ) * vy_c
            el(6,4)   = - ( 1 + xi ) * wz_c
            el(6,5)   =   ( 1 + xi )
            do l = niv+1 , nv-npv-nvv
               el ( niv+1 , l ) = xi * psi_c ( l-niv )
            end do
            if ( npv > 0 ) then
               el ( niv+1 , nv-npv-nvv+1:nv ) = 0.0_dp
            end if

            if ( nv-npv-nvv >= 7 ) then
               ! 4 : FOURTH
               do l = 7 , nv-npv-nvv
                  el(l,1) = - b1        * Y_c (l-6)
                  el(l,2) =   b2 * ux_c * Y_c (l-6)
                  el(l,3) =   b2 * vy_c * Y_c (l-6)
                  el(l,4) =   b2 * wz_c * Y_c (l-6)
                  el(l,5) = - b2        * Y_c (l-6)
               end do
               if ( npv > 0 ) then
                  do l = nv-npv-nvv+1 , nv
                     el(l,1) = - b1        * Y_c (l-5)
                     el(l,2) =   b2 * ux_c * Y_c (l-5)
                     el(l,3) =   b2 * vy_c * Y_c (l-5)
                     el(l,4) =   b2 * wz_c * Y_c (l-5)
                     el(l,5) = - b2        * Y_c (l-5)
                  end do
               end if
               ! 5 : FIFTH
               do s1 = 1 , nv-npv-nvv-niv-1
                  do s2 = 1 , nv-npv-nvv-niv
                     el ( s1+6 , s2+5 ) = - b2 * Y_c (s1) * psi_c (s2)
                  end do
               end do
               do l = 1 , nv-npv-nvv-niv-1
                  el ( l+6 , l+5 ) = el ( l+6 , l+5 ) + 1.0_dp
               end do
               if ( npv > 0 ) then
                  el (7:nv-npv-nvv,nv-npv-nvv+1:nv) = 0.0_dp
                  do s1 = 1 , npv+nvv
                     do s2 = 1 , nv-npv-nvv-niv
                        el ( s1+nv-npv-nvv , s2+5 ) = - b2 * Y_c (s1+nv-npv-nvv-niv) * psi_c (s2)
                     end do
                  end do
                  do s1 = 1 , npv+nvv
                     do s2 = 1 , npv+nvv
                        el ( s1+nv-npv-nvv , s2+nv-npv-nvv ) = 0.0_dp
                     end do
                  end do
                  do l = 1 , npv+nvv
                     el ( l+nv-npv-nvv , l+nv-npv-nvv ) = 1.0_dp
                  end do
               end if
            end if

            ! 1 : FIRST
            er(1,1)   =  1.0_dp
            er(1,2)   =  1.0_dp
            er(1,3)   =  1.0_dp
            er(1,4)   =  0.0_dp
            er(1,5)   =  0.0_dp

            er(2,1)   =  ux_c - cs_c
            er(2,2)   =  ux_c
            er(2,3)   =  ux_c + cs_c
            er(2,4)   =  0.0_dp
            er(2,5)   =  0.0_dp

            er(3,1)   =  vy_c
            er(3,2)   =  vy_c
            er(3,3)   =  vy_c
            er(3,4)   =  1.0_dp
            er(3,5)   =  0.0_dp

            er(4,1)   =  wz_c
            er(4,2)   =  wz_c
            er(4,3)   =  wz_c
            er(4,4)   =  0.0_dp
            er(4,5)   =  1.0_dp

            er(5,1)   =  ht_c - ux_c * cs_c
            er(5,2)   =  q
            er(5,3)   =  ht_c + ux_c * cs_c
            er(5,4)   =  vy_c
            er(5,5)   =  wz_c

            ! 2 : SECOND
            er(1:4,niv+1:nv) = 0.0_dp

            ! 3 : THIRD
            do l = niv+1 , nv
               er(l,1) =  Y_c (l-5)
               er(l,2) =  0.0_dp
               er(l,3) =  Y_c (l-5)
               er(l,4) =  0.0_dp
               er(l,5) =  0.0_dp
            end do

            ! 4 : FOURTH
            do s1 = niv , nv-npv-nvv
               do s2 = niv+1 , nv
                  er(s1,s2) = 0.0_dp
               end do
            end do
            do l = niv+1 , nv-npv-nvv
               er(l-1,l) = 1.0_dp
            end do

            ! 5 : FIFTH
            er ( nv-npv-nvv , 6 ) = - 1.0_dp / psi_c (nv-npv-nvv-niv)
            if ( nv-npv-nvv >= 7 ) then
               wrk = - 1.0_dp / psi_c (nv-npv-nvv-niv)
               do l = 1 , nv-npv-nvv-niv-1
                  er (nv-npv-nvv,l+niv+1) = wrk * psi_c (l)
               end do
               if ( npv > 0 ) then
                  er (nv-npv-nvv,nv-npv-nvv+1:nv) = 0.0_dp
                  do s1 = nv-npv-nvv+1 , nv
                     do s2 = niv+1 , nv
                        er(s1,s2) = 0.0_dp
                     end do
                  end do
                  do l = 1 , npv+nvv
                     er(l+nv-npv-nvv,l+nv-npv-nvv) = 1.0_dp
                  end do
               end if
            end if

           
           !$acc loop vector private(eigenmax,  gplus , gminus , wc , gc , gl , gr) 
            do m = 1 , nv ! loop on the five char. fields

               eigenmax = -1.0_dp
               !$acc loop reduction(max:eigenmax)
               do st = 1 , stencil_m1 ! LLF
                  eigenmax = max ( L_s (st,m) , eigenmax )
               end do
              
               do st = 1 , stencil_m1 ! loop over the stencil centered at face i
                  wc(st) = 0.0_dp
                  gc(st) = 0.0_dp
                  i_s = i_l + st - ng
                  do mm = 1 , nv
                     wc(st) = wc(st) + el(m,mm) * v (i_s,j,k,mm)
                     gc(st) = gc(st) + el(m,mm) * f_s (st,mm)
                  end do
                  gplus (st) = 0.5_dp * ( gc(st) + eigenmax * wc(st) )
                  gminus(st) = gc(st) - gplus (st)
               end do

               ! Reconstruction of the '+' and '-' fluxes (page 32)
               if ( shk .and. weno_avg ) then

                  call wenorec  ( gplus , gminus , gl , gr )
               else
                  call wenorec_nw ( gplus , gminus , gl , gr )
               end if

               ghat(m) = gl + gr ! char. flux

            end do

            ! Evaluation of fhat: the aim of this loop
   
           !$acc loop vector
            do m = 1 , nv
               fhat (i_l,m) = 0.0_dp
               do mm = 1 , nv
                  fhat (i_l,m) = fhat(i_l,m) + er(m,mm) * ghat(mm)
               end do
            end do

        end do ! end of loop on the cell faces

        !$acc loop  collapse(2) vector   
         do m = 1 , nv
            do i_c = sx , sx
               i_l = i_c - 1
               df = ( fhat(i_c,m) - fhat(i_l,m) ) * dx_i (i_c)
               fl (i_c,j,k,m) = df * fl_W ! instead of fl = fl + df
            end do
         end do

        !$acc loop vector collapse(2)  
         do m = 1 , nv
            do i_c = sx+1 , ex-1 ! loop on the inner nodes
               i_l = i_c - 1
               df = ( fhat(i_c,m) - fhat(i_l,m) ) * dx_i (i_c)
               fl (i_c,j,k,m) = df ! instead of fl = fl + df
            end do
         end do
        !$acc loop vector collapse(2) 
         do m = 1 , nv
            do i_c = ex , ex
               i_l = i_c - 1
               df = ( fhat(i_c,m) - fhat(i_l,m) ) * dx_i (i_c)
               fl (i_c,j,k,m) = df * fl_E ! instead of fl = fl + df
            end do
         end do

         end do ! end of j-loop
      end do ! end of k-loop
!$acc end parallel loop
!$acc update self(fl, neg_pres_coords, neg_pres)
!$acc end data

! Gather results from all processes
   if (neg_pres) then
      write (*,'(1X,A,I9,5(1X,I10))') 'WARNING: negative pressure at X-dir (rank,i,j,k)   =' , &
                   rank, neg_pres_coords(1) , neg_pres_coords(2) , neg_pres_coords(3)
      write (*,'(42X,A,9X,5(1X,1PE10.3))') '(x,y,z)   =' , &
                                     x (neg_pres_coords(1)) * adi % L_ref , &
                                     y (neg_pres_coords(2)) * adi % L_ref , &
                                     z (neg_pres_coords(3)) * adi % L_ref

     call mpi_abort ( MPI_COMM_WORLD , dummy , mpicode )

   end if

   deallocate ( fhat , vy , wz , ht , Ya , et , &
                ux , cs , P , gam , rho_i ) 
 end subroutine euler_LLF_x

the execution time has increased. I’m unsure where the bottleneck is or how to further optimize this code.
thank you !

Hi khawlaadjane,

What I’d suggest is to profile your code with Nsight-Systems to see if it’s a kernel execution issue or a data movement issue. You can also use Nsight-Compute to do a hardware level profile of the kernel to see where the bottle-necks are.

Have you checked your code for correctness? I see that you have several shared arrays (like “rho_i”, “ux”, “P”, etc.) that may have collisions. The inner “i_c” seems to have the same loop indices for all iteration of the outer k and j loops so the threads will be writing over each other. Should these be declared private?

Also, what are the loop trip counts? Are they in the 10s, 100s, 1000s?

What is the compiler feedback messages from the “-Minfo=accel” flag?

-Mat

thank you so much for your reply This is the output from the -Minfo=accel compiler flag :

wenopar:
    288, Generating update device(zu,zt,zs,zr,zq,zp,zo,zn,zm,zl,zi,zh,zg,zf,ze,zd,zc,zb,za,dweno(:),cweno(:,:),dtweno(:))
wenorec_nw:
    298, Generating acc routine seq
         Generating NVIDIA GPU code
wenorec:
    322, Generating acc routine seq
         Generating NVIDIA GPU code
euler_llf_x:
    562, Generating copyin(neg_pres,fl(:,:,:,:),neg_pres_coords(:),thd_part) [if not already present]
    563, Generating create(et(:),ht(:),ya(:,:),wz(:),fhat(:,:),rho_i(:),ux(:),vy(:),p(:)) [if not already present]
         Generating implicit firstprivate(ey,sy,ez,sz)
         Generating NVIDIA GPU code
        568, !$acc loop gang collapse(2) ! blockidx%x
        569,   ! blockidx%x collapsed
        573, !$acc loop vector(128) ! threadidx%x
        597, !$acc loop vector(128) ! threadidx%x
        601, !$acc loop seq
        605, !$acc loop seq
        613, !$acc loop seq
        619, !$acc loop vector(128) ! threadidx%x
        629, !$acc loop seq
        639, !$acc loop seq
        665, !$acc loop vector(128) ! threadidx%x
        682, !$acc loop vector(128) ! threadidx%x
        728, !$acc loop vector(128) ! threadidx%x
        731, !$acc loop vector(128) ! threadidx%x
        733, !$acc loop seq
             !$acc loop vector(128) ! threadidx%x
        735, !$acc loop seq
             !$acc loop vector(128) ! threadidx%x
        744, !$acc loop vector(128) ! threadidx%x
        748, !$acc loop vector(128) ! threadidx%x
        753, !$acc loop vector(128) ! threadidx%x
        761, !$acc loop vector(128) ! threadidx%x
        770, !$acc loop seq
        771, !$acc loop vector(128) ! threadidx%x
        775, !$acc loop vector(128) ! threadidx%x
        779, !$acc loop seq
             !$acc loop vector(128) ! threadidx%x
        780, !$acc loop seq
        781, !$acc loop vector(128) ! threadidx%x
        785, !$acc loop seq
        786, !$acc loop vector(128) ! threadidx%x
        790, !$acc loop vector(128) ! threadidx%x
        828, !$acc loop seq
             !$acc loop vector(128) ! threadidx%x
        831, !$acc loop vector(128) ! threadidx%x
        840, !$acc loop seq
        841, !$acc loop vector(128) ! threadidx%x
        845, !$acc loop vector(128) ! threadidx%x
        853, !$acc loop vector(128) ! threadidx%x
        857, !$acc loop vector(128) ! threadidx%x
        858, !$acc loop seq
        859, !$acc loop vector(128) ! threadidx%x
        863, !$acc loop vector(128) ! threadidx%x
        871, !$acc loop vector(128) ! threadidx%x
        875, !$acc loop seq
             Generating reduction(max:eigenmax)
        879, !$acc loop seq
        883, !$acc loop seq
        906, !$acc loop vector(128) ! threadidx%x
        908, !$acc loop seq
        916, !$acc loop vector(128) collapse(2) ! threadidx%x
        917,   ! threadidx%x collapsed
        925, !$acc loop vector(128) collapse(2) ! threadidx%x
        926,   ! threadidx%x collapsed
        933, !$acc loop vector(128) collapse(2) ! threadidx%x
        934,   ! threadidx%x collapsed
    563, CUDA shared memory used for cs,el
         Local memory used for gc
         CUDA shared memory used for f_s
         Local memory used for gplus
         CUDA shared memory used for ha_c,er,l_s
         Local memory used for gminus
         CUDA shared memory used for ghat,neg_pres_coords
         Local memory used for wc
         CUDA shared memory used for gam
         Generating default present(cp(:,:,:),y_c(:),w_i(:,:,:),ha(:,:,:,:),dx_i(:),psi_c(:),v(:,:,:,:),t(:,:,:))
    569, Generating implicit firstprivate(sx,nv,ex,i_l)
    573, Loop is parallelizable
    597, Loop is parallelizable
         Generating implicit firstprivate(nvv,npv,l)
    601, Loop is parallelizable
    605, Complex loop carried dependence of ht prevents parallelization
         Loop carried reuse of ht prevents parallelization
    613, Loop carried dependence of el prevents parallelization
         Loop carried backward dependence of el prevents vectorization
         Loop carried dependence of er prevents parallelization
         Loop carried backward dependence of er prevents vectorization
         Loop carried dependence of psi_c prevents parallelization
         Loop carried backward dependence of psi_c prevents vectorization
         Loop carried dependence of f_s prevents parallelization
         Loop carried backward dependence of f_s prevents vectorization
         Loop carried dependence of l_s prevents parallelization
         Loop carried backward dependence of l_s prevents vectorization
         Loop carried dependence of y_c prevents parallelization
         Loop carried backward dependence of y_c prevents vectorization
         Loop carried dependence of ghat prevents parallelization
         Loop carried backward dependence of ghat prevents vectorization
         Generating implicit firstprivate(ht_c,i_r,ntx,q,r,t_c,ux_c,wrk,gam_c,cs_c_i,vy_c,s1,b2,et_c,cp_c,w_i_c,cs_c_2,cs_c,nty,b1,xi,wz_c,r1)
    619, Scalar last value needed after loop for shk at line 892
         Generating implicit firstprivate(dpres,max_rel_weight,drho)
    629, Loop is parallelizable
    639, Loop is parallelizable
    665, Loop is parallelizable
    672, Reference argument passing prevents parallelization: w_i_c
    675, Reference argument passing prevents parallelization: cp_c
         Reference argument passing prevents parallelization: t_c
    680, Reference argument passing prevents parallelization: t_c
    682, Loop is parallelizable
    728, Loop is parallelizable
    731, Loop is parallelizable
    733, Loop is parallelizable
    735, Loop is parallelizable
    744, Loop is parallelizable
    748, Loop is parallelizable
    753, Loop is parallelizable
    761, Loop is parallelizable
    770, Loop is parallelizable
         Generating implicit firstprivate(s2)
    771, Loop is parallelizable
    775, Loop is parallelizable
    779, Loop is parallelizable
    780, Loop is parallelizable
    781, Loop is parallelizable
    785, Loop is parallelizable
    786, Loop is parallelizable
    790, Loop is parallelizable
    828, Loop is parallelizable
    831, Loop is parallelizable
    840, Loop is parallelizable
    841, Loop is parallelizable
    845, Loop is parallelizable
    853, Loop is parallelizable
    857, Loop is parallelizable
    858, Loop is parallelizable
    859, Loop is parallelizable
    863, Loop is parallelizable
    871, Loop is parallelizable
         Generating implicit firstprivate(weno_avg)
    875, Loop is parallelizable
    879, Loop is parallelizable
         Generating implicit firstprivate(mm)
    883, Complex loop carried dependence of wc,gc prevents parallelization
         Loop carried reuse of wc,gc prevents parallelization
    906, Loop is parallelizable
    908, Complex loop carried dependence of fhat prevents parallelization
         Loop carried reuse of fhat prevents parallelization
    916, Loop is parallelizable
    917, Loop is parallelizable
         Generating implicit firstprivate(fl_w)
    925, Loop is parallelizable
    926, Loop is parallelizable
    933, Loop is parallelizable
    934, Loop is parallelizable
         Generating implicit firstprivate(fl_e)
    944, Generating update self(fl(:,:,:,:),neg_pres_coords(:),neg_pres)
mpif90  -L/opt/nvidia/hpc_sdk/Linux_x86_64/24.3/compilers/lib -lacchost -laccdevice -lacccuda -laccdevaux -laccdevaux10 -laccdevaux110 -laccdevaux113 -L/opt/nvidia/hpc_sdk/Linux_x86_64/24.3/cuda/12.3/targets/x86_64-linux/lib -lcudart -lcudadevrt -lnvToolsExt -cuda nrtype.o  nrutil.o  module_nr_subroutines.o  random.o module_parameters.o profiling.o module_parallel.o module_adim.o module_input.o param_thd.o  type_thd.o part_type_thd.o common_file.o  file_tools.o  thd_tools_perfectgas.o module_thermodynamics.o module_Rankine_Hugoniot.o module_weno.o module_deriv.o module_tools.o module_SGS_models.o module_eg_lib.o module_viscflux.o module_BCs.o module_ICs.o module_reaction.o module_solver.o ckinterp36.o  tranfit.o ckinterp39.o  xerror.o  tran.o initchemkin.o  vode.o  psr_chemkin.o  psr_simp_chemkin.o  reaction_v.o egfrmc.o  EGSlib.o  EGini.o izem.o      -o  build/izem ```
I've also profiled the code using NVIDIA Nsight Systems and identified that there's a kernel execution issue but I don't know how to solve it : 
times : 97.5%
total Time : 66.443 s
Instances : 2631
Category : CUDA_KERNEL
Operation : weno_euler_llf_x_569_gpu