Need advices for optimizing heart of CFD code

Hello,

I have a 3D CFD (Finite Volumes scheme) application written in Fortran with already two levels of parallelism : MPI and OpenMP. I try to add an other second level with OpenACC directives on Tesla k40 GPU for comparison with OpenMP.
The heart of the application (largest cpu time consuming part, quite 60%) is the computation of the fluxes in the 3 directions.
For each direction, we have two nested loops (two other directions), with a vector extraction, and the computation of the fluxes for all points of this vector. So three times this construction.
The computation of the fluxes is done by an other subroutine, called inside these two neested loops. This subroutine uses temporary variables, scalars and vectors, that should be private for each call.
Simply, for the first direction, we have

   DO k = 1, Nz
      DO j = 1, Ny
!
         DO L = 1, 5    ! Five unknowns per mesh cell
            DO i = 1, Nx
               Q1(i,L) = Q(i,L,j,k)
            END DO
         END DO
!
         CALL compute_flux (Nx,Q1,F)
!
         DO L = 1, 5
            DO i = 1, Nx
               RQ(i,L,j,k) = linear function of (F(i,L) )
            END DO
         END DO
!
      END DO
   END DO

In the subroutine compute_flux, we have temporary arrays used in several loops that can be vectorized by the compiler. Each array should be private for the thread that makes the call.
We have (N is the dimension received by argument)
11 local arrays of dimension (N)
2 local arrays of dimension (N,5)
2 local arrays of dimension (N,6,5)

In the calling subroutine I choose to parallelize the two nested loops,

   DO k = 1, Nz
      DO j = 1, Ny

In this case, I think I can use the following OpenACC directives :

!$ACC ROUTINE(compute_flux) VECTOR
!$ACC DATA PRESENT(Q,RQ,...)
!$ACC PARALLEL PRIVATE(Q1,F,...)
!$ACC LOOP COLLAPSE(2) PRIVATE(i,L)
   DO k = 1, Nz
      DO j = 1, Ny
!
         DO L = 1, 5
            DO i = 1, Nx
               Q1(i,L) = Q(i,L,j,k)
            END DO
         END DO
!
         CALL compute_flux (Nx,Q1,F)
!
         DO L = 1, 5
            DO i = 1, Nx
               RQ(i,L,j,k) = linear function of (F(i,L) )
            END DO
         END DO
!
      END DO
   END DO
!$ACC END LOOP
...
...
!$ACC LOOP COLLAPSE(2) PRIVATE(j,L)
   DO k = 1, Nz
      DO i = 1, Nx

      END DO
   END DO
!$ACC END LOOP
...
...
!$ACC LOOP COLLAPSE(2) PRIVATE(k,L)
   DO j = 1, Ny
      DO i = 1, Nx

      END DO
   END DO
!$ACC END LOOP
!$ACC END PARALLEL
!$ACC END DATA

In the subroutine compute_flux, one has :

!$ACC ROUTINE VECTOR

and directives like

!$ACC LOOP VECTOR

for the different loops in it.

In compute_flux, there are several loops. Local arrays computed in a loop are reused in following loops, so they have to be preserved during the execution of this routine.

My problem is that I do not know how to make private (to the thread making the call of compute_flux) the local arrays of this subroutine and being efficient.

I try adding clauses concerning the number of gangs and the number of workers.
I can go up to 64 gangs / 1 worker, next I get an internal error from Cuda. In the best case, so 64 gangs, The code runs twice as slow as 1 MPI process alone.
If I use more than one worker, I get an internal Cuda error. I think that it comes from local arrays that are shared.

I want to increase my mesh size, from (100x150x100) to (1000x300x26) and more after. I can’t go further than 4 gangs / 1 worker (Cuda internal error) and computing time is very bad with regard to the one obtained using the 20 CPU cores of the running node.

Could you please give me some advices ?

Regards,
Guy.

Hi Guy,

I can’t speak the CUDA error you’re getting without more details but I do think we can improve how you’re using the directives. Hopefully this will resolve the errors and performance issues.

First, I wouldn’t use “parallel” here, or a least not make a large parallel region that spans multiple loops. “parallel” creates a single CUDA kernel which will be slower than if you have a kernel for each of the nested loops.

Second, I would privatize “Q1” and “F” for each nested loop. The reason to put them in a “private” clause at the “parallel” level is if each gang’s private data needs to be propagated from one loop nest to the next, but I don’t think that’s the case here. Also, there’s no need to privatize “i”, and “L”, since in OpenACC scalars are private by defaults.

Third, I’d interchange the “L” and “i” loops, then make the “i” loop a “vector”.

Try something like the following:

!$ACC ROUTINE(compute_flux) VECTOR 
!$ACC DATA PRESENT(Q,RQ,...) 
!$ACC KERNELS 
!$ACC LOOP COLLAPSE(2) GANG PRIVATE(Q1,F,...) 
   DO k = 1, Nz 
      DO j = 1, Ny 
! 
!$ACC LOOP VECTOR   
         DO i = 1, Nx 
            DO L = 1, 5
                Q1(i,L) = Q(i,L,j,k) 
            END DO 
         END DO 
! 
         CALL compute_flux (Nx,Q1,F) 
! 
!$ACC LOOP VECTOR 
        DO i = 1, Nx
            DO L = 1, 5  
               RQ(i,L,j,k) = linear function of (F(i,L) ) 
            END DO 
         END DO 
! 
      END DO 
   END DO 
... 
... 
!$ACC LOOP COLLAPSE(2) GANG PRIVATE(Q1,F,...) 
   DO k = 1, Nz 
      DO i = 1, Nx 
... inner vector loops 
      END DO 
   END DO 
!$ACC END LOOP 
... 
... 
!$ACC LOOP COLLAPSE(2) GANG PRIVATE(Q1,F,...) 
   DO j = 1, Ny 
      DO i = 1, Nx 
... inner vector loops
      END DO 
   END DO 
!$ACC END KERNELS
!$ACC END DATA

Be sure to compare the compiler feedback messages (-Minfo=accel) with your original and this version of the code. The messages are often very informative as to what’s happening. Also, setting “PGI_ACC_TIME=1” in your environment will give you some quick profiling information.

Hope this helps,
Mat

Hello Mat,

thank you for your message.
I’ll try what you propose even if it sounds strange to me to interchange the “L” and the “i” loops.
I’ll be back ASAP.

Regards,

Hello,

I change my source file following your advices.
At the beginning, I get lots of bad comments from pgf90 about the parallelization, for example

     68, Loop carried dependence due to exposed use of rq(:,1:5,:,:) prevents parallelization
     69, Loop carried dependence due to exposed use of rq(:,1:5,:,:) prevents parallelization
         68, !$acc loop seq collapse(2)
         69,   collapsed
         72, !$acc loop seq
         74, !$acc loop vector(32) ! threadidx%x
         95, !$acc loop seq
         97, !$acc loop vector(32) ! threadidx%x
        106, !$acc loop seq
        108, !$acc loop vector(32) ! threadidx%x
        118, !$acc loop seq
        120, !$acc loop vector(32) ! threadidx%x
     72, Parallelization would require privatization of array q1(:,:)

I put INDEPENDENT clauses everywhere I can, and now it seems better, all loops are parallelized.

I run the code but it fails immediately in the first large parallel loops of the routine :

call to cuMemFreeHost returned error 700: Illegal address during kernel execution

If I set the number of GANG, for example only 1, the code behaves better, but it crashes the second time it enters this routine, still in the first large parallel loops.

call to cuMemFreeHost returned error 700: Illegal address during kernel execution

If I try others values like 2 or 4, it crashes at the first time.

Here is the first part, the X direction, of the routine

 42 
 43 !$ACC DATA PRESENT (Q, QP, RQ, Xa, Yb, Zc, TV_i, sensor)
 44 
 45 !
 46       IF (iInvF == 4) THEN
 47          CALL Shock_Sensor
 48       END IF
 49 !
 50 
 51 !$ACC KERNELS
 52 
 53 
 54 !*******************************************
 55 !****  Computation of inviscid X-terms  ****
 56 !*******************************************
 57 !$ACC LOOP VECTOR INDEPENDENT
 58       DO i = Imin, Imax
 59          TV_i(i) = one / Xa(i)
 60       END DO
 61 !$ACC END LOOP
 62 !
 63       sq = one / Da
 64 !
 65       iX = 1
 66       iY = 0
 67       iZ = 0
 68 !
 69 !$ACC LOOP COLLAPSE(2) GANG(4) PRIVATE(F,Fc,Fw,Q1) INDEPENDENT
 70       DO k = Kmin, Kmax
 71          DO j = Jmin, Jmax
 72 !
 73 !$ACC LOOP VECTOR INDEPENDENT
 74             DO i = Imin-3, Imax+3
 75 !$ACC LOOP INDEPENDENT
 76                DO L = 1, 5
 77                   Q1(i,L) = Q(i,L,j,k)
 78                END DO
 79             END DO
 80 !$ACC END LOOP
 81 
 82 !C** Computation of ENO fluxes
 83             SELECT CASE (iInvF)
 84             CASE (1)
 85                CALL Flux         (iX, iY, iZ, Imin, Imax, F, Fc, Q1)
 86 !
 87             CASE (2)
 88                CALL Flux_split   (iX, iY, iZ, Imin, Imax, F, Q1)
 89 !
 90             CASE (3)
 91                CALL Flux_split_6 (iX, iY, iZ, Imin, Imax, F, Q1)
 92 !
 93             CASE (4)
 94                CALL Flux         (iX, iY, iZ, Imin, Imax, F, Fc, Q1)
 95 !
 96 !$ACC LOOP VECTOR INDEPENDENT
 97                DO i = Imin-1, Imax
 98 !$ACC LOOP INDEPENDENT
 99                   DO L = 1, 5
100                      Fw(i,L) = F(i,L)
101                   END DO
102                END DO
103 !$ACC END LOOP
104 !
105                CALL Flux_split_6 (iX, iY, iZ, Imin, Imax, F, Q1)
106 !
107 !$ACC LOOP VECTOR INDEPENDENT
108                DO i = Imin-1, Imax
109 !$ACC LOOP INDEPENDENT
110                   DO L = 1, 5
111                      F(i,L) = Fw(i,L) * REAL(sensor(i,j,k),rp) + F(i,L) * REAL(1-sensor(i,j,k),rp)
112                   END DO
113                END DO
114 !$ACC END LOOP
115 !
116             END SELECT
117 
118 !C** Computation of R.H.S.
119 !$ACC LOOP VECTOR INDEPENDENT
120             DO i = Imin, Imax
121 !$ACC LOOP INDEPENDENT
122                DO L = 1, 5
123                   RQ(i,L,j,k) = RQ(i,L,j,k) + (F(i-1,L) - F(i,L) ) * sq * TV_i(i)
124                END DO
125             END DO
126 !$ACC END LOOP
127 !
128          END DO
129       END DO
130 !$ACC END LOOP

The corresponding output of the compiler is :

pgf90  -O3 -Mvect -Munroll -Mipa=inline -Mextend -mcmodel=medium -g -traceback -Mpreprocess -acc -Minfo -Minfo=accel -c ./SRC/invrhs.f
     43, Generating present(q(:,:,:,:),qp(:,:,:,:),rq(:,:,:,:),xa(:),yb(:),zc(:),tv_i(:),sensor(:,:,:))
     51, Generating copy(iz,iy,ix)
     58, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         58, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     58, Generated 4 alternate versions of the loop
         Generated vector sse code for the loop
         Generated a prefetch instruction for the loop
     70, Loop is parallelizable
     71, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         70, !$acc loop gang(4) collapse(2) ! blockidx%x
         71,   ! blockidx%x collapsed
         74, !$acc loop vector(128) ! threadidx%x
         76, !$acc loop seq
         97, !$acc loop vector(128) ! threadidx%x
         99, !$acc loop seq
        108, !$acc loop vector(128) ! threadidx%x
        110, !$acc loop seq
        120, !$acc loop vector(128) ! threadidx%x
        122, !$acc loop seq
     71, Loop not vectorized/parallelized: contains call
     74, Loop is parallelizable
         Loop interchange produces reordered loop nest: 76,74
         Generated 4 alternate versions of the loop
         Generated vector sse code for the loop
         Generated a prefetch instruction for the loop
     76, Loop is parallelizable
         Outer loop unrolled 5 times (completely unrolled)
     97, Loop is parallelizable
         Loop interchange produces reordered loop nest: 99,97
         Generated 4 alternate versions of the loop
         Generated vector sse code for the loop
         Generated a prefetch instruction for the loop
     99, Loop is parallelizable
         Outer loop unrolled 5 times (completely unrolled)
    108, Loop is parallelizable
         Loop interchange produces reordered loop nest: 110,108
         Generated 3 alternate versions of the loop
         Generated vector sse code for the loop
         Generated 4 prefetch instructions for the loop
    110, Loop is parallelizable
         Outer loop unrolled 5 times (completely unrolled)
    120, Loop is parallelizable
         Loop interchange produces reordered loop nest: 122,120
         Generated 3 alternate versions of the loop
         Generated vector sse code for the loop
         Generated 3 prefetch instructions for the loop
    122, Loop is parallelizable
         Outer loop unrolled 5 times (completely unrolled)

Private arrays are declared as local, automatic ones like :

 17       REAL(rp),  DIMENSION( 0:((Nmax-0+1)/8+1)*8-1,5)       ::  F
 18       REAL(rp),  DIMENSION( 0:((Nmax-0+1)/8+1)*8-1,5)       ::  Fw
 19       REAL(rp),  DIMENSION(-2:((Nmax+3-(-2)+1)/8+1)*8-3,5)  ::  Fc
 20       REAL(rp),  DIMENSION(-2:((Nmax+3-(-2)+1)/8+1)*8-3,5)  ::  Q1

Nmax= MAX(Nx,Ny,Nz)

My test case sets Nx = 100 ; Ny = 150 ; Nz = 100 and iInvF = 1.

Is there something I’ve missed ?

Guy.

Hi Guy,

51, Generating copy(iz,iy,ix)

This is a potential issue. One of the few cases where the compiler doesn’t automatically privatize scalars is when they are passed by reference to a device routine, as is the case here. Hence, these variables are global and shared across all threads.

Do any of your flux routines modify these variables? If so, you may have a dependency given you initialize these variables before the k loop. The value of iz, iy, and ix will depend upon the value they were set to in previous iterations of the k and j loops.

If no dependency exists, then you can put these variables into a “firstprivate” clause so that each thread has it’s own copy.


68, Loop carried dependence due to exposed use of rq(:,1:5,:,:) prevents parallelization

Is RQ a pointer or assume-size array? If not, then I’m not sure why this would occur.

call to cuMemFreeHost returned error 700: Illegal address during kernel execution

The error is similar to a seg fault on the CPU where a bad or protected address is being referenced. Exactly why it’s happening, I’m not sure.

It could be caused by ix, iy, and iz being shared, especially if they are being used as indices into arrays.

I have seen this when the total size of a private array (i.e. the size of each array multiplied by the number of copies, one for each gang in this case) grows above 2GB. However, your arrays aren’t that big and you have -mcmodel=medium enabled so 2GB arrays shouldn’t be a problem.

Does Flux use automatic arrays? Automatics are dynamically allocated upon entry into a routine. However by default, the heap size on the device is only 8MB so can easily run out.

Does the error go away if you comment out the call to Flux?


Are you able to share this code? If so, please send a reproducing example to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me. It might be easier than trying to debug this via the user forum.

  • Mat

Hello Mat,

thank you for all your explanations.
Here are some elements.

51, Generating copy(iz,iy,ix)

iz, iy,ix are three scalar whose values are 0 or 1 depending of the direction
X direction, (/ix,iy,iz/) = (/1,0,0/)
Y direction, (/ix,iy,iz/) = (/0,1,0/)
Z direction, (/ix,iy,iz/) = (/0,0,1/)
They are constant for each of these three steps.
Where should I put the FIRSTPRIVATE clause ? I can’t on a LOOP or a KERNELS directive.


68, Loop carried dependence due to exposed use of rq(:,1:5,:,:) prevents parallelization

RQ is a global shared work array, declared in a module, allocated on the GPU and in PRESENT clause everywhere it is used.
It is used to sum the contributions of the fluxes. I do not understand what is wrong with it here.


Does Flux use automatic arrays? Automatics are dynamically allocated upon entry into a routine. However by default, the heap size on the device is only 8MB so can easily run out.

Yes, Flux uses automatic arrays ! In the subroutine compute_flux, we have temporary arrays used in several loops that can be vectorized by the compiler. Each array should be private for the thread that makes the call.
We have (N is the dimension received by argument)
11 local arrays of dimension (N)
2 local arrays of dimension (N,5)
2 local arrays of dimension (N,6,5)

N is equal to Nx or Ny or Nz.


all to cuMemFreeHost returned error 700: Illegal address during kernel execution

Does the error go away if you comment out the call to Flux?

=> YES
Is it possible to increase the size of the heap on the device ?


Sadly, I’m not able to share this code as it is. I’ll try to make a smaller set of routines to send to you.

Regards,
Guy.

Hi Guy,

iz, iy,ix are three scalar whose values are 0 or 1 depending of the direction

Since the values don’t change, you don’t need to worry about them being global. Though you might want to update “Flux” to use the F2003 “value” attribute so the variables are passed by value.

RQ is a global shared work array, declared in a module, allocated on the GPU and in PRESENT clause everywhere it is used.

Ok, the problem here is that since RQ is a module variable then the compiler must assume RQ is used within Flux. Even if RQ isn’t used within Flux, it’s possible, so therefore the compiler must assume that it is. This then causes a potential dependency which is what the compiler is complaining about. Easy fix is to add the “independent” clause.

Is it possible to increase the size of the heap on the device ?

Yes, by calling cudaDeviceSetLimit. I wrote an example to show you how it works. Note that if possible, try to change your automatic arrays to fixed size arrays. Automatics implicitly allocate memory on the device which can be quite slow.

Here’s an example of setting the heap size when using automatic arrays.

% cat setHeap.F90
module mod_data
   integer, parameter :: dp = selected_real_kind(15, 307)
   real(kind=dp), dimension(:,:), allocatable :: A, B
!$acc declare create(A,B)
end module mod_data

module use_data
   use mod_data
contains
   subroutine fillData (j,val,size)
!$acc routine vector
      integer(8), value :: j, size
      real(kind=dp), value :: val
      integer(8) :: i
      real(kind=dp) :: tmp(size)
!$acc loop vector
       do i=1,size
          tmp(i) = B(j,i) + ((j-1)*M)+(i-1)
       end do

!$acc loop vector
       do i=1,size
          A(j,i) = tmp(i)
       end do
   end subroutine fillData
end module use_data

program example
    use use_data
    use cudafor

    integer(8) :: N1, M1
    integer(8) :: i,j,heapsize
    integer    :: setheap,istat
    N1=32
    print *, "Input the number of elements to create: "
    read(*,*), M1
    print *, "Set the device heap? 1 - yes, 0 - no"
    read(*,*), setheap
    print *, "Heap size needed for automatic array: ", real(M1*N1*dp)/real(1024*1024), "(MB)"
    heapsize = 2*(M1*N1*dp)
    if (setheap > 0) then
      print *, "Setting heapsize=",real(heapsize)/real(1024*1024), "(MB)"
      istat = cudaDeviceSetLimit(cudaLimitMallocHeapSize,heapsize);
    endif
    allocate(A(N1,M1),B(N1,M1))
!$acc kernels
    B=2.5_dp
!$acc end kernels

!$acc parallel loop gang num_gangs(N1)
    do j=1,N1
        call fillData(j,2.5_dp,M1)
    end do
!$acc update self(A)
    print *, A(1,1), A(1,M1)
    deallocate(A,B)

end program example

% pgfortran -fast -Minfo=accel -acc -ta=tesla -Mcuda setHeap.F90 -V16.5 -o setHeap.out
filldata:
     10, Generating Tesla code
         17, !$acc loop vector ! threadidx%x
         22, !$acc loop vector ! threadidx%x
     10, Generating acc routine vector
     17, Loop is parallelizable
     22, Loop is parallelizable
example:
     48, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         48, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             !$acc loop gang, vector(4) ! blockidx%y threadidx%y
     51, Accelerator kernel generated
         Generating Tesla code
         52, !$acc loop gang(32) ! blockidx%x
     55, Generating update self(a(:,:))

Let’s run the program that uses about 6MB for the automatic array. It should run fine given the default is 8MB and we need to leave some room on the heap for other things.

% setHeap.out
 Input the number of elements to create:
25000
 Set the device heap? 1 - yes, 0 - no
0
 Heap size needed for automatic array:     6.103516     (MB)
    2.500000000000000         25001.50000000000

Bumping up the size to use over 6.5MB causes the program to crash.

% setHeap.out
 Input the number of elements to create:
27000
 Set the device heap? 1 - yes, 0 - no
0
 Heap size needed for automatic array:     6.591797     (MB)
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Now let’s set a heap value. Note that I just doubled the minimum amount needed but you’ll want to play with what values work best for you.

% setHeap.out
 Input the number of elements to create:
27000
 Set the device heap? 1 - yes, 0 - no
1
 Heap size needed for automatic array:     6.591797     (MB)
 Setting heapsize=    13.18359     (MB)
    2.500000000000000         27001.50000000000

How big can we go? I’m not entirely sure what the upper limit is, but ~4GB of heap is the largest I could go on my K80 with 12GB of memory

% setHeap.out
 Input the number of elements to create:
10000000
 Set the device heap? 1 - yes, 0 - no
1
 Heap size needed for automatic array:     2441.406     (MB)
 Setting heapsize=    4882.813     (MB)
    2.500000000000000         9999999.002604166

Give this a try. Hopefully it fixes the issue.

  • Mat

Hello Mat,

Thank you very much for your example and the explanations. It’s very useful for me (at least).

There remains some things I wish to submit to you.

I use your example and now I can increase the size of the mesh and I can fill the memory of the GPU. I check with a call to cudaDeviceGetLimit that the heap size is really bigger.
As it is a MPI application, I want to use the two Tesla cards of the server my code is running on, i.e. two MPI processes, one per GPU.
Quite very often, I encounter the following error message

call to cuStreamSynchronize returned error 702: Launch timeout
call to cuMemFreeHost returned error 702: Launch timeout

What is this ?


2.
The second thing is about the optimization of the main part of the code. You suggest to put KERNELS instead of PARALLEL. I try it but it does not work. I get the same error as before, even with the “small” mesh, even with a 8GB heap, even taking only one MPI process on one card.

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
  1. At least,

Note that if possible, try to change your automatic arrays to fixed size arrays. Automatics implicitly allocate memory on the device which can be quite slow.

The size of the automatic arrays is known : Nmax = MAX (Nx,Ny,Nz). So how should I declare these “no more automatic” arrays as private work arrays for the compute_flux routine ? I think of declaring these arrays in the calling routine, and putting them in a PRIVATE clause, like the others (F,Fc,Fw,Q1). Right ?

Regards,
Guy.

For #1, Are you using Windows? The only time I’ve seen launch timeouts are on Windows when using WDDM driver. WDDM has a watchdog timer meant to prevent your monitor from freezing. Since you’re using Tesla cards, make sure that you have the TCC driver installed instead of WDDM (See: http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/#driver-model).

If you’re on Linux, then I’m not sure.

For #2, Sorry, I’m not sure. Do you see any differences in the compiler feedback messages (-Minfo=accel) between the two?

For #3, is Nmax a parameter or is it being set at runtime? If NMAX’s value is constant and known (i.e. a parameter) at compile time, then the arrays are fixed sized, not automatic.

  • Mat

Hi Mat,

#1.
I’m using Linux

uname -a
Linux igloo-gpu3 3.0.101-0.47.71-default #1 SMP Thu Nov 12 12:22:22 UTC 2015 (b5b212e) x86_64 x86_64 x86_64 GNU/Linux

#2
With KERNELS, I get the following feedback :

     43, Generating present(q(:,:,:,:),qp(:,:,:,:),rq(:,:,:,:),xa(:),yb(:),zc(:),tv_i(:),sensor(:,:,:))
     51, Generating copy(iz,iy,ix)
     58, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         58, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     70, Loop is parallelizable
     71, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         70, !$acc loop gang collapse(2) ! blockidx%x
         71,   ! blockidx%x collapsed
         74, !$acc loop vector(128) collapse(2) ! threadidx%x
         75,   ! threadidx%x collapsed
         96, !$acc loop vector(128) collapse(2) ! threadidx%x
         97,   ! threadidx%x collapsed
        106, !$acc loop vector(128) collapse(2) ! threadidx%x
        107,   ! threadidx%x collapsed
        117, !$acc loop vector(128) collapse(2) ! threadidx%x
        118,   ! threadidx%x collapsed
     74, Loop is parallelizable
     75, Loop is parallelizable
     96, Loop is parallelizable
     97, Loop is parallelizable
    106, Loop is parallelizable
    107, Loop is parallelizable
    117, Loop is parallelizable
    118, Loop is parallelizable

With PARALLEL, it’s

    59, Generating present(q(:,:,:,:),qp(:,:,:,:),rq(:,:,:,:),xa(:),yb(:),zc(:),tv_i(:),sensor(:,:,:))
     68, Accelerator kernel generated
         Generating Tesla code
         82, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
        111, !$acc loop gang collapse(2) ! blockidx%x
        112,   ! blockidx%x collapsed
        117, !$acc loop vector(32) collapse(2) ! threadidx%x
        118,   ! threadidx%x collapsed
         143, !$acc loop vector(32) collapse(2) ! threadidx%x
        144,   ! threadidx%x collapsed
        157, !$acc loop vector(32) collapse(2) ! threadidx%x
        158,   ! threadidx%x collapsed
        179, !$acc loop vector(32) collapse(2) ! threadidx%x
        180,   ! threadidx%x collapsed
    117, Loop is parallelizable
    118, Loop is parallelizable
    143, Loop is parallelizable
    144, Loop is parallelizable
    157, Loop is parallelizable
    158, Loop is parallelizable
    179, Loop is parallelizable
    180, Loop is parallelizable

Line numbers do not correspond. In one version there are more comments than in the other one, but there are the same loops in both versions.


Regards,

Guy.

For the parallel version the vector length is 32 but 128 for the kernels. One possibility is that that Flux routine is 32 and it’s this discrepancy that’s causing the issue. Can you try explicitly setting the vector length, “vector(32)” for the kernels version to see if this works around the issue?

Hello Mat

yes, it helps for this …
thanks.