Need advices for optimizing heart of CFD code

Hello,

I have a 3D CFD (Finite Volumes scheme) application written in Fortran with already two levels of parallelism : MPI and OpenMP. I try to add an other second level with OpenACC directives on Tesla k40 GPU for comparison with OpenMP.
The heart of the application (largest cpu time consuming part, quite 60%) is the computation of the fluxes in the 3 directions.
For each direction, we have two nested loops (two other directions), with a vector extraction, and the computation of the fluxes for all points of this vector. So three times this construction.
The computation of the fluxes is done by an other subroutine, called inside these two neested loops. This subroutine uses temporary variables, scalars and vectors, that should be private for each call.
Simply, for the first direction, we have

``````   DO k = 1, Nz
DO j = 1, Ny
!
DO L = 1, 5    ! Five unknowns per mesh cell
DO i = 1, Nx
Q1(i,L) = Q(i,L,j,k)
END DO
END DO
!
CALL compute_flux (Nx,Q1,F)
!
DO L = 1, 5
DO i = 1, Nx
RQ(i,L,j,k) = linear function of (F(i,L) )
END DO
END DO
!
END DO
END DO
``````

In the subroutine compute_flux, we have temporary arrays used in several loops that can be vectorized by the compiler. Each array should be private for the thread that makes the call.
We have (N is the dimension received by argument)
11 local arrays of dimension (N)
2 local arrays of dimension (N,5)
2 local arrays of dimension (N,6,5)

In the calling subroutine I choose to parallelize the two nested loops,

``````   DO k = 1, Nz
DO j = 1, Ny
``````

In this case, I think I can use the following OpenACC directives :

``````!\$ACC ROUTINE(compute_flux) VECTOR
!\$ACC DATA PRESENT(Q,RQ,...)
!\$ACC PARALLEL PRIVATE(Q1,F,...)
!\$ACC LOOP COLLAPSE(2) PRIVATE(i,L)
DO k = 1, Nz
DO j = 1, Ny
!
DO L = 1, 5
DO i = 1, Nx
Q1(i,L) = Q(i,L,j,k)
END DO
END DO
!
CALL compute_flux (Nx,Q1,F)
!
DO L = 1, 5
DO i = 1, Nx
RQ(i,L,j,k) = linear function of (F(i,L) )
END DO
END DO
!
END DO
END DO
!\$ACC END LOOP
...
...
!\$ACC LOOP COLLAPSE(2) PRIVATE(j,L)
DO k = 1, Nz
DO i = 1, Nx

END DO
END DO
!\$ACC END LOOP
...
...
!\$ACC LOOP COLLAPSE(2) PRIVATE(k,L)
DO j = 1, Ny
DO i = 1, Nx

END DO
END DO
!\$ACC END LOOP
!\$ACC END PARALLEL
!\$ACC END DATA
``````

In the subroutine compute_flux, one has :

``````!\$ACC ROUTINE VECTOR
``````

and directives like

``````!\$ACC LOOP VECTOR
``````

for the different loops in it.

In compute_flux, there are several loops. Local arrays computed in a loop are reused in following loops, so they have to be preserved during the execution of this routine.

My problem is that I do not know how to make private (to the thread making the call of compute_flux) the local arrays of this subroutine and being efficient.

I try adding clauses concerning the number of gangs and the number of workers.
I can go up to 64 gangs / 1 worker, next I get an internal error from Cuda. In the best case, so 64 gangs, The code runs twice as slow as 1 MPI process alone.
If I use more than one worker, I get an internal Cuda error. I think that it comes from local arrays that are shared.

I want to increase my mesh size, from (100x150x100) to (1000x300x26) and more after. I can’t go further than 4 gangs / 1 worker (Cuda internal error) and computing time is very bad with regard to the one obtained using the 20 CPU cores of the running node.

Regards,
Guy.

Hi Guy,

I can’t speak the CUDA error you’re getting without more details but I do think we can improve how you’re using the directives. Hopefully this will resolve the errors and performance issues.

First, I wouldn’t use “parallel” here, or a least not make a large parallel region that spans multiple loops. “parallel” creates a single CUDA kernel which will be slower than if you have a kernel for each of the nested loops.

Second, I would privatize “Q1” and “F” for each nested loop. The reason to put them in a “private” clause at the “parallel” level is if each gang’s private data needs to be propagated from one loop nest to the next, but I don’t think that’s the case here. Also, there’s no need to privatize “i”, and “L”, since in OpenACC scalars are private by defaults.

Third, I’d interchange the “L” and “i” loops, then make the “i” loop a “vector”.

Try something like the following:

``````!\$ACC ROUTINE(compute_flux) VECTOR
!\$ACC DATA PRESENT(Q,RQ,...)
!\$ACC KERNELS
!\$ACC LOOP COLLAPSE(2) GANG PRIVATE(Q1,F,...)
DO k = 1, Nz
DO j = 1, Ny
!
!\$ACC LOOP VECTOR
DO i = 1, Nx
DO L = 1, 5
Q1(i,L) = Q(i,L,j,k)
END DO
END DO
!
CALL compute_flux (Nx,Q1,F)
!
!\$ACC LOOP VECTOR
DO i = 1, Nx
DO L = 1, 5
RQ(i,L,j,k) = linear function of (F(i,L) )
END DO
END DO
!
END DO
END DO
...
...
!\$ACC LOOP COLLAPSE(2) GANG PRIVATE(Q1,F,...)
DO k = 1, Nz
DO i = 1, Nx
... inner vector loops
END DO
END DO
!\$ACC END LOOP
...
...
!\$ACC LOOP COLLAPSE(2) GANG PRIVATE(Q1,F,...)
DO j = 1, Ny
DO i = 1, Nx
... inner vector loops
END DO
END DO
!\$ACC END KERNELS
!\$ACC END DATA
``````

Be sure to compare the compiler feedback messages (-Minfo=accel) with your original and this version of the code. The messages are often very informative as to what’s happening. Also, setting “PGI_ACC_TIME=1” in your environment will give you some quick profiling information.

Hope this helps,
Mat

Hello Mat,

I’ll try what you propose even if it sounds strange to me to interchange the “L” and the “i” loops.
I’ll be back ASAP.

Regards,

Hello,

``````     68, Loop carried dependence due to exposed use of rq(:,1:5,:,:) prevents parallelization
69, Loop carried dependence due to exposed use of rq(:,1:5,:,:) prevents parallelization
68, !\$acc loop seq collapse(2)
69,   collapsed
72, !\$acc loop seq
74, !\$acc loop vector(32) ! threadidx%x
95, !\$acc loop seq
97, !\$acc loop vector(32) ! threadidx%x
106, !\$acc loop seq
108, !\$acc loop vector(32) ! threadidx%x
118, !\$acc loop seq
120, !\$acc loop vector(32) ! threadidx%x
72, Parallelization would require privatization of array q1(:,:)
``````

I put INDEPENDENT clauses everywhere I can, and now it seems better, all loops are parallelized.

I run the code but it fails immediately in the first large parallel loops of the routine :

``````call to cuMemFreeHost returned error 700: Illegal address during kernel execution
``````

If I set the number of GANG, for example only 1, the code behaves better, but it crashes the second time it enters this routine, still in the first large parallel loops.

``````call to cuMemFreeHost returned error 700: Illegal address during kernel execution
``````

If I try others values like 2 or 4, it crashes at the first time.

Here is the first part, the X direction, of the routine

`````` 42
43 !\$ACC DATA PRESENT (Q, QP, RQ, Xa, Yb, Zc, TV_i, sensor)
44
45 !
46       IF (iInvF == 4) THEN
47          CALL Shock_Sensor
48       END IF
49 !
50
51 !\$ACC KERNELS
52
53
54 !*******************************************
55 !****  Computation of inviscid X-terms  ****
56 !*******************************************
57 !\$ACC LOOP VECTOR INDEPENDENT
58       DO i = Imin, Imax
59          TV_i(i) = one / Xa(i)
60       END DO
61 !\$ACC END LOOP
62 !
63       sq = one / Da
64 !
65       iX = 1
66       iY = 0
67       iZ = 0
68 !
69 !\$ACC LOOP COLLAPSE(2) GANG(4) PRIVATE(F,Fc,Fw,Q1) INDEPENDENT
70       DO k = Kmin, Kmax
71          DO j = Jmin, Jmax
72 !
73 !\$ACC LOOP VECTOR INDEPENDENT
74             DO i = Imin-3, Imax+3
75 !\$ACC LOOP INDEPENDENT
76                DO L = 1, 5
77                   Q1(i,L) = Q(i,L,j,k)
78                END DO
79             END DO
80 !\$ACC END LOOP
81
82 !C** Computation of ENO fluxes
83             SELECT CASE (iInvF)
84             CASE (1)
85                CALL Flux         (iX, iY, iZ, Imin, Imax, F, Fc, Q1)
86 !
87             CASE (2)
88                CALL Flux_split   (iX, iY, iZ, Imin, Imax, F, Q1)
89 !
90             CASE (3)
91                CALL Flux_split_6 (iX, iY, iZ, Imin, Imax, F, Q1)
92 !
93             CASE (4)
94                CALL Flux         (iX, iY, iZ, Imin, Imax, F, Fc, Q1)
95 !
96 !\$ACC LOOP VECTOR INDEPENDENT
97                DO i = Imin-1, Imax
98 !\$ACC LOOP INDEPENDENT
99                   DO L = 1, 5
100                      Fw(i,L) = F(i,L)
101                   END DO
102                END DO
103 !\$ACC END LOOP
104 !
105                CALL Flux_split_6 (iX, iY, iZ, Imin, Imax, F, Q1)
106 !
107 !\$ACC LOOP VECTOR INDEPENDENT
108                DO i = Imin-1, Imax
109 !\$ACC LOOP INDEPENDENT
110                   DO L = 1, 5
111                      F(i,L) = Fw(i,L) * REAL(sensor(i,j,k),rp) + F(i,L) * REAL(1-sensor(i,j,k),rp)
112                   END DO
113                END DO
114 !\$ACC END LOOP
115 !
116             END SELECT
117
118 !C** Computation of R.H.S.
119 !\$ACC LOOP VECTOR INDEPENDENT
120             DO i = Imin, Imax
121 !\$ACC LOOP INDEPENDENT
122                DO L = 1, 5
123                   RQ(i,L,j,k) = RQ(i,L,j,k) + (F(i-1,L) - F(i,L) ) * sq * TV_i(i)
124                END DO
125             END DO
126 !\$ACC END LOOP
127 !
128          END DO
129       END DO
130 !\$ACC END LOOP
``````

The corresponding output of the compiler is :

``````pgf90  -O3 -Mvect -Munroll -Mipa=inline -Mextend -mcmodel=medium -g -traceback -Mpreprocess -acc -Minfo -Minfo=accel -c ./SRC/invrhs.f
43, Generating present(q(:,:,:,:),qp(:,:,:,:),rq(:,:,:,:),xa(:),yb(:),zc(:),tv_i(:),sensor(:,:,:))
51, Generating copy(iz,iy,ix)
58, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
58, !\$acc loop gang, vector(128) ! blockidx%x threadidx%x
58, Generated 4 alternate versions of the loop
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
70, Loop is parallelizable
71, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
70, !\$acc loop gang(4) collapse(2) ! blockidx%x
71,   ! blockidx%x collapsed
74, !\$acc loop vector(128) ! threadidx%x
76, !\$acc loop seq
97, !\$acc loop vector(128) ! threadidx%x
99, !\$acc loop seq
108, !\$acc loop vector(128) ! threadidx%x
110, !\$acc loop seq
120, !\$acc loop vector(128) ! threadidx%x
122, !\$acc loop seq
71, Loop not vectorized/parallelized: contains call
74, Loop is parallelizable
Loop interchange produces reordered loop nest: 76,74
Generated 4 alternate versions of the loop
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
76, Loop is parallelizable
Outer loop unrolled 5 times (completely unrolled)
97, Loop is parallelizable
Loop interchange produces reordered loop nest: 99,97
Generated 4 alternate versions of the loop
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
99, Loop is parallelizable
Outer loop unrolled 5 times (completely unrolled)
108, Loop is parallelizable
Loop interchange produces reordered loop nest: 110,108
Generated 3 alternate versions of the loop
Generated vector sse code for the loop
Generated 4 prefetch instructions for the loop
110, Loop is parallelizable
Outer loop unrolled 5 times (completely unrolled)
120, Loop is parallelizable
Loop interchange produces reordered loop nest: 122,120
Generated 3 alternate versions of the loop
Generated vector sse code for the loop
Generated 3 prefetch instructions for the loop
122, Loop is parallelizable
Outer loop unrolled 5 times (completely unrolled)
``````

Private arrays are declared as local, automatic ones like :

`````` 17       REAL(rp),  DIMENSION( 0:((Nmax-0+1)/8+1)*8-1,5)       ::  F
18       REAL(rp),  DIMENSION( 0:((Nmax-0+1)/8+1)*8-1,5)       ::  Fw
19       REAL(rp),  DIMENSION(-2:((Nmax+3-(-2)+1)/8+1)*8-3,5)  ::  Fc
20       REAL(rp),  DIMENSION(-2:((Nmax+3-(-2)+1)/8+1)*8-3,5)  ::  Q1
``````

Nmax= MAX(Nx,Ny,Nz)

My test case sets Nx = 100 ; Ny = 150 ; Nz = 100 and iInvF = 1.

Is there something I’ve missed ?

Guy.

Hi Guy,

51, Generating copy(iz,iy,ix)

This is a potential issue. One of the few cases where the compiler doesn’t automatically privatize scalars is when they are passed by reference to a device routine, as is the case here. Hence, these variables are global and shared across all threads.

Do any of your flux routines modify these variables? If so, you may have a dependency given you initialize these variables before the k loop. The value of iz, iy, and ix will depend upon the value they were set to in previous iterations of the k and j loops.

If no dependency exists, then you can put these variables into a “firstprivate” clause so that each thread has it’s own copy.

68, Loop carried dependence due to exposed use of rq(:,1:5,:,:) prevents parallelization

Is RQ a pointer or assume-size array? If not, then I’m not sure why this would occur.

call to cuMemFreeHost returned error 700: Illegal address during kernel execution

The error is similar to a seg fault on the CPU where a bad or protected address is being referenced. Exactly why it’s happening, I’m not sure.

It could be caused by ix, iy, and iz being shared, especially if they are being used as indices into arrays.

I have seen this when the total size of a private array (i.e. the size of each array multiplied by the number of copies, one for each gang in this case) grows above 2GB. However, your arrays aren’t that big and you have -mcmodel=medium enabled so 2GB arrays shouldn’t be a problem.

Does Flux use automatic arrays? Automatics are dynamically allocated upon entry into a routine. However by default, the heap size on the device is only 8MB so can easily run out.

Does the error go away if you comment out the call to Flux?

Are you able to share this code? If so, please send a reproducing example to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me. It might be easier than trying to debug this via the user forum.

• Mat

Hello Mat,

thank you for all your explanations.
Here are some elements.

51, Generating copy(iz,iy,ix)

iz, iy,ix are three scalar whose values are 0 or 1 depending of the direction
X direction, (/ix,iy,iz/) = (/1,0,0/)
Y direction, (/ix,iy,iz/) = (/0,1,0/)
Z direction, (/ix,iy,iz/) = (/0,0,1/)
They are constant for each of these three steps.
Where should I put the FIRSTPRIVATE clause ? I can’t on a LOOP or a KERNELS directive.

68, Loop carried dependence due to exposed use of rq(:,1:5,:,:) prevents parallelization

RQ is a global shared work array, declared in a module, allocated on the GPU and in PRESENT clause everywhere it is used.
It is used to sum the contributions of the fluxes. I do not understand what is wrong with it here.

Does Flux use automatic arrays? Automatics are dynamically allocated upon entry into a routine. However by default, the heap size on the device is only 8MB so can easily run out.

Yes, Flux uses automatic arrays ! In the subroutine compute_flux, we have temporary arrays used in several loops that can be vectorized by the compiler. Each array should be private for the thread that makes the call.
We have (N is the dimension received by argument)
11 local arrays of dimension (N)
2 local arrays of dimension (N,5)
2 local arrays of dimension (N,6,5)

N is equal to Nx or Ny or Nz.

all to cuMemFreeHost returned error 700: Illegal address during kernel execution

Does the error go away if you comment out the call to Flux?

=> YES
Is it possible to increase the size of the heap on the device ?

Sadly, I’m not able to share this code as it is. I’ll try to make a smaller set of routines to send to you.

Regards,
Guy.

Hi Guy,

iz, iy,ix are three scalar whose values are 0 or 1 depending of the direction

Since the values don’t change, you don’t need to worry about them being global. Though you might want to update “Flux” to use the F2003 “value” attribute so the variables are passed by value.

RQ is a global shared work array, declared in a module, allocated on the GPU and in PRESENT clause everywhere it is used.

Ok, the problem here is that since RQ is a module variable then the compiler must assume RQ is used within Flux. Even if RQ isn’t used within Flux, it’s possible, so therefore the compiler must assume that it is. This then causes a potential dependency which is what the compiler is complaining about. Easy fix is to add the “independent” clause.

Is it possible to increase the size of the heap on the device ?

Yes, by calling cudaDeviceSetLimit. I wrote an example to show you how it works. Note that if possible, try to change your automatic arrays to fixed size arrays. Automatics implicitly allocate memory on the device which can be quite slow.

Here’s an example of setting the heap size when using automatic arrays.

``````% cat setHeap.F90
module mod_data
integer, parameter :: dp = selected_real_kind(15, 307)
real(kind=dp), dimension(:,:), allocatable :: A, B
!\$acc declare create(A,B)
end module mod_data

module use_data
use mod_data
contains
subroutine fillData (j,val,size)
!\$acc routine vector
integer(8), value :: j, size
real(kind=dp), value :: val
integer(8) :: i
real(kind=dp) :: tmp(size)
!\$acc loop vector
do i=1,size
tmp(i) = B(j,i) + ((j-1)*M)+(i-1)
end do

!\$acc loop vector
do i=1,size
A(j,i) = tmp(i)
end do
end subroutine fillData
end module use_data

program example
use use_data
use cudafor

integer(8) :: N1, M1
integer(8) :: i,j,heapsize
integer    :: setheap,istat
N1=32
print *, "Input the number of elements to create: "
print *, "Set the device heap? 1 - yes, 0 - no"
print *, "Heap size needed for automatic array: ", real(M1*N1*dp)/real(1024*1024), "(MB)"
heapsize = 2*(M1*N1*dp)
if (setheap > 0) then
print *, "Setting heapsize=",real(heapsize)/real(1024*1024), "(MB)"
endif
allocate(A(N1,M1),B(N1,M1))
!\$acc kernels
B=2.5_dp
!\$acc end kernels

!\$acc parallel loop gang num_gangs(N1)
do j=1,N1
call fillData(j,2.5_dp,M1)
end do
!\$acc update self(A)
print *, A(1,1), A(1,M1)
deallocate(A,B)

end program example

% pgfortran -fast -Minfo=accel -acc -ta=tesla -Mcuda setHeap.F90 -V16.5 -o setHeap.out
filldata:
10, Generating Tesla code
17, !\$acc loop vector ! threadidx%x
22, !\$acc loop vector ! threadidx%x
10, Generating acc routine vector
17, Loop is parallelizable
22, Loop is parallelizable
example:
48, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
48, !\$acc loop gang, vector(32) ! blockidx%x threadidx%x
!\$acc loop gang, vector(4) ! blockidx%y threadidx%y
51, Accelerator kernel generated
Generating Tesla code
52, !\$acc loop gang(32) ! blockidx%x
55, Generating update self(a(:,:))
``````

Let’s run the program that uses about 6MB for the automatic array. It should run fine given the default is 8MB and we need to leave some room on the heap for other things.

``````% setHeap.out
Input the number of elements to create:
25000
Set the device heap? 1 - yes, 0 - no
0
Heap size needed for automatic array:     6.103516     (MB)
2.500000000000000         25001.50000000000
``````

Bumping up the size to use over 6.5MB causes the program to crash.

``````% setHeap.out
Input the number of elements to create:
27000
Set the device heap? 1 - yes, 0 - no
0
Heap size needed for automatic array:     6.591797     (MB)
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
``````

Now let’s set a heap value. Note that I just doubled the minimum amount needed but you’ll want to play with what values work best for you.

``````% setHeap.out
Input the number of elements to create:
27000
Set the device heap? 1 - yes, 0 - no
1
Heap size needed for automatic array:     6.591797     (MB)
Setting heapsize=    13.18359     (MB)
2.500000000000000         27001.50000000000
``````

How big can we go? I’m not entirely sure what the upper limit is, but ~4GB of heap is the largest I could go on my K80 with 12GB of memory

``````% setHeap.out
Input the number of elements to create:
10000000
Set the device heap? 1 - yes, 0 - no
1
Heap size needed for automatic array:     2441.406     (MB)
Setting heapsize=    4882.813     (MB)
2.500000000000000         9999999.002604166
``````

Give this a try. Hopefully it fixes the issue.

• Mat

Hello Mat,

Thank you very much for your example and the explanations. It’s very useful for me (at least).

There remains some things I wish to submit to you.

I use your example and now I can increase the size of the mesh and I can fill the memory of the GPU. I check with a call to cudaDeviceGetLimit that the heap size is really bigger.
As it is a MPI application, I want to use the two Tesla cards of the server my code is running on, i.e. two MPI processes, one per GPU.
Quite very often, I encounter the following error message

``````call to cuStreamSynchronize returned error 702: Launch timeout
call to cuMemFreeHost returned error 702: Launch timeout
``````

What is this ?

2.
The second thing is about the optimization of the main part of the code. You suggest to put KERNELS instead of PARALLEL. I try it but it does not work. I get the same error as before, even with the “small” mesh, even with a 8GB heap, even taking only one MPI process on one card.

``````call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
``````
1. At least,

Note that if possible, try to change your automatic arrays to fixed size arrays. Automatics implicitly allocate memory on the device which can be quite slow.

The size of the automatic arrays is known : Nmax = MAX (Nx,Ny,Nz). So how should I declare these “no more automatic” arrays as private work arrays for the compute_flux routine ? I think of declaring these arrays in the calling routine, and putting them in a PRIVATE clause, like the others (F,Fc,Fw,Q1). Right ?

Regards,
Guy.

For #1, Are you using Windows? The only time I’ve seen launch timeouts are on Windows when using WDDM driver. WDDM has a watchdog timer meant to prevent your monitor from freezing. Since you’re using Tesla cards, make sure that you have the TCC driver installed instead of WDDM (See: http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/#driver-model).

If you’re on Linux, then I’m not sure.

For #2, Sorry, I’m not sure. Do you see any differences in the compiler feedback messages (-Minfo=accel) between the two?

For #3, is Nmax a parameter or is it being set at runtime? If NMAX’s value is constant and known (i.e. a parameter) at compile time, then the arrays are fixed sized, not automatic.

• Mat

Hi Mat,

#1.
I’m using Linux

``````uname -a
Linux igloo-gpu3 3.0.101-0.47.71-default #1 SMP Thu Nov 12 12:22:22 UTC 2015 (b5b212e) x86_64 x86_64 x86_64 GNU/Linux
``````

#2
With KERNELS, I get the following feedback :

``````     43, Generating present(q(:,:,:,:),qp(:,:,:,:),rq(:,:,:,:),xa(:),yb(:),zc(:),tv_i(:),sensor(:,:,:))
51, Generating copy(iz,iy,ix)
58, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
58, !\$acc loop gang, vector(128) ! blockidx%x threadidx%x
70, Loop is parallelizable
71, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
70, !\$acc loop gang collapse(2) ! blockidx%x
71,   ! blockidx%x collapsed
74, !\$acc loop vector(128) collapse(2) ! threadidx%x
96, !\$acc loop vector(128) collapse(2) ! threadidx%x
106, !\$acc loop vector(128) collapse(2) ! threadidx%x
117, !\$acc loop vector(128) collapse(2) ! threadidx%x
74, Loop is parallelizable
75, Loop is parallelizable
96, Loop is parallelizable
97, Loop is parallelizable
106, Loop is parallelizable
107, Loop is parallelizable
117, Loop is parallelizable
118, Loop is parallelizable
``````

With PARALLEL, it’s

``````    59, Generating present(q(:,:,:,:),qp(:,:,:,:),rq(:,:,:,:),xa(:),yb(:),zc(:),tv_i(:),sensor(:,:,:))
68, Accelerator kernel generated
Generating Tesla code
82, !\$acc loop gang, vector(32) ! blockidx%x threadidx%x
111, !\$acc loop gang collapse(2) ! blockidx%x
112,   ! blockidx%x collapsed
117, !\$acc loop vector(32) collapse(2) ! threadidx%x
143, !\$acc loop vector(32) collapse(2) ! threadidx%x
157, !\$acc loop vector(32) collapse(2) ! threadidx%x
179, !\$acc loop vector(32) collapse(2) ! threadidx%x
117, Loop is parallelizable
118, Loop is parallelizable
143, Loop is parallelizable
144, Loop is parallelizable
157, Loop is parallelizable
158, Loop is parallelizable
179, Loop is parallelizable
180, Loop is parallelizable
``````

Line numbers do not correspond. In one version there are more comments than in the other one, but there are the same loops in both versions.

Regards,

Guy.

For the parallel version the vector length is 32 but 128 for the kernels. One possibility is that that Flux routine is 32 and it’s this discrepancy that’s causing the issue. Can you try explicitly setting the vector length, “vector(32)” for the kernels version to see if this works around the issue?

Hello Mat

yes, it helps for this …
thanks.