# OpenACC directives and GPU

Hi all,
how can I make this portion of the code run on GPU using OpenACC?
I also can’t figure out when to make variables(or matrices) private. Where or what’s a good resource to read up on that topic? I have a large fortran code that I need to port into GPU.

``````do j=2,nym1
jm1=j-1
jp1=j+1
do i=2,nxm1
im1=i-1
ip1=i+1
aj(im1)=aw(im1,jm1)
bj(im1)=ap(im1,jm1)
cj(im1)=ae(im1,jm1)
dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
if(i.eq.2) then
dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
aj(im1)=0.d0
else if (i.eq.nxm1) then
dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
cj(im1)=0.d0
endif
akd=akd+dabs(aj(im1)*p(im1,j)+bj(im1)*p(i,j) + cj(im1)*p(ip1,j)-dj(im1))
akn=akn+dabs(bj(im1)*p(i,j))
enddo
enddo
``````

Hi Dolf,

Why are you using arrays to hold intermediary values in the inner “i” loop? While I may be missing something, it seems like aj, bj, cj, and dj could be scalars and allow you to then parallelize both the “j” and “I” loops.

If these variables really do need to be arrays, then you will need to either privatize the arrays or inter-change the “I” and “j” loops, parallelizing “I” and running “j” sequentially.

Option 1:

``````!\$acc parallel loop
do j=2,nym1
jm1=j-1
jp1=j+1
!\$acc loop vector private(aj,bj,cj,dj)
do i=2,nxm1
im1=i-1
ip1=i+1
``````

``````!\$acc parallel loop gang vector
do i=2,nxm1
im1=i-1
ip1=i+1
!\$acc loop seq
do j=2,nym1
jm1=j-1
jp1=j+1
``````

By default in OpenACC, scalars are private and arrays are shared. In cases such as your code when arrays are reused, having them shared across all threads would cause a race condition. Hence, the user must privatize the array. The array is private to the loop so a “private” clause on a “vector” loop means every thread gets it’s own copy of the array. A “private” on a “worker” loop means that it will be private to each worker but shared between the vectors within that worker. A “private” on a “gang” loop means each gang will get it’s own copy which are shared amongst all the vectors within that gang.

Since privatized arrays are placed in global device memory and may lead to poor memory access patterns, it’s advisable to avoid using “private” if the algorithm can be rewritten.

• Mat

hi Mat,

I used both the methods but I keep getting akd/akn equal to NaN. any ideas why this is happening?
btw akd is accumulated over the loops (summation). how can we do it in parallel?
also, when do I need to make variables/matrices private on a parallel region?

Regards,
Dolf

``````!\$acc update device(akd,akn)
!\$acc parallel loop
do i=2,nxm1
im1=i-1
ip1=i+1
!\$acc loop vector
do j=2,nym1
jm1=j-1
jp1=j+1

aj1=aw(im1,jm1)
bj1=ap(im1,jm1)
cj1=ae(im1,jm1)
dj1=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
if(i.eq.2) then
dj1=dj1-aw(im1,jm1)*p(im1,j)
aj1=0.d0
else if (i.eq.nxm1) then
dj1=dj1-ae(im1,jm1)*p(ip1,j)
cj1=0.d0
endif
akd=akd+dabs(aj1*p(im1,j)+bj1*p(i,j) + cj1*p(ip1,j)-dj1)
akn=akn+dabs(bj1*p(i,j))
enddo
enddo
``````

btw akd is accumulated over the loops (summation). how can we do it in parallel?

Put “akd” and “akn” into a reduction clause:

``````!\$acc parallel loop reduction(+:akd,akn)
do i=2,nxm1
im1=i-1
ip1=i+1
!\$acc loop vector reduction(+:akd,akn)
jm1=j-1
jp1=j+1
``````

I used both the methods but I keep getting akd/akn equal to NaN. any ideas why this is happening?

I’m not sure. Maybe it’s because of the lack of the reduction clauses but I could be something else as well. I’d need a reproducing example to know for sure.

when do I need to make variables/matrices private on a parallel region?

Again scalars are private by default, so no need to add them to a private clause. There are a few exceptions, such when passing a scalar by reference to a device routine when “private” is needed, but the compiler will give a feedback message (-Minfo=accel) when this occurs.

Arrays are shared by default. However, you only need to privatize an array when an element is reused by multiple iterations of a parallel loop.

• Mat

Hi Mat,

so I did correct one line (first one, update host(akd,akn) instead of update device). It gave me small error ak (=akd/akn) of 0.563e-4 which is not too bad except the original loop (cpu version, no acc directives) has an error ak=0.1233e-5. (code 1). How can I bring the error down?

If I change it have reduction in (code 2), I get NaN. which does not make any sense. Any clues?

cheers!
Dolf

Code 1:

``````!\$acc update host(akd,akn)
!\$acc parallel loop
do i=2,nxm1
im1=i-1
ip1=i+1
!\$acc loop vector
do j=2,nym1
jm1=j-1
jp1=j+1
aj1=aw(im1,jm1)
bj1=ap(im1,jm1)
cj1=ae(im1,jm1)
dj1=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
if(i.eq.2) then
dj1=dj1-aw(im1,jm1)*p(im1,j)
aj1=0.d0
else if (i.eq.nxm1) then
dj1=dj1-ae(im1,jm1)*p(ip1,j)
cj1=0.d0
endif
akd=akd+dabs(aj1*p(im1,j)+bj1*p(i,j)+cj1*p(ip1,j)-dj1)
akn=akn+dabs(bj1*p(i,j))
enddo
enddo
``````

Code 2:

``````!\$acc update host(akd,akn)
!\$acc parallel loop reduction(+:akd,akn)
do i=2,nxm1
im1=i-1
ip1=i+1
!\$acc loop vector reduction(+:akd,akn)
do j=2,nym1
jm1=j-1
jp1=j+1
aj1=aw(im1,jm1)
bj1=ap(im1,jm1)
cj1=ae(im1,jm1)
dj1=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
if(i.eq.2) then
dj1=dj1-aw(im1,jm1)*p(im1,j)
aj1=0.d0
else if (i.eq.nxm1) then
dj1=dj1-ae(im1,jm1)*p(ip1,j)
cj1=0.d0
endif
akd=akd+dabs(aj1*p(im1,j)+bj1*p(i,j)+cj1*p(ip1,j)-dj1)
akn=akn+dabs(bj1*p(i,j))
enddo
enddo
``````

I’m guessing in code example 2 that you’re only updating the host values before the loop which most likely are garbage values. When you don’t add the reductions clauses, the compiler’s automatic reduction is used and the compiler handles the data movement. When you add the reduction clauses, then you’re assumed to be managing the data movement when the variables are in a data region. So by not putting in a update host directive at the end of the loop, the values are not updated on the host.

You should either not include “akd” and “akn” in a data region and have the compiler manage the data (i.e. remove them from any create/copy data clause and remove the update directive), or put the update directive after the parallel region.

• Mat

I removed update (akd,akn), I get the NaN output. If I put it at the end of the parallel region, I get the same error ak=0.5e-4. and the error never get smaller (does not converge).

any other clues?

Best,
Dolf

Sorry, no. I’ll need a reproducing example or at least a fuller example and the compiler feedback messages (-Minfo=accel).

• Mat

Hi Mat,
Ok. I will send you the whole code to look at via e-mail. I think this way we save our time. It’s a code that you are familiar with since we worked on it a while back. We did move one of the regions to GPU and I am back at moving more regions.

Also, is there a good resource that I can use to help me figure out how to move regions to GPU safely and correctly using OpenAcc directives? I am pretty excited to move more of my code to run faster.

FYI: code sent to trs@pgroup.com and I asked them to forward it to you.

many thanks,
Dolf

Thanks, I’ll look for the code.

Also, is there a good resource that I can use to help me figure out how to move regions to GPU safely and correctly using OpenAcc directives? I am pretty excited to move more of my code to run faster.

There’s lot of resources out there but I’m not positive it exactly what you’re looking for. Plus they are a bit scattered.

To start, OpenACC.org has lots of tutorials: http://www.openacc.org/An we have several as well: http://www.pgroup.com/resources/accel.htm
Jeff Larkin did an online course which while complete, you can still view: https://developer.nvidia.com/openacc-courses

There’s also an OpenACC book in the works, which I contributed a chapter, but it’s not expect to be available until late this year.

• Mat

Hi Mat,
the code is working great now after I applied your directions.
I moved to the next parallel region and I encountered few errors. I am not surprised since these nested loops are little bit tricky!
here is the original code:

``````   do 500 j=2,nym1
jm1=j-1
jp1=j+1
ymyp=(yref(jp1)-yref(jm1))/2.d0

do 450 i=2,nxm1
im1=i-1
ip1=i+1
xmxp=(xref(ip1)-xref(im1))/2.d0
!
aj(im1)=aw(im1,jm1)
bj(im1)=ap(im1,jm1)
cj(im1)=ae(im1,jm1)
dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
!
if(i.eq.2) then
dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
aj(im1)=0.d0
else if (i.eq.nxm1) then
dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
cj(im1)=0.d0
endif
!
450         continue
call tridag(nxm2,aj,bj,cj,dj)

do 475 i=1,nxm2
p(i+1,j)=dj(i)
475       continue

500       continue
``````

here is my edit:

``````!\$acc private(aj,bj,cj,dj)
do 500 j=2,nym1
jm1=j-1
jp1=j+1
ymyp=(yref(jp1)-yref(jm1))/2.d0
!\$acc loop vector
do 450 i=2,nxm1
im1=i-1
ip1=i+1
xmxp=(xref(ip1)-xref(im1))/2.d0
!
aj(im1)=aw(im1,jm1)
bj(im1)=ap(im1,jm1)
cj(im1)=ae(im1,jm1)
dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
!
if(i.eq.2) then
dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
aj(im1)=0.d0
else if (i.eq.nxm1) then
dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
cj(im1)=0.d0
endif
!
450         continue
call tridag(nxm2,aj,bj,cj,dj)

p(1:nxm2,j)=dj(1:nxm2)

500       continue
``````

compiler output:

``````reyneq:
86, Generating update device(iqpo,icoe,nter,qndat(:),zdat(:),d0,pit,pir,gama,t4,t3,t2,t1)
88, Accelerator kernel generated
Generating Tesla code
91, !\$acc loop gang, vector(128) collapse(2) ! blockidx%
88, Generating copy(ap(:,:),an(:,:),as(:,:),ae(:,:),aw(:,:))
Generating copyin(yref(:),xref(:),hydnew(:,:),hxlnew(:,:),hyunew(:,:),hxrnew(:,:),zta(:,:),eta(:,:),p(:,:),bearx(:,:),beary(:,:),pold(:,:),h(:,:))
Generating copy(ar(:,:))
Generating copyin(hnew(:,:))
345, Accelerator kernel generated
Generating Tesla code
345, Generating reduction(+:akn,akd)
346, !\$acc loop gang ! blockidx%x
350, !\$acc loop vector(128) ! threadidx%x
Generating reduction(+:akn,akd)
345, Generating copyin(p(1:nxm1+1,1:nym1+1),aw(:nxm1-1,:nym1-1),ap(:nxm1-1,:nym1-1),ae(:nxm1-1,:nym1-1),ar(:nxm1-1,:nym1-1),as(:nxm1-1,:nym1-1),an(:nxm1-1,:nym1-1))
350, Loop is parallelizable
396, Accelerator kernel generated
Generating Tesla code
398, !\$acc loop gang ! blockidx%x
403, !\$acc loop vector(128) ! threadidx%x
426, !\$acc loop vector(128) ! threadidx%x
396, Generating copy(nxm2)
Generating copyin(an(:,:),as(:,:))
Generating copy(p(:,:))
Generating copyin(ar(:,:),ae(:,:),ap(:,:),aw(:,:))
403, Loop is parallelizable
426, Loop is parallelizable
flow:
1164, Generating acc routine seq
1219, !\$acc do seq
1245, !\$acc do seq
1245, Scalar last value needed after loop for ff0 at line 1260
Scalar last value needed after loop for f1 at line 1261
Scalar last value needed after loop for f2 at line 1262
Roughness.F90
Stiff.F90
String.f90
Suspension.F90
UserDefGeom.f90
Util.F90
tridag:
229, Generating acc routine seq
242, !\$acc do seq
248, !\$acc do seq

Dyn5d build succeeded.

Build log was saved at "file://C:\D Drive\CML codes\Dyn5d\x64\Release\BuildLog.htm"

========== Rebuild All: 1 succeeded, 0 failed, 0 skipped ==========
``````

errors:
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

what did I do wrong?

regards,
Dolf

Hi Dolf,

I looked at the code for “tridag”. There’s two automatic arrays in it with the size “nxmax” where “nxmax” is a module variable. I’m assuming that you put “nxmax” into a “declare create” directive since the compiler would complain if you didn’t. While “declare create” creates the variable over on the device, my guess is that you forgot to update it’s value when you initialize it in Init.F90. Hence when the automatic arrays are allocated, the size is a garbage value.

Note that using automatic arrays in device routines can severely hurt performance since each thread needs to dynamically allocate memory on the device which can be quite slow. If possible, try changing these arrays to be fixed size.

• Mat

Hi Mat, I did (update device (nxmax,nymax)) (code 1). Also I got rid of the memory allocation error but now I have the ak=NaN again. Code 1 have no relation to calculating ak even!.
Here is what I did:

Code 1:

``````!\$acc update device(nxmax, nymax, nxmax1, nxmax2, nxmax3, nxmax4, &
!\$acc nymax1, nymax2, nymax3, nymax4)
!\$acc parallel loop &
!\$acc private(aj,bj,cj,dj)
do 500 j=2,nym1
jm1=j-1
jp1=j+1
ymyp=(yref(jp1)-yref(jm1))/2.d0
!\$acc loop vector
do 450 i=2,nxm1
im1=i-1
ip1=i+1
xmxp=(xref(ip1)-xref(im1))/2.d0
!
aj(im1)=aw(im1,jm1)
bj(im1)=ap(im1,jm1)
cj(im1)=ae(im1,jm1)
dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
!
if(i.eq.2) then
dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
aj(im1)=0.d0
else if (i.eq.nxm1) then
dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
cj(im1)=0.d0
endif
!
450         continue
call tridag(nxm2,aj,bj,cj,dj)

!do 475 i=1,nxm2
!p(i+1,j)=dj(i)
p(1:nxm2,j)=dj(1:nxm2)
!475         continue

500       continue
``````

if you pay attention to those lines in code 2, is that the correct way of getting rid of the third do loop? ( do 475 …)
code 2:

``````!do 475 i=1,nxm2
!p(i+1,j)=dj(i)
p(1:nxm2,j)=dj(1:nxm2)
!475         continue
``````

any thoughts?
many thanks,
Dolf

There’s still a loop, it’s just implicit when using array syntax.

Here’s what I did:

``````!\$acc kernels loop gang vector independent private(aj,bj,cj,dj)
do 500 j=2,nym1
jm1=j-1
jp1=j+1
ymyp=(yref(jp1)-yref(jm1))/2.d0
!acc loop vector independent
do 450 i=2,nxm1
im1=i-1
ip1=i+1
xmxp=(xref(ip1)-xref(im1))/2.d0
!
aj(im1)=aw(im1,jm1)
bj(im1)=ap(im1,jm1)
cj(im1)=ae(im1,jm1)
dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
!
if(i.eq.2) then
dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
aj(im1)=0.d0
else if (i.eq.nxm1) then
dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
cj(im1)=0.d0
endif
!
450         continue
call tridag(nxm2,aj,bj,cj,dj)
!acc loop vector independent
do 475 i=1,nxm2
p(i+1,j)=dj(i)
475         continue

500       continue
! reyneq_p3 end
``````

Any my results:

`````` *****************************************
* CML GPU Dynamic Simulator Version 5.0 *
*****************************************

GPU installed: GeForce GTX 690

*****************************************
Dynamics.def file version:
4.039

parameter header does not match parameter values

Running with non-zero altitude
Air parameters will be set according to specified altitude
*****************************************
CML Version 7.0  RAIL.DAT
7.0
*****************************************
Actual Grid Size: nx= 402  ny= 402
Initial Disk Velocity(m/s) =    8.21991
Initial IDEMA Skew Angle(degre -14.76900
Intermolecular forces are: OFF

SIMULATOR SOLVES SLIDER RESPONSE
TO THE FOLLOWING DYNAMIC INPUTS:

=> Initial Impulse

=> Three DOF Stiffness and Damping Coef. Suspension Model

=> Asperity Contact
( GW model )

PRE-PROCESSING...

dense meshing ...

T(s)= 0.00000    ,Iteration=  5,Residual=0.138019E-08
T(s)= 0.00000    ,Iteration=  5,Residual=0.148343E-09

START OF TRANSIENT ANALYSIS ...

t(s)=0.500000E-07, iteration number=  2, residual=0.130576E-10
t(s)=0.100000E-06, iteration number=  2, residual=0.771753E-11
t(s)=0.150000E-06, iteration number=  2, residual=0.134445E-10
t(s)=0.200000E-06, iteration number=  2, residual=0.192544E-10
t(s)=0.250000E-06, iteration number=  2, residual=0.250092E-10
t(s)=0.300000E-06, iteration number=  2, residual=0.220895E-10
t(s)=0.350000E-06, iteration number=  2, residual=0.258807E-10
t(s)=0.400000E-06, iteration number=  2, residual=0.298706E-10
t(s)=0.450000E-06, iteration number=  2, residual=0.336706E-10
t(s)=0.500000E-06, iteration number=  2, residual=0.373555E-10
t(s)=0.550000E-06, iteration number=  2, residual=0.408913E-10
t(s)=0.600000E-06, iteration number=  2, residual=0.442343E-10
t(s)=0.650000E-06, iteration number=  2, residual=0.473254E-10
t(s)=0.700000E-06, iteration number=  2, residual=0.500976E-10
t(s)=0.750000E-06, iteration number=  2, residual=0.524710E-10
t(s)=0.800000E-06, iteration number=  2, residual=0.543585E-10
t(s)=0.850000E-06, iteration number=  2, residual=0.556587E-10
t(s)=0.900000E-06, iteration number=  2, residual=0.553717E-10
t(s)=0.950000E-06, iteration number=  2, residual=0.551747E-10
t(s)=0.100000E-05, iteration number=  2, residual=0.540844E-10
``````

Hi Mat,

thanks for the reply. Thats an interesting results, it should be that way actually. Unfortunately, this is not what I get here. I did copy the snippet of your code in the last post, but here is what I get:

``````warning: negative pressure occurred
warning: negative pressure occurred
warning: negative pressure occurred
warning: negative pressure occurred
warning: negative pressure occurred
warning: negative pressure occurred
warning: negative pressure occurred
T(s)= 0.00000    ,Iteration=  5,Residual=         NaN
T(s)= 0.00000    ,Iteration= 10,Residual=         NaN
T(s)= 0.00000    ,Iteration= 15,Residual=         NaN
T(s)= 0.00000    ,Iteration= 20,Residual=         NaN
T(s)= 0.00000    ,Iteration= 25,Residual=         NaN
T(s)= 0.00000    ,Iteration= 30,Residual=         NaN
T(s)= 0.00000    ,Iteration= 35,Residual=         NaN
T(s)= 0.00000    ,Iteration= 40,Residual=         NaN
T(s)= 0.00000    ,Iteration= 45,Residual=         NaN
T(s)= 0.00000    ,Iteration= 50,Residual=         NaN
T(s)= 0.00000    ,Iteration= 55,Residual=         NaN
T(s)= 0.00000    ,Iteration= 60,Residual=         NaN
T(s)= 0.00000    ,Iteration= 65,Residual=         NaN
T(s)= 0.00000    ,Iteration= 70,Residual=         NaN
T(s)= 0.00000    ,Iteration= 75,Residual=         NaN
^C
C:\Users\Dolf\Desktop\Dyn5d>
``````

Did you change other parts of the code? I made tridag as sequential sub as you suggested. Shall I put it back as regular sub?

regards,

Dolf

I just applied this change plus the addition of updating the device copy of “nmax” to the source you sent me on June 30th. I can send you what I have so you can diff your version with mine.

• Mat

Thanks Mat, I got the code you sent me. I appreciate it. I have updated what you send me to make it work (when I am updating p matrix). Now it works perfect.
Now I have moved to the next parallel region as shown in code 1 (original). My edit is shown in code 2. I don’t know why I am getting a NaN, any ideas why? I tried to do what you already done in previous parallel region but with no luck.

Code 1:

``````! reyen_p4
do i=2,nxm1
im1=i-1
ip1=i+1
xmxp=(xref(ip1)-xref(im1))/2.d0
!
aj(im1)=0.d0
bj(im1)=0.d0
cj(im1)=0.d0
dj(im1)=0.d0
!
do j=2,nym1
jm1=j-1
jp1=j+1
ymyp=(yref(jp1)-yref(jm1))/2.d0
!
aj(im1)=aj(im1)+aw(im1,jm1)
bj(im1)=bj(im1)+ap(im1,jm1)
if(j.ne.2) bj(im1)=bj(im1)+as(im1,jm1)
if(j.ne.nym1) bj(im1)=bj(im1)+an(im1,jm1)
cj(im1)=cj(im1)+ae(im1,jm1)
dj(im1)=dj(im1)+ar(im1,jm1)-aw(im1,jm1)*p(im1,j)    &
-ae(im1,jm1)*p(ip1,j)-as(im1,jm1)*p(i,jm1)  &
-an(im1,jm1)*p(i,jp1)-ap(im1,jm1)*p(i  ,j)
!
enddo
enddo
! end reyneq_p4
``````

Code 2:

``````!\$acc kernels loop independent reduction(+:aj1,bj1,cj1,dj1) private(aj,bj,cj,dj)
do i=2,nxm1
im1=i-1
ip1=i+1
xmxp=(xref(ip1)-xref(im1))/2.d0
!
aj(im1)=0.d0
bj(im1)=0.d0
cj(im1)=0.d0
dj(im1)=0.d0
aj1 = 0.d0
bj1 = 0.d0
cj1 = 0.d0
dj1 = 0.d0

!\$acc loop vector independent reduction(+:aj1,bj1,cj1,dj1)
do j=2,nym1
jm1=j-1
jp1=j+1
ymyp=(yref(jp1)-yref(jm1))/2.d0
!
aj1=aj1+aw(im1,jm1)
bj1=bj1+ap(im1,jm1)
if(j.ne.2) bj1=bj1+as(im1,jm1)
if(j.ne.nym1) bj1=bj1+an(im1,jm1)
cj1=cj1+ae(im1,jm1)
dj1=dj1+ar(im1,jm1)-aw(im1,jm1)*p(im1,j)    &
-ae(im1,jm1)*p(ip1,j)-as(im1,jm1)*p(i,jm1)  &
-an(im1,jm1)*p(i,jp1)-ap(im1,jm1)*p(i  ,j)
enddo
aj(im1)= aj1
bj(im1)= bj1
cj(im1)= cj1
dj(im1)= dj1
enddo
``````

Hi Dolf,

You don’t want to privatize the aj, bj, cj, dj arrays here given you use the results after the compute region. Instead, you want to put these in a “copyout” clause.

Also, no need to have the reduction clause on the outer i loop given you’re not using these values after the compute region.

• Mat

Thanks Mat, I have corrected the problem and now it’s working flawlessly.
I have this subroutine, and I want to know which way I should make it run on GPU.

Question 1: since it has multiple nested do loops. what shall I do?

1. make it run on GPU by adding !\$acc routine seq at the start of the subroutine?
2. go through each do loop and use !\$acc parallel loop directives?

Question 2: what does !\$acc routine seq do exactly?

Question 3: if I have data region (example code 1 below) and a GPU subroutine was called within that region, what should be the method used to run subroutine on GPU? Question 1.1 or 1.2? or it does not matter?

do we treat !\$acc data like any other !\$acc directive? in that case I have to use !\$acc routine seq inside the subroutine (Q1.1).

thanks!
Dolf

here is the subroutine:

``````!================================================================
subroutine acmdn(aw1,ae1,as1,an1,ap1,ar1,p1,aw2,ae2,as2,    &
an2,ap2,ar2,dp2,nx1,ny1,nx2,ny2,nx1max,ny1max,nx2max,     &
ny2max,nlimit)
!================================================================
use Dyn_sizes_module
implicit none

real(8) :: AR1, AR2, DJ, CJ, AP2, AP1, BJ, AKN, AN2, RATIO, AKPRE, AKD, AS2, AS1, P1
real(8) :: AK, AJ, AW2, AW1, AE1, AE2, DP2, DABS, AN1
integer :: K1, NX2M2, NX2M1, NX1, NX2, I1, I2, KINT, NY1MAX, K, L, NY2M1, NY2M2
integer :: L1, NX1MAX, NLIMIT, NY2, NY1, J1, J2, NY2MAX, NX2MAX

dimension aw1(nx1max,ny1max),ae1(nx1max,ny1max), &
as1(nx1max,ny1max),an1(nx1max,ny1max), &
ap1(nx1max,ny1max),ar1(nx1max,ny1max), &
aw2(nx2max,ny2max),ae2(nx2max,ny2max), &
as2(nx2max,ny2max),an2(nx2max,ny2max), &
ap2(nx2max,ny2max),ar2(nx2max,ny2max), &
p1(nx1max,ny1max), dp2(nx2max,ny2max)
dimension aj(nxmax),bj(nxmax),cj(nxmax),dj(nxmax)

nx2m1=nx2-1
ny2m1=ny2-1
nx2m2=nx2-2
ny2m2=ny2-2

do k=2,nx2m1
k1=k-1
do l=2,ny2m1
l1=l-1
i1=2*(k-1)-1
i2=2*(k-1)+1-1
j1=2*(l-1)-1
j2=2*(l-1)+1-1
ap2(k1,l1)=ap1(i1,j1)+ap1(i2,j1)+ap1(i1,j2)+ap1(i2,j2)+ &
ae1(i1,j1)+ae1(i1,j2)+aw1(i2,j1)+aw1(i2,j2)+ &
as1(i1,j2)+as1(i2,j2)+an1(i1,j1)+an1(i2,j1)
aw2(k1,l1)=aw1(i1,j1)+aw1(i1,j2)
ae2(k1,l1)=ae1(i2,j1)+ae1(i2,j2)
as2(k1,l1)=as1(i1,j1)+as1(i2,j1)
an2(k1,l1)=an1(i1,j2)+an1(i2,j2)
ar2(k1,l1)=ar1(i1,j1)+ar1(i2,j1)+ar1(i1,j2)+ar1(i2,j2)- &
ap1(i1,j1)*p1(i1+1,j1+1)-ap1(i2,j1)*p1(i2+1,j1+1)-   &
ap1(i1,j2)*p1(i1+1,j2+1)-ap1(i2,j2)*p1(i2+1,j2+1)-   &
aw1(i1,j1)*p1(i1  ,j1+1)-aw1(i1,j2)*p1(i1  ,j2+1)-   &
ae1(i1,j1)*p1(i2+1,j1+1)-ae1(i1,j2)*p1(i2+1,j2+1)-   &
as1(i1,j1)*p1(i1+1,j1  )-as1(i1,j2)*p1(i1+1,j1+1)-   &
an1(i1,j1)*p1(i1+1,j2+1)-an1(i1,j2)*p1(i1+1,j2+2)-   &
aw1(i2,j1)*p1(i1+1,j1+1)-aw1(i2,j2)*p1(i1+1,j2+1)-   &
ae1(i2,j1)*p1(i2+2,j1+1)-ae1(i2,j2)*p1(i2+2,j2+1)-   &
as1(i2,j1)*p1(i2+1,j1  )-as1(i2,j2)*p1(i2+1,j1+1)-   &
an1(i2,j1)*p1(i2+1,j2+1)-an1(i2,j2)*p1(i2+1,j2+2)
end do
end do
!
do k=1,nx2
do l=1,ny2
dp2(k,l)=0.d0
end do
end do
!
kint=0
50      kint=kint+1

do 200 l=2,ny2m1
l1=l-1
do 100 k=2,nx2m1
k1=k-1
aj(k1)=aw2(k1,l1)
bj(k1)=ap2(k1,l1)
cj(k1)=ae2(k1,l1)
dj(k1)=-as2(k1,l1)*dp2(k,l1)-an2(k1,l1)*dp2(k,l+1) + ar2(k1,l1)
100       continue
aj(1)=0.d0
cj(nx2m2)=0.d0
call tridag(nx2m2,aj,bj,cj,dj)
do 150 k=1,nx2m2
dp2(k+1,l)=dj(k)
150       continue
200     continue

do k=2,nx2m1
k1=k-1
!
aj(k1)=0.d0
bj(k1)=0.d0
cj(k1)=0.d0
dj(k1)=0.d0
!
do l=2,ny2m1
l1=l-1
aj(k1)=aj(k1)+aw2(k1,l1)
bj(k1)=bj(k1)+ap2(k1,l1)
if(l.ne.2) bj(k1)=bj(k1)+as2(k1,l1)
if(l.ne.ny2m1) bj(k1)=bj(k1)+an2(k1,l1)
cj(k1)=cj(k1)+ae2(k1,l1)
dj(k1)=dj(k1)+ar2(k1,l1)-aw2(k1,l1)*dp2(k1,l)       &
-ae2(k1,l1)*dp2(k+1,l)-as2(k1,l1)*dp2(k,l1)  &
-an2(k1,l1)*dp2(k,l+1)-ap2(k1,l1)*dp2(k,l)
enddo
enddo
aj(1)=0.d0
cj(nx2m2)=0.d0
call tridag(nx2m2,aj,bj,cj,dj)
do k=1,nx2m2
do l=1,ny2m2
dp2(k+1,l+1)=dp2(k+1,l+1)+dj(k)
enddo
enddo

do 400 k=2,nx2m1
k1=k-1
do 300 l=2,ny2m1
l1=l-1
aj(l1)=as2(k1,l1)
bj(l1)=ap2(k1,l1)
cj(l1)=an2(k1,l1)
dj(l1)=-aw2(k1,l1)*dp2(k1,l)-ae2(k1,l1)*dp2(k+1,l) + ar2(k1,l1)
300       continue
aj(1)=0.d0
cj(ny2m2)=0.d0
call tridag(ny2m2,aj,bj,cj,dj)
do 350 l=1,ny2m2
dp2(k,l+1)=dj(l)
350       continue
400     continue

do l=2,ny2m1
l1=l-1
!
aj(l1)=0.d0
bj(l1)=0.d0
cj(l1)=0.d0
dj(l1)=0.d0
!
do k=2,nx2m1
k1=k-1
aj(l1)=aj(l1)+as2(k1,l1)
bj(l1)=bj(l1)+ap2(k1,l1)
if(k.ne.2) bj(l1)=bj(l1)+aw2(k1,l1)
if(k.ne.nx2m1) bj(l1)=bj(l1)+ae2(k1,l1)
cj(l1)=cj(l1)+an2(k1,l1)
dj(l1)=dj(l1)+ar2(k1,l1)-aw2(k1,l1)*dp2(k1,l)       &
-ae2(k1,l1)*dp2(k+1,l)-as2(k1,l1)*dp2(k,l1)  &
-an2(k1,l1)*dp2(k,l+1)-ap2(k1,l1)*dp2(k,l)
enddo
enddo
aj(1)=0.d0
cj(ny2m2)=0.d0
call tridag(ny2m2,aj,bj,cj,dj)
do l=1,ny2m2
do k=1,nx2m2
dp2(k+1,l+1)=dp2(k+1,l+1)+dj(l)
enddo
enddo
!
akd=0.d0
akn=0.d0
do k=2,nx2m1
k1=k-1
do l=2,ny2m1
l1=l-1
aj(l1)=as2(k1,l1)
bj(l1)=ap2(k1,l1)
cj(l1)=an2(k1,l1)
dj(l1)=-aw2(k1,l1)*dp2(k1,l)-ae2(k1,l1)*dp2(k+1,l) + ar2(k1,l1)
akd=akd+dabs(aj(l1)*dp2(k,l1)+bj(l1)*dp2(k,l) + cj(l1)*dp2(k,l+1)-dj(l1))
akn=akn+dabs(bj(l1)*dp2(k,l))
enddo
enddo
!
ak=akd/akn
if(kint.eq.1) then
akpre=ak
goto 50
end if

ratio=ak/akpre
akpre=ak

return
end
``````

``````!=======================================================
subroutine tridag(n,a,b,c,d)
!=======================================================
!....invert a tri-diagonal matrix

use Dyn_sizes_module
implicit real*8(a-h,o-z)
!\$ACC routine seq
dimension a(1),b(1),c(1),d(1)
dimension beta(402),gamma(402)

beta(1)=b(1)
gamma(1)=d(1)/beta(1)
do 10 i=2,n
im1=i-1
beta(i)=b(i)-a(i)*c(im1)/beta(im1)
gamma(i)=(d(i)-a(i)*gamma(im1))/beta(i)
10  continue
d(n)=gamma(n)
do 20 i=n-1,1,-1
d(i)=gamma(i)-c(i)*d(i+1)/beta(i)
20  continue

return
end
``````

``````!\$acc data copyin(p,aw,ae......) &
!\$acc copyout(p)
call acmdn(........)
!\$acc end data
``````

Question 2: what does !\$acc routine seq do exactly?

It tells the compiler to generate device code for this routine that is callable from other device code. In other words, it allows you to call functions and subroutines from “parallel” and “kernel” regions. These articles may help:

Question 1: since it has multiple nested do loops. what shall I do?

1. make it run on GPU by adding !\$acc routine seq at the start of the subroutine?
2. go through each do loop and use !\$acc parallel loop directives?

I presume you’re talking about adding routine to “tridag” not “acmdn”. In this case you have a choice of using “seq” or “vector”. “seq” says to run the code within routine sequentially and will be executed by all vectors if called from a vector loop. If you use “routine vector”, then the loops within “tridag” can be parallelized across vectors.

The choice mostly depends on when it’s being called. So if in acmdn you schedule the 200 loop using “gang”, then you most likely want “tridag” to be a “vector” routine. If the 200 loop uses “gang vector”, then you’ll need to make it a “seq” routine.

If you use “seq”, no “loop” directives should be added. For “vector”, you’ll need to specific which loop(s) to schedule. Don’t use “parallel loop” since this will define a new parallel region (nested parallellism) which isn’t what you want.

For example:

``````  !=======================================================
subroutine tridag(n,a,b,c,d)
!=======================================================
!....invert a tri-diagonal matrix

use Dyn_sizes_module
implicit real*8(a-h,o-z)
!\$ACC routine vector
dimension a(1),b(1),c(1),d(1)
dimension beta(402),gamma(402)

beta(1)=b(1)
gamma(1)=d(1)/beta(1)
!\$acc loop vector
do 10 i=2,n
im1=i-1
beta(i)=b(i)-a(i)*c(im1)/beta(im1)
gamma(i)=(d(i)-a(i)*gamma(im1))/beta(i)
10  continue
d(n)=gamma(n)
!\$acc loop vector
do 20 i=n-1,1,-1
d(i)=gamma(i)-c(i)*d(i+1)/beta(i)
20  continue

return
end
``````

do we treat !\$acc data like any other !\$acc directive? in that case I have to use !\$acc routine seq inside the subroutine (Q1.1).

Data regions can’t be used inside a device routine since data synchronization can only be initiated from the host. Data is implied to be present within a device routine.

Note that static global data must be placed in a “declare create” directive. Hence if you are using any data from the “Dyn_sizes_module” module, then this module data must be put within a “declare create” to create the device reference to the data and the synchronized at runtime via an “update” directive. Though I believe we did this already in one of your earlier posts so it may be already ported.

• Mat