OpenACC directives and GPU

Dolfie · June 22, 2016, 7:39pm

Hi all,
how can I make this portion of the code run on GPU using OpenACC?
I also can’t figure out when to make variables(or matrices) private. Where or what’s a good resource to read up on that topic? I have a large fortran code that I need to port into GPU.

do j=2,nym1
          jm1=j-1
          jp1=j+1
          do i=2,nxm1
            im1=i-1
            ip1=i+1
            aj(im1)=aw(im1,jm1)
            bj(im1)=ap(im1,jm1)
            cj(im1)=ae(im1,jm1)
            dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
            if(i.eq.2) then
              dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
              aj(im1)=0.d0
            else if (i.eq.nxm1) then
              dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
              cj(im1)=0.d0
            endif
            akd=akd+dabs(aj(im1)*p(im1,j)+bj(im1)*p(i,j) + cj(im1)*p(ip1,j)-dj(im1))
            akn=akn+dabs(bj(im1)*p(i,j))
          enddo
        enddo

please advice.

MatColgrove · June 22, 2016, 11:14pm

Hi Dolf,

Why are you using arrays to hold intermediary values in the inner “i” loop? While I may be missing something, it seems like aj, bj, cj, and dj could be scalars and allow you to then parallelize both the “j” and “I” loops.

If these variables really do need to be arrays, then you will need to either privatize the arrays or inter-change the “I” and “j” loops, parallelizing “I” and running “j” sequentially.

Option 1:

!$acc parallel loop
    do j=2,nym1 
          jm1=j-1 
          jp1=j+1 
!$acc loop vector private(aj,bj,cj,dj)
          do i=2,nxm1 
            im1=i-1 
            ip1=i+1

!$acc parallel loop gang vector
      do i=2,nxm1 
            im1=i-1 
            ip1=i+1 
!$acc loop seq
    do j=2,nym1 
          jm1=j-1 
          jp1=j+1

By default in OpenACC, scalars are private and arrays are shared. In cases such as your code when arrays are reused, having them shared across all threads would cause a race condition. Hence, the user must privatize the array. The array is private to the loop so a “private” clause on a “vector” loop means every thread gets it’s own copy of the array. A “private” on a “worker” loop means that it will be private to each worker but shared between the vectors within that worker. A “private” on a “gang” loop means each gang will get it’s own copy which are shared amongst all the vectors within that gang.

Since privatized arrays are placed in global device memory and may lead to poor memory access patterns, it’s advisable to avoid using “private” if the algorithm can be rewritten.

Mat

Dolfie · June 27, 2016, 7:49pm

hi Mat,

I used both the methods but I keep getting akd/akn equal to NaN. any ideas why this is happening?
btw akd is accumulated over the loops (summation). how can we do it in parallel?
also, when do I need to make variables/matrices private on a parallel region?

Regards,
Dolf

!$acc update device(akd,akn)
!$acc parallel loop       
    do i=2,nxm1
	  im1=i-1
      ip1=i+1
!$acc loop vector
	do j=2,nym1
		jm1=j-1
        jp1=j+1
            
            aj1=aw(im1,jm1)
            bj1=ap(im1,jm1)
            cj1=ae(im1,jm1)
            dj1=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
            if(i.eq.2) then
              dj1=dj1-aw(im1,jm1)*p(im1,j)
              aj1=0.d0
            else if (i.eq.nxm1) then
              dj1=dj1-ae(im1,jm1)*p(ip1,j)
              cj1=0.d0
            endif
            akd=akd+dabs(aj1*p(im1,j)+bj1*p(i,j) + cj1*p(ip1,j)-dj1)
            akn=akn+dabs(bj1*p(i,j))
          enddo
        enddo

MatColgrove · June 27, 2016, 11:09pm

btw akd is accumulated over the loops (summation). how can we do it in parallel?

Put “akd” and “akn” into a reduction clause:

!$acc parallel loop reduction(+:akd,akn)       
    do i=2,nxm1 
     im1=i-1 
      ip1=i+1 
!$acc loop vector reduction(+:akd,akn) 
      jm1=j-1 
        jp1=j+1

I used both the methods but I keep getting akd/akn equal to NaN. any ideas why this is happening?

I’m not sure. Maybe it’s because of the lack of the reduction clauses but I could be something else as well. I’d need a reproducing example to know for sure.

when do I need to make variables/matrices private on a parallel region?

Again scalars are private by default, so no need to add them to a private clause. There are a few exceptions, such when passing a scalar by reference to a device routine when “private” is needed, but the compiler will give a feedback message (-Minfo=accel) when this occurs.

Arrays are shared by default. However, you only need to privatize an array when an element is reused by multiple iterations of a parallel loop.

Mat

Dolfie · June 29, 2016, 12:28am

Hi Mat,

so I did correct one line (first one, update host(akd,akn) instead of update device). It gave me small error ak (=akd/akn) of 0.563e-4 which is not too bad except the original loop (cpu version, no acc directives) has an error ak=0.1233e-5. (code 1). How can I bring the error down?

If I change it have reduction in (code 2), I get NaN. which does not make any sense. Any clues?

cheers!
Dolf

Code 1:

!$acc update host(akd,akn)
!$acc parallel loop 
    do i=2,nxm1
	  im1=i-1
      ip1=i+1
!$acc loop vector 
    do j=2,nym1
		jm1=j-1
        jp1=j+1
            aj1=aw(im1,jm1)
            bj1=ap(im1,jm1)
            cj1=ae(im1,jm1)
            dj1=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
            if(i.eq.2) then
              dj1=dj1-aw(im1,jm1)*p(im1,j)
              aj1=0.d0
            else if (i.eq.nxm1) then
              dj1=dj1-ae(im1,jm1)*p(ip1,j)
              cj1=0.d0
            endif
            akd=akd+dabs(aj1*p(im1,j)+bj1*p(i,j)+cj1*p(ip1,j)-dj1)
            akn=akn+dabs(bj1*p(i,j))
          enddo
        enddo

Code 2:

!$acc update host(akd,akn)
!$acc parallel loop reduction(+:akd,akn) 
    do i=2,nxm1
	  im1=i-1
      ip1=i+1
!$acc loop vector reduction(+:akd,akn)
    do j=2,nym1
		jm1=j-1
        jp1=j+1
            aj1=aw(im1,jm1)
            bj1=ap(im1,jm1)
            cj1=ae(im1,jm1)
            dj1=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
            if(i.eq.2) then
              dj1=dj1-aw(im1,jm1)*p(im1,j)
              aj1=0.d0
            else if (i.eq.nxm1) then
              dj1=dj1-ae(im1,jm1)*p(ip1,j)
              cj1=0.d0
            endif
            akd=akd+dabs(aj1*p(im1,j)+bj1*p(i,j)+cj1*p(ip1,j)-dj1)
            akn=akn+dabs(bj1*p(i,j))
          enddo
        enddo

MatColgrove · June 29, 2016, 4:30pm

I’m guessing in code example 2 that you’re only updating the host values before the loop which most likely are garbage values. When you don’t add the reductions clauses, the compiler’s automatic reduction is used and the compiler handles the data movement. When you add the reduction clauses, then you’re assumed to be managing the data movement when the variables are in a data region. So by not putting in a update host directive at the end of the loop, the values are not updated on the host.

You should either not include “akd” and “akn” in a data region and have the compiler manage the data (i.e. remove them from any create/copy data clause and remove the update directive), or put the update directive after the parallel region.

Mat

Dolfie · June 29, 2016, 8:30pm

I removed update (akd,akn), I get the NaN output. If I put it at the end of the parallel region, I get the same error ak=0.5e-4. and the error never get smaller (does not converge).

any other clues?

Best,
Dolf

MatColgrove · June 29, 2016, 8:52pm

Sorry, no. I’ll need a reproducing example or at least a fuller example and the compiler feedback messages (-Minfo=accel).

Mat

Dolfie · June 29, 2016, 10:16pm

Hi Mat,
Ok. I will send you the whole code to look at via e-mail. I think this way we save our time. It’s a code that you are familiar with since we worked on it a while back. We did move one of the regions to GPU and I am back at moving more regions.

Also, is there a good resource that I can use to help me figure out how to move regions to GPU safely and correctly using OpenAcc directives? I am pretty excited to move more of my code to run faster.

FYI: code sent to trs@pgroup.com and I asked them to forward it to you.

many thanks,
Dolf

MatColgrove · June 30, 2016, 5:11pm

Thanks, I’ll look for the code.

Also, is there a good resource that I can use to help me figure out how to move regions to GPU safely and correctly using OpenAcc directives? I am pretty excited to move more of my code to run faster.

There’s lot of resources out there but I’m not positive it exactly what you’re looking for. Plus they are a bit scattered.

To start, OpenACC.org has lots of tutorials: http://www.openacc.org/An we have several as well: PGI Compilers with OpenACC | PGI
Jeff Larkin did an online course which while complete, you can still view: https://developer.nvidia.com/openacc-courses

There’s also an OpenACC book in the works, which I contributed a chapter, but it’s not expect to be available until late this year.

Mat

Dolfie · July 6, 2016, 8:51pm

Hi Mat,
the code is working great now after I applied your directions.
I moved to the next parallel region and I encountered few errors. I am not surprised since these nested loops are little bit tricky!
here is the original code:

   do 500 j=2,nym1
          jm1=j-1
          jp1=j+1
          ymyp=(yref(jp1)-yref(jm1))/2.d0

          do 450 i=2,nxm1
	      im1=i-1
              ip1=i+1
              xmxp=(xref(ip1)-xref(im1))/2.d0
!
              aj(im1)=aw(im1,jm1)
              bj(im1)=ap(im1,jm1)
              cj(im1)=ae(im1,jm1)
              dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
!
              if(i.eq.2) then
                dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
                aj(im1)=0.d0
              else if (i.eq.nxm1) then
                dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
                cj(im1)=0.d0
              endif
!
450         continue
            call tridag(nxm2,aj,bj,cj,dj)

    do 475 i=1,nxm2
            p(i+1,j)=dj(i)			
475       continue

500       continue

here is my edit:

!$acc private(aj,bj,cj,dj)
          do 500 j=2,nym1
          jm1=j-1
          jp1=j+1
          ymyp=(yref(jp1)-yref(jm1))/2.d0
!$acc loop vector 
              do 450 i=2,nxm1
			  im1=i-1
              ip1=i+1
              xmxp=(xref(ip1)-xref(im1))/2.d0
!
              aj(im1)=aw(im1,jm1)
              bj(im1)=ap(im1,jm1)
              cj(im1)=ae(im1,jm1)
              dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
!
              if(i.eq.2) then
                dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
                aj(im1)=0.d0
              else if (i.eq.nxm1) then
                dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
                cj(im1)=0.d0
              endif
!
450         continue
            call tridag(nxm2,aj,bj,cj,dj)

 	    p(1:nxm2,j)=dj(1:nxm2)

500       continue

compiler output:

reyneq:
     86, Generating update device(iqpo,icoe,nter,qndat(:),zdat(:),d0,pit,pir,gama,t4,t3,t2,t1)
     88, Accelerator kernel generated
         Generating Tesla code
         91, !$acc loop gang, vector(128) collapse(2) ! blockidx%
         92,   ! blockidx%x threadidx%x collapsed
     88, Generating copy(ap(:,:),an(:,:),as(:,:),ae(:,:),aw(:,:))
         Generating copyin(yref(:),xref(:),hydnew(:,:),hxlnew(:,:),hyunew(:,:),hxrnew(:,:),zta(:,:),eta(:,:),p(:,:),bearx(:,:),beary(:,:),pold(:,:),h(:,:))
         Generating copy(ar(:,:))
         Generating copyin(hnew(:,:))
        345, Accelerator kernel generated
         Generating Tesla code
        345, Generating reduction(+:akn,akd)
        346, !$acc loop gang ! blockidx%x
        350, !$acc loop vector(128) ! threadidx%x
             Generating reduction(+:akn,akd)
        345, Generating copyin(p(1:nxm1+1,1:nym1+1),aw(:nxm1-1,:nym1-1),ap(:nxm1-1,:nym1-1),ae(:nxm1-1,:nym1-1),ar(:nxm1-1,:nym1-1),as(:nxm1-1,:nym1-1),an(:nxm1-1,:nym1-1))
    350, Loop is parallelizable
    396, Accelerator kernel generated
         Generating Tesla code
        398, !$acc loop gang ! blockidx%x
        403, !$acc loop vector(128) ! threadidx%x
        426, !$acc loop vector(128) ! threadidx%x
        396, Generating copy(nxm2)
         Generating copyin(an(:,:),as(:,:))
         Generating copy(p(:,:))
         Generating copyin(ar(:,:),ae(:,:),ap(:,:),aw(:,:))
    403, Loop is parallelizable
    426, Loop is parallelizable
flow:
   1164, Generating acc routine seq
       1219, !$acc do seq
       1245, !$acc do seq
   1245, Scalar last value needed after loop for ff0 at line 1260
         Scalar last value needed after loop for f1 at line 1261
         Scalar last value needed after loop for f2 at line 1262
Roughness.F90
Stiff.F90
String.f90
Suspension.F90
UserDefGeom.f90
Util.F90
tridag:
    229, Generating acc routine seq
        242, !$acc do seq
        248, !$acc do seq
Linking...

Dyn5d build succeeded.

Build log was saved at "file://C:\D Drive\CML codes\Dyn5d\x64\Release\BuildLog.htm"

========== Rebuild All: 1 succeeded, 0 failed, 0 skipped ==========

errors:
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

what did I do wrong?

regards,
Dolf

MatColgrove · July 6, 2016, 10:04pm

Hi Dolf,

I looked at the code for “tridag”. There’s two automatic arrays in it with the size “nxmax” where “nxmax” is a module variable. I’m assuming that you put “nxmax” into a “declare create” directive since the compiler would complain if you didn’t. While “declare create” creates the variable over on the device, my guess is that you forgot to update it’s value when you initialize it in Init.F90. Hence when the automatic arrays are allocated, the size is a garbage value.

Note that using automatic arrays in device routines can severely hurt performance since each thread needs to dynamically allocate memory on the device which can be quite slow. If possible, try changing these arrays to be fixed size.

Mat

Dolfie · July 8, 2016, 11:22pm

Hi Mat, I did (update device (nxmax,nymax)) (code 1). Also I got rid of the memory allocation error but now I have the ak=NaN again. Code 1 have no relation to calculating ak even!.
Here is what I did:

Code 1:

!$acc update device(nxmax, nymax, nxmax1, nxmax2, nxmax3, nxmax4, & 
!$acc nymax1, nymax2, nymax3, nymax4)
!$acc parallel loop &
!$acc private(aj,bj,cj,dj)
          do 500 j=2,nym1
		  jm1=j-1
          jp1=j+1
          ymyp=(yref(jp1)-yref(jm1))/2.d0
!$acc loop vector 
              do 450 i=2,nxm1
			  im1=i-1
              ip1=i+1
              xmxp=(xref(ip1)-xref(im1))/2.d0
!
              aj(im1)=aw(im1,jm1)
              bj(im1)=ap(im1,jm1)
              cj(im1)=ae(im1,jm1)
              dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
!
              if(i.eq.2) then
                dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
                aj(im1)=0.d0
              else if (i.eq.nxm1) then
                dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
                cj(im1)=0.d0
              endif
!
450         continue
            call tridag(nxm2,aj,bj,cj,dj)

            !do 475 i=1,nxm2
             !p(i+1,j)=dj(i)
			 p(1:nxm2,j)=dj(1:nxm2)
!475         continue

500       continue

if you pay attention to those lines in code 2, is that the correct way of getting rid of the third do loop? ( do 475 …)
code 2:

!do 475 i=1,nxm2
             !p(i+1,j)=dj(i)
			 p(1:nxm2,j)=dj(1:nxm2)
!475         continue

any thoughts?
many thanks,
Dolf

MatColgrove · July 11, 2016, 10:05pm

There’s still a loop, it’s just implicit when using array syntax.

Here’s what I did:

!$acc kernels loop gang vector independent private(aj,bj,cj,dj)
          do 500 j=2,nym1
            jm1=j-1
            jp1=j+1
            ymyp=(yref(jp1)-yref(jm1))/2.d0
!acc loop vector independent
            do 450 i=2,nxm1
              im1=i-1
              ip1=i+1
              xmxp=(xref(ip1)-xref(im1))/2.d0
!
              aj(im1)=aw(im1,jm1)
              bj(im1)=ap(im1,jm1)
              cj(im1)=ae(im1,jm1)
              dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
!
              if(i.eq.2) then
                dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
                aj(im1)=0.d0
              else if (i.eq.nxm1) then
                dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
                cj(im1)=0.d0
              endif
!
450         continue
            call tridag(nxm2,aj,bj,cj,dj)
!acc loop vector independent
            do 475 i=1,nxm2
              p(i+1,j)=dj(i)
475         continue

500       continue
! reyneq_p3 end

Any my results:

 *****************************************
 * CML GPU Dynamic Simulator Version 5.0 *
 *****************************************

 GPU installed: GeForce GTX 690

 *****************************************
 Dynamics.def file version:
 4.039

 error encountered when reading parameters:
 parameter header does not match parameter values

 Running with non-zero altitude
 Air parameters will be set according to specified altitude
 *****************************************
 CML Version 7.0  RAIL.DAT
 7.0
 *****************************************
Actual Grid Size: nx= 402  ny= 402
Suspension Load(grams) =  2.500
Initial Disk Velocity(m/s) =    8.21991
Initial IDEMA Skew Angle(degre -14.76900
Initial Radial Position(mm) =   14.53600
Intermolecular forces are: OFF

 SIMULATOR SOLVES SLIDER RESPONSE
 TO THE FOLLOWING DYNAMIC INPUTS:

  => Initial Impulse

   => Three DOF Stiffness and Damping Coef. Suspension Model

  => Asperity Contact
     ( GW model )

 PRE-PROCESSING...

  dense meshing ...

  adaptive mesh discretization ...
  T(s)= 0.00000    ,Iteration=  5,Residual=0.138019E-08
  T(s)= 0.00000    ,Iteration=  5,Residual=0.148343E-09


 START OF TRANSIENT ANALYSIS ...

    t(s)=0.500000E-07, iteration number=  2, residual=0.130576E-10
    t(s)=0.100000E-06, iteration number=  2, residual=0.771753E-11
    t(s)=0.150000E-06, iteration number=  2, residual=0.134445E-10
    t(s)=0.200000E-06, iteration number=  2, residual=0.192544E-10
    t(s)=0.250000E-06, iteration number=  2, residual=0.250092E-10
    t(s)=0.300000E-06, iteration number=  2, residual=0.220895E-10
    t(s)=0.350000E-06, iteration number=  2, residual=0.258807E-10
    t(s)=0.400000E-06, iteration number=  2, residual=0.298706E-10
    t(s)=0.450000E-06, iteration number=  2, residual=0.336706E-10
    t(s)=0.500000E-06, iteration number=  2, residual=0.373555E-10
    t(s)=0.550000E-06, iteration number=  2, residual=0.408913E-10
    t(s)=0.600000E-06, iteration number=  2, residual=0.442343E-10
    t(s)=0.650000E-06, iteration number=  2, residual=0.473254E-10
    t(s)=0.700000E-06, iteration number=  2, residual=0.500976E-10
    t(s)=0.750000E-06, iteration number=  2, residual=0.524710E-10
    t(s)=0.800000E-06, iteration number=  2, residual=0.543585E-10
    t(s)=0.850000E-06, iteration number=  2, residual=0.556587E-10
    t(s)=0.900000E-06, iteration number=  2, residual=0.553717E-10
    t(s)=0.950000E-06, iteration number=  2, residual=0.551747E-10
    t(s)=0.100000E-05, iteration number=  2, residual=0.540844E-10

Dolfie · July 11, 2016, 11:12pm

Hi Mat,

thanks for the reply. Thats an interesting results, it should be that way actually. Unfortunately, this is not what I get here. I did copy the snippet of your code in the last post, but here is what I get:

warning: negative pressure occurred
 warning: negative pressure occurred
 warning: negative pressure occurred
 warning: negative pressure occurred
 warning: negative pressure occurred
 warning: negative pressure occurred
 warning: negative pressure occurred
  T(s)= 0.00000    ,Iteration=  5,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 10,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 15,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 20,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 25,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 30,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 35,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 40,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 45,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 50,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 55,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 60,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 65,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 70,Residual=         NaN
  T(s)= 0.00000    ,Iteration= 75,Residual=         NaN
^C
C:\Users\Dolf\Desktop\Dyn5d>

Did you change other parts of the code? I made tridag as sequential sub as you suggested. Shall I put it back as regular sub?

regards,

Dolf

MatColgrove · July 12, 2016, 2:04pm

I just applied this change plus the addition of updating the device copy of “nmax” to the source you sent me on June 30th. I can send you what I have so you can diff your version with mine.

Mat

Dolfie · July 18, 2016, 9:51pm

Thanks Mat, I got the code you sent me. I appreciate it. I have updated what you send me to make it work (when I am updating p matrix). Now it works perfect.
Now I have moved to the next parallel region as shown in code 1 (original). My edit is shown in code 2. I don’t know why I am getting a NaN, any ideas why? I tried to do what you already done in previous parallel region but with no luck.

Code 1:

! reyen_p4
        do i=2,nxm1
          im1=i-1
          ip1=i+1
          xmxp=(xref(ip1)-xref(im1))/2.d0
!
          aj(im1)=0.d0
          bj(im1)=0.d0
          cj(im1)=0.d0
          dj(im1)=0.d0
!
          do j=2,nym1
            jm1=j-1
            jp1=j+1
            ymyp=(yref(jp1)-yref(jm1))/2.d0
!
            aj(im1)=aj(im1)+aw(im1,jm1)
            bj(im1)=bj(im1)+ap(im1,jm1)
            if(j.ne.2) bj(im1)=bj(im1)+as(im1,jm1)
            if(j.ne.nym1) bj(im1)=bj(im1)+an(im1,jm1)
            cj(im1)=cj(im1)+ae(im1,jm1)
            dj(im1)=dj(im1)+ar(im1,jm1)-aw(im1,jm1)*p(im1,j)    &
                    -ae(im1,jm1)*p(ip1,j)-as(im1,jm1)*p(i,jm1)  &
                    -an(im1,jm1)*p(i,jp1)-ap(im1,jm1)*p(i  ,j)
!
          enddo
        enddo
! end reyneq_p4

Code 2:

!$acc kernels loop independent reduction(+:aj1,bj1,cj1,dj1) private(aj,bj,cj,dj)
        do i=2,nxm1
          im1=i-1
          ip1=i+1
          xmxp=(xref(ip1)-xref(im1))/2.d0
!
          aj(im1)=0.d0
          bj(im1)=0.d0
          cj(im1)=0.d0
          dj(im1)=0.d0
          aj1 = 0.d0
	  bj1 = 0.d0
	  cj1 = 0.d0
	  dj1 = 0.d0

!$acc loop vector independent reduction(+:aj1,bj1,cj1,dj1)
          do j=2,nym1
            jm1=j-1
            jp1=j+1
            ymyp=(yref(jp1)-yref(jm1))/2.d0
!
            aj1=aj1+aw(im1,jm1)
	    bj1=bj1+ap(im1,jm1)
            if(j.ne.2) bj1=bj1+as(im1,jm1)
            if(j.ne.nym1) bj1=bj1+an(im1,jm1)
	    cj1=cj1+ae(im1,jm1)
	    dj1=dj1+ar(im1,jm1)-aw(im1,jm1)*p(im1,j)    &
                    -ae(im1,jm1)*p(ip1,j)-as(im1,jm1)*p(i,jm1)  &
                    -an(im1,jm1)*p(i,jp1)-ap(im1,jm1)*p(i  ,j)
          enddo
          aj(im1)= aj1
          bj(im1)= bj1
          cj(im1)= cj1
          dj(im1)= dj1
        enddo

MatColgrove · July 19, 2016, 5:45pm

Hi Dolf,

You don’t want to privatize the aj, bj, cj, dj arrays here given you use the results after the compute region. Instead, you want to put these in a “copyout” clause.

Also, no need to have the reduction clause on the outer i loop given you’re not using these values after the compute region.

Mat

Dolfie · July 20, 2016, 11:00pm

Thanks Mat, I have corrected the problem and now it’s working flawlessly.
I have this subroutine, and I want to know which way I should make it run on GPU.

Question 1: since it has multiple nested do loops. what shall I do?

make it run on GPU by adding !$acc routine seq at the start of the subroutine?
go through each do loop and use !$acc parallel loop directives?

Question 2: what does !$acc routine seq do exactly?

Question 3: if I have data region (example code 1 below) and a GPU subroutine was called within that region, what should be the method used to run subroutine on GPU? Question 1.1 or 1.2? or it does not matter?

do we treat !$acc data like any other !$acc directive? in that case I have to use !$acc routine seq inside the subroutine (Q1.1).

Please advice,
thanks!
Dolf

here is the subroutine:

!================================================================
        subroutine acmdn(aw1,ae1,as1,an1,ap1,ar1,p1,aw2,ae2,as2,    &
          an2,ap2,ar2,dp2,nx1,ny1,nx2,ny2,nx1max,ny1max,nx2max,     &
          ny2max,nlimit)
!================================================================
        use Dyn_sizes_module
        implicit none

	real(8) :: AR1, AR2, DJ, CJ, AP2, AP1, BJ, AKN, AN2, RATIO, AKPRE, AKD, AS2, AS1, P1
	real(8) :: AK, AJ, AW2, AW1, AE1, AE2, DP2, DABS, AN1
	integer :: K1, NX2M2, NX2M1, NX1, NX2, I1, I2, KINT, NY1MAX, K, L, NY2M1, NY2M2 
	integer :: L1, NX1MAX, NLIMIT, NY2, NY1, J1, J2, NY2MAX, NX2MAX

        dimension aw1(nx1max,ny1max),ae1(nx1max,ny1max), &
                  as1(nx1max,ny1max),an1(nx1max,ny1max), &
                  ap1(nx1max,ny1max),ar1(nx1max,ny1max), &
                  aw2(nx2max,ny2max),ae2(nx2max,ny2max), &
                  as2(nx2max,ny2max),an2(nx2max,ny2max), &
                  ap2(nx2max,ny2max),ar2(nx2max,ny2max), &
                  p1(nx1max,ny1max), dp2(nx2max,ny2max)
        dimension aj(nxmax),bj(nxmax),cj(nxmax),dj(nxmax)

        nx2m1=nx2-1
        ny2m1=ny2-1
        nx2m2=nx2-2
        ny2m2=ny2-2

        do k=2,nx2m1
          k1=k-1
          do l=2,ny2m1
            l1=l-1
            i1=2*(k-1)-1
            i2=2*(k-1)+1-1
            j1=2*(l-1)-1
            j2=2*(l-1)+1-1
            ap2(k1,l1)=ap1(i1,j1)+ap1(i2,j1)+ap1(i1,j2)+ap1(i2,j2)+ &
                       ae1(i1,j1)+ae1(i1,j2)+aw1(i2,j1)+aw1(i2,j2)+ &
                       as1(i1,j2)+as1(i2,j2)+an1(i1,j1)+an1(i2,j1)
            aw2(k1,l1)=aw1(i1,j1)+aw1(i1,j2)
            ae2(k1,l1)=ae1(i2,j1)+ae1(i2,j2)
            as2(k1,l1)=as1(i1,j1)+as1(i2,j1)
            an2(k1,l1)=an1(i1,j2)+an1(i2,j2)
            ar2(k1,l1)=ar1(i1,j1)+ar1(i2,j1)+ar1(i1,j2)+ar1(i2,j2)- &
               ap1(i1,j1)*p1(i1+1,j1+1)-ap1(i2,j1)*p1(i2+1,j1+1)-   &
               ap1(i1,j2)*p1(i1+1,j2+1)-ap1(i2,j2)*p1(i2+1,j2+1)-   &
               aw1(i1,j1)*p1(i1  ,j1+1)-aw1(i1,j2)*p1(i1  ,j2+1)-   &
               ae1(i1,j1)*p1(i2+1,j1+1)-ae1(i1,j2)*p1(i2+1,j2+1)-   &
               as1(i1,j1)*p1(i1+1,j1  )-as1(i1,j2)*p1(i1+1,j1+1)-   &
               an1(i1,j1)*p1(i1+1,j2+1)-an1(i1,j2)*p1(i1+1,j2+2)-   &
               aw1(i2,j1)*p1(i1+1,j1+1)-aw1(i2,j2)*p1(i1+1,j2+1)-   &
               ae1(i2,j1)*p1(i2+2,j1+1)-ae1(i2,j2)*p1(i2+2,j2+1)-   &
               as1(i2,j1)*p1(i2+1,j1  )-as1(i2,j2)*p1(i2+1,j1+1)-   &
               an1(i2,j1)*p1(i2+1,j2+1)-an1(i2,j2)*p1(i2+1,j2+2)
          end do
        end do
!
        do k=1,nx2
          do l=1,ny2
            dp2(k,l)=0.d0
          end do
        end do
!
        kint=0
50      kint=kint+1

        do 200 l=2,ny2m1
          l1=l-1
          do 100 k=2,nx2m1
            k1=k-1
            aj(k1)=aw2(k1,l1)
            bj(k1)=ap2(k1,l1)
            cj(k1)=ae2(k1,l1)
            dj(k1)=-as2(k1,l1)*dp2(k,l1)-an2(k1,l1)*dp2(k,l+1) + ar2(k1,l1)
100       continue
          aj(1)=0.d0
          cj(nx2m2)=0.d0
          call tridag(nx2m2,aj,bj,cj,dj)
          do 150 k=1,nx2m2
            dp2(k+1,l)=dj(k)
150       continue
200     continue

        do k=2,nx2m1
          k1=k-1
!
	  aj(k1)=0.d0
	  bj(k1)=0.d0
	  cj(k1)=0.d0
	  dj(k1)=0.d0
!
          do l=2,ny2m1
            l1=l-1
            aj(k1)=aj(k1)+aw2(k1,l1)
            bj(k1)=bj(k1)+ap2(k1,l1)
	    if(l.ne.2) bj(k1)=bj(k1)+as2(k1,l1)
	    if(l.ne.ny2m1) bj(k1)=bj(k1)+an2(k1,l1)
            cj(k1)=cj(k1)+ae2(k1,l1)
            dj(k1)=dj(k1)+ar2(k1,l1)-aw2(k1,l1)*dp2(k1,l)       &
                   -ae2(k1,l1)*dp2(k+1,l)-as2(k1,l1)*dp2(k,l1)  &
                   -an2(k1,l1)*dp2(k,l+1)-ap2(k1,l1)*dp2(k,l)
	  enddo
        enddo
        aj(1)=0.d0
        cj(nx2m2)=0.d0
        call tridag(nx2m2,aj,bj,cj,dj)
        do k=1,nx2m2
          do l=1,ny2m2
            dp2(k+1,l+1)=dp2(k+1,l+1)+dj(k)
	  enddo
        enddo

        do 400 k=2,nx2m1
          k1=k-1
          do 300 l=2,ny2m1
            l1=l-1
            aj(l1)=as2(k1,l1)
            bj(l1)=ap2(k1,l1)
            cj(l1)=an2(k1,l1)
            dj(l1)=-aw2(k1,l1)*dp2(k1,l)-ae2(k1,l1)*dp2(k+1,l) + ar2(k1,l1)
300       continue
          aj(1)=0.d0
          cj(ny2m2)=0.d0
          call tridag(ny2m2,aj,bj,cj,dj)
          do 350 l=1,ny2m2
            dp2(k,l+1)=dj(l)
350       continue
400     continue

        do l=2,ny2m1
          l1=l-1
!
	  aj(l1)=0.d0
	  bj(l1)=0.d0
	  cj(l1)=0.d0
	  dj(l1)=0.d0
!
          do k=2,nx2m1
            k1=k-1
            aj(l1)=aj(l1)+as2(k1,l1)
            bj(l1)=bj(l1)+ap2(k1,l1)
	    if(k.ne.2) bj(l1)=bj(l1)+aw2(k1,l1)
	    if(k.ne.nx2m1) bj(l1)=bj(l1)+ae2(k1,l1)
            cj(l1)=cj(l1)+an2(k1,l1)
            dj(l1)=dj(l1)+ar2(k1,l1)-aw2(k1,l1)*dp2(k1,l)       &
                   -ae2(k1,l1)*dp2(k+1,l)-as2(k1,l1)*dp2(k,l1)  &
                   -an2(k1,l1)*dp2(k,l+1)-ap2(k1,l1)*dp2(k,l)
	  enddo
        enddo
        aj(1)=0.d0
        cj(ny2m2)=0.d0
        call tridag(ny2m2,aj,bj,cj,dj)
        do l=1,ny2m2
	  do k=1,nx2m2
            dp2(k+1,l+1)=dp2(k+1,l+1)+dj(l)
          enddo
        enddo
!
        akd=0.d0
        akn=0.d0
        do k=2,nx2m1
          k1=k-1
          do l=2,ny2m1
            l1=l-1
            aj(l1)=as2(k1,l1)
            bj(l1)=ap2(k1,l1)
            cj(l1)=an2(k1,l1)
            dj(l1)=-aw2(k1,l1)*dp2(k1,l)-ae2(k1,l1)*dp2(k+1,l) + ar2(k1,l1)
            akd=akd+dabs(aj(l1)*dp2(k,l1)+bj(l1)*dp2(k,l) + cj(l1)*dp2(k,l+1)-dj(l1))
            akn=akn+dabs(bj(l1)*dp2(k,l))
          enddo
        enddo
!
        ak=akd/akn
        if(kint.eq.1) then
          akpre=ak
          goto 50
        end if

        ratio=ak/akpre
        akpre=ak

        return
        end

!=======================================================
    subroutine tridag(n,a,b,c,d)
!=======================================================                                                                              
!....invert a tri-diagonal matrix                             
 
    use Dyn_sizes_module                                                                     
    implicit real*8(a-h,o-z)
!$ACC routine seq
    dimension a(1),b(1),c(1),d(1)
    dimension beta(402),gamma(402)
    
    beta(1)=b(1)
    gamma(1)=d(1)/beta(1)
    do 10 i=2,n
            im1=i-1
            beta(i)=b(i)-a(i)*c(im1)/beta(im1)
            gamma(i)=(d(i)-a(i)*gamma(im1))/beta(i)
10  continue
    d(n)=gamma(n)
    do 20 i=n-1,1,-1
        d(i)=gamma(i)-c(i)*d(i+1)/beta(i)
20  continue
                                                                          
    return
    end

!$acc data copyin(p,aw,ae......) &
!$acc copyout(p)
call acmdn(........)
!$acc end data

MatColgrove · July 21, 2016, 9:07pm

Question 2: what does !$acc routine seq do exactly?

It tells the compiler to generate device code for this routine that is callable from other device code. In other words, it allows you to call functions and subroutines from “parallel” and “kernel” regions. These articles may help:

Question 1: since it has multiple nested do loops. what shall I do?

make it run on GPU by adding !$acc routine seq at the start of the subroutine?

go through each do loop and use !$acc parallel loop directives?

I presume you’re talking about adding routine to “tridag” not “acmdn”. In this case you have a choice of using “seq” or “vector”. “seq” says to run the code within routine sequentially and will be executed by all vectors if called from a vector loop. If you use “routine vector”, then the loops within “tridag” can be parallelized across vectors.

The choice mostly depends on when it’s being called. So if in acmdn you schedule the 200 loop using “gang”, then you most likely want “tridag” to be a “vector” routine. If the 200 loop uses “gang vector”, then you’ll need to make it a “seq” routine.

If you use “seq”, no “loop” directives should be added. For “vector”, you’ll need to specific which loop(s) to schedule. Don’t use “parallel loop” since this will define a new parallel region (nested parallellism) which isn’t what you want.

For example:

  !======================================================= 
    subroutine tridag(n,a,b,c,d) 
!=======================================================                                                                              
!....invert a tri-diagonal matrix                              
  
    use Dyn_sizes_module                                                                      
    implicit real*8(a-h,o-z) 
!$ACC routine vector
    dimension a(1),b(1),c(1),d(1) 
    dimension beta(402),gamma(402) 
    
    beta(1)=b(1) 
    gamma(1)=d(1)/beta(1) 
!$acc loop vector
    do 10 i=2,n 
            im1=i-1 
            beta(i)=b(i)-a(i)*c(im1)/beta(im1) 
            gamma(i)=(d(i)-a(i)*gamma(im1))/beta(i) 
10  continue 
    d(n)=gamma(n) 
!$acc loop vector
    do 20 i=n-1,1,-1 
        d(i)=gamma(i)-c(i)*d(i+1)/beta(i) 
20  continue 
                                                                          
    return 
    end

do we treat !$acc data like any other !$acc directive? in that case I have to use !$acc routine seq inside the subroutine (Q1.1).

Data regions can’t be used inside a device routine since data synchronization can only be initiated from the host. Data is implied to be present within a device routine.

Note that static global data must be placed in a “declare create” directive. Hence if you are using any data from the “Dyn_sizes_module” module, then this module data must be put within a “declare create” to create the device reference to the data and the synchronized at runtime via an “update” directive. Though I believe we did this already in one of your earlier posts so it may be already ported.

Mat

Topic		Replies	Views
Using OpenACC with GeForce 680 Legacy PGI Compilers	23	15187	January 28, 2016
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2482	December 4, 2023
Call in OpenACC region to procedure 'pgf90_copy_f90_argl' Legacy PGI Compilers	10	11404	July 5, 2017
Clarification on using OpenACC in a shared library Legacy PGI Compilers	27	4540	December 9, 2020
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1523	December 4, 2020
Need advices for optimizing heart of CFD code Legacy PGI Compilers	11	7067	July 13, 2016
OpenACC: Best way to parallelize nested DO loops (continued) nvc, nvc++ and nvfortran	22	1701	March 28, 2023
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13513	December 21, 2012
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	1891	November 30, 2020
Error and huge slowdown from !$acc region Legacy PGI Compilers	4	2954	March 26, 2012

OpenACC directives and GPU

Related topics