Fortran code not compiling for GPU

I have recently installed PGI Community Edition in hopes of running a fairly simple explicit finite difference scheme for a type of heat equation written in Fortran. I’ve compiled it using

pgfortran -acc -fast filename.f90

and

pgfortran -acc -Minfo filename.f90

which gave me the impression that it was automatically compiled for the GPU, but that doesn’t seems to be the case - when I tried running the .exe file on PGI Profiler there was no trace of GPU activity. Likewise, running nvidia-smi during execution showed the GPU to be idling, so I assume it’s because my source code didn’t include any calls for CUDA. This may be obvious to the more experienced user but is a bit baffling to me as I couldn’t find much info on how to make the appropriate calls in the source code. I can post a stripped-down version of it if it’s of any help.

Here is the skeleton of the code. The current !$ACC statement I think was taken from one of the examples in the documentation but it doesn’t work here as I get the warning

Parallelization would require privatization of array cz



PROGRAM DIFFUSION
      
IMPLICIT NONE


!**********************************************************************
INTEGER ::  I,J,K,NT
INTEGER, PARAMETER :: NX = 500, NY = 100
DOUBLE PRECISION :: INIC,BC,D,T
DOUBLE PRECISION :: DX,DT
DOUBLE PRECISION :: TMAX,XMAX,YMAX
DOUBLE PRECISION :: C(NX,NY)
DOUBLE PRECISION :: CZ(NX,NY) !PLACEHOLDER FOR PARALLELISATION


D = 1.D-5
DX = 1.D-4
INIC = 300 ! INITIAL VALUE
XMAX = NX*DX
YMAX = NY*DX
TMAX = 5E5 ! TIME
BC = 600
T = 300

!**********************************************************************
!     DEFINE DT
DT = 0.5*DX*DX/D
NT = TMAX/DT

C = INIC
CZ = 0
! ***************************************************************
!     TIME LOOP
!$ACC KERNELS LOOP COPYIN(D,DT,DX,C(1:NX,1:NY)) COPYOUT(CZ(1:NX,1:NY))

DO K=1, NT

    CZ(2:NX-1,2:NY-1) = C(2:NX-1,2:NY-1)+0.5*((D*DT)/(DX*DX))* &
             ((C(1:NX-2,2:NY-1)-2.D0*C(2:NX-1,2:NY-1)+C(3:NX,2:NY-1))+ &
             (C(2:NX-1,1:NY-2)-2.D0*C(2:NX-1,2:NY-1)+C(2:NX-1,3:NY)))
    
    C(2:NX-1,2:NY-1) = CZ(2:NX-1,2:NY-1)    
ENDDO

END

Hi aturk,

The compiler is correct in that the outer “K” loop can’t be parallelized. Each iteration computation depends on the results from the previous iteration.

The compiler is correctly parallelizing the inner implicit loops from the array syntax and does run to completion, though takes awhile given the size of NT (1,000,000,000).

% pgfortran -acc -Minfo=accel -ta=tesla:cc60 test.f90 -V17.4
diffusion:
     36, Generating copyout(cz(:,:))
         Generating copyin(d,dx,dt,c(:,:))
     38, Loop carried dependence due to exposed use of c(2:499,:),c(:498,2:99),c(3:,2:99) prevents parallelization
         Parallelization would require privatization of array cz(2:499,i2+2)
         Sequential loop scheduled on host
     40, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         40, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     44, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         44, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x

While your example runs correctly, it will give incorrect answers since your only copying “C” to the device. You should change “copyin” to “copy” so “C” is both copied to and copied from the device.

Since “CZ” is only used on the device, you can use a “create” clause so the data is not copied. If you do need CZ’s values back on the host keep in mind that since only the interior of the array is set, copying back the entire array will give you garbage values in the halo. In this case I’d suggest to also putting CZ in a “copy” clause so the halo values are initialized on the device and thus contain valid values when copied back to the host.

Note for convenience, I set NT=10. I also modified the directives so that the kernels directives are around the inner implicit loops and used a data directive to manage the data. Note that you do not need to copyin the scalars in this case since scalars are firstprivate by default.

% cat test.f90
PROGRAM DIFFUSION

 IMPLICIT NONE


 !**********************************************************************
 INTEGER ::  I,J,K,NT
 INTEGER, PARAMETER :: NX = 500, NY = 100
 DOUBLE PRECISION :: INIC,BC,D,T
 DOUBLE PRECISION :: DX,DT
 DOUBLE PRECISION :: TMAX,XMAX,YMAX
 DOUBLE PRECISION :: C(NX,NY)
 DOUBLE PRECISION :: CZ(NX,NY) !PLACEHOLDER FOR PARALLELISATION


 D = 1.D-5
 DX = 1.D-4
 INIC = 300 ! INITIAL VALUE
 XMAX = NX*DX
 YMAX = NY*DX
 TMAX = 5E5 ! TIME
 BC = 600
 T = 300

 !**********************************************************************
 !     DEFINE DT
 DT = 0.5*DX*DX/D
 NT = TMAX/DT

 C = INIC
 CZ = 0
 print *, NT
  NT=10
 ! ***************************************************************
 !     TIME LOOP
 !$ACC DATA COPY(C(1:NX,1:NY)) CREATE(CZ(1:NX,1:NY))
 DO K=1, NT
!$ACC KERNELS
     CZ(2:NX-1,2:NY-1) = C(2:NX-1,2:NY-1)+0.5*((D*DT)/(DX*DX))* &
        ((C(1:NX-2,2:NY-1)-2.D0*C(2:NX-1,2:NY-1)+C(3:NX,2:NY-1))+ &
        (C(2:NX-1,1:NY-2)-2.D0*C(2:NX-1,2:NY-1)+C(2:NX-1,3:NY)))
     C(2:NX-1,2:NY-1) = CZ(2:NX-1,2:NY-1)
!$ACC END KERNELS
 ENDDO
 !$ACC END DATA

 print *, C(2:5,2:5)
 END
% export PGI_ACC_TIME=1
% pgfortran -acc -Minfo=accel -ta=tesla:cc60 test.f90 -V17.4; a.out
diffusion:
     36, Generating copy(c(:,:))
         Generating create(cz(:,:))
     39, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         39, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     42, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         42, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
   1000000000

    300.0000000000000         300.0000000000000         300.0000000000000
    300.0000000000000         300.0000000000000         300.0000000000000
    300.0000000000000         300.0000000000000         300.0000000000000
    300.0000000000000         300.0000000000000         300.0000000000000
    300.0000000000000         300.0000000000000         300.0000000000000
    300.0000000000000

Accelerator Kernel Timing data
test.f90
  diffusion  NVIDIA  devicenum=0
    time(us): 249
    36: data region reached 2 times
        36: data copyin transfers: 2
             device time(us): total=31 max=18 min=13 avg=15
        45: data copyout transfers: 1
             device time(us): total=218 max=218 min=218 avg=218
    38: compute region reached 10 times
        39: kernel launched 10 times
            grid: [16x25]  block: [32x4]
            elapsed time(us): total=2,929 max=317 min=290 avg=292
        42: kernel launched 10 times
            grid: [16x25]  block: [32x4]
            elapsed time(us): total=2,924 max=305 min=290 avg=292

Hope this helps,
Mat

Thanks for the quick help, much appreciated. Shortly after I’d posted I actually found a pdf where directives were thoroughly explained and together with your post things seem much clearer now. I was a bit confused yesterday because I wasn’t using them properly and I tried putting !$ACC KERNELS LOOP … in front of the vectorised expression for CZ so that was obviously throwing errors. I do have a few questions though:

  • Does compiling with pgfort -acc produce a GPU-ready .exe even without using !$ACC directives? I didn’t see any GPU activity in the profiler but I’m not sure if it was rather due to some driver issues that I only discovered a few hours ago and have since been fixed.

I don’t really need the variable CZ, it was just meant to be a placeholder for the new values of C in each time loop iteration, although this may be redundant. I only really need to save the matrix C into a file every n-th iteration (currently doing this with an if clause within the loop). Are there any rules of thumb what’s more efficient, assignments such b = 2a, a = b or simply a = 2a?

Are WHERE statements computationally intensive? There are regions in the domain where special rules apply so I’m using masks to treat them separately.

In the full source file I have another while loop following the one posted above - I mistakenly thought that since CUDA didn’t allow while loops that the time loop should also be made into a for even though it’s not parallelisable. Since while loops in time are OK I’m just curious if there is any performance difference between using an infinite DO loop containing a breaking condition (IF (…) THEN EXIT) and a regular WHILE loop.

As a follow-up to my previous post: the answer to the first question is a no (I can see that now that the profiler is working). Speaking of the profiler, I noticed that profiling reveals a poor use of resources, so perhaps there is room for improvement, but I don’t know how to go about it. I’ve currently got a GeForce GTX 970M, is that not good enough to benefit significantly from parallelism on grids of 10^5 - 10^7 points? I also tried to replace all variables with single precision ones, but only managed to halve the execution time compared to double. The GPU version is currently only about 3 times faster than the CPU one (or double that if running on single precision). The processor is an i7-4710HQ if that’s of any help.

Does compiling with pgfort -acc produce a GPU-ready .exe even without using !$ACC directives? I didn’t see any GPU activity in the profiler but I’m not sure if it was rather due to some driver issues that I only discovered a few hours ago and have since been fixed.

No. “-acc” simply enables OpenACC directives. The compiler will not attempt to automatically offload code to the GPU. Note that the target accelerator flag (-ta), implies “-acc” so using “-ta=tesla” is the same as “-acc -ta=tesla”. Also using “-acc” by itself, implies “-ta=tesla,host”.

I don’t really need the variable CZ, it was just meant to be a placeholder for the new values of C in each time loop iteration, although this may be redundant. I only really need to save the matrix C into a file every n-th iteration (currently doing this with an if clause within the loop). Are there any rules of thumb what’s more efficient, assignments such b = 2a, a = b or simply a = 2a?

Then you need only create “CZ” on the device. No need to copy it.

a=2*a is preferred since it will save the extra memory of b and not require the extra copy. But this assumes that you are using the same element of a on both sides of the equation. In the case above you will need to use CZ since you are using multiple elements of C thus causing a dependency.

What I’d suggest when outputting to the file would be to use the OpenACC update clause to synchronize the host and device data. Something like:

 !$ACC DATA COPY(C(1:NX,1:NY)) CREATE(CZ(1:NX,1:NY)) 
 DO K=1, NT 
!$ACC KERNELS 
     CZ(2:NX-1,2:NY-1) = C(2:NX-1,2:NY-1)+0.5*((D*DT)/(DX*DX))* & 
        ((C(1:NX-2,2:NY-1)-2.D0*C(2:NX-1,2:NY-1)+C(3:NX,2:NY-1))+ & 
        (C(2:NX-1,1:NY-2)-2.D0*C(2:NX-1,2:NY-1)+C(2:NX-1,3:NY))) 
     C(2:NX-1,2:NY-1) = CZ(2:NX-1,2:NY-1) 
!$ACC END KERNELS 
    IF ( ... conditional to output to file is true ...
!$ACC UPDATE SELF (C(1:NX,1:NY)
      ... code to output to file ...
    THEN
 ENDDO 
 !$ACC END DATA



Are WHERE statements computationally intensive? There are regions in the domain where special rules apply so I’m using masks to treat them separately.

That would depend on what operation you are doing in the WHERE statement. WHERE’s just become DO loop with a conditional and can be parallelized assuming no dependencies.

In the full source file I have another while loop following the one posted above - I mistakenly thought that since CUDA didn’t allow while loops that the time loop should also be made into a for even though it’s not parallelisable. Since while loops in time are OK I’m just curious if there is any performance difference between using an infinite DO loop containing a breaking condition (IF (…) THEN EXIT) and a regular WHILE loop.

Loops can’t be parallelized if they branch out of the loop. You can offload the code to the GPU, but it will be run sequentially. The decision becomes if it is more expensive to copy the data between the host and device, and do the computation on the host, or more expensive to run the computation sequentially on the device.

Hope this helps,
Mat

Could you provide some insight on this? I can post profiler results if needed.

Loops can’t be parallelized if they branch out of the loop. You can offload the code to the GPU, but it will be run sequentially. The decision becomes if it is more expensive to copy the data between the host and device, and do the computation on the host, or more expensive to run the computation sequentially on the device.

Does that mean either is undesirable since the compiler doesn’t know how many iterations the loop will have? The reason I’m using a WHILE loop is because I’m adjusting the timestep in every iteration and I have to reach a certain maximum time. I could also precalculate the number of steps needed to reach the final time and turn the whole thing into a DO loop if that’s more efficient.

That would depend on what operation you are doing in the WHERE statement. WHERE’s just become DO loop with a conditional and can be parallelized assuming no dependencies.

It should be, it looks something like

WHERE(MASK.EQ.1)
 FLUX = -(D(1:NX-1,1:NY) + D(2:NX,1:NY))/2.0*( &
		       CL(1:NX-1,1:NY)*K - CL(2:NX,1:NY))
...
END WHERE

Also, I’m using other arrays (fluxes and some parameters such as D) in the loop based on which I then calculate the new values of C. I suppose they should be treated the same as the array CZ because they are internal to the loop and don’t need to be copied back to the host? I assume I should be using something like !$ACC DATA COPY© CREATE(CZ,D,FLUX …) to tag all variables that only need to reside on the device. Speaking of handling arrays - I noticed that wrapping every array assignment in its own !$ACC KERNELS clause seems to give a slight performance boost compared to using a single clause spanning several assignments.

Could you provide some insight on this? I can post profiler results if needed.

Not sure posting the profile would help here. What are resources that are poorly utilized? Memory? Compute units?

Correct, there are fewer double precision floating point units on the GeForce products than Tesla so the 2x speed-up using single precision is about what I’d expect.

Does that mean either is undesirable since the compiler doesn’t know how many iterations the loop will have?

Correct. In order to parallelize the loop, it needs to be countable, i.e. the number of iterations is known when the GPU kernel is launched.

I could also precalculate the number of steps needed to reach the final time and turn the whole thing into a DO loop if that’s more efficient.

That should allow you to parallelize the loop.

It should be, it looks something like

That should be fine, though I know we have a few open bug reports from users that are using WHERE with mask arrays. If it doesn’t work as expected, post or send a reproducing example to PGI Customer Service (trs@pgroup.com). The compiler will need to implicitly copy the mask array to the GPU in this case, but you could also manage it via the data clauses, and even build the mask on the GPU so it doesn’t need to be copied.

Also, I’m using other arrays (fluxes and some parameters such as D) in the loop based on which I then calculate the new values of C. I suppose they should be treated the same as the array CZ because they are internal to the loop and don’t need to be copied back to the host?

Yes, if the arrays are only used on the device, then put them into a create clause.

-Mat

Quite a few things according to the profiler:

  • Low memcpy/compute overlap
  • Low memcpy throughput
  • Low memcpy overlap
  • Low kernel concurrency
  • Low compute utilisation

Looks like a lot of cores are idling and/or not running in parallel. Is there something seriously wrong with the way the code is written or could it be some other, possibly driver-related issue? At first I thought it might be because of the boundary conditions and WHERE statements but even the most basic accelerated loop with two simple assignments (like the one you posted in one of your first replies) had the same problems.


But I don’t think the outer loop, the time loop, is parallelisable as the values of C depend on those calculated in the previous iteration. The question is then whether it makes any difference if this loop is a WHILE or a DO loop.

Another problem that has cropped up is that the full code that includes heat fluxes and special values thereof assigned in parts of the domain either does not update at all or produces NaN values when compiled for the GPU. It works just fine when compiled for the CPU though, which makes me think it might be some sort of data transfer issue.

The way it is written is

!$ACC DATA COPY(CL) &
!$ACC CREATE(CZ,FLUXX,FLUXY,D) &
!$ACC CREATE(MASKFLUXX,DX,DT, ...) ! includes masks and parameters


DO K=1, NT

! calculate fluxes in x and y, apply special values:
    !$ACC KERNELS
    FLUXX(1:NX-1,2:NY-1) =  -D*(CL(1:NX-1,2:NY-1) - CL(2:NX,2:NY-1))/DX
    FLUXY(2:NX-1,1:NY-1) =  -D*(CL(2:NX-1,1:NY-1) - CL(2:NX-1,2:NY))/DX
    !$ACC END KERNELS

    !$ACC KERNELS
    WHERE(MASKFLUXX .EQ. 1) 
       FLUXX = -D*(CL(1:NX-1,2:NY-1) - CL(2:NX,2:NY-1))/DX*... 
    END WHERE
    !$ACC END KERNELS

    !$ACC KERNELS
    WHERE(MASKFLUXY .EQ. 1)
       FLUXY(2:NX-1,1:NY-1) =  -D*(CL(2:NX-1,1:NY-1) - CL(2:NX-1,2:NY))/DX* ...
    END WHERE
    !$ACC END KERNELS


! calculate CZ values, copy them into CL which can then be copied back from the GPU
    !$ACC KERNELS
    CZ(2:NX-1,2:NY-1) = CL(2:NX-1,2:NY-1) + 0.5*(DT/DX)* &
                       (FLUXX(2:NX-1,2:NY-1) - FLUXX(1:NX-2,2:NY-1) + &
                        FLUXY(2:NX-1,2:NY-1) - FLUXY(2:NX-1,1:NY-2))
    !$ACC END KERNELS

    !$ACC KERNELS
    CL(2:NX-1,2:NY-1) = CZ(2:NX-1,2:NY-1)
    !$ACC END KERNELS

! export data
    IF (<condition>) THEN
    !$ACC UPDATE SELF(CL(1:NX,1:NY))
        WRITE(*,*) &
        WRITE(50,<format>) CL(1:NX,1:NY)
     END IF
END DO
!$ACC END DATA

The first thing that I see is that you’re using “create” for several of the arrays and scalars. Create only creates memory for the data on the device but the data itself is uninitialized. You either need to change these to “copy”, “copyout” or “copyin”, or add an “update” directive to synchronize the data.

-Mat

Thanks, that solved it. I should’ve read the documentation regarding data handling more carefully. Any thoughts on why the profiler says the GPU resources are poorly utilised?

Quite a few things according to the profiler:

  • Low memcpy/compute overlap
  • Low memcpy throughput
  • Low memcpy overlap
  • Low kernel concurrency
  • Low compute utilisation

Looks like a lot of cores are idling and/or not running in parallel. Is there something seriously wrong with the way the code is written or could it be some other, possibly driver-related issue? At first I thought it might be because of the boundary conditions and WHERE statements but even the most basic accelerated loop with two simple assignments (like the one you posted in one of your first replies) had the same problems.