openacc routine function efficiency

I am working on a legacy fortran code where the loops in the dimensions in (i,j) can be parallelized across gangs and the process in k direction is given in function. If I label these function as “acc routine vector” and parallelize the (x,y) loops the code seems not to be efficient. Thus, I find myself constantly moving the outer loops into the function and modifying the indices to work directly with 3d arrays. This is very cumbersome because I will have to change the one dimension indexing in the column function with 3D indexing, and also hard to upgrade the openacc code when new versions are released. I would like to use acc routine vector if there was no performance penalty.

I demonstrate this with an example below.

#define NX 201
#define NY 201
#define NZ 21

module mymod

contains

   subroutine set(I,J,U)
!$acc routine vector
        real, dimension(1:NX,1:NZ,1:NY), intent(out):: U
        integer :: I, J, K
        real :: x(NZ)

        DO K = 1, 21 
            x(K) = 3.0 * K * I + J;
        ENDDO

        DO K = 1, 21 
            U(I,K,J)=x(K) * x(K)
        ENDDO
   end subroutine

   subroutine test1(U)
        real, dimension(1:NX,1:NZ,1:NY), intent(out):: U
!$acc declare present(U)
        integer :: I, J, K
        real :: x(NZ)

!$acc parallel loop collapse(2) private(x)
        DO J = 1,NY 
            DO I = 1,NX

        DO K = 1, 21 
            x(K) = 3.0 * K * I + J;
        ENDDO

        DO K = 1, 21 
            U(I,K,J)=x(K) * x(K)
        ENDDO

            ENDDO
        ENDDO

   end subroutine

   subroutine test2(U)
        real, dimension(1:NX,1:NZ,1:NY), intent(out):: U
!$acc declare present(U)
        integer :: I, J, K
!$acc parallel loop collapse(2)
        DO J = 1,NY
            DO I = 1,NX
                call set(I,J,U)
            ENDDO
        ENDDO
   end subroutine

end module 

program main
   use mymod 
   real, dimension(1:NX,1:NZ,1:NY):: U

!$acc declare create(U)
   call test1(U)
!$acc update host(U)
   print*, U(1,1,1)

   call test2(U)
!$acc update host(U)
   print*, U(1,1,1)
end program

Compilation:

[04:47 dabdi@hsw213 exp] > pgf90 -acc -Minfo=accel -ta=host,tesla,cc60,cuda9.0 ss.F90
set:
      9, Generating Tesla code
         15, !$acc loop vector ! threadidx%x
         19, !$acc loop vector ! threadidx%x
     15, Loop is parallelizable
     19, Loop is parallelizable
test1:
     26, Generating present(u(:,:,:))
     30, Accelerator kernel generated
         Generating Tesla code
         31, !$acc loop gang collapse(2) ! blockidx%x
         32,   ! blockidx%x collapsed
         34, !$acc loop vector(128) ! threadidx%x
         38, !$acc loop vector(128) ! threadidx%x
     30, CUDA shared memory used for x
     34, Loop is parallelizable
     38, Loop is parallelizable
test2:
     49, Generating present(u(:,:,:))
     51, Accelerator kernel generated
         Generating Tesla code
         52, !$acc loop gang collapse(2) ! blockidx%x
         53,   ! blockidx%x collapsed
main:
     64, Generating create(u(:,:,:))
     66, Generating update self(u(:,:,:))
     70, Generating update self(u(:,:,:))

Profile output

[04:47 dabdi@hsw213 exp] > nvprof ./a.out
==62007== NVPROF is profiling process 62007, command: ./a.out
    16.00000    
    16.00000    
==62007== Profiling application: ./a.out
==62007== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   94.70%  10.883ms         1  10.883ms  10.883ms  10.883ms  test2_51_gpu
                    4.50%  517.40us         2  258.70us  257.53us  259.87us  [CUDA memcpy DtoH]
                    0.80%  91.871us         1  91.871us  91.871us  91.871us  test1_30_gpu

test1() and test2() are identical functions except for the fact that test2 uses the “acc routine vector” method to parallelize. The latter method consumes 95% but inlining the column function(set) as in test1() consumes only 1%. Why? Is the compiler not able to parallelize as efficiently with “acc routine vector”?

Edit: I noticed that if I remove the use of the private array X in both tests and directly set U=2.0, they more or less consume the same amount of time (11% in the profile). The profile after removing X.

[04:53 dabdi@hsw213 exp] > nvprof ./a.out
==62545== NVPROF is profiling process 62545, command: ./a.out
    2.000000    
    2.000000    
==62545== Profiling application: ./a.out
==62545== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   77.50%  527.48us         2  263.74us  258.01us  269.47us  [CUDA memcpy DtoH]
                   11.26%  76.607us         1  76.607us  76.607us  76.607us  test2_51_gpu
                   11.24%  76.511us         1  76.511us  76.511us  76.511us  test1_30_gpu

Daniel

Hi Daniel,

One of the main issues here is that “X” needs to be malloc’d, which is very slow on the device. This is why the performance gets better when you don’t use it. Another work around is to inline the routine (i.e. add the -Minline flag), in which case the malloc can be optimized away.

Though give the loop trip count of the “K” loops is only 21, you’re severely under utilizing the GPU. With the a vector length of 128, 107 threads are doing no work. For this case, I’d recommend making the “K” loops, and the “set” routine sequential, and scheduling the outer loops “gang vector”.

Here’s the original timings on a V100 device:

  GPU activities:   94.31%  11.096ms         1  11.096ms  11.096ms  11.096ms  test2_51_gpu
                    5.08%  597.66us         2  298.83us  293.37us  304.29us  [CUDA memcpy DtoH]
                    0.61%  71.904us         1  71.904us  71.904us  71.904us  test1_30_gpu

Original code with “-Minline”

 GPU activities:   78.40%  517.34us         2  258.67us  258.37us  258.97us  [CUDA memcpy DtoH]
                   10.82%  71.391us         1  71.391us  71.391us  71.391us  main_66_gpu
                   10.78%  71.136us         1  71.136us  71.136us  71.136us  main_70_gpu

Here’s a diff of the files from my version and your original version:

% diff test7218.F90 test7218.a.F90
10c10
< !$acc routine vector
---
> !$acc routine seq
30c30
< !$acc parallel loop collapse(2) private(x)
---
> !$acc parallel loop gang vector collapse(2) private(x)
51c51
< !$acc parallel loop collapse(2)
---
> !$acc parallel loop gang vector collapse(2)

And the timing of this new version is over 10 times as fast as you version.

 GPU activities:   97.75%  541.63us         2  270.81us  258.30us  283.33us  [CUDA memcpy DtoH]
                    1.20%  6.6240us         1  6.6240us  6.6240us  6.6240us  test1_30_gpu
                    1.06%  5.8560us         1  5.8560us  5.8560us  5.8560us  test2_51_gpu

Hope this helps,
Mat