openacc routine function efficiency

danxpy · June 30, 2018, 11:26am

I am working on a legacy fortran code where the loops in the dimensions in (i,j) can be parallelized across gangs and the process in k direction is given in function. If I label these function as “acc routine vector” and parallelize the (x,y) loops the code seems not to be efficient. Thus, I find myself constantly moving the outer loops into the function and modifying the indices to work directly with 3d arrays. This is very cumbersome because I will have to change the one dimension indexing in the column function with 3D indexing, and also hard to upgrade the openacc code when new versions are released. I would like to use acc routine vector if there was no performance penalty.

I demonstrate this with an example below.

#define NX 201
#define NY 201
#define NZ 21

module mymod

contains

   subroutine set(I,J,U)
!$acc routine vector
        real, dimension(1:NX,1:NZ,1:NY), intent(out):: U
        integer :: I, J, K
        real :: x(NZ)

        DO K = 1, 21 
            x(K) = 3.0 * K * I + J;
        ENDDO

        DO K = 1, 21 
            U(I,K,J)=x(K) * x(K)
        ENDDO
   end subroutine

   subroutine test1(U)
        real, dimension(1:NX,1:NZ,1:NY), intent(out):: U
!$acc declare present(U)
        integer :: I, J, K
        real :: x(NZ)

!$acc parallel loop collapse(2) private(x)
        DO J = 1,NY 
            DO I = 1,NX

        DO K = 1, 21 
            x(K) = 3.0 * K * I + J;
        ENDDO

        DO K = 1, 21 
            U(I,K,J)=x(K) * x(K)
        ENDDO

            ENDDO
        ENDDO

   end subroutine

   subroutine test2(U)
        real, dimension(1:NX,1:NZ,1:NY), intent(out):: U
!$acc declare present(U)
        integer :: I, J, K
!$acc parallel loop collapse(2)
        DO J = 1,NY
            DO I = 1,NX
                call set(I,J,U)
            ENDDO
        ENDDO
   end subroutine

end module 

program main
   use mymod 
   real, dimension(1:NX,1:NZ,1:NY):: U

!$acc declare create(U)
   call test1(U)
!$acc update host(U)
   print*, U(1,1,1)

   call test2(U)
!$acc update host(U)
   print*, U(1,1,1)
end program

Compilation:

[04:47 dabdi@hsw213 exp] > pgf90 -acc -Minfo=accel -ta=host,tesla,cc60,cuda9.0 ss.F90
set:
      9, Generating Tesla code
         15, !$acc loop vector ! threadidx%x
         19, !$acc loop vector ! threadidx%x
     15, Loop is parallelizable
     19, Loop is parallelizable
test1:
     26, Generating present(u(:,:,:))
     30, Accelerator kernel generated
         Generating Tesla code
         31, !$acc loop gang collapse(2) ! blockidx%x
         32,   ! blockidx%x collapsed
         34, !$acc loop vector(128) ! threadidx%x
         38, !$acc loop vector(128) ! threadidx%x
     30, CUDA shared memory used for x
     34, Loop is parallelizable
     38, Loop is parallelizable
test2:
     49, Generating present(u(:,:,:))
     51, Accelerator kernel generated
         Generating Tesla code
         52, !$acc loop gang collapse(2) ! blockidx%x
         53,   ! blockidx%x collapsed
main:
     64, Generating create(u(:,:,:))
     66, Generating update self(u(:,:,:))
     70, Generating update self(u(:,:,:))

Profile output

[04:47 dabdi@hsw213 exp] > nvprof ./a.out
==62007== NVPROF is profiling process 62007, command: ./a.out
    16.00000    
    16.00000    
==62007== Profiling application: ./a.out
==62007== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   94.70%  10.883ms         1  10.883ms  10.883ms  10.883ms  test2_51_gpu
                    4.50%  517.40us         2  258.70us  257.53us  259.87us  [CUDA memcpy DtoH]
                    0.80%  91.871us         1  91.871us  91.871us  91.871us  test1_30_gpu

test1() and test2() are identical functions except for the fact that test2 uses the “acc routine vector” method to parallelize. The latter method consumes 95% but inlining the column function(set) as in test1() consumes only 1%. Why? Is the compiler not able to parallelize as efficiently with “acc routine vector”?

Edit: I noticed that if I remove the use of the private array X in both tests and directly set U=2.0, they more or less consume the same amount of time (11% in the profile). The profile after removing X.

[04:53 dabdi@hsw213 exp] > nvprof ./a.out
==62545== NVPROF is profiling process 62545, command: ./a.out
    2.000000    
    2.000000    
==62545== Profiling application: ./a.out
==62545== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   77.50%  527.48us         2  263.74us  258.01us  269.47us  [CUDA memcpy DtoH]
                   11.26%  76.607us         1  76.607us  76.607us  76.607us  test2_51_gpu
                   11.24%  76.511us         1  76.511us  76.511us  76.511us  test1_30_gpu

Daniel

MatColgrove · July 2, 2018, 6:05pm

Hi Daniel,

One of the main issues here is that “X” needs to be malloc’d, which is very slow on the device. This is why the performance gets better when you don’t use it. Another work around is to inline the routine (i.e. add the -Minline flag), in which case the malloc can be optimized away.

Though give the loop trip count of the “K” loops is only 21, you’re severely under utilizing the GPU. With the a vector length of 128, 107 threads are doing no work. For this case, I’d recommend making the “K” loops, and the “set” routine sequential, and scheduling the outer loops “gang vector”.

Here’s the original timings on a V100 device:

  GPU activities:   94.31%  11.096ms         1  11.096ms  11.096ms  11.096ms  test2_51_gpu
                    5.08%  597.66us         2  298.83us  293.37us  304.29us  [CUDA memcpy DtoH]
                    0.61%  71.904us         1  71.904us  71.904us  71.904us  test1_30_gpu

Original code with “-Minline”

 GPU activities:   78.40%  517.34us         2  258.67us  258.37us  258.97us  [CUDA memcpy DtoH]
                   10.82%  71.391us         1  71.391us  71.391us  71.391us  main_66_gpu
                   10.78%  71.136us         1  71.136us  71.136us  71.136us  main_70_gpu

Here’s a diff of the files from my version and your original version:

% diff test7218.F90 test7218.a.F90
10c10
< !$acc routine vector
---
> !$acc routine seq
30c30
< !$acc parallel loop collapse(2) private(x)
---
> !$acc parallel loop gang vector collapse(2) private(x)
51c51
< !$acc parallel loop collapse(2)
---
> !$acc parallel loop gang vector collapse(2)

And the timing of this new version is over 10 times as fast as you version.

 GPU activities:   97.75%  541.63us         2  270.81us  258.30us  283.33us  [CUDA memcpy DtoH]
                    1.20%  6.6240us         1  6.6240us  6.6240us  6.6240us  test1_30_gpu
                    1.06%  5.8560us         1  5.8560us  5.8560us  5.8560us  test2_51_gpu

Hope this helps,
Mat

Topic		Replies	Views
OpenACC Gang-Vector Performance Legacy PGI Compilers	4	3663	June 18, 2015
How to parallelize this loop... Legacy PGI Compilers	14	7811	December 18, 2012
Help understanding gang and vector specification Legacy PGI Compilers	1	2382	November 26, 2012
OpenACC: Best way to parallelize nested DO loops (continued) nvc, nvc++ and nvfortran	22	1493	March 28, 2023
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4176	December 6, 2012
I cannot get ACC ROUTINE to work Legacy PGI Compilers	2	548	May 21, 2020
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20306	October 25, 2017
Openacc fortran acc routine error [nvlink error : undefined reference to 'subroutine_name_' in 'file_name'] Legacy PGI Compilers	5	1330	March 3, 2023
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10415	May 2, 2014
MatMul with openACC Legacy PGI Compilers	7	13010	December 17, 2012

openacc routine function efficiency

Related topics