I am working on a legacy fortran code where the loops in the dimensions in (i,j) can be parallelized across gangs and the process in k direction is given in function. If I label these function as “acc routine vector” and parallelize the (x,y) loops the code seems not to be efficient. Thus, I find myself constantly moving the outer loops into the function and modifying the indices to work directly with 3d arrays. This is very cumbersome because I will have to change the one dimension indexing in the column function with 3D indexing, and also hard to upgrade the openacc code when new versions are released. I would like to use acc routine vector if there was no performance penalty.

I demonstrate this with an example below.

```
#define NX 201
#define NY 201
#define NZ 21
module mymod
contains
subroutine set(I,J,U)
!$acc routine vector
real, dimension(1:NX,1:NZ,1:NY), intent(out):: U
integer :: I, J, K
real :: x(NZ)
DO K = 1, 21
x(K) = 3.0 * K * I + J;
ENDDO
DO K = 1, 21
U(I,K,J)=x(K) * x(K)
ENDDO
end subroutine
subroutine test1(U)
real, dimension(1:NX,1:NZ,1:NY), intent(out):: U
!$acc declare present(U)
integer :: I, J, K
real :: x(NZ)
!$acc parallel loop collapse(2) private(x)
DO J = 1,NY
DO I = 1,NX
DO K = 1, 21
x(K) = 3.0 * K * I + J;
ENDDO
DO K = 1, 21
U(I,K,J)=x(K) * x(K)
ENDDO
ENDDO
ENDDO
end subroutine
subroutine test2(U)
real, dimension(1:NX,1:NZ,1:NY), intent(out):: U
!$acc declare present(U)
integer :: I, J, K
!$acc parallel loop collapse(2)
DO J = 1,NY
DO I = 1,NX
call set(I,J,U)
ENDDO
ENDDO
end subroutine
end module
program main
use mymod
real, dimension(1:NX,1:NZ,1:NY):: U
!$acc declare create(U)
call test1(U)
!$acc update host(U)
print*, U(1,1,1)
call test2(U)
!$acc update host(U)
print*, U(1,1,1)
end program
```

Compilation:

```
[04:47 dabdi@hsw213 exp] > pgf90 -acc -Minfo=accel -ta=host,tesla,cc60,cuda9.0 ss.F90
set:
9, Generating Tesla code
15, !$acc loop vector ! threadidx%x
19, !$acc loop vector ! threadidx%x
15, Loop is parallelizable
19, Loop is parallelizable
test1:
26, Generating present(u(:,:,:))
30, Accelerator kernel generated
Generating Tesla code
31, !$acc loop gang collapse(2) ! blockidx%x
32, ! blockidx%x collapsed
34, !$acc loop vector(128) ! threadidx%x
38, !$acc loop vector(128) ! threadidx%x
30, CUDA shared memory used for x
34, Loop is parallelizable
38, Loop is parallelizable
test2:
49, Generating present(u(:,:,:))
51, Accelerator kernel generated
Generating Tesla code
52, !$acc loop gang collapse(2) ! blockidx%x
53, ! blockidx%x collapsed
main:
64, Generating create(u(:,:,:))
66, Generating update self(u(:,:,:))
70, Generating update self(u(:,:,:))
```

Profile output

```
[04:47 dabdi@hsw213 exp] > nvprof ./a.out
==62007== NVPROF is profiling process 62007, command: ./a.out
16.00000
16.00000
==62007== Profiling application: ./a.out
==62007== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 94.70% 10.883ms 1 10.883ms 10.883ms 10.883ms test2_51_gpu
4.50% 517.40us 2 258.70us 257.53us 259.87us [CUDA memcpy DtoH]
0.80% 91.871us 1 91.871us 91.871us 91.871us test1_30_gpu
```

test1() and test2() are identical functions except for the fact that test2 uses the “acc routine vector” method to parallelize. The latter method consumes 95% but inlining the column function(set) as in test1() consumes only 1%. Why? Is the compiler not able to parallelize as efficiently with “acc routine vector”?

Edit: I noticed that if I remove the use of the private array X in both tests and directly set U=2.0, they more or less consume the same amount of time (11% in the profile). The profile after removing X.

```
[04:53 dabdi@hsw213 exp] > nvprof ./a.out
==62545== NVPROF is profiling process 62545, command: ./a.out
2.000000
2.000000
==62545== Profiling application: ./a.out
==62545== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 77.50% 527.48us 2 263.74us 258.01us 269.47us [CUDA memcpy DtoH]
11.26% 76.607us 1 76.607us 76.607us 76.607us test2_51_gpu
11.24% 76.511us 1 76.511us 76.511us 76.511us test1_30_gpu
```

Daniel