How to cache an entire array in openacc gang loop...?

Hi, I am unable to cache an entire array in openacc gangs loop(copy the array to shared memory).
For example, consider the following loop

INTEGER(KIND=4)::A(10),B(10)
!$acc parallel loop gang num_gangs(1000) vector vector_length(16) private(A)
DO 1,1000
!$acc cache(A)
     
     !$acc loop seq
     DO 1,10
        !Some calculations

     END DO  

END DO

After compiling sometimes, I get the following warning.

NVFORTRAN-W-0155-Cached array section must have fixed size: A

Can someone please let me know how to put an entire array in cache/shared memory in openacc gang loop…?

Hi Nmnethaji8,

Do you have a minimal reproducing example? Also which compiler version and platform are you using?

The warning would typically occur when the size of “A” is not known so you’d need add the bounds information. Though, you can’t have “A” be both vector private and shared, so you’d need to remove the “private(A)”. However doing so may introduce race conditions on the shared array.

Note that if an array is gang private, the compiler does attempt to implicitly use shared memory. For example:

% cat test.F90

module foo

contains

subroutine bar (Arr)

INTEGER, dimension(:) :: Arr
INTEGER(KIND=4)::A(10),B(10)
INTEGER :: i,j,sumA
B=1

!$acc parallel loop gang copy(Arr,B) private(A)
DO i=1,1000
!acc cache(A)
     !$acc loop vector
     DO j=1,10
        !Some calculations
        A(j) = B(j)
     END DO
     sumA=0
     !$acc loop vector reduction(+:sumA)
     DO j=1,10
        sumA = sumA+A(j)
     END DO
     Arr(i)=sumA
END DO

end subroutine bar
end module foo
% nvfortran -c test.F90 -Minfo=accel -acc -V22.5
bar:
     13, Generating copy(b(:),arr(:)) [if not already present]
         Generating NVIDIA GPU code
         14, !$acc loop gang ! blockidx%x
         17, !$acc loop vector(32) ! threadidx%x
         23, !$acc loop vector(32) ! threadidx%x
             Generating reduction(+:suma)
     13, CUDA shared memory used for a
     17, Loop is parallelizable
     23, Loop is parallelizable

-Mat

Please consider the following example,

Program name: shared.f95

MODULE TESTING
CONTAINS
SUBROUTINE TEST(N)
   IMPLICIT NONE 
   INTEGER(KIND=4),INTENT(IN)::N
   
   INTEGER(KIND=4)::A(N),B(10),I,J,ARR(N)

   !$acc parallel loop gang num_gangs(N) copy(arr) private(A,B) vector_length(10)
   DO I=1,N
      !$acc cache(A,B)
      ARR(I)=0

      !$acc loop vector
      DO J=1,10
         A(J)=J
         B(J)=J
      ENDDO
      ARR(I)=SUM(B)
   ENDDO

   PRINT*,ARR(:)
END SUBROUTINE TEST

END MODULE TESTING

PROGRAM MAIN
USE TESTING
IMPLICIT NONE 
INTEGER(KIND=4)::N=10

CALL TEST(N)
END PROGRAM MAIN

In this program, A and B are private to the gang loop, and I specified the vector_length=10.

compiling:
nvfortran -Mcuda -acc -Minfo=accel shared.f95 -o shared
NVFORTRAN-W-0155-Cached array section must have fixed size: a (shared.f95: 12)
test:
     10, Generating copy(arr(:)) [if not already present]
         Generating NVIDIA GPU code
         11, !$acc loop gang(n) ! blockidx%x
         16, !$acc loop vector(10) ! threadidx%x
         20, !$acc loop vector(10) ! threadidx%x
             Generating implicit reduction(+:b$r)
     10, CUDA shared memory used for b
     16, Loop is parallelizable
     20, Loop is parallelizable
  0 inform,   1 warnings,   0 severes, 0 fatal for test

You can see only B is copied to shared memory.

compiler used:
nvfortran --version
nvfortran 22.2-0 64-bit target on x86-64 Linux -tp skylake-avx512 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

“A” is an automatic array (i.e. variable length array whose size is set via an argument). Given shared memory is fixed size, the compiler wont use shared memory if the size is unknown. This would cause odd runtime failures if too much is used. Hence local memory would be used instead.

To fix, you need to use a fixed size range in either the “private” clause or “cache” directive. ex. “A(1:10)”

Though again, you should use either “private” or the gang loop, or “cache”. It doesn’t hurt to use both, but they do the same thing and “cache” will be ignored.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.