Openacc fortran array syntax not translated correctly

Hi,

OpenACC array syntax such as A = B + C as shown below does not give correct results, instead I have to use
A(:,:) = B(:,:) + C(:,:) to get correct results. Why is that ?
I also have other issues with fortran arrays in which a seeminly innocent operation to split a block of matrix operations into two kernels gives different results …

program test

IMPLICIT NONE
INTEGER, PARAMETER :: fp = SELECTED_REAL_KIND(6)
INTEGER, PARAMETER :: MAX_N_ANGLES = 128 
REAL(fp), ALLOCATABLE, DIMENSION(:,:) :: A, B, C
INTEGER :: k

ALLOCATE(A(MAX_N_ANGLES,MAX_N_ANGLES), &
     B(MAX_N_ANGLES,MAX_N_ANGLES), &
     C(MAX_N_ANGLES,MAX_N_ANGLES))

CALL RANDOM_NUMBER(A)
CALL RANDOM_NUMBER(B)
CALL RANDOM_NUMBER(C)

!$acc enter data copyin(A,B,C)

!$acc kernels present(A,B,C)
CALL myMATMUL(A,B,C)
A = B + C 
!A(:,:) = B(:,:) + C(:,:)
!$acc end kernels

!$acc update self(A,B,C)
print*, A(1:12,1)
print*, B(1:12,1)
print*, C(1:12,1)

CONTAINS

  SUBROUTINE  myMATMUL(A, B, C)
!$acc routine gang
REAL(fp), INTENT(IN), DIMENSION(1:MAX_N_ANGLES,1:MAX_N_ANGLES) :: A, B
REAL(fp), INTENT(OUT), DIMENSION(1:MAX_N_ANGLES,1:MAX_N_ANGLES) :: C
REAL(fp) :: acc 
INTEGER :: I, J, K

!$acc loop collapse(2)
DO I = 1, MAX_N_ANGLES
  DO J = 1, MAX_N_ANGLES
    acc = 0 
!$acc loop seq
    DO K = 1, MAX_N_ANGLES
        acc = acc + A(I,K) * B(K,J)
    END DO
    C(I,J) = acc 
  END DO
END DO

  END SUBROUTINE myMATMUL

end program test

Hi danxpy,

I don’t think this is a problem with the array syntax but something else is going on.

Looking at the compiler feedback messages, you’ll see that “myMATMUL” is using a serial kernel. Contained subroutines have a hidden argument where a pointer to the parent’s stack is passed in. Since this is a pointer to the host stack, it will cause errors if accessed on the device.

My suggestion to fix these issues, is to move myMatmul into a module and push the compute region inside the subroutine.

% cat test.F90
module matmul

INTEGER, PARAMETER :: fp = SELECTED_REAL_KIND(6)
INTEGER, PARAMETER :: MAX_N_ANGLES = 128

contains

  SUBROUTINE  myMATMUL(A, B, C)
REAL(fp), INTENT(IN), DIMENSION(1:MAX_N_ANGLES,1:MAX_N_ANGLES) :: A, B
REAL(fp), INTENT(OUT), DIMENSION(1:MAX_N_ANGLES,1:MAX_N_ANGLES) :: C
REAL(fp) :: acc
INTEGER :: I, J, K

!$acc kernels loop collapse(2) present(A,B,C)
DO I = 1, MAX_N_ANGLES
  DO J = 1, MAX_N_ANGLES
    acc = 0
!$acc loop seq
    DO K = 1, MAX_N_ANGLES
        acc = acc + A(I,K) * B(K,J)
    END DO
    C(I,J) = acc
  END DO
END DO

  END SUBROUTINE myMATMUL

end module matmul

program test
use matmul
IMPLICIT NONE
REAL(fp), ALLOCATABLE, DIMENSION(:,:) :: A, B, C
INTEGER :: k

ALLOCATE(A(MAX_N_ANGLES,MAX_N_ANGLES), &
     B(MAX_N_ANGLES,MAX_N_ANGLES), &
     C(MAX_N_ANGLES,MAX_N_ANGLES))

CALL RANDOM_NUMBER(A)
CALL RANDOM_NUMBER(B)
CALL RANDOM_NUMBER(C)

!$acc enter data copyin(A,B,C)
CALL myMATMUL(A,B,C)

!$acc kernels
A = B + C
!$acc end kernels

!$acc update self(A,B,C)
print*, A(1:12,1)
print*, B(1:12,1)
print*, C(1:12,1)
!$acc exit data delete(A,B,C)

end program test

% nvfortran -acc -Minfo=accel test.F90; a.out
mymatmul:
     14, Generating present(a(:,:),c(:,:),b(:,:))
     15, Loop is parallelizable
     16, Loop is parallelizable
         Generating Tesla code
         15, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x collapsed-innermost
         16,   ! blockidx%x threadidx%x collapsed
         19, !$acc loop seq
test:
     44, Generating enter data copyin(b(:,:),c(:,:),a(:,:))
     47, Generating implicit copyin(c(1:128,1:128),b(1:128,1:128)) [if not already present]
         Generating implicit copyout(a(1:128,1:128)) [if not already present]
     48, Loop is parallelizable
         Generating Tesla code
         48,   ! blockidx%x threadidx%x auto-collapsed
             !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
     51, Generating update self(c(:,:),b(:,:),a(:,:))
     55, Generating exit data delete(c(:,:),b(:,:),a(:,:))
    31.04105        28.16795        29.00390        29.33393
    29.29962        26.79399        34.51864        28.09657
    32.60807        30.06055        27.82124        32.19381
   0.6832442       0.7977929       0.2731100       0.8713030
   0.5932503       0.1152713       0.4520081       3.2375805E-02
   0.3400532       0.2701175       0.4576416       0.8395543
    30.35780        27.37016        28.73079        28.46263
    28.70637        26.67872        34.06664        28.06420
    32.26801        29.79043        27.36360        31.35425

Hi Mat,

Ok I made the contained subroutine mistake again, but I still think there is a problem when using routine functions.
I noticed making the matmul function a stand alone kernel like you did worked for me too but it doesn’t work with routine.

MODULE kernels

INTEGER, PARAMETER :: fp = SELECTED_REAL_KIND(6)
INTEGER, PARAMETER :: MAX_N_ANGLES = 128

CONTAINS

  SUBROUTINE  myMATMUL(A, B, C)
!$acc routine gang
    REAL(fp), INTENT(IN), DIMENSION(1:MAX_N_ANGLES,1:MAX_N_ANGLES) :: A, B
    REAL(fp), INTENT(OUT), DIMENSION(1:MAX_N_ANGLES,1:MAX_N_ANGLES) :: C
    REAL(fp) :: acc
    INTEGER :: I, J, K

!$acc loop collapse(2) private(acc)
    DO I = 1, MAX_N_ANGLES
      DO J = 1, MAX_N_ANGLES
        acc = 0
!$acc loop seq
        DO K = 1, MAX_N_ANGLES
            acc = acc + A(I,K) * B(K,J)
        END DO
        C(I,J) = acc
      END DO  
    END DO  
  END SUBROUTINE myMATMUL

END MODULE kernels 

program test

USE kernels 

IMPLICIT NONE
REAL(fp), ALLOCATABLE, DIMENSION(:,:) :: A, B, C 
INTEGER :: k

ALLOCATE(A(MAX_N_ANGLES,MAX_N_ANGLES), &
         B(MAX_N_ANGLES,MAX_N_ANGLES), &
         C(MAX_N_ANGLES,MAX_N_ANGLES))

CALL RANDOM_NUMBER(A)
CALL RANDOM_NUMBER(B)
CALL RANDOM_NUMBER(C)

print*, A(1:12,1)
print*, B(1:12,1)
print*, C(1:12,1)
print*, "======================="

!$acc enter data copyin(A,B,C)

!$acc kernels present(A,B,C)
CALL myMATMUL(A,B,C)
A = B + C
!A(:,:) = B(:,:) + C(:,:)
!$acc end kernels

!$acc update self(A,B,C)
print*, A(1:12,1)
print*, B(1:12,1)
print*, C(1:12,1)

end program test

In another program, I had to replace simple operations like add and transpose of matrices, with explicitly written code to get it working. Also using “acc routine gang” often tend to produce incorrect results so I have to use “acc routine worker/vector” for MATMUL/TRANSPOSE/ADD etc. I do not feel confident that my code won’t break with some reshuffling of code even when it seems to be “working”…

> pgf95 -O3 -mp -acc -Minfo=accel -ta=tesla:cc60,cuda10.1 matmul.f90 -o gpu
mymatmul:
      9, Generating Tesla code
         17, !$acc loop gang, vector collapse(2) ! blockidx%x threadidx%x
         18,   ! blockidx%x threadidx%x collapsed
         21, !$acc loop seq
test:
      0, Generating Tesla code
         56, !$acc loop seq
     52, Generating enter data copyin(c(:,:),b(:,:),a(:,:))
     54, Generating present(a(:,:),c(:,:),b(:,:))
         Accelerator serial kernel generated
     56, Loop is parallelizable
     60, Generating update self(c(:,:),b(:,:),a(:,:))

The reason why I can’t push the kernels loop to inside MATMUL is because these matrix multiplies are very small, maximum 16x16 matrices. But I do have an outer loop that is big so the matrix multiplies need to be routines.

regards,
Daniel

I noticed making the matmul function a stand alone kernel like you did worked for me too but it doesn’t work with routine.

Try putting the call in it’s own compute region:

!$acc kernels present(A,B,C)
CALL myMATMUL(A,B,C)
!$acc end kernels
!$acc kernels present(A,B,C)
A = B + C
!$acc end kernels

Have you seen our cuTensor support in OpenACC using standard Fortran intrinsics including matmul and transpose? Might be useful here: Bringing Tensor Cores to Standard Fortran | NVIDIA Developer Blog

Also using “acc routine gang” often tend to produce incorrect results

Not to say that routine gang can’t be used, but each time I’ve seen someone use it, it was almost always better to move the gang loop inside the routine itself.

The reason why I can’t push the kernels loop to inside MATMUL is because these matrix multiplies are very small, maximum 16x16 matrices. But I do have an outer loop that is big so the matrix multiplies need to be routines.

Using a gang routine doesn’t help here since a gang will launch a CUDA kernel. Ideally, you’d want make the outer loop the gang loop and make myMatmul a vector routine.

-Mat