Fortran MATMUL function fails in OpenACC

Hi,

I am using PGI/18.5.

Here I have an unit test program as:
PROGRAM TEST

USE openacc

 IMPLICIT NONE
 
  REAL(8),dimension(9,9,1000)::A,B,C
  Integer::n
  
  A=1d0
  B=1d0
  C=0d0
 !$acc parallel loop 
  Do n=1,1000
  C(:,:,n)=MATMUL(A(:,:,n),B(:,:,n))
  enddo: 
  
  Print*,C(:,:,1)
  
  END PROGRAM TEST

If without acc directives, it works fine. But if add that acc directive, it goes:

$ pgfortran -acc -Minfo=accel main.f90
PGF90-S-0155-Call to PGI runtime function not supported - pgf90_mmul_real8 (main.f90: 15)
PGF90-S-0155-Accelerator region ignored; see -Minfo messages (main.f90: 13)
test:
13, Accelerator region ignored
14, Accelerator restriction: invalid loop
0 inform, 0 warnings, 2 severes, 0 fatal for test

Would you please suggest how to solve that issue?

Hi Wending,

Sorry, but not all Fortran intrinsics, including matmul, are not supported on the device. You’ll need to manually write the matmul operations.

If you can upgrade the compiler to the NVHPC SDK 20.9 version (https://developer.nvidia.com/hpc-sdk), you can try using tensor cores via matmul calls which we implicitly translate to calls to cuTensorEX. See: https://developer.nvidia.com/blog/bringing-tensor-cores-to-standard-fortran/

-Mat

Hi Mat,

Thank you for your reply.
Yes, I have tested NVHPC SDK on my own laptop and it works fine. But there is only an old version on the supercomputer which I do not have administrator access.
Does the manually written matmul operation impact the computational efficiency? Do you have anything to recommend with high efficiency?

Thanks,

Wending

The most efficient thing to do is to use batched cuBLAS. Ron Rahaman has several examples: https://github.com/RonRahaman/cublas-demos/tree/master/src

He also has some pure OpenACC examples as well (i.e no cuBLAS).

For your code, it would look something like:

% cat testmm.F90
PROGRAM TEST
#ifdef _OPENACC
  USE openacc
#endif

  IMPLICIT NONE

  REAL(8),dimension(9,9,1000)::A,B,C
  REAL :: tmp
  Integer::n,i,j,k

  A=1d0
  B=1d0
  C=0d0
 !$acc parallel loop vector_length(32) copyin(A,B) copyout(C)
  Do n=1,1000
#ifndef _OPENACC
    C(:,:,n)=MATMUL(A(:,:,n),B(:,:,n))
#else
!$acc loop vector collapse(2)
    do j=1,9
       do i=1,9
          tmp = 0.0
          do k=1,9
             tmp = tmp + a(i,k,n) * b(k,j,n)
          enddo
          c(i,j,n) = tmp
       enddo
    enddo
#endif
  enddo

  Print*,C(:,:,1)

END PROGRAM TEST
% nvfortran testmm.F90 -acc -Minfo=accel -fast
test:
     15, Generating copyin(a(:,:,:)) [if not already present]
         Generating copyout(c(:,:,:)) [if not already present]
         Generating copyin(b(:,:,:)) [if not already present]
         Generating Tesla code
         16, !$acc loop gang ! blockidx%x
         21, !$acc loop vector(32) collapse(2) ! threadidx%x
             Interchanging generated strip mine loop outwards
         22,   ! threadidx%x collapsed
             Interchanging generated vector loop outwards
             Interchanging generated strip mine loop outwards
         24, !$acc loop seq
     21, Loop is parallelizable
     22, Loop is parallelizable
     24, Loop carried scalar dependence for tmp at line 25
         Scalar last value needed after loop for tmp at line 27
% a.out
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000

Got it. It works. Thanks, Mat.