Fortran MATMUL function fails in OpenACC

Wending · November 30, 2020, 2:19am

Hi,

I am using PGI/18.5.

Here I have an unit test program as:
PROGRAM TEST

USE openacc

 IMPLICIT NONE
 
  REAL(8),dimension(9,9,1000)::A,B,C
  Integer::n
  
  A=1d0
  B=1d0
  C=0d0
 !$acc parallel loop 
  Do n=1,1000
  C(:,:,n)=MATMUL(A(:,:,n),B(:,:,n))
  enddo: 
  
  Print*,C(:,:,1)
  
  END PROGRAM TEST

If without acc directives, it works fine. But if add that acc directive, it goes:

$ pgfortran -acc -Minfo=accel main.f90
PGF90-S-0155-Call to PGI runtime function not supported - pgf90_mmul_real8 (main.f90: 15)
PGF90-S-0155-Accelerator region ignored; see -Minfo messages (main.f90: 13)
test:
13, Accelerator region ignored
14, Accelerator restriction: invalid loop
0 inform, 0 warnings, 2 severes, 0 fatal for test

Would you please suggest how to solve that issue?

MatColgrove · November 30, 2020, 5:58pm

Hi Wending,

Sorry, but not all Fortran intrinsics, including matmul, are not supported on the device. You’ll need to manually write the matmul operations.

If you can upgrade the compiler to the NVHPC SDK 20.9 version (https://developer.nvidia.com/hpc-sdk), you can try using tensor cores via matmul calls which we implicitly translate to calls to cuTensorEX. See: https://developer.nvidia.com/blog/bringing-tensor-cores-to-standard-fortran/

-Mat

Wending · November 30, 2020, 7:00pm

Hi Mat,

Thank you for your reply.
Yes, I have tested NVHPC SDK on my own laptop and it works fine. But there is only an old version on the supercomputer which I do not have administrator access.
Does the manually written matmul operation impact the computational efficiency? Do you have anything to recommend with high efficiency?

Thanks,

Wending

MatColgrove · November 30, 2020, 9:22pm

The most efficient thing to do is to use batched cuBLAS. Ron Rahaman has several examples: cublas-demos/src at master · RonRahaman/cublas-demos · GitHub

He also has some pure OpenACC examples as well (i.e no cuBLAS).

For your code, it would look something like:

% cat testmm.F90
PROGRAM TEST
#ifdef _OPENACC
  USE openacc
#endif

  IMPLICIT NONE

  REAL(8),dimension(9,9,1000)::A,B,C
  REAL :: tmp
  Integer::n,i,j,k

  A=1d0
  B=1d0
  C=0d0
 !$acc parallel loop vector_length(32) copyin(A,B) copyout(C)
  Do n=1,1000
#ifndef _OPENACC
    C(:,:,n)=MATMUL(A(:,:,n),B(:,:,n))
#else
!$acc loop vector collapse(2)
    do j=1,9
       do i=1,9
          tmp = 0.0
          do k=1,9
             tmp = tmp + a(i,k,n) * b(k,j,n)
          enddo
          c(i,j,n) = tmp
       enddo
    enddo
#endif
  enddo

  Print*,C(:,:,1)

END PROGRAM TEST
% nvfortran testmm.F90 -acc -Minfo=accel -fast
test:
     15, Generating copyin(a(:,:,:)) [if not already present]
         Generating copyout(c(:,:,:)) [if not already present]
         Generating copyin(b(:,:,:)) [if not already present]
         Generating Tesla code
         16, !$acc loop gang ! blockidx%x
         21, !$acc loop vector(32) collapse(2) ! threadidx%x
             Interchanging generated strip mine loop outwards
         22,   ! threadidx%x collapsed
             Interchanging generated vector loop outwards
             Interchanging generated strip mine loop outwards
         24, !$acc loop seq
     21, Loop is parallelizable
     22, Loop is parallelizable
     24, Loop carried scalar dependence for tmp at line 25
         Scalar last value needed after loop for tmp at line 27
% a.out
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000
    9.000000000000000         9.000000000000000         9.000000000000000

Wending · December 1, 2020, 1:57am

Got it. It works. Thanks, Mat.

Topic		Replies	Views
MatMul with openACC Legacy PGI Compilers	7	13210	December 17, 2012
Compiler bug with matmul intrinsic using openACC and nvfortran 22.3 nvc, nvc++ and nvfortran	3	640	June 10, 2023
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13841	December 21, 2012
OpenACC with cuBLAS and cuSPARSE in Fortran code Legacy PGI Compilers	7	8557	February 22, 2016
OpenACC matrix multiplication sample for fair comparison with other parallelization strategies Legacy PGI Compilers	1	2302	May 20, 2019
multiple small matrices multiplication Legacy PGI Compilers	6	5786	June 22, 2015
Need advices for optimizing "matrix.vector product&quot Legacy PGI Compilers	3	3322	November 23, 2016
Openacc fortran array syntax not translated correctly Legacy PGI Compilers	3	926	May 27, 2021
Improving the performance of matrix-matrix multiplication Legacy PGI Compilers	0	496	January 31, 2020
Performance of pgi openaccfor a matrix-matrix multiplication Legacy PGI Compilers	2	4782	May 1, 2014

Fortran MATMUL function fails in OpenACC

Related topics