Optimizing memory access for real*2 datatype in OpenACC

Hi,

I have a code which uses a batched gemv operation in mixed precision. I ported the code to OpenACC and the codes runs fine. But I wonder if there is a way to further optimize the memory access for the real*2 matrices in the code.
I read through the section “Vectorized Memory Access” in https://www.pgroup.com/blogs/posts/cuda-fortran-tensor-cores.html but I’m not able to implement it in OpenACC.

I created a small example, in this program the matrix m_h is in half precision whereas the vectors r,s are in double precision. When I change m_h to double precision and compare the performance to the half precision version I don’t observe such a big improvement.
Is there a way to optimized the m_h memory access in half precision?

Thank you for your help

main.f90

PROGRAM test
       implicit none
       real*8,allocatable :: s(:)
       real*8,allocatable :: r(:)
       real*2,allocatable :: m_h(:)
       integer n,k,bSize,tSize,tMatrices,sIndex,mSize
       integer i,j
       n=256*256*256
       bSize=16
       mSize=bSize*bSize
       tMatrices=n/bSize
       allocate(s(n))
       allocate(r(n))
       allocate(m_h(n/bSize*mSize))
       !$acc enter data create(s,r,m_h)

       !$acc parallel loop present(s,r,m_h)
       do k=1,tMatrices
         sIndex = bSize*(k-1)
         do i=1,bSize
           s(sIndex+i) = 0.0d0
         enddo
         do i=1,bSize
           do j=1,bsize
              s(sIndex+j) = s(sIndex+j) + r(sIndex+i) *&
                            m_h(mSize*(k-1)+bSize*(i-1)+j)
           enddo
         enddo
       enddo
       write(*,*)"Programend"
      end PROGRAM test

makefile

CC=pgfortran
OBJS=main.o
OPTS=-mp -tp=skylake -fast -mcmodel=medium -m64 -cpp -acc -Minfo=acc -ta=tesla:cc70

%.o: %.f90
        ${CC} ${OPTS} -c $<

all: myProgram
myProgram: main.o
        ${CC} ${OPTS} -o myProgram main.o
myProg:main.o
        ${CC} ${OPTS} -c $<

clean:
        rm -f *.o

Hi Peter,

Is there a way to optimized the m_h memory access in half precision?

No, there isn’t an equivalent to the CUDA Fortran wmmaLoadMatrix, which I think is what your asking. All the cuTensor WMMA operations act on a warp, which wouldn’t be exposed in OpenACC.

We are adding support where when using several Fortran intrinsics (like matmul) with device data, the tensor cores are used. For example: https://www.pgroup.com/resources/docs/20.4/x86/pgi-cuda-interfaces/index.htm#cflib-tensor-oacc-host

Though you would still not have direct control over the memory access.

-Mat

HI Mat,

thank you for your answer. What I was looking for was something like the “Vectorized memory access with Cray pointers” which are described in article. But I assume these are not available OpenACC?

Since I haven’t tried, I don’t want to say it’s impossible to maybe mimic something similar, but don’t know of a way to do it. In the tensorCoreVectorCP.CUF they’re using the Cray pointers to explicitly load data into shared memory and then use that data in a WMMA call. OpenACC is generalized so it can be portable across different types of devices, so wouldn’t expose this level of detail for a specific type of device.

If you do what to use these features, I would suggest using CUDA Fortran instead (at least for this section). OpenACC and CUDA Fortran are interoperable so you can mix them together in the same program.

-Mat