Hi,
I have a code which uses a batched gemv operation in mixed precision. I ported the code to OpenACC and the codes runs fine. But I wonder if there is a way to further optimize the memory access for the real*2 matrices in the code.
I read through the section “Vectorized Memory Access” in https://www.pgroup.com/blogs/posts/cuda-fortran-tensor-cores.html but I’m not able to implement it in OpenACC.
I created a small example, in this program the matrix m_h is in half precision whereas the vectors r,s are in double precision. When I change m_h to double precision and compare the performance to the half precision version I don’t observe such a big improvement.
Is there a way to optimized the m_h memory access in half precision?
Thank you for your help
main.f90
PROGRAM test
implicit none
real*8,allocatable :: s(:)
real*8,allocatable :: r(:)
real*2,allocatable :: m_h(:)
integer n,k,bSize,tSize,tMatrices,sIndex,mSize
integer i,j
n=256*256*256
bSize=16
mSize=bSize*bSize
tMatrices=n/bSize
allocate(s(n))
allocate(r(n))
allocate(m_h(n/bSize*mSize))
!$acc enter data create(s,r,m_h)
!$acc parallel loop present(s,r,m_h)
do k=1,tMatrices
sIndex = bSize*(k-1)
do i=1,bSize
s(sIndex+i) = 0.0d0
enddo
do i=1,bSize
do j=1,bsize
s(sIndex+j) = s(sIndex+j) + r(sIndex+i) *&
m_h(mSize*(k-1)+bSize*(i-1)+j)
enddo
enddo
enddo
write(*,*)"Programend"
end PROGRAM test
makefile
CC=pgfortran
OBJS=main.o
OPTS=-mp -tp=skylake -fast -mcmodel=medium -m64 -cpp -acc -Minfo=acc -ta=tesla:cc70
%.o: %.f90
${CC} ${OPTS} -c $<
all: myProgram
myProgram: main.o
${CC} ${OPTS} -o myProgram main.o
myProg:main.o
${CC} ${OPTS} -c $<
clean:
rm -f *.o