Hi,
The question is related to the two older questions.
The matrix-matrix multiplication likes
do e=1,nel
do k=1,n
do j=1,n
do i=1,n
tmp = 0.
do l=1,n
tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e)
enddo
m(i,j,k,e) = tmp
enddo
enddo
enddo
enddo
where typically n=4-16 and nel=10-1000. We can only obtain around maximum 20G FLOPS using pgi compilers on Tesla k20x for n=16 and nel=400.
!$acc kernels
!$acc loop gang
671: do e=1,nel
!$acc loop gang vector
673: do k=1,n
!$acc loop vector
675: do j=1,n
677: do i=1,n
tmp = 0.
679: do l=1,n
tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e)
enddo
m(i,j,k,e) = tmp
enddo
enddo
enddo
enddo
!$acc end kernels
- In that case, should “reduction” be added on the inner-loop ?
!$ACC LOOP REDUCTION(+:tmp)
do l=1,n ! serial loop, no reductio
Even the reduction is added, the loop “do l=1,n” still is parallelized
…
671, Loop is parallelizable
673, Loop is parallelizable
675, Loop is parallelizable
677, Loop is parallelizable
679, Loop is parallelizable
…
(but performance reduced from 20G to 13G FLOGS)
2) The performance for the kernels on a Fermi GPU was obtained around 23G. Is there any way to improve the performance on K20x ? I have used the compiler flag (-ta=nvidia,5.0) and other openacc implementation have been tested, such as
!$acc kernels
do ;
do; do;
...
!$acc parallel loop collapse(4)
do;
do, do
but cannot obtain better performance.
Thanks for your assistance.
/Jing