Different Performance by 13.xx ver.

Hello, I have developed an iterative solver for a large block linear system in Fortran code using openacc. The implementation with the latest versions of pgf90 compiler is significantly slower than the one using e.g. 12.7.

The main part of the code implemented in the GPU is

!$acc loop independent vector(32)
do k=1,n
k1=k*j
k2=(k+1)*j
!$acc loop independent vector(32)
do i=1,ik
t(k1+i)=t(k1+i)+x(k2+i)*a3(3,i)
enddo

enddo

The compilation options are -mp -acc -ta=nvidia,cc20 -Minfo -O4 -tp=nehalem-64 for an HP SL390 machine equipped with Tesla M2070 GPUs using Oraclelinux 6.2 operation system.

Hi emath,

In 13.x, our engineers moved to using pinned memory by default in order to better support asynchronous data movement. While it improved many cases, we later found others where codes slowed down. The problem being that when pinned memory is deallocated, the device driver needs to synchronize the device and host to ensure all pending memory movement is complete. This can cause a slow down. Our engineers are revamping this behavior and hope to have an improved method soon. In the mean time you can try setting the environment variable “PGI_ACC_SYNCHRONOUS=1” partially revert to the old behavior.

Note the way to tell if this is indeed the problem with your program is to compare the device profile information between 12.7 and 13.6. Since the freeing of the pinned memory doesn’t show up in the profile, if they profiles are about the same, then this is the problem.

  • Mat

Hi Mat,
You are right, the deallocation of pinned memory causes the problem in our implementation. Setting this env variable to 1 the performance between 12.xx and 13.xx PGI versions is about the same.
Thank you.
Manolis