I’ve noticed the performance downgrades I moved from my PC (PGI 14.2, Tesla C2070) to some cluster (PGI 14.4, K40). The reason is PGI generate code which unable to fully load K40. For example in my code (test example) i have the following structure:
!$acc kernels !$acc loop independent collapse(2) gang vector(16) do i=its,ite ! i loop (east-west) do j=jts,jte ! j loop (north-south)
On my system PGI manage to launch kernel on GS(5 4 1) BS(16 16 1), while with PGI 14.4 profiler reports GS(129 1 1) BS(32 1 1). For the real data I see <<<(3872,1,1),(32,1,1),0>>>
By changing code I get the same GS and BS on both systems.
!$acc kernels !$acc loop independent gang vector(16) do i=its,ite ! i loop (east-west) !$acc loop independent gang vector(16) do j=jts,jte ! j loop (north-south)
Is it correct behavior for collapse clause to join two nested loops into one?
The second question is relevant to the first one…
Above mentioned approach works fine for three kernels of four. For one kernel compiler reported that code was generated, but at run time I see the error
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution