how dose PGI manage collapse clause

Hi!

I’ve noticed the performance downgrades I moved from my PC (PGI 14.2, Tesla C2070) to some cluster (PGI 14.4, K40). The reason is PGI generate code which unable to fully load K40. For example in my code (test example) i have the following structure:

!$acc kernels
!$acc loop independent collapse(2) gang vector(16)
   do i=its,ite      ! i loop (east-west)
   do j=jts,jte      ! j loop (north-south)

On my system PGI manage to launch kernel on GS(5 4 1) BS(16 16 1), while with PGI 14.4 profiler reports GS(129 1 1) BS(32 1 1). For the real data I see <<<(3872,1,1),(32,1,1),0>>>

By changing code I get the same GS and BS on both systems.

!$acc kernels
!$acc loop independent gang vector(16)
   do i=its,ite      ! i loop (east-west)
!$acc loop independent gang vector(16)
   do j=jts,jte      ! j loop (north-south)

Is it correct behavior for collapse clause to join two nested loops into one?

The second question is relevant to the first one…

Above mentioned approach works fine for three kernels of four. For one kernel compiler reported that code was generated, but at run time I see the error

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Alexey

Hi Alexy,

In 14.4 we did update the behavior of “collapse” to better match the OpenACC standard and be more consistent with the behavior of other compilers. Collapse joins multiple loops and then applies the given schedule. Previously, we would have ignored the collapse clause in this case since using multi-dimensional blocks is the better schedule here.

You can try using “worker” to create a 2-D thread block:

!$acc kernels 
!$acc loop independent collapse(2) gang worker(16) vector(16) 
   do i=its,ite      ! i loop (east-west) 
   do j=jts,jte      ! j loop (north-south)

For the illegal address error, I would try running the program under cuda-memcheck and/or the DDT debugger to see if they shed light on the issue.

  • Mat

Hi Mat,

changing acc directive resulted in error. See pgroup.com/track #443

Sometimes it’s impossible to use cuda-memcheck. Whether it problem of cuda-memcheck or PGI integrates something which prevents cuda-memcheck to inspect the code… it’s a question.
I see a lot of messages like

========= Program hit error 1 on CUDA API call to cuPointerGetAttribute
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:.....

Refer to the kernel I really see errors like

========= Invalid __global__ read of size 4
=========     at 0x00003e78 in morr_two_moment_micro_1922_gpu
=========     by thread (3,12,0) in block (4,2,0)
=========     Address 0x23038fffe8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
=========     Host Frame:

I understand that if we write explicit gang/worker/vector parameters + collapse, these explicit settings would apply to the aggregated loop.
If we don’t write explicit gang/worker/vector parameters, though, it looks like you are taking what would have been the default parameters for the outer loop in isolation, and then applying them to the aggregated loop.
Is this what you want?
I would expect you would measure the size of the aggregated loop first, and then generate the default parameters.