about gang and worker

The code as below:
!$acc kernels
DO j=jtf,jtg
DO k=kts,kte
DO i=its,ite
work(i,k,j)=dc05*(u0(i,k,j)+u0(i-1,k,j))
ENDDO
ENDDO
ENDDO


the compiler information is as
3420, Loop is parallelizable
3421, Loop is parallelizable
3422, Loop is parallelizable
Accelerator kernel generated
3420, Cached references to size [(x+1)x(y)] block of ‘u0’
3421, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
3422, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 1.0 : 26 registers; 112 shared, 8 constant, 0 local memory bytes
CC 2.0 : 22 registers; 0 shared, 124 constant, 0 local memory bytes


My questions are:
1.Is the parallel of J-orientation gang-level? Why the compiler did not acclocate the parallel of worker-level ?

2.from the 3420 line, Cached references to size [(x+1)x(y)] block of ‘u0’ ,what’s that meaning? Is this result created by the directive of $acc cache (u0)?

Hi Telsalady,

From the compiler feedback messages, it appears to me that the compiler has made the schedule out of the k and i loops and put the j loop in the kernel. The reasoning being that this loop doesn’t have much work so by serializing j, you can increase the amount of work each kernel does. Especially since the code utilizes cached memory, this will help in increasing the computational intensity.

While not guaranteed, the compiler usually does a good job at finding an optimal schedule. Though, you can use the loop directives to override the compiler’s default schedule if you want to try others.

Why the compiler did not acclocate the parallel of worker-level ?

The “worker” construct on an Nvidia device corresponds to the warp size. The warp size is fixed at 32 so can’t be changed.

2.from the 3420 line, Cached references to size [(x+1)x(y)] block of ‘u0’ ,what’s that meaning? Is this result created by the directive of $acc cache (u0)?

This is the compiler auto-detecting where to apply caching. You could used the cache directive, but the PGI compiler will automatically find opportunities.

  • Mat

Thanks Mat!

I have some confuse about Copy clause. Is Copy clause that copy date from host to the global memory of GPU?

Thanks again

The “copy” clause allocates memory on the device and then copies the data to the devices global memory at the start of the region (data or compute). At the end of the region, the data is copied back to the host and the memory is deallocated on the device.

The “copyin” clause copies the data to the device, but does not copy it back to the host. While the “copyout” clause only copies the data back to the host but does not copy it to the device.

The “create” clause only allocates and deallocated the device memory, but does not perform any copies.

The “update” directive can be use within a data region to copy data to/from the device at specific points in your program.

  • Mat