OpenACC directive "update" does not work properly

Hi,

I am stuck on a strange behavior of the compiler. I have a code in Fortran that looks like next:

!$acc data … create(my_array)
!$acc update device(my_array)
do ibin=1,nbins
!$acc parallel loop gang
do j=j1,j2
!$acc loop worker
do k=k1,k2
call myroutine(my_array(:,j,k,ibin),…)

end do
end do
!$acc end parallel
!$acc update self(my_array(:,:,:,ibin)) async(ibin)
end do
!$acc wait
!$acc end data

My problem is when I force the compiler to inline myroutine. In that case, the compilation log shows me that the updating to host is about the whole my_array, not just the slicing I am asking for. The log is:
470, Generating update self(my_array(:,:,:,:))
The profiler also shows that the data copy device to host is what I am expecting in case of asking the whole array.

However, if myroutine is not inline, the log is:
470, Generating update self(my_array(:,:,:,ibin))
and the execution then goes fine.

I wonder if there is some way to force the compiler to generate the update as I am asking for. Or any suggestions?
By the way, I am using PGI 19.9

Thank you,
Natalia.

Hi Natalia,

Would it be possible to get a reproducing example?

This doesn’t quite makes sense to me as to what would cause this (especially since the update directive is outside of the loop where the routine is inline) and there’s not enough information here to determine the cause.

Thanks,
Mat

Sure thing, Mat.
I have been working on this dummy example to reproduce the problem.

MODULE ADS
integer,public :: ips,ipe,jps,jpe,kps,kpe
public:: myroutine

CONTAINS 
subroutine myroutine(output,input,my_pe,my_ps,ix,iy)
  !$acc routine vector
  implicit none
  real,intent(OUT) :: output(my_ps:my_pe)
  real,intent(IN) :: input(my_ps:my_pe)
  integer,intent(IN)::my_ps
  integer,intent(IN)::my_pe
  integer,intent(IN)::ix
  integer,intent(IN)::iy
  integer     :: i

  !$acc loop vector
  do i=my_ps+2,my_pe-2
    output(i) = input(i) + 3*ix+iy
  end do


end subroutine myroutine
END MODULE ADS

program Test
  use ADS
  implicit none

  integer       :: nbins, ibin,i,j,k
  real,allocatable:: image(:,:,:,:)
  real,allocatable:: aux(:)
  real :: total
  
  nbins = 6
  
  ips=10
  ipe=120
  jps=ips
  jpe=ipe
  kps=40
  kpe=240
  total=0
  
  allocate(image(ips:ipe,jps:jpe,kps:kpe,1:nbins))
  allocate(aux(ips:ipe))

  !$acc data create(image) copy(total)
  do ibin=1,nbins
    
    !$acc parallel loop gang reduction(+:total)
    do k=kps+2,kpe-2
      !$acc loop worker private(aux)
      do j=jps+2,jpe-2

        call myroutine(aux,image(:,j,k,ibin),ips,ipe,j,k)
        !$acc loop vector
        do i=ips+2,ipe-2
          image(i,j,k,ibin) = image(i,j,k,ibin) + aux(i)
        end do
      end do
      total=total+image(ips+20,ips+20,k,ibin)
    end do
    !$acc update self(image(:,:,:,ibin)) async(ibin)
  end do
  !$acc wait
  !$acc end data

  deallocate(image)
  deallocate(aux)
end program Test

The log when myroutine is not inline is:

pgf90 -ta=tesla:cc70,cuda10.0  -fast  -Minfo=accel,inline -o test.bin test.f90
myroutine:
      6, Generating Tesla code
         18, !$acc loop vector ! threadidx%x
     18, Loop is parallelizable
test:
     48, Generating copy(total) [if not already present]
         Generating create(image(:,:,:,:)) [if not already present]
     51, Generating Tesla code
         51, Generating reduction(+:total)
         52, !$acc loop gang ! blockidx%x
         54, !$acc loop worker(4) ! threadidx%y
         58, !$acc loop vector(32) ! threadidx%x
     54, Loop is parallelizable
     58, Loop is parallelizable
     64, Generating update self(image(:,:,:,ibin))

However, when inlining is forced we get:

pgf90 -ta=tesla:cc70,cuda10.0  -fast  -Minfo=accel,inline -Minline:reshape,name:myroutine -o test.bin test.f90
myroutine:
      6, Generating Tesla code
         18, !$acc loop vector ! threadidx%x
     18, Loop is parallelizable
test:
     48, Generating create(image(:,:,:,:)) [if not already present]
         Generating copy(total) [if not already present]
     51, Generating Tesla code
         51, Generating reduction(+:total)
         52, !$acc loop gang ! blockidx%x
         54, !$acc loop worker(4) ! threadidx%y
             Vector barrier inserted to share data across vector lanes
         56, !$acc loop vector(32) ! threadidx%x
             Vector barrier inserted due to potential dependence into a vector loop
             Vector barrier inserted due to potential dependence out of a vector loop
         58, !$acc loop vector(32) ! threadidx%x
             Vector barrier inserted due to potential dependence out of a vector loop
     54, Loop is parallelizable
     56, myroutine inlined, size=10, file test.f90 (6)
          56, Loop is parallelizable
     58, Loop is parallelizable
     64, Generating update self(image(:,:,:,:))

Cheers,
Natalia

Thanks Natalia!

While I needed to fix a couple of issues in your example, I was able to reproduce the error. I’ve not seen this before and it’s unclear what’s going on. Though, it’s most likely a compiler error so I’ve added a problem report (TPR #28435) and sent it our engineers for investigation. Besides not inlining, unfortunately I don’t have a work around for you.

Note that in the example, you’ve accidentally pass “ips” to “ipe” and “ipe” to “ips” when calling myroutine.

After fixing this, I then get an out-of-memory error due to the privatization of “aux”. To fix this, I modified the code to remove worker and instead collapse the outer loops. Even without the memory error, I would generally recommend this change since gang private arrays are likely to be stored in shared memory, rather than main memory.

Finally, I’d recommend only using a few async queues, rather than having each iteration of the “bin” loop have it’s own queue. There’s overhead in creating queues which can be higher than the benefit of interleaving the data movement, especially if the queue is only used once.

Here’s my version:

MODULE ADS
integer,public :: ips,ipe,jps,jpe,kps,kpe
public:: myroutine

CONTAINS
subroutine myroutine(output,input,my_pe,my_ps,ix,iy)
  !$acc routine vector
  implicit none
  real,intent(OUT) :: output(my_ps:my_pe)
  real,intent(IN) :: input(my_ps:my_pe)
  integer,intent(IN)::my_ps
  integer,intent(IN)::my_pe
  integer,intent(IN)::ix
  integer,intent(IN)::iy
  integer     :: i

  !$acc loop vector
  do i=my_ps+2,my_pe-2
    output(i) = input(i) + 3*ix+iy
  end do


end subroutine myroutine
END MODULE ADS

program Test
  use ADS
  implicit none

  integer       :: nbins, ibin,i,j,k,queue
  real,allocatable:: image(:,:,:,:)
  real,allocatable:: aux(:)
  real :: total

  nbins = 6

  ips=10
  ipe=120
  jps=ips
  jpe=ipe
  kps=40
  kpe=240
  total=0

  allocate(image(ips:ipe,jps:jpe,kps:kpe,1:nbins))
  allocate(aux(ips:ipe))

  !$acc data create(image) copy(total)
  do ibin=1,nbins
    queue = mod(ibin,2)
    !$acc parallel loop gang collapse(2) private(aux) reduction(+:total) async(queue)
    do k=kps+2,kpe-2
      do j=jps+2,jpe-2

        call myroutine(aux,image(:,j,k,ibin),ipe,ips,j,k)
        !$acc loop vector
        do i=ips+2,ipe-2
          image(i,j,k,ibin) = image(i,j,k,ibin) + aux(i)
        end do
      end do
      total=total+image(ips+20,ips+20,k,ibin)
    end do
    !$acc update self(image(:,:,:,ibin)) async(queue)
  end do
  !$acc wait
  !$acc end data

  deallocate(image)
  deallocate(aux)
end program Test

-Mat

Hi Mat,

thank you very much for trying. I suspected it was a compiler problem, but you can never know for sure. And sorry for the bugs, I just prepared the code for compiling.

About collapsing loops, well, in the original code I had other problems on that, getting numerical errors. My way around was that because previously I found using also workers improved performance. What is strange is I do not have out-of-memory error and I am using arrays of similar size.

On the other hand, I am using already a few async queues in the original code. Anyhow, thank you for the advice.

Cheers.
Natalia.

Hi Mat,

something came up using asynchronous queues and so I have a further question for you. Having larger array to be updated (“image” in the code example), it seems that the updating is done by several data chunk transfers, and then the computing is overlapped just with the last of them.

You can find a profiler snapshot in:
https://drive.google.com/open?id=1i5ItWjYEOzwBUWO3IrzyX3TvBS9wSm06

How can I fix this issue?

Cheers.
Natalia.

Hi Natalia,

While in theory, the “update self” call should be completely asynchronous, the implementation of this proved too much of a challenge. There’s wasn’t a way to have the OS call back the runtime to “know” when the data transfer was done, so all but the last buffer transfer is blocking. (By default data transfers are performed via a double buffering system, the “holes” between the copies is the pinned buffer to virtual memory copy). Asynchronously copying to the device is simple since we can take advantage of CUDA streams.

Things to try:

  1. Compile with “-ta=tesla:pinned” so memory is allocated in pinned (physical) memory so the double buffers aren’t needed.
  2. Adjust the size of the transfer buffers via the environment variable PGI_ACC_BUFFERSIZE so they are big enough to fit the same of the image slice.
  3. Delay copying back image as one large block until the end.

-Mat

Hi Mat,

I see. Actually, dealing with pinned memory was my next step. About that, I read somewhere that in order to limit the use of pinned memory for just some arrays (with CUDA Fortran “pinned” attribute) we need to use the flag for linking the executable. Is that right?
As far as I understand, this solution will mitigate the “holes”, but not the transfer buffering. So I can combine it with tuning PGI_ACC_BUFFERSIZE. Is there any contraindication?

Cheers.