Async memory transfer and results problem

unrue · November 11, 2021, 10:16am

Dear Nvidia users, I’m trying to overlap computation and memory transfer of the following code:

   subroutine add2s2_omp(a,b,c1,n)
  real a(n),b(n)
  real,value:: c1
  integer,value:: n
!$OMP TARGET TEAMS LOOP
  do i=1,n
    a(i)=a(i)+c1*b(i)
  enddo
  return

…

!$OMP TARGET DATA MAP(to:xx) MAP(from:b,bbar)
!$OMP TARGET UPDATE TO(bb) depend(out:xbar) nowait      
  do k = 2,m
     call add2s2_omp(xbar,xx(:,k),alpha(k),n)
  end do

!$OMP TARGET UPDATE FROM(xbar) nowait
  do k = 2,m
     call add2s2_omp(bbar,bb(:,k),alpha(k),n)
  end do

!$OMP TARGET UPDATE FROM(xbar,xx) depend(in:xbar) nowait
  do k = 2,m
     call add2s2_omp(b,bb(:,k),-alpha(k),n)
  end do

!$OMP END TARGET DATA

print *, xbar(1),bbar(1),b(1)

The problem is that the results are bad. This should be the correct result:

7.399999 5.980000 -0.4800002

Insted I have:

7.399999 5.980000 -3.780000

So results on “b” are bad. I don’t understand the reason. Maybe I have to enable pinned memory? How I can enable pinned memory using OpenMP offload? Attached the code example. Thanks.

add2s2_omp.f (2.1 KB)

MatColgrove · November 11, 2021, 6:40pm

The wrong answers are due your algorithm needing that values of “a” to be initialized, but the “from” clause does not initialize device data. To fix, you would want to use “tofrom”.

Though to interleave data movement with compute on the device, I would suggest something like the following. In this case, you wont see much improvement give the compute time of the kernel is so small there is little to overlap.

!$OMP TARGET DATA MAP(alloc:xx,bb,b,bbar,xbar)
!$OMP TARGET UPDATE TO(xx,xbar)
!$OMP TARGET UPDATE TO(bb,b,bbar) nowait
      do k = 2,m
         call add2s2_omp(xbar,xx(:,k),alpha(k),n)
      end do
!$OMP TARGET UPDATE FROM(xbar) nowait
      do k = 2,m
         call add2s2_omp(bbar,bb(:,k),alpha(k),n)
      end do

!$OMP TARGET UPDATE FROM(bbar) nowait
      do k = 2,m
         call add2s2_omp(b,bb(:,k),-alpha(k),n)
      end do
!$omp target update from(b)
!$OMP END TARGET DATA

% nvfortran -mp=gpu add2s2_omp.f -fast -Minfo=mp -V21.9
add2s2_omp:
     10, !$omp target teams loop
         10, Generating "nvkernel_foo_add2s2_omp__F1L10_1" GPU kernel
             Generating Tesla code
           11, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
         10, Generating Multicore code
           11, Loop parallelized across threads
add2s2_omp:
     54, Generating map(alloc:xx(:,:),xbar(:),b(:),bb(:,:),bbar(:))
     55, Generating update to(xbar(:),xx(:,:))
     56, Generating update to(bbar(:),bb(:,:),b(:))
     60, Generating update from(xbar(:))
     65, Generating update from(bbar(:))
     69, Generating update from(b(:))
% a.out
    7.399999        5.980000      -0.4800002

unrue · November 12, 2021, 9:58am

Hi Mat, thanks for the reply. From my side is not totally clear your solution (it works well)

Such line:

!$OMP TARGET UPDATE TO(bb,b,bbar) nowait

How can be guaranteed that bbar is updated on GPU before second kernel is launched? Same question for “b” and last kernel. Usually I use depend condition, but in your solution there are not present.
Thanks.

Same question with

MatColgrove · November 12, 2021, 6:05pm

The depends would go on the target teams region. Though while we accept the syntax, we don’t support depends for the GPU as of yet. Though, this support is being added and will be available soon.

unrue · November 15, 2021, 7:12am

Hi Mat,

I’m bit confused. In which sense depend is not supported for GPUs? Maybe can be used just with tasks? The following is an example extracted by Nvidia lesson “OpenMP on GPUs, First Experiences and Best Practices”:

#pragma omp target data map(alloc:image[0:WIDTH*HEIGHT])
for(block = 0; block < num_blocks; block++ ) {
    int start = block * (HEIGHT/num_blocks);
    end = start + (HEIGHT/num_blocks);

#pragma omp target teams distribute \
parallel for simd collapse(2) 
depend(inout:image[block*block_size]) nowait
    for(int y=start;y<end;y++) {
        for(int x=0;x<WIDTH;x++) {
            image[y*WIDTH+x]=mandelbrot(x,y);
       }
    }

#pragma omp target update from(image[block*block_size:block_size])\
depend(inout:image[block*block_size]) nowait

#pragma omp taskwait

Here I see target regions, update and depends used together.The following is what I would to do, but you said is not supported ah the moment, if I understand good:

int a;
#pragma omp targe data map(to: A[:S], B[:S]) map(from: C[:S]) {
#pragma omp target update to(C[:S]) depend(out: a) nowait
#pragma omp target teams distribute parallel for
for () {
... // Parallel work involving A and B
}
#pragma omp target update from(A[:S]) nowait
#pragma omp target teams distribute parallel for depend(in: a)
for () {
... //Parallel work involving B and C
}

Altough, I still don’t understand how your solution works, how asynchronous transfer are guaranteed to be finished before the buffer involved is used. Is the runtime system that guarantees the consistency? Thanks.

MatColgrove · November 15, 2021, 6:18pm

It’s not that depends can’t support GPU, just that we’re still in the process of adding support for depends with GPUs. It’s nearing completion, but not yet available in a released compiler.

Altough, I still don’t understand how your solution works, how asynchronous transfer are guaranteed to be finished before the buffer involved is used. Is the runtime system that guarantees the consistency?

In this case, it’s luck and timing. Once depend is fully supported and the CUDA stream dependencies created, then there can be a guarantee. While I don’t claim to know the details, my understanding is that since “depend” is associated with a task, it’s difficult to leverage this to use CUDA streams. Basically, each task will need it’s own stream leading to higher overhead cost due to extra stream creating costs. Though, I’ll hold out judgement once we get the support in place. OpenACC does a better job here since there’s a one to one mapping between an async queue and a CUDA stream.