Questions about omp offload and memory transfer

Dear Nvidia users, I have some doubt about memory transfer from/to GPu using OMP Offload

Suppose I have matrix A on host and is passed to target region. If I understand well, TOFROM clause is implicit if I don’t specify anything, so A is transferred automatically on GPU. If another target region encounter same matrix, is not transferred again because already present on GPU. Now:

  1. When the matrix is copied FROM GPU? When is modified on GPU itself? Or each time target region ends?
  2. What happen if A is modified from hostside after target region is done? Is updated on GPU on next target region?
  3. What happen when GPU memory is full or has low free space? TOFROM still work? Old data are cleaned?

Because I have very huge code memory bound on GPU, and such code use implicit TOFROM everywhere. Maybe I can define better the memory usage.

Thanks.

No, it would only not be transferred again if A was in a target data region. Implicit data transfer on a target compute region would occur for each compute region.

Target compute regions have an implicit target data region so I’ll answer my answers are same for use with an explicit or implicit target data region.

  1. When the matrix is copied FROM GPU? When is modified on GPU itself? Or each time target region ends?

Upon entry into a data region, the space for the data is allocated on the device. For data mapped with “TO”, the data is then copied to the device at the start of the data region. For data mapped with “FROM”, data is copied back at the end of the region. “TOFROM” is copied both at the beginning and end of the region. At the end of the region, the device data is then deallocated.

  1. What happen if A is modified from hostside after target region is done? Is updated on GPU on next target region?

It is not implicitly updated so you can have a scenario where the host and device copies of the data are not in sync. To synchronize data between the host and device within an explicit data region, you’d use an “update” construct.

  1. What happen when GPU memory is full or has low free space? TOFROM still work? Old data are cleaned?

If you’re device is out of memory, the program will fail when allocating new data on the device. No, old data is not implicitly “cleaned”.

While not explicitly part of OpenMP, our compiler does allow for the use of CUDA Unified Memory, enabled via the flag “-gpu=managed”. The flag will cause all dynamically allocated memory to be placed in UM where the CUDA driver manages the data movement for you. (Note static memory still need to be explicitly managed). Data is copied at a page granularity upon demand from either the host or device side and is kept in sync automatically. It also allows you to oversubscribe the device memory so you can use more memory than available on the device. The caveat being that performance can degrade if the program “ping pongs” data back and forth between the device and host. The ideal case is for the program to copy it’s data once to the device, do all computation, then copy it back once at the end.

-Mat

Hi Mat,

thanks for your reply. So in a code like this:

  do k = 2,m
     call add2s2_omp(xbar,xx(1,k),alpha(k),n)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo

And the target region is:

  subroutine add2s2_omp(a,b,c1,n)
  real a(n),b(n)
!$OMP TARGET TEAMS LOOP
  do i=1,n
      a(i)=a(i)+c1*b(i)
  enddo
  return
  end

It means each time add2s2_omp is called, xbar, bbar, b, alpha, xx,and bb are allocated, copied on GPU, and deallocated? If yes, it is very inefficient. Is it possible to keep xx and bb in GPU memory and deallocated just at the end of the loop?

It is not implicitly updated so you can have a scenario where the host and device copies of the data are not in sync. To synchronize data between the host and device within an explicit data region, you’d use an “update” construct.

But, if matrix are deallocated at the end of target region, how such construct works? Maybe if some matrix are named “update” are not automatically deallocated?

Thanks.

It means each time add2s2_omp is called, xbar, bbar, b, alpha, xx,and bb are allocated, copied on GPU, and deallocated?

Correct.

If yes, it is very inefficient. Is it possible to keep xx and bb in GPU memory and deallocated just at the end of the loop?

Yes, it is inefficient. Hence you’d want to use of a target data region. Something like:

!$omp target data map(tofrom:xbar,xx,bbar,bb,b)
  do k = 2,m
     call add2s2_omp(xbar,xx(1,k),alpha(k),n)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo
!$omp target end data

But, if matrix are deallocated at the end of target region, how such construct works?

In a target compute region, it wouldn’t. The target update directive work within a target data region to synchronize data between the start and end of the region.

!$omp target data map(tofrom:xbar,xx,bbar,bb) map(to:b)
  do k = 2,m
     call add2s2_omp(xbar,xx(1,k),alpha(k),n)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo
!$omp target update from(b)
print *, b(1)
!$omp target end data

Hope this helps,
Mat

Hi Mat,

yes, it help very much.

The ideal case is for the program to copy it’s data once to the device, do all computation, then copy it back once at the end.

I agree, but having many MPI calls I have to copy back some data, make MPI calls and return on GPU. In fact -gpu=managed flag degrades performance of 10%

You should investigate using CUDA Aware MPI in which case the data is directly transferred between the devices. The OpenMPI we ship with the compilers has this enabled by default.

To use, you need just pass in the device data to your MPI sends and receive calls. For OpenACC, the easiest way to do this is add a “host_data” region around the calls. “host_data” basically will have the runtime use the device pointer on the host within the region. Something like:

!$acc host_data use_device(sendbuf)
    call MPI_send(..., sendbuf,...)
!$acc end host_data

I haven’t used this feature with OpenMP as of yet, but believe the syntax would be:

!$omp target data use_device_ptr(sendbuf)
   call MPI_send(,,,.sendbuf,...)
!$omp end target data

In fact -gpu=managed flag degrades performance of 10%

Not unexpected. UM is a convenience feature and makes it much easier for the initial port. In most cases using UM is performance neutral, but in some cases can cause a degradation in performance, especially when access to the data occurs often on both the host and device. This leads to the data being “ping-ponged” back and forth.

Note that CUDA Aware MPI does not work with UM, hence you will need to manually manage the data movement (via data directive) in order to take advantage of it.

Thanks Mat.

I’m trying cuda Aware MPI, but my code crash. I’m using NVHPC 21.9:

!$omp target data use_device_ptr(alpha)
call gop(alpha,work,'+ ',m)
!$omp end target data

where gop is:

call mpi_allreduce(x,w,n,MPI_DOUBLE_PRECISION,mpi_sum,nekcomm,ie)
call copy(x,w,n)

and copy is simply:

  subroutine copy(a,b,n)
  real a(1),b(1)

  do i=1,n
     a(i)=b(i)
  enddo

  return
  end

The error is:

FATAL ERROR: data in use_device clause was not found on device 1: host:0x7fff7a13ab30

That’s very strange, alpha is used on GPu from some kernels before, so should be already on device (such routine is into ENTER DATA region):

     subroutine project1_a_omp
     .....
     real work(mxprev),alpha(mxprev)
     .....

      do k = 2,m
         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
         call add2s2_omp(b,bb(1,k),-alpha(k),n)
      enddo

      !Second round of CGS
      do k = 1, m
         if(ifwt) then
            alpha(k) = vlsc3_omp(xx(1,k),w,b,n)
         else
            alpha(k) = vlsc2_omp(xx(1,k),b,n)
         endif
      enddo

!$omp target data use_device_ptr(alpha)
      call gop(alpha,work,'+  ',m)
!$omp end target data

With no changes code works well. Do you know the reason?

I tried also:

!$OMP TARGET DATA MAP(TOFROM:alpha)
      do k = 2,m
         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
         call add2s2_omp(b,bb(1,k),-alpha(k),n)
      enddo

      !Second round of CGS
      do k = 1, m
         if(ifwt) then
            alpha(k) = vlsc3_omp(xx(1,k),w,b,n)
         else
            alpha(k) = vlsc2_omp(xx(1,k),b,n)
         endif
      enddo

!$omp target data use_device_ptr(alpha)
      call my_gop(alpha,work,'+  ',m)
!$omp end target data

Having NVFORTRAN-S-0104-Illegal control structure - unterminated TARGETDATA

In the first example, “alpha” isn’t in a target data region within the scope of the “use_device_ptr” so the error is correct.

The second example is more what you want, but you’re missing an matching “end target”. You have two “target data” directives, but only one “target end data”. Both structured data regions need a start and an end.

Thanks Mat, I add two question about my code:

My goal is to run a piece of code entire on GPU, so avoiding memory transfer on CPU. My code actually is:

!$OMP TARGET DATA MAP(TOFROM:alpha) MAP(TO:work,m)
  do k = 2,m
     call add2s2_omp(xbar,xx(1,k),alpha(k),n)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo

  !Second round of CGS
  do k = 1, m
     if(ifwt) then
        alpha(k) = vlsc3_omp(xx(1,k),w,b,n)
     else
        alpha(k) = vlsc2_omp(xx(1,k),b,n)
     endif
  enddo

!$omp use_device_ptr(alpha)
  call my_gop(alpha,work,'+  ',m)

  do k = 1,m
     call add2s2_omp(xbar,xx(1,k),alpha(k),n)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo

!$omp end target data

the routine my_gomp is

subroutine my_gop( x, w, op, n)
...
   call mpi_allreduce(x,w,n,MPI_DOUBLE_PRECISION,mpi_sum,nekcomm,ie)
   call copy_omp(x,w,n)

where copy_omp is simply:

  subroutine copy_omp(a,b,n)
  real a(n),b(n)

!$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO
  do i=1,n
     a(i)=b(i)
  enddo
  return
  end

Now, if copy_omp is executed on GPU, the code crashes. If executed on CPU works well. I don’t know the reason, all prameters should be on GPU. I’m doing something wrong?

  1. calling a function into target data region like:

alpha(k) = vlsc3_omp(xx(1,k),w,b,n)

where vlsc3_omp is:

function vlsc3_omp(x,y,b,n)
 dimension x(n),y(n),b(n)
 real dt
!$OMP TARGET TEAMS LOOP REDUCTION(+:dt) 
  do i=1,n
     dt = dt+x(i)*y(i)*b(i)
  enddo
 return
end

Since alpha is on device, is updated on device or transferred on CPU? Can my piece of code run totally on GPU? Thanks.

‘use_device_ptr’ is a clause for a “target data” directive and is meaningless by itself. Plus you would want to place the around the MPI, not “my_gop”, and include both the send and receive buffers. Something like:

subroutine my_gop( x, w, op, n)
...
!$omp target data use_device_ptr(x,w)
   call mpi_allreduce(x,w,n,MPI_DOUBLE_PRECISION,mpi_sum,nekcomm,ie)
!$omp end target data 
   call copy_omp(x,w,n)

Now, if copy_omp is executed on GPU, the code crashes. If executed on CPU works well

I’m not seeing anything obvious, but in an earlier post you allocated ‘alpha’ to be smaller than loop bounds causing an out of bounds error. The CPU wont crash in this circumstance until the memory crosses a page boundary while the GPU is more sensitive to out of bounds accesses. Did you fix this issue? If so, then I’ll need a reproducing example to investigate.

Since alpha is on device, is updated on device or transferred on CPU?

Neither. If the data is already on the device (i.e. included in an outer target data region), the data is not implicitly copied. If you did need to synchronize data between the host and device, within a data region, you’d used a “target update” directive.

Can my piece of code run totally on GPU?

Not totally since the GPU is not self-hosted, but you should be able to offload all the compute intensive portions.

-Mat

Hi Mat,

the real code does not have bug on alpha array. Now, This is my code:

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb,alpha,work)
  do k = 2,m
     alpha_d = alpha(k)
     call add2s2_omp(xbar,xx(1,k),alpha(k),n)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo

  !Second round of CGS
  do k = 1, m
     if(ifwt) then
        alpha(k) = vlsc3_omp(xx(1,k),w,b,n)
     else
        alpha(k) = vlsc2_omp(xx(1,k),b,n)
     endif
  enddo

   call my_gop(alpha,work,'+  ',m)

  do k = 1,m
     call add2s2_omp(xbar,xx(1,k),alpha(k),n)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo
 !$omp end target data

where my_gop now is:

subroutine my_gop( x, w, op, n)
...
!$omp target data use_device_ptr(x,w)
 call mpi_allreduce(x,w,n,MPI_DOUBLE_PRECISION,mpi_sum,nekcomm,ie)
!$omp end target data 

 call copy_omp(x,w,n)

And the code crashes with numerical errors, but for other problem not depends from my modifications.

Thanks for the moment.

Ok, but again there’s nothing obvious in what you posted so I’d need a reproducing example to investigate. Though, numerical errors could mean that your missing an update and the data is not getting synchronized between the host and device.

In another post you mentioned that at least some of these arrays were in a higher level data region. This means the “MAP” clauses would essentially be ignored here. Maybe these arrays get assigned on the host someplace and you need to add an “target update” directive?

Hi Mat, sincerly I don’t know. The code is too big and very difficult to understand which data needs to be updated (if needs). Is it possible, just to debug to have all updating automatically?

Other problem I get, from NVHPC, I don’t know , from the timeline (very large), how isolate a portion of the code where I’m working in order to see the behaviour before/after my changes. I have just a global view, but not understanding where such piece of timeline has a correspondence in the code.

Well, you could remove all the data regions and have the compiler implicitly map the data at each compute region, but this would be painfully slow. Instead, you can try using CUDA Unified Memory (enabled via the flag “-gpu=managed”) where all allocated data will be managed by the CUDA driver for you. (You can leave in the data regions, they’ll just essentially be ignored). Note that static data such as fixed size arrays still need to be managed via data regions. Also, CUDA Aware MPI sees UM as a host pointer, so doesn’t do GPU direct transfers. Though for debugging this should be ok.

Typically my strategy is to take a top-down approach to data management and a bottom-up approach to compute regions. Basically I add unstructured data regions (target enter data) just after allocating the data, or putting structured data regions as early as possible in the code. Next, I add a compute region putting update directive before and after the compute region. If that works, I move to add the next compute region, but now move the updates out wider. Eventually as more and more compute is offloaded, you can remove the unneeded updates.

-Mat