Is_device_pointer and target region question

unrue · October 22, 2021, 7:19am

Dear Nvidia users,

I’m using OpenMP offload with NVHPC 21.9. I would like to use cublas routine and TARGET routine in a TARGET DATA region, by using use_device_prt and is_Device_prt:

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b,alpha) MAP(TO:xx,bb,w) 
!$OMP& use_device_ptr(xbar,xx,bb,bbar,b,w)
  do k = 2,m
     alpha_d = alpha(k)
     call cublasDaxpy(n, alpha_d, xx(:,k), 1, xbar, 1)
     call cublasDaxpy(n, alpha_d, bb(1,k), 1, bbar, 1)
     call cublasDaxpy(n, -alpha_d, bb(1,k), 1, b, 1)
  enddo
 
  do k = 1, m
      alpha(k) = vlsc3_omp(xx(1,k),w,b,n)
  enddo

!$OMP END TARGET DATA

Where vlsc3_omp is:

  dimension x(n),y(n),b(n)
  real dt

  dt = 0.0

!$OMP TARGET TEAMS LOOP REDUCTION(+:dt) is_device_ptr(x,y,b)
!$OMP& MAP(TOFROM:dt)
  do i=1,n
     dt = dt+x(i)*y(i)*b(i)
  enddo
  vlsc3_omp = dt

  return
  end

The code crashes into the above subroutine. I don’t understand the reason. All arrays are on GPU since are defined use_device_prt, and the subroutine vlsc3_omp run on GPU. What I’m doing wrong? Thanks

MatColgrove · October 22, 2021, 5:27pm

Hi unrue,

Why are you wanting to pass in device arrays to a host routine? It makes sense to do this for the cuBLAS routines since those are wrappers to the CUDA C library, but “vlsc3_omp” has a host interface. Granted you don’t show the signature, but unless you’re using the CUDA Fortran device attribute on the arguments, the subroutine expect them to be host arrays.

Though I just had a thought. Maybe you didn’t realize that you can nested data regions? In other words add another data region around the calls to cuBLAS so the device pointer are passed in, but the remainder of the outer region uses the host pointers on the host? Something like:

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b,alpha) MAP(TO:xx,bb,w)
!$OMP TARGET DATA use_device_ptr(xbar,xx,bb,bbar,b,w)
  do k = 2,m
     alpha_d = alpha(k)
     call cublasDaxpy(n, alpha_d, xx(:,k), 1, xbar, 1)
     call cublasDaxpy(n, alpha_d, bb(1,k), 1, bbar, 1)
     call cublasDaxpy(n, -alpha_d, bb(1,k), 1, b, 1)
  enddo
!$OMP END TARGET DATA
  do k = 1, m
      alpha(k) = vlsc3_omp(xx(1,k),w,b,n)
  enddo
!$OMP END TARGET DATA

And then take off the “is_device_ptr” clause from the target teams region in vlsc3_omp.

Now if it is really your intent to pass in device arrays to vlsc3_omp, we probably can make it work by adding in CUDA Fortran, but I’d need a full example before I can recommend what needs to be changed.

-Mat