Cublas saxpy error

Dear Nvidia user,

I’m experiencing a problem using cublas_Saxpy. I would to optimize my little kernel that is part of very large code and takes about 10% of total time:

      subroutine add2s2_omp(a,b,c1,n)
      real a(n),b(n)
      real,value:: c1
      integer,value:: n
!$OMP TARGET TEAMS LOOP
      do i=1,n
        a(i)=a(i)+c1*b(i)
      enddo
      return
      end

So, such code is a saxpy. Trying to make the implementation using cublas I get:

call to cuMemAlloc returned error 700: Illegal address during kernel execution

I’m using NVHPC/21.5-GCC-10.3.0 module. Maybe I did a wrong allocation or passed wrong parameters to cublas_Saxpy. Could you help me to solve the problem? Attached the original code (add2s2_omp.f) and cublas implementation (add2s2_omp_cublas.f) Thanks.

add2s2_omp.f (1.5 KB)
add2s2_omp_cublas.f (2.1 KB)

Hi unrue,

I’m a bit confused since there’s no “cublas_Saxpy” routine, just “cublasSaxpy” (no underbar), nor do the other “cublas” call include an underbar. Are you using your own wrappers to call cuBlas? Perhaps using a C interface? If so, then this may just be a Fortran to C interoperability issue in that you can’t pass Fortran allocable arrays to C routines and instead only pass the C pointer.

Though, I’d suggest instead using the Fortran interface we provide. You already have ‘use cublas’ so you just need to use the proper interface calls which we document at: NVIDIA Fortran CUDA Library Interfaces Version 21.9 for ARM, OpenPower, x86

-Mat

Hi Mat,

thanks for your reply. I’m using more simply approach:

use cublas
...
!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx(:,k),bb(:,k)) 
!$OMP& use_device_ptr(xbar, xx)
      do k = 2,m
         call cublasSaxpy(n, alpha(k), xx(:,k), 1, xbar, 1)
         call add2s2_omp(bbar,bb(:,k),alpha(k),n)
         call add2s2_omp(b,bb(:,k),-alpha(k),n)
         enddo

!$OMP END TARGET DATA



Having:

FATAL ERROR: data in use_device clause was not found on device 1: host:0x153608a3c650

You’ll want to update to NVHPC release 21.7 or later. OpenMP target directives are being added incrementally with support for “use_device_ptr” being added in 21.7.

Hi Mat,

now I have available NVPHC 21.9. Using cublas as you suggested I have a sigfault and I don’t understand the reason. Attached my code. Thanks.

add2s2_omp_cublas.f (1.6 KB)

Hi Unrue,

The problem is with “alpha(k)”. Since you put “alpha” in the use_device_ptr clause, this means that the code is dereferencing the device pointer on the host, hence the segv. Also, “alpha” is getting accessed out of bounds since it’s size “3” but indexed to “m” (10).

% cat add2s2_omp_cublas.f
c pgf90 -fopenmp -mp=gpu -Mcuda=cc80  -Minfo=all add2s2_omp.f -O2 -o add2s2_omp
      module foo
      contains
      subroutine add2s2_omp(a,b,c1,n)
      real a(n),b(n)
      real,value:: c1
      integer,value:: n
!$OMP TARGET TEAMS LOOP
      do i=1,n
        a(i)=a(i)+c1*b(i)
      enddo
      return
      end

      end module foo

      program add2s2_omp
        use foo
        use cublas
        implicit none
        integer k, m, n, incy, incx
        real, dimension(:), allocatable :: xbar,bbar,b,w
        real, dimension(:,:), allocatable :: xx, bb
        real, dimension(:), allocatable:: alpha
        real :: alpha_d
        logical ifwt

        m = 10
        n = 4669440

        allocate(xbar(n))
        allocate(bbar(n))
        allocate(b(n))
        allocate(xx(n,m))
        allocate(bb(n,m))
        allocate(w(n))
        allocate(alpha(m))

        xbar = 1.1
        bbar = 2.2
        b = 3.3
        alpha = 0.1
        alpha(1) = 0.2
        alpha(2) = 0.3
        alpha(3) = 0.4
        xx = 3.5
        bb = 2.1
        ifwt = .true.
        w = 7.3
        incx = 1
        incy = 1

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb,alpha)
!$OMP& use_device_ptr(xbar,xx)
      do k = 2,m
!         call add2s2_omp(xbar,xx(:,k),alpha(k),n)
!         call cublasSaxpy(n, alpha(k), xx(:,k), 1, xbar, 1)
         alpha_d = alpha(k)
         call cublasSaxpy(n, alpha_d, xx(:,k), incx, xbar, incy)
         call add2s2_omp(bbar,bb(:,k),alpha(k),n)
         call add2s2_omp(b,bb(:,k),-alpha(k),n)
         enddo

!$OMP END TARGET DATA

       print *, xbar(1),bbar(1),b(1)

       deallocate(xbar)
       deallocate(bbar)
       deallocate(b)
       deallocate(xx)
       deallocate(bb)
       deallocate(w)
       deallocate(alpha)

      end program
% nvfortran -mp=gpu add2s2_omp_cublas.f -cudalib=cublas ; a.out
    6.000000        5.140000       0.3599997

-Mat

Thanks Mat,

now it works well, also if performance are more or less the same (the code is memory bound). What I don’t understand instead, when I use cublasSaxpy in a real big code, I get:

NVFORTRAN-S-0155-Could not resolve generic procedure cublassaxpy

From here:

https://forums.developer.nvidia.com/t/pgf90-s-0155-could-not-resolve-generic-procedure-cublasdgemm/135904/2

I read could be a parameter mismatch, but parameters are right in my code. Any other possible causes of such error? Thanks:

  subroutine project1_a_omp(xbar,bbar,b,xx,bb,n,m,w,ifwt,ifvec)

  use cublas
....

!$OMP TARGET DATA MAP(TOFROM:alpha)
      do k = 2,m
        alpha_d = alpha(k)
        call cublasSaxpy(n, alpha_d, xx(:,k), 1, xbar, 1)
        call cublasSaxpy(n, alpha_d, bb(:,k), 1, bbar, 1)
        call cublasSaxpy(n, -alpha_d, bb(:,k), 1, b, 1)
  enddo
....

!$omp end target data

cublasSaxpy is expecting the third and fifth argument to have a “device” attribute, Using the “use_device_ptr” clause and copying the arrays via “TARGET DATA”, like the previous example you posted, will pass in the device copy to these calls.

Hi Mat

that’s strange, because some of these array are declared MAP in other part of the code where such routine is included , so I supposed was not necessary to redefining the MAP. Now,in big code, I rewrite as my little example:

 subroutine project1_a_omp(xbar,bbar,b,xx,bb,n,m,w,ifwt,ifvec)

  use cublas

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb,alpha) 
!$OMP& use_device_ptr(xbar,xx,bb,b)
  do k = 2,m
     alpha_d = alpha(k)
     call cublasSaxpy(n, alpha_d, xx(:,k), 1, xbar, 1)
     call add2s2_omp(bbar,bb(:,k),alpha(k),n)
     call add2s2_omp(b,bb(:,k),-alpha(k),n)
     enddo
....


!$omp end target data

And still I have the same error. Maybe “use cublas” should be moved in another part of the code? Is it right using into a subroutine? Same code works in little example, (in attach)Thanks.

add2s2_omp.f (1.7 KB)

“MAP” defines when the data is created and copied to/from the device. “use_device_ptr” defines the section of host code in which the device pointer is used. So while you don’t need to put the variables in another MAP clause if they are already there, you would want to define a “use_device_ptr” data region so the device pointer is passed to cublasSaxpy.

As to why the interface is not recognized, I’m not sure. From what you show, there’s no real difference between this and your small example. Granted you’d probably get a segfault during execution since you’re passing “bb” and “b” as device pointers to a host subroutine, but that’s a different problem.

If you can post a reproducer, that might help understand what’s wrong.

The code is quite big and under not public repository unfortunately. The little example where the piece of code is extracted does not present the problem. I dont’ know how to reproduce the problem with another little example.

The problem of such error is that it does not inform which passed parameter is wrong, also if I’m sure are right.

What about “use cublas”? Is it right to put it into a subroutine? Could be just a problem with such definition?

Just to make a test, I rewrite the code as (with no allocation or initialization, It does’t matter, just to test the compilation):

real xx_d, xbar_d
real n_d

!$OMP TARGET DATA MAP(TOFROM:xbar_d,bbar,b) MAP(TO:xx_d,bb,alpha) 
!$OMP& use_device_ptr(xbar_d,xx_d)
  do k = 2,m
     alpha_d = alpha(k)
     call cublasSaxpy(n_d, alpha_d, xx_d, 1, xbar_d, 1)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo

!$OMP END TARGET DATA

And the problem is still present:

NVFORTRAN-S-0155-Could not resolve generic procedure cublassaxpy

So, I think it is not an interface problem

Yes, it’s correct to use the cublas module in the subroutine.

Could be just a problem with such definition?

Not sure what you mean. A problem with the definition of your subroutine? The cublasSaxpy interface we provide is widely tested and been available for over a decade, so I highly doubt that it’s the problem. Plus it works fine with the smaller example.

I’m not asking you to share the full code, but without having a reproducing example that shows the differences between your full code and the small example, it’s very difficult for me to help.

Just to make a test, I rewrite the code as (with no allocation or initialization, It does’t matter, just to test the compilation):

It actually does matter. Here you’ve made xx_d and xbar_d scalars so the interface defiantly won’t match. I assume in the real code these are still arrays?

.

It actually does matter. Here you’ve made xx_d and xbar_d scalars so the interface defiantly won’t match. I assume in the real code these are still arrays?

My bad. I made xx_d and xbar_d array and still the error. I really appreciate your help, unfortunately I’m not a manteiner of the repository and I can not provide you an access.

But maybe I can upload the code ( it is open source) in some your Nvidia testing machine? I work in CINECA so maybe we already have a collaboration having access to some machines. It could be possible?

Hi Unrue,

I just sent a note to the email you registered with including a link to a secure FTP server where you can upload the code.

-Mat

Thanks Mat, but the credential does not works. I have to wait some time?

I just tried logging into it and see the same error. I haven’t used this in a few years, so it’s possible it’s no longer working as expected.

I’m not sure what the file size limit is, but you can try direct messaging me and attaching the package to the message. Select my user name to open my profile, then select the “Message” box. In the message select the seventh icon over (the one with the up arrow), to upload the package.

If that doesn’t work, we may need to go to google drive or some other method.

FYI, the error was that unrue had the “-r8” flag on the compilation which promotes the default kind of real’s to 8. Hence the code was trying to pass in double precision array’s to Saxpy which expects single precision.

Yesssss, it works!

Thanks very much for your help!

Hi Mat,

sorry but I’m writing again because the application (real code) now it compiles using cublasDaxpy, but crashes at runtime:

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb,alpha) 
!$OMP& use_device_ptr(xbar,xx,bb,bbar)
   do k = 2,m
 !   call add2s2_omp(xbar,xx(1,k),alpha(k),n)
     alpha_d = alpha(k)
     call cublasDaxpy(n, alpha_d, xx(1,k), 1, xbar, 1)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo

!$OMP END TARGET DATA

I don’t understand what I’m doing wrong.

See my note above:

Granted you’d probably get a segfault during execution since you’re passing “bb” and “b” as device pointers to a host subroutine, but that’s a different problem.