Cublas saxpy error

unrue · September 21, 2021, 7:51am

Dear Nvidia user,

I’m experiencing a problem using cublas_Saxpy. I would to optimize my little kernel that is part of very large code and takes about 10% of total time:

      subroutine add2s2_omp(a,b,c1,n)
      real a(n),b(n)
      real,value:: c1
      integer,value:: n
!$OMP TARGET TEAMS LOOP
      do i=1,n
        a(i)=a(i)+c1*b(i)
      enddo
      return
      end

So, such code is a saxpy. Trying to make the implementation using cublas I get:

call to cuMemAlloc returned error 700: Illegal address during kernel execution

I’m using NVHPC/21.5-GCC-10.3.0 module. Maybe I did a wrong allocation or passed wrong parameters to cublas_Saxpy. Could you help me to solve the problem? Attached the original code (add2s2_omp.f) and cublas implementation (add2s2_omp_cublas.f) Thanks.

add2s2_omp.f (1.5 KB)
add2s2_omp_cublas.f (2.1 KB)

MatColgrove · September 22, 2021, 5:01pm

Hi unrue,

I’m a bit confused since there’s no “cublas_Saxpy” routine, just “cublasSaxpy” (no underbar), nor do the other “cublas” call include an underbar. Are you using your own wrappers to call cuBlas? Perhaps using a C interface? If so, then this may just be a Fortran to C interoperability issue in that you can’t pass Fortran allocable arrays to C routines and instead only pass the C pointer.

Though, I’d suggest instead using the Fortran interface we provide. You already have ‘use cublas’ so you just need to use the proper interface calls which we document at: NVIDIA Fortran CUDA Library Interfaces Version 22.7 for ARM, OpenPower, x86

-Mat

unrue · September 24, 2021, 8:24am

Hi Mat,

thanks for your reply. I’m using more simply approach:

use cublas
...
!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx(:,k),bb(:,k)) 
!$OMP& use_device_ptr(xbar, xx)
      do k = 2,m
         call cublasSaxpy(n, alpha(k), xx(:,k), 1, xbar, 1)
         call add2s2_omp(bbar,bb(:,k),alpha(k),n)
         call add2s2_omp(b,bb(:,k),-alpha(k),n)
         enddo

!$OMP END TARGET DATA

Having:

FATAL ERROR: data in use_device clause was not found on device 1: host:0x153608a3c650

MatColgrove · September 24, 2021, 3:57pm

You’ll want to update to NVHPC release 21.7 or later. OpenMP target directives are being added incrementally with support for “use_device_ptr” being added in 21.7.

unrue · October 11, 2021, 8:04am

Hi Mat,

now I have available NVPHC 21.9. Using cublas as you suggested I have a sigfault and I don’t understand the reason. Attached my code. Thanks.

add2s2_omp_cublas.f (1.6 KB)

MatColgrove · October 11, 2021, 4:57pm

Hi Unrue,

The problem is with “alpha(k)”. Since you put “alpha” in the use_device_ptr clause, this means that the code is dereferencing the device pointer on the host, hence the segv. Also, “alpha” is getting accessed out of bounds since it’s size “3” but indexed to “m” (10).

% cat add2s2_omp_cublas.f
c pgf90 -fopenmp -mp=gpu -Mcuda=cc80  -Minfo=all add2s2_omp.f -O2 -o add2s2_omp
      module foo
      contains
      subroutine add2s2_omp(a,b,c1,n)
      real a(n),b(n)
      real,value:: c1
      integer,value:: n
!$OMP TARGET TEAMS LOOP
      do i=1,n
        a(i)=a(i)+c1*b(i)
      enddo
      return
      end

      end module foo

      program add2s2_omp
        use foo
        use cublas
        implicit none
        integer k, m, n, incy, incx
        real, dimension(:), allocatable :: xbar,bbar,b,w
        real, dimension(:,:), allocatable :: xx, bb
        real, dimension(:), allocatable:: alpha
        real :: alpha_d
        logical ifwt

        m = 10
        n = 4669440

        allocate(xbar(n))
        allocate(bbar(n))
        allocate(b(n))
        allocate(xx(n,m))
        allocate(bb(n,m))
        allocate(w(n))
        allocate(alpha(m))

        xbar = 1.1
        bbar = 2.2
        b = 3.3
        alpha = 0.1
        alpha(1) = 0.2
        alpha(2) = 0.3
        alpha(3) = 0.4
        xx = 3.5
        bb = 2.1
        ifwt = .true.
        w = 7.3
        incx = 1
        incy = 1

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb,alpha)
!$OMP& use_device_ptr(xbar,xx)
      do k = 2,m
!         call add2s2_omp(xbar,xx(:,k),alpha(k),n)
!         call cublasSaxpy(n, alpha(k), xx(:,k), 1, xbar, 1)
         alpha_d = alpha(k)
         call cublasSaxpy(n, alpha_d, xx(:,k), incx, xbar, incy)
         call add2s2_omp(bbar,bb(:,k),alpha(k),n)
         call add2s2_omp(b,bb(:,k),-alpha(k),n)
         enddo

!$OMP END TARGET DATA

       print *, xbar(1),bbar(1),b(1)

       deallocate(xbar)
       deallocate(bbar)
       deallocate(b)
       deallocate(xx)
       deallocate(bb)
       deallocate(w)
       deallocate(alpha)

      end program
% nvfortran -mp=gpu add2s2_omp_cublas.f -cudalib=cublas ; a.out
    6.000000        5.140000       0.3599997

-Mat

unrue · October 13, 2021, 9:13am

Thanks Mat,

now it works well, also if performance are more or less the same (the code is memory bound). What I don’t understand instead, when I use cublasSaxpy in a real big code, I get:

NVFORTRAN-S-0155-Could not resolve generic procedure cublassaxpy

From here:

https://forums.developer.nvidia.com/t/pgf90-s-0155-could-not-resolve-generic-procedure-cublasdgemm/135904/2

I read could be a parameter mismatch, but parameters are right in my code. Any other possible causes of such error? Thanks:

  subroutine project1_a_omp(xbar,bbar,b,xx,bb,n,m,w,ifwt,ifvec)

  use cublas
....

!$OMP TARGET DATA MAP(TOFROM:alpha)
      do k = 2,m
        alpha_d = alpha(k)
        call cublasSaxpy(n, alpha_d, xx(:,k), 1, xbar, 1)
        call cublasSaxpy(n, alpha_d, bb(:,k), 1, bbar, 1)
        call cublasSaxpy(n, -alpha_d, bb(:,k), 1, b, 1)
  enddo
....

!$omp end target data

MatColgrove · October 13, 2021, 4:36pm

cublasSaxpy is expecting the third and fifth argument to have a “device” attribute, Using the “use_device_ptr” clause and copying the arrays via “TARGET DATA”, like the previous example you posted, will pass in the device copy to these calls.

unrue · October 14, 2021, 7:09am

Hi Mat

that’s strange, because some of these array are declared MAP in other part of the code where such routine is included , so I supposed was not necessary to redefining the MAP. Now,in big code, I rewrite as my little example:

 subroutine project1_a_omp(xbar,bbar,b,xx,bb,n,m,w,ifwt,ifvec)

  use cublas

…

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb,alpha) 
!$OMP& use_device_ptr(xbar,xx,bb,b)
  do k = 2,m
     alpha_d = alpha(k)
     call cublasSaxpy(n, alpha_d, xx(:,k), 1, xbar, 1)
     call add2s2_omp(bbar,bb(:,k),alpha(k),n)
     call add2s2_omp(b,bb(:,k),-alpha(k),n)
     enddo

....


!$omp end target data

And still I have the same error. Maybe “use cublas” should be moved in another part of the code? Is it right using into a subroutine? Same code works in little example, (in attach)Thanks.

add2s2_omp.f (1.7 KB)

MatColgrove · October 14, 2021, 4:56pm

“MAP” defines when the data is created and copied to/from the device. “use_device_ptr” defines the section of host code in which the device pointer is used. So while you don’t need to put the variables in another MAP clause if they are already there, you would want to define a “use_device_ptr” data region so the device pointer is passed to cublasSaxpy.

As to why the interface is not recognized, I’m not sure. From what you show, there’s no real difference between this and your small example. Granted you’d probably get a segfault during execution since you’re passing “bb” and “b” as device pointers to a host subroutine, but that’s a different problem.

If you can post a reproducer, that might help understand what’s wrong.

unrue · October 15, 2021, 7:08am

The code is quite big and under not public repository unfortunately. The little example where the piece of code is extracted does not present the problem. I dont’ know how to reproduce the problem with another little example.

The problem of such error is that it does not inform which passed parameter is wrong, also if I’m sure are right.

What about “use cublas”? Is it right to put it into a subroutine? Could be just a problem with such definition?

Just to make a test, I rewrite the code as (with no allocation or initialization, It does’t matter, just to test the compilation):

real xx_d, xbar_d
real n_d

!$OMP TARGET DATA MAP(TOFROM:xbar_d,bbar,b) MAP(TO:xx_d,bb,alpha) 
!$OMP& use_device_ptr(xbar_d,xx_d)
  do k = 2,m
     alpha_d = alpha(k)
     call cublasSaxpy(n_d, alpha_d, xx_d, 1, xbar_d, 1)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo

!$OMP END TARGET DATA

And the problem is still present:

NVFORTRAN-S-0155-Could not resolve generic procedure cublassaxpy

So, I think it is not an interface problem

MatColgrove · October 15, 2021, 3:46pm

Yes, it’s correct to use the cublas module in the subroutine.

Could be just a problem with such definition?

Not sure what you mean. A problem with the definition of your subroutine? The cublasSaxpy interface we provide is widely tested and been available for over a decade, so I highly doubt that it’s the problem. Plus it works fine with the smaller example.

I’m not asking you to share the full code, but without having a reproducing example that shows the differences between your full code and the small example, it’s very difficult for me to help.

Just to make a test, I rewrite the code as (with no allocation or initialization, It does’t matter, just to test the compilation):

It actually does matter. Here you’ve made xx_d and xbar_d scalars so the interface defiantly won’t match. I assume in the real code these are still arrays?

.

unrue · October 18, 2021, 6:53am

It actually does matter. Here you’ve made xx_d and xbar_d scalars so the interface defiantly won’t match. I assume in the real code these are still arrays?

My bad. I made xx_d and xbar_d array and still the error. I really appreciate your help, unfortunately I’m not a manteiner of the repository and I can not provide you an access.

But maybe I can upload the code ( it is open source) in some your Nvidia testing machine? I work in CINECA so maybe we already have a collaboration having access to some machines. It could be possible?

MatColgrove · October 18, 2021, 4:13pm

Hi Unrue,

I just sent a note to the email you registered with including a link to a secure FTP server where you can upload the code.

-Mat

unrue · October 18, 2021, 4:51pm

Thanks Mat, but the credential does not works. I have to wait some time?

MatColgrove · October 18, 2021, 8:59pm

I just tried logging into it and see the same error. I haven’t used this in a few years, so it’s possible it’s no longer working as expected.

I’m not sure what the file size limit is, but you can try direct messaging me and attaching the package to the message. Select my user name to open my profile, then select the “Message” box. In the message select the seventh icon over (the one with the up arrow), to upload the package.

If that doesn’t work, we may need to go to google drive or some other method.

MatColgrove · October 19, 2021, 3:29pm

FYI, the error was that unrue had the “-r8” flag on the compilation which promotes the default kind of real’s to 8. Hence the code was trying to pass in double precision array’s to Saxpy which expects single precision.

unrue · October 19, 2021, 4:29pm

Yesssss, it works!

Thanks very much for your help!

unrue · October 21, 2021, 6:46am

Hi Mat,

sorry but I’m writing again because the application (real code) now it compiles using cublasDaxpy, but crashes at runtime:

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb,alpha) 
!$OMP& use_device_ptr(xbar,xx,bb,bbar)
   do k = 2,m
 !   call add2s2_omp(xbar,xx(1,k),alpha(k),n)
     alpha_d = alpha(k)
     call cublasDaxpy(n, alpha_d, xx(1,k), 1, xbar, 1)
     call add2s2_omp(bbar,bb(1,k),alpha(k),n)
     call add2s2_omp(b,bb(1,k),-alpha(k),n)
  enddo

!$OMP END TARGET DATA

I don’t understand what I’m doing wrong.

MatColgrove · October 21, 2021, 2:58pm

See my note above:

Granted you’d probably get a segfault during execution since you’re passing “bb” and “b” as device pointers to a host subroutine, but that’s a different problem.

Topic		Replies	Views
Questions about cublasSaxpy CUDA Programming and Performance	15	5131	June 26, 2009
Problem with CUBLAS New problem CUDA Programming and Performance	4	8607	July 4, 2010
cuBLAS Dgemm "Could not Resolve Generic Procedure nvc, nvc++ and nvfortran cuda , hpc	7	1488	December 17, 2021
Help Matrix Multiplication using cuBLAS CUDA Programming and Performance	10	23921	July 24, 2010
Strange behaviour setMatrix 3.2 v 3.1 CUDA Programming and Performance	12	12491	October 22, 2010
Error optimization CUDA Programming and Performance	1	803	February 12, 2012
cublasSgemm() alway fail during compute intensify task CUDA Programming and Performance	14	4615	January 8, 2015
Using CUDA Libraries from CUDA Fortran Device Code Legacy PGI Compilers	6	7443	July 19, 2017
0: copyout Memcpy... FAILED: 4(unspecified launch failure) Legacy PGI Compilers	3	5251	May 13, 2013
Can not compile boiler plate cublas code GPU-Accelerated Libraries	1	28	February 4, 2025

Cublas saxpy error

Related topics