Calling cuBlas from a Fortran program

Hi,

I’m trying to call cuBlas from a Fortran program, but somehow the codes does not compile.
The error message is:

PGF90-S-0084-Illegal use of symbol cublasdcopy - attempt to use a SUBROUTINE as a FUNCTION (main.f90: 14)

What is wrong with this code?

Thank you for your help

The code is:

 PROGRAM test
       use cublas
       implicit none
       integer n,i,ierr
       type(cublasHandle) :: h
       real*8,device,allocatable :: x(:)
       real*8,device,allocatable :: y(:)
       real*8,device,allocatable :: z(:)
       n=10e6
       allocate(x(n))
       allocate(y(n))
       allocate(z(n))
       h = cublasGetHandle()
       ierr = cublasDcopy(h,10,x,1,y,1)
      end PROGRAM test

makefile

CC=pgfortran

OBJS=main.o
OPTS=-mp -tp=skylake -fast -mcmodel=medium -m64 -cpp -acc -Minfo=acc -ta=tesla:cc70 -Mcuda -Mcudalib=cublas

%.o: %.f90
        ${CC} ${OPTS} -c $<

all: myProgram
myProgram: main.o
        ${CC} ${OPTS} -o myProgram main.o
myProg:main.o
        ${CC} ${OPTS} -c $<

Hi Peter85,

Yes, this is a bit confusing.

cuBlas changed their interfaces a bit ago. When you use “cublas”, you’re using the v1 interface where “cublasdcopy” is a subroutine that does not include a handle as the first argument. Though if you use “cublasdcopy_v2” instead, then you’re using the v2 interface where it’s a function with a handle. Alternatively, you can use “cublas_v2” instead of “cublas”, in which case “cublasdcopy” will be using the v2 interface.

The complete interfaces can be found our CUDA Fortran Library Interfaces Guide (https://www.pgroup.com/resources/docs/18.3/pdf/pgi18cudaint.pdf). In particular see pages 36 and 106.

Hope this helps,
Mat

Thanks for the info. I will try it! Yes, it is confusing. Is it recommended to use the the v2 interface?

Is it recommended to use the the v2 interface?

Yes.

I’m now trying to use cublasDgemmStridedBatched_v2, but I get the same error message (cublasDcopy_v2 only works).

 PROGRAM test
       use cublas_v2
       implicit none
       integer n,i,ierr
       type(cublasHandle) :: h
       real*8,device,allocatable :: x(:)
       real*8,device,allocatable :: y(:)
       real*8,device,allocatable :: z(:)
       real*8 a,b
       n=10
       a=1.0d0
       b=1.0d0
       allocate(x(n))
       allocate(y(n))
       allocate(z(n))
       h = cublasGetHandle()
       ierr = cublasDcopy_v2(h,10,x,1,y,1)
       ierr = cublasDgemmStridedBatched_v2(h,CUBLAS_OP_N,CUBLAS_OP_N,&
               1,1,1,a, x,1,1,y,1,1,0,b ,z,1,1,1)
       write(*,*)"Programend"
      end PROGRAM test

Hi Peter,

You have an extra argument in the call and why the generic procedure can’t be resolved. To fix remove the “0” in “1,1,0,b ,z”.

-Mat

Thank you very much! It worked! I oversaw this extra parameter.

I have another question regarding mixing cublas and OpenACC.
Do I have to call cudaDeviceSynchronize() after I called a cublas function if
I have OpenACC kernels after the cuBLAS call? Do cuBLAS and OpenACC both use the same stream?

Thank you for your help!

  !$acc host_data use_device(dBlocks_gpu,r,s)
  ierr = cublasDgemmStridedBatched_v2(h,CUBLAS_OP_N,CUBLAS_OP_N,&
                                   bSize,1,bSize,&
                                   1.0d0,dBlocks_gpu,bSize,mSize,&
                                   r,bSize,bSize, 0.0d0, s,bSize,bSize, n/bSize)
  ierr = cudaDeviceSynchronize()
  !$acc end host_data
   
  ! More OpenACC loops

The cuBlas call will block waiting for the return code. So while it doesn’t hurt, adding the cudaDeviceSynchronize isn’t needed.

-Mat

Mat, this isn’t necessarily true. For absolute safeness, you can run cublas and your openacc kernels on the same stream. If you use an openacc async number of 5, for instance, you can do this:
ierr = cublasSetStream(h, acc_get_cuda_stream(5))
If you use the default stream everywhere, you will be fine. Or add cudaDeviceSynchronize as you said.