PGF90-S-0155-Could not resolve generic procedure cublasdgemm

gongjing · November 8, 2018, 8:32am

Hi,

It works fine to call “cublasDgemm” from OpenACC host in a Fortran code. but when tried to use the batched cublas library cublasDgemmBatched from OpenACC host code likes

!$acc host_data use_device(A,B,C) stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&
         A,dim,B,dim,beta,C,dim,batch_count)
!$acc end host_data

I got compiling error

PGF90-S-0155-Could not resolve generic procedure cublasdgemmbatched

Should we have to use a wrapper with ISO_C_BINDING interface to call
cublasdgemmbatched in Fortran codes?

Thanks.

Regards, Jing

MatColgrove · November 8, 2018, 4:30pm

Hi Jing,

Should we have to use a wrapper with ISO_C_BINDING interface to call
cublasdgemmbatched in Fortran codes?

No, you shouldn’t need a wrapper. What this error is indicating is that you do have a generic interface for this routine but there’s a data type mismatch so it can’t determine which actual routine to resolve it to.

I’m not 100% clear if you’re using the interface module shown in the web page link, or using the PGI supplied cublas modules (as defined in Fortran CUDA Interfaces :: PGI version 18.7 Documentation for x86 and NVIDIA Processors). Though in either case, the interfaces don’t take a Fortran array as an argument, but rather a C device pointer.

Unfortunately off-hand, I don’t have an example of calling cublasDgemmBatched from an OpenACC code. But think you should be able to make this work by using either “C_LOC” or “C_DEVLOC” on A, B, and C.

Something like:

!$acc host_data use_device(A,B,C) stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,& 
         C_LOC(A),dim,C_LOC(B),dim,beta,C_LOC(C),dim,batch_count) 
!$acc end host_data

or

stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,& 
         C_DEVLOC(A),dim,C_DEVLOC(B),dim,beta,C_DEVLOC(C),dim,batch_count)

If you still have issues, please post a small reproducing example and we can see what we can do to get it working.

Hope this helps,
Mat

gongjing · November 8, 2018, 9:11pm

Hi Mat,

It seems to me that it does not work using “C_LOC” or “C_DEVLOC” on A, B, and C. I still got same compiling error.

I just modified the code cublas-batch.f90 without ISO_C_BINDING, which is provided in the following link

http://vegtingraros.tk/userforum/viewtopic.php?t=5460&highlight=

cublas-batch.f90

program main

    use cublas
    implicit none
 
    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), allocatable :: A, B, C
    real(8) :: alpha, beta, index, sum
    type(cublasHandle) :: handle
 
    integer :: sizeof_double
    parameter (sizeof_double=8)
 
    !Linear dimension of matrices
    dim = 100
 
    ! Number of A,B,C matrix sets
    batch_count = 1000
 
    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))
   
    !$acc enter data create(A,B,C)
 
 
    ! Fill A,Bd diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major

!$acc data present(A,B,C)
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
!$acc loop
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    a(i,j,k) = k*sin(index)
                    b(i,j,k) = sin(index)
                    c(i,j,k) = k*cos(index) * cos(index)
                else
                    a(i,j,k) = 0.0
                    b(i,j,k) = 0.0
                    c(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k

!$acc end kernels
 
    ! Create cublas instance
    stat = cublasCreate(handle)
    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'
 
 
    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0
 
    ! batched DGEMM: C = alpha*A*B + beta*C

!$acc host_data use_device(A,B,C)
    stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&
         C_LOC(A),dim,C_LOC(B),dim,beta,C_LOC(C),dim,batch_count)
!$acc end host_data

    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDgemm failed:',stat
    ! Simple sanity test, sum up all elements
    sum = 0.0
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
!$acc end kernels
    print *, "Sum is:", sum, "should be: ",dim*(batch_count)*(batch_count+1)/2

!$acc end data
 
!$acc exit data delete(A,B,C)
    deallocate(A)
    deallocate(B)
    deallocate(C)

!    call cublasDestroy(handle)
 
end program

$ pgfortran -o cublas-batch.out cublas_batch.f90 -acc -ta=nvidia -Minfo -Mcuda
PGF90-S-0155-Could not resolve generic procedure cublasdgemmbatched (cublas_batch.f90: 68)
0 inform, 0 warnings, 1 severes, 0 fatal for main

Thanks.

/Jing

MatColgrove · November 9, 2018, 6:32pm

Hi Jing,

Having not used cublasDgemmBatched before myself, it was nice to learn something new!

cublasDgemmBatched doesn’t take matrices as arguments, but rather it takes arrays of pointers to matrices as arguments. So you can’t pass A, B, and directly, but rather need to create an array of device pointers to each of the 1000 arrays. I’ve have updated your code (See below) with this change.

However since your arrays are evenly strided, you can instead use cublasDgemmStridedBatched and then pass A, B, and C directly to routine (within a host_data region). I’ve included an example of this below as well. Also, the strided version should perform better as well.

Note for my tests, I’m using a Skylake CPU on Linux with a Tesla V100 and PGI 18.10. Though earlier versions of PGI work as well.

Hope this helps,
Mat

% cat  batchedDgemm.f90
program main

    use cublas
    implicit none

    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), allocatable :: A, B, C
    type(c_devptr), device, dimension(:), allocatable :: dA, dB, dC
    real(8) :: alpha, beta
    real(8) :: index, sum
    type(cublasHandle) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100

    ! Number of A,B,C matrix sets
    batch_count = 1000

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))
    allocate(dA(batch_count))
    allocate(dB(batch_count))
    allocate(dC(batch_count))

    !$acc enter data create(A,B,C)

    ! Fill A,Bd diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major

!$acc data present(A,B,C)
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
!$acc loop
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    a(i,j,k) = k*sin(index)
                    b(i,j,k) = sin(index)
                    c(i,j,k) = k*cos(index) * cos(index)
                else
                    a(i,j,k) = 0.0
                    b(i,j,k) = 0.0
                    c(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k
!$acc end kernels

    ! Create cublas instance
    stat = cublasCreate(handle)
    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'

    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0

    ! batched DGEMM: C = alpha*A*B + beta*C
    do i=1,1000
      dA(i) = C_DEVLOC(A(:,:,i))
      dB(i) = C_DEVLOC(B(:,:,i))
      dC(i) = C_DEVLOC(C(:,:,i))
    enddo
    stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&
        dA,dim,dB,dim,beta,dC,dim,batch_count)

    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDgemm failed:',stat
    ! Simple sanity test, sum up all elements
    sum = 0.0
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
!$acc end kernels
    print *, "Sum is:", sum, "should be: ",dim*(batch_count)*(batch_count+1)/2

!$acc end data
!$acc exit data delete(A,B,C)
    deallocate(A)
    deallocate(B)
    deallocate(C)
    deallocate(dA)
    deallocate(dB)
    deallocate(dC)
    stat = cublasDestroy(handle)

end program
% pgf90 -ta=tesla:cc70 -Mcudalib=cublas batchedDgemm.f90 -Mcuda -V18.10
% a.out
 Sum is:    50050000.00000000      should be:      50050000

Example of using strided:

% cat batchedStridedDgemm.f90
program main

    use cublas
    implicit none

    integer :: dim, stat, i, j, k, stride, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), allocatable :: A, B, C
    real(8) :: alpha, beta
    real(8) :: index, sum
    type(cublasHandle) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100
    stride = dim*dim

    ! Number of A,B,C matrix sets
    batch_count = 1000

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))
    allocate(dA(batch_count))
    allocate(dB(batch_count))
    allocate(dC(batch_count))

    !$acc enter data create(A,B,C)

    ! Fill A,Bd diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major

!$acc data present(A,B,C)
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
!$acc loop
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    a(i,j,k) = k*sin(index)
                    b(i,j,k) = sin(index)
                    c(i,j,k) = k*cos(index) * cos(index)
                else
                    a(i,j,k) = 0.0
                    b(i,j,k) = 0.0
                    c(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k
!$acc end kernels

    ! Create cublas instance
    stat = cublasCreate(handle)
    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'

    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0

    ! batched DGEMM: C = alpha*A*B + beta*C
!$acc host_data use_device(A,B,C)
    stat=cublasDgemmStridedBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&
        A,dim,stride,B,dim,stride,beta,C,dim,stride,batch_count)
!$acc end host_data

    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDgemm failed:',stat
    ! Simple sanity test, sum up all elements
    sum = 0.0
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
!$acc end kernels
    print *, "Sum is:", sum, "should be: ",dim*(batch_count)*(batch_count+1)/2

!$acc end data
!$acc exit data delete(A,B,C)
    deallocate(A)
    deallocate(B)
    deallocate(C)
    deallocate(dA)
    deallocate(dB)
    deallocate(dC)
    stat = cublasDestroy(handle)

end program
% pgf90 -ta=tesla:cc70 -Mcudalib=cublas batchedStridedDgemm.f90 -Mcuda -V18.10
% a.out
 Sum is:    50050000.00000000      should be:      50050000

gongjing · November 10, 2018, 9:44am

Hi Mat,

Both methods work fine with PGI v18.7.

Thanks a lot for your help. /Jing

Topic		Replies	Views
openacc with cublas batched routine in fortran Legacy PGI Compilers	7	8212	January 27, 2017
0: copyout Memcpy... FAILED: 4(unspecified launch failure) Legacy PGI Compilers	3	5239	May 13, 2013
cuBLAS Dgemm "Could not Resolve Generic Procedure nvc, nvc++ and nvfortran cuda , hpc	7	1439	December 17, 2021
Using CUDA Libraries from CUDA Fortran Device Code Legacy PGI Compilers	6	7434	July 19, 2017
Interfacing OpenACC with cublasDgetrfBatched/cusparseDgtsvStridedBatch in Fortran GPU-Accelerated Libraries	2	1158	April 11, 2014
Segmentation fault : cublasDgetrfBatched and cublasDgetriBatched Legacy PGI Compilers	5	1760	July 10, 2019
New cuBlas function "getrfBatched" Legacy PGI Compilers	5	8512	January 21, 2014
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4758	April 7, 2021
Using Fortran derived types and cuBLAS Legacy PGI Compilers	19	12047	June 24, 2016
Undefined reference to cublas_ -- but library is there! Legacy PGI Compilers	5	12444	September 1, 2010

PGF90-S-0155-Could not resolve generic procedure cublasdgemm

Related topics