PGF90-S-0155-Could not resolve generic procedure cublasdgemm

Hi,

It works fine to call “cublasDgemm” from OpenACC host in a Fortran code. but when tried to use the batched cublas library cublasDgemmBatched from OpenACC host code likes

!$acc host_data use_device(A,B,C) stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&
         A,dim,B,dim,beta,C,dim,batch_count)
!$acc end host_data

I got compiling error

PGF90-S-0155-Could not resolve generic procedure cublasdgemmbatched

Should we have to use a wrapper with ISO_C_BINDING interface to call
cublasdgemmbatched in Fortran codes?

https://www.olcf.ornl.gov/tutorials/concurrent-kernels-ii-batched-library-calls/#fortran

Thanks.

Regards, Jing

Hi Jing,

Should we have to use a wrapper with ISO_C_BINDING interface to call
cublasdgemmbatched in Fortran codes?

No, you shouldn’t need a wrapper. What this error is indicating is that you do have a generic interface for this routine but there’s a data type mismatch so it can’t determine which actual routine to resolve it to.

I’m not 100% clear if you’re using the interface module shown in the web page link, or using the PGI supplied cublas modules (as defined in https://www.pgroup.com/resources/docs/18.7/x86/pgi-cuda-interfaces/index.htm#dp-cublasdgemmbatched). Though in either case, the interfaces don’t take a Fortran array as an argument, but rather a C device pointer.

Unfortunately off-hand, I don’t have an example of calling cublasDgemmBatched from an OpenACC code. But think you should be able to make this work by using either “C_LOC” or “C_DEVLOC” on A, B, and C.

Something like:

!$acc host_data use_device(A,B,C) stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,& 
         C_LOC(A),dim,C_LOC(B),dim,beta,C_LOC(C),dim,batch_count) 
!$acc end host_data

or

stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,& 
         C_DEVLOC(A),dim,C_DEVLOC(B),dim,beta,C_DEVLOC(C),dim,batch_count)

If you still have issues, please post a small reproducing example and we can see what we can do to get it working.

Hope this helps,
Mat

Hi Mat,

It seems to me that it does not work using “C_LOC” or “C_DEVLOC” on A, B, and C. I still got same compiling error.

I just modified the code cublas-batch.f90 without ISO_C_BINDING, which is provided in the following link

http://vegtingraros.tk/userforum/viewtopic.php?t=5460&highlight=

cublas-batch.f90

program main

    use cublas
    implicit none
 
    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), allocatable :: A, B, C
    real(8) :: alpha, beta, index, sum
    type(cublasHandle) :: handle
 
    integer :: sizeof_double
    parameter (sizeof_double=8)
 
    !Linear dimension of matrices
    dim = 100
 
    ! Number of A,B,C matrix sets
    batch_count = 1000
 
    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))
   
    !$acc enter data create(A,B,C)
 
 
    ! Fill A,Bd diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major

!$acc data present(A,B,C)
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
!$acc loop
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    a(i,j,k) = k*sin(index)
                    b(i,j,k) = sin(index)
                    c(i,j,k) = k*cos(index) * cos(index)
                else
                    a(i,j,k) = 0.0
                    b(i,j,k) = 0.0
                    c(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k

!$acc end kernels
 
    ! Create cublas instance
    stat = cublasCreate(handle)
    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'
 
 
    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0
 
    ! batched DGEMM: C = alpha*A*B + beta*C

!$acc host_data use_device(A,B,C)
    stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&
         C_LOC(A),dim,C_LOC(B),dim,beta,C_LOC(C),dim,batch_count)
!$acc end host_data

    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDgemm failed:',stat
    ! Simple sanity test, sum up all elements
    sum = 0.0
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
!$acc end kernels
    print *, "Sum is:", sum, "should be: ",dim*(batch_count)*(batch_count+1)/2

!$acc end data
 
!$acc exit data delete(A,B,C)
    deallocate(A)
    deallocate(B)
    deallocate(C)

!    call cublasDestroy(handle)
 
end program



$ pgfortran -o cublas-batch.out cublas_batch.f90 -acc -ta=nvidia -Minfo -Mcuda
PGF90-S-0155-Could not resolve generic procedure cublasdgemmbatched (cublas_batch.f90: 68)
0 inform, 0 warnings, 1 severes, 0 fatal for main

Thanks.

/Jing

Hi Jing,

Having not used cublasDgemmBatched before myself, it was nice to learn something new!

cublasDgemmBatched doesn’t take matrices as arguments, but rather it takes arrays of pointers to matrices as arguments. So you can’t pass A, B, and directly, but rather need to create an array of device pointers to each of the 1000 arrays. I’ve have updated your code (See below) with this change.

However since your arrays are evenly strided, you can instead use cublasDgemmStridedBatched and then pass A, B, and C directly to routine (within a host_data region). I’ve included an example of this below as well. Also, the strided version should perform better as well.

Note for my tests, I’m using a Skylake CPU on Linux with a Tesla V100 and PGI 18.10. Though earlier versions of PGI work as well.

Hope this helps,
Mat

% cat  batchedDgemm.f90
program main

    use cublas
    implicit none

    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), allocatable :: A, B, C
    type(c_devptr), device, dimension(:), allocatable :: dA, dB, dC
    real(8) :: alpha, beta
    real(8) :: index, sum
    type(cublasHandle) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100

    ! Number of A,B,C matrix sets
    batch_count = 1000

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))
    allocate(dA(batch_count))
    allocate(dB(batch_count))
    allocate(dC(batch_count))

    !$acc enter data create(A,B,C)

    ! Fill A,Bd diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major

!$acc data present(A,B,C)
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
!$acc loop
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    a(i,j,k) = k*sin(index)
                    b(i,j,k) = sin(index)
                    c(i,j,k) = k*cos(index) * cos(index)
                else
                    a(i,j,k) = 0.0
                    b(i,j,k) = 0.0
                    c(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k
!$acc end kernels

    ! Create cublas instance
    stat = cublasCreate(handle)
    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'

    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0

    ! batched DGEMM: C = alpha*A*B + beta*C
    do i=1,1000
      dA(i) = C_DEVLOC(A(:,:,i))
      dB(i) = C_DEVLOC(B(:,:,i))
      dC(i) = C_DEVLOC(C(:,:,i))
    enddo
    stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&
        dA,dim,dB,dim,beta,dC,dim,batch_count)

    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDgemm failed:',stat
    ! Simple sanity test, sum up all elements
    sum = 0.0
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
!$acc end kernels
    print *, "Sum is:", sum, "should be: ",dim*(batch_count)*(batch_count+1)/2

!$acc end data
!$acc exit data delete(A,B,C)
    deallocate(A)
    deallocate(B)
    deallocate(C)
    deallocate(dA)
    deallocate(dB)
    deallocate(dC)
    stat = cublasDestroy(handle)

end program
% pgf90 -ta=tesla:cc70 -Mcudalib=cublas batchedDgemm.f90 -Mcuda -V18.10
% a.out
 Sum is:    50050000.00000000      should be:      50050000

Example of using strided:

% cat batchedStridedDgemm.f90
program main

    use cublas
    implicit none

    integer :: dim, stat, i, j, k, stride, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), allocatable :: A, B, C
    real(8) :: alpha, beta
    real(8) :: index, sum
    type(cublasHandle) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100
    stride = dim*dim

    ! Number of A,B,C matrix sets
    batch_count = 1000

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))
    allocate(dA(batch_count))
    allocate(dB(batch_count))
    allocate(dC(batch_count))

    !$acc enter data create(A,B,C)

    ! Fill A,Bd diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major

!$acc data present(A,B,C)
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
!$acc loop
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    a(i,j,k) = k*sin(index)
                    b(i,j,k) = sin(index)
                    c(i,j,k) = k*cos(index) * cos(index)
                else
                    a(i,j,k) = 0.0
                    b(i,j,k) = 0.0
                    c(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k
!$acc end kernels

    ! Create cublas instance
    stat = cublasCreate(handle)
    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'

    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0

    ! batched DGEMM: C = alpha*A*B + beta*C
!$acc host_data use_device(A,B,C)
    stat=cublasDgemmStridedBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&
        A,dim,stride,B,dim,stride,beta,C,dim,stride,batch_count)
!$acc end host_data

    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDgemm failed:',stat
    ! Simple sanity test, sum up all elements
    sum = 0.0
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
!$acc end kernels
    print *, "Sum is:", sum, "should be: ",dim*(batch_count)*(batch_count+1)/2

!$acc end data
!$acc exit data delete(A,B,C)
    deallocate(A)
    deallocate(B)
    deallocate(C)
    deallocate(dA)
    deallocate(dB)
    deallocate(dC)
    stat = cublasDestroy(handle)

end program
% pgf90 -ta=tesla:cc70 -Mcudalib=cublas batchedStridedDgemm.f90 -Mcuda -V18.10
% a.out
 Sum is:    50050000.00000000      should be:      50050000

Hi Mat,

Both methods work fine with PGI v18.7.

Thanks a lot for your help. /Jing