openacc with cublas batched routine in fortran

Hi, all.

I used a simple code from internet and modified it to check interfacing openacc with cublas batche routine in fortran. The code could run, but the result was not correct. Here are the codes. I am appreciated for any suggestion.
#include <stdio.h>
#include "cublas_v2.h"

extern "C" int f_cublasCreate(cublasHandle_t **handle)
    *handle = (cublasHandle_t*)malloc(sizeof(cublasHandle_t));
    return cublasCreate(*handle);

extern "C" void f_cublasDestroy(cublasHandle_t *handle)


 Module cublas_f
    enum, BIND(C)
        enumerator :: CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C
    end enum
    enum, BIND(C)
    end enum

    integer(C_INT) function cublasCreate(handle_ptr) BIND(C, NAME='f_cublasCreate')
        use ISO_C_BINDING
        type(C_PTR) :: handle_ptr
    end function
    subroutine  cublasDestroy(handle_ptr) BIND(C, NAME='f_cublasDestroy')
        use ISO_C_BINDING
        type(C_PTR), value :: handle_ptr
    end subroutine
    integer(C_INT) function cublasDgemmBatched(handle, transa, transb, m, n, k, alpha, &
                                        A, lda, B, ldb, beta, C, ldc, batch_count) &
                                        BIND(C, NAME='cublasDgemmBatched')
        use ISO_C_BINDING
        type(C_PTR), value    :: handle
        integer(C_INT), value :: transa
        integer(C_INT), value :: transb
        integer(C_INT), value :: m
        integer(C_INT), value :: n
        integer(C_INT), value :: k
        real(C_DOUBLE)        :: alpha
        REAL(C_DOUBLE) :: A(lda,k,batch_count)
        integer(C_INT), value :: lda
        real(C_DOUBLE)    :: B(ldb,n,batch_count)
        integer(C_INT), value :: ldb
        real(C_DOUBLE)        :: beta
        real(C_DOUBLE)  :: C(ldc,n,batch_count)
        integer(C_INT), value :: ldc
        integer(C_INT), value :: batch_count
    end function
end module cublas_f

program main
    use cublas_f
    implicit none
    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), allocatable :: A, B, C
    real(8) :: alpha, beta, index, sum
    type(C_PTR) :: handle
    integer :: sizeof_double
    parameter (sizeof_double=8)
    !Linear dimension of matrices
    dim = 100
    ! Number of A,B,C matrix sets
    batch_count = 1000
    ! Allocate host storage for A,B,C square matrices
    !$acc enter data create(A,B,C)
    ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major

!$acc data present(A,B,C)
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
!$acc loop
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    a(i,j,k) = k*sin(index)
                    b(i,j,k) = sin(index)
                    c(i,j,k) = k*cos(index) * cos(index)
                    a(i,j,k) = 0.0
                    b(i,j,k) = 0.0
                    c(i,j,k) = 0.0
            enddo ! i
        enddo ! j
    enddo ! k

!$acc end kernels

    ! Create cublas instance
    stat = cublasCreate(handle)
    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'
    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0
    ! batched DGEMM: C = alpha*A*B + beta*C

!$acc host_data use_device(A,B,C)
!$acc end host_data

    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDgemm failed:',stat
    ! Simple sanity test, sum up all elements
    sum = 0.0
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
!$acc end kernels
    print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2

!$acc end data
!$acc exit data delete(A,B,C)

    call cublasDestroy(handle)
end program

I compiled the codes as follows.
nvcc -c
pgfortran -c cublas_for_iso.f90
pgfortran -o cublas-batch.out cublas-batch.f90 cublas_for_iso.o -acc -ta=nvidia -Minfo -Mcuda

This is the running result.
cublasDgemm failed: 1
Sum is: 24907700.05277213 should be: 50050000

Thanks a lot.

PGI has modules for interfacing Fortran to the CUDA libraries such as cublas.
After you install, look into the PGI installation under 2016/examples/CUDA-Libraries, and also this document:

Thanks, brentl.

Yes, I have written the code with PGI CUDA Fortran features and run successfully as you pointed out.

I noticed that this implementation depends on some additional features like type definition attributes with ‘device’, ‘c_devptr’, and a function of ‘c_devloc’.

I am wondering if it is exist another way to interface openacc with cublas library without PGI defined above-mentioned features. It is much like the example given by the link

Thank all of you.

You have to invent some way to create an array of addresses. This is not a normal Fortran programming paradigm. You can probably use c_loc and c_ptr from the iso_c_binding module, to do basically what we’ve done with c_devloc and c_devptr.

Thank you so much. I’ll have a try.

I am calling cuSparse from Fortran using my own connection (not the provided bindings).

I do not know if this will help in your case but here is the way I do it:

In my C code (after global declarations):

void load_prec_inv_cusparse(double* CSR_LU, int* CSR_I, int* CSR_J, int N, int M)
  // Setup cusparse.

  // Set up L
  cusparseCreateMatDescr  (&L_described);
  cusparseSetMatType      (L_described,CUSPARSE_MATRIX_TYPE_GENERAL);
  cusparseSetMatIndexBase (L_described,CUSPARSE_INDEX_BASE_ONE);
  cusparseSetMatFillMode  (L_described,CUSPARSE_FILL_MODE_LOWER);
  cusparseSetMatDiagType  (L_described,CUSPARSE_DIAG_TYPE_UNIT);

  // Set up U
  cusparseCreateMatDescr  (&U_described);
  cusparseSetMatType      (U_described,CUSPARSE_MATRIX_TYPE_GENERAL);
  cusparseSetMatIndexBase (U_described,CUSPARSE_INDEX_BASE_ONE);
  cusparseSetMatFillMode  (U_described,CUSPARSE_FILL_MODE_UPPER);
  cusparseSetMatDiagType  (U_described,CUSPARSE_DIAG_TYPE_NON_UNIT);

  // Analyze L
  cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                          N, M, L_described, CSR_LU, CSR_I, CSR_J, L_analyzed);

  // Analyze U
  cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                          N, M, U_described, CSR_LU, CSR_I, CSR_J, U_analyzed);


void apply_prec_inv_cusparse(double* x, double* CSR_LU, int* CSR_I,
                             int* CSR_J, int N, int M)
  // Forward solve (Lx=y)
          CUSPARSE_OPERATION_NON_TRANSPOSE, N, &one, L_described,
                      CSR_LU, CSR_I, CSR_J, L_analyzed, x, x);

  // Backward solve (Uy=x)
          CUSPARSE_OPERATION_NON_TRANSPOSE, N, &one, U_described,
                      CSR_LU, CSR_I, CSR_J, U_analyzed, x, x);


void unload_prec_inv_cusparse()
  // Free up cusparse.

Then, in the FORTRAN code:

      module cusparse_interface
        subroutine load_prec_inv_cusparse(CSR_A,CSR_AI,CSR_AJ,N,M)
     &    BIND(C, name="load_prec_inv_cusparse")
          use, intrinsic :: iso_c_binding
          implicit none
          integer(C_INT), value :: N,M
          type(C_PTR), value :: CSR_A,CSR_AI,CSR_AJ
        end subroutine load_prec_inv_cusparse
        subroutine apply_prec_inv_cusparse(x,CSR_LU,CSR_AI,CSR_AJ,N,M)
     &    BIND(C, name="apply_prec_inv_cusparse")
          use, intrinsic :: iso_c_binding
          implicit none
          integer(C_INT), value :: N
          type(C_PTR), value :: x,CSR_LU,CSR_AI,CSR_AJ
        end subroutine apply_prec_inv_cusparse
        subroutine unload_prec_inv_cusparse()
     &    BIND(C, name="unload_prec_inv_cusparse")
        end subroutine unload_prec_inv_cusparse
       end module

Then, to call the routine (for example):

      integer(c_int) :: cN,cM
      integer :: N,M

!$acc host_data use_device(x,a_csr,a_csr_ja,a_csr_ia)
          call apply_prec_inv_cusparse(C_LOC(x(1)),C_LOC(a_csr(1)),
     &                                 C_LOC(a_csr_ia(1)),
     &                                 C_LOC(a_csr_ja(1)), cN, cM)
!$acc end host_data

In the above, I already had moved the data to the device with a data enter.

When I compile I use:

nvcc -c prec_inv.c
mpif90 prec_inv.o mycode.f

(with all the libs, flags, includes, etc needed)

Semseq, thank you for your information.

I noticed that you use C_LOC function to return a pointer. I am wondering if the input variable of C_LOC function in your codes is on the host.

I also tried to use c_loc like your method, but the compiler reported errors.

Here are what I have changed on my codes.

       TYPE(C_PTR),VALUE :: A
       TYPE(C_PTR),VALUE   :: B
       TYPE(C_PTR), VALUE  :: C

   stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&        C_LOC(A(1,1,1)),dim,C_LOC(B(1,1,1)),dim,beta,C_LOC(C(1,1,1)),dim,batch_count)

The compilers emits error messages like this.
PGF90-W-0189-Argument number 8 to cublasdgemmbatched: association of scalar actual argument to array dummy argument (cublas-batch.f90: 95)
PGF90-S-0188-Argument number 8 to cublasdgemmbatched: type mismatch (cublas-acc.f90: 95)
PGF90-W-0189-Argument number 10 to cublasdgemmbatched: association of scalar actual argument to array dummy argument (cublas-batch.f90: 95)
PGF90-S-0188-Argument number 10 to cublasdgemmbatched: type mismatch (cublas-acc.f90: 95)
PGF90-W-0189-Argument number 13 to cublasdgemm: association of scalar actual argument to array dummy argument (cublas-batch.f90: 95)
PGF90-S-0188-Argument number 13 to cublasdgemmbatched: type mismatch (cublas-batch.f90: 95)
0 inform, 3 warnings, 3 severes, 0 fatal for main
make: *** [cublas-acc.o] Error 2

Any advice will be greatly appreciated.


The C_LOC() in my code is being used on device pointers through the use of the OpenACC “host_data use_device()”

It looks like in your code you are using host pointers?

I do not use any PGI CUDA-Fortran code and instead use straight OpenACC to move the data to the device using an unstructured “acc data enter”. Then when using “use_device” the C_LOC pointer is a device pointer that can be used with cusparse/cublas.