openacc with cublas batched routine in fortran

Hi, all.

I used a simple code from internet and modified it to check interfacing openacc with cublas batche routine in fortran. The code could run, but the result was not correct. Here are the codes. I am appreciated for any suggestion.



cublas_for.cu
#include <stdio.h>
#include "cublas_v2.h"

extern "C" int f_cublasCreate(cublasHandle_t **handle)
{
    *handle = (cublasHandle_t*)malloc(sizeof(cublasHandle_t));
    return cublasCreate(*handle);
}


extern "C" void f_cublasDestroy(cublasHandle_t *handle)
{
    cublasDestroy(*handle);
    free(handle);
}



cublas_for_iso.f90

 Module cublas_f
  use ISO_C_BINDING
 
    enum, BIND(C)
        enumerator :: CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C
    end enum
    
    enum, BIND(C)
        enumerator :: CUBLAS_STATUS_SUCCESS,CUBLAS_STATUS_NOT_INITIALIZED,CUBLAS_STATUS_INVALID_VALUE,&
        CUBLAS_STATUS_ARCH_MISMATCH,CUBLAS_STATUS_EXECUTION_FAILED
    end enum

 
  INTERFACE
 
    integer(C_INT) function cublasCreate(handle_ptr) BIND(C, NAME='f_cublasCreate')
        use ISO_C_BINDING
        type(C_PTR) :: handle_ptr
    end function
 
    subroutine  cublasDestroy(handle_ptr) BIND(C, NAME='f_cublasDestroy')
        use ISO_C_BINDING
        type(C_PTR), value :: handle_ptr
    end subroutine
 
    integer(C_INT) function cublasDgemmBatched(handle, transa, transb, m, n, k, alpha, &
                                        A, lda, B, ldb, beta, C, ldc, batch_count) &
                                        BIND(C, NAME='cublasDgemmBatched')
        use ISO_C_BINDING
        type(C_PTR), value    :: handle
        integer(C_INT), value :: transa
        integer(C_INT), value :: transb
        integer(C_INT), value :: m
        integer(C_INT), value :: n
        integer(C_INT), value :: k
        real(C_DOUBLE)        :: alpha
        REAL(C_DOUBLE) :: A(lda,k,batch_count)
        integer(C_INT), value :: lda
        real(C_DOUBLE)    :: B(ldb,n,batch_count)
        integer(C_INT), value :: ldb
        real(C_DOUBLE)        :: beta
        real(C_DOUBLE)  :: C(ldc,n,batch_count)
        integer(C_INT), value :: ldc
        integer(C_INT), value :: batch_count
    end function
 
 
  END INTERFACE
 
end module cublas_f



cublas-batch.f90
program main
    use ISO_C_BINDING
    use cublas_f
    implicit none
 
    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), allocatable :: A, B, C
    real(8) :: alpha, beta, index, sum
    type(C_PTR) :: handle
 
    integer :: sizeof_double
    parameter (sizeof_double=8)
 
    !Linear dimension of matrices
    dim = 100
 
    ! Number of A,B,C matrix sets
    batch_count = 1000
 
    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))
    
    !$acc enter data create(A,B,C)
 
 
    ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major

!$acc data present(A,B,C)
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
!$acc loop
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    a(i,j,k) = k*sin(index)
                    b(i,j,k) = sin(index)
                    c(i,j,k) = k*cos(index) * cos(index)
                else
                    a(i,j,k) = 0.0
                    b(i,j,k) = 0.0
                    c(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k

!$acc end kernels
 

    ! Create cublas instance
    stat = cublasCreate(handle)
    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'
 
 
    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0
 
    ! batched DGEMM: C = alpha*A*B + beta*C

!$acc host_data use_device(A,B,C)
    stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&
         A,dim,B,dim,beta,C,dim,batch_count)
!$acc end host_data

    if(stat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDgemm failed:',stat
    ! Simple sanity test, sum up all elements
    sum = 0.0
!$acc kernels
!$acc loop
    do k=1,batch_count
!$acc loop
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
!$acc end kernels
    print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2

!$acc end data
 
!$acc exit data delete(A,B,C)
    deallocate(A)
    deallocate(B)
    deallocate(C)

    call cublasDestroy(handle)
 
end program

I compiled the codes as follows.
nvcc -c cublas_for.cu
pgfortran -c cublas_for_iso.f90
pgfortran -o cublas-batch.out cublas-batch.f90 cublas_for_iso.o cublas_for.cu -acc -ta=nvidia -Minfo -Mcuda

This is the running result.
cublasDgemm failed: 1
Sum is: 24907700.05277213 should be: 50050000

Thanks a lot.

PGI has modules for interfacing Fortran to the CUDA libraries such as cublas.
After you install, look into the PGI installation under 2016/examples/CUDA-Libraries, and also this document:
http://www.pgroup.com/doc/pgicudaint.pdf

Thanks, brentl.

Yes, I have written the code with PGI CUDA Fortran features and run successfully as you pointed out.

I noticed that this implementation depends on some additional features like type definition attributes with ‘device’, ‘c_devptr’, and a function of ‘c_devloc’.

I am wondering if it is exist another way to interface openacc with cublas library without PGI defined above-mentioned features. It is much like the example given by the link https://devblogs.nvidia.com/parallelforall/3-versatile-openacc-interoperability-techniques/.

Thank all of you.

You have to invent some way to create an array of addresses. This is not a normal Fortran programming paradigm. You can probably use c_loc and c_ptr from the iso_c_binding module, to do basically what we’ve done with c_devloc and c_devptr.

Thank you so much. I’ll have a try.

I am calling cuSparse from Fortran using my own connection (not the provided bindings).

I do not know if this will help in your case but here is the way I do it:

In my C code (after global declarations):

void load_prec_inv_cusparse(double* CSR_LU, int* CSR_I, int* CSR_J, int N, int M)
{
  // Setup cusparse.
  cusparseCreate(&cusparseHandle);

  // Set up L
  cusparseCreateMatDescr  (&L_described);
  cusparseSetMatType      (L_described,CUSPARSE_MATRIX_TYPE_GENERAL);
  cusparseSetMatIndexBase (L_described,CUSPARSE_INDEX_BASE_ONE);
  cusparseSetMatFillMode  (L_described,CUSPARSE_FILL_MODE_LOWER);
  cusparseSetMatDiagType  (L_described,CUSPARSE_DIAG_TYPE_UNIT);
  cusparseCreateSolveAnalysisInfo(&L_analyzed);

  // Set up U
  cusparseCreateMatDescr  (&U_described);
  cusparseSetMatType      (U_described,CUSPARSE_MATRIX_TYPE_GENERAL);
  cusparseSetMatIndexBase (U_described,CUSPARSE_INDEX_BASE_ONE);
  cusparseSetMatFillMode  (U_described,CUSPARSE_FILL_MODE_UPPER);
  cusparseSetMatDiagType  (U_described,CUSPARSE_DIAG_TYPE_NON_UNIT);
  cusparseCreateSolveAnalysisInfo(&U_analyzed);

  // Analyze L
  cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                          N, M, L_described, CSR_LU, CSR_I, CSR_J, L_analyzed);

  // Analyze U
  cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                          N, M, U_described, CSR_LU, CSR_I, CSR_J, U_analyzed);

  cudaDeviceSynchronize();
}

void apply_prec_inv_cusparse(double* x, double* CSR_LU, int* CSR_I,
                             int* CSR_J, int N, int M)
{
  // Forward solve (Lx=y)
  cusparseDcsrsv_solve(cusparseHandle,
          CUSPARSE_OPERATION_NON_TRANSPOSE, N, &one, L_described,
                      CSR_LU, CSR_I, CSR_J, L_analyzed, x, x);

  // Backward solve (Uy=x)
  cusparseDcsrsv_solve(cusparseHandle,
          CUSPARSE_OPERATION_NON_TRANSPOSE, N, &one, U_described,
                      CSR_LU, CSR_I, CSR_J, U_analyzed, x, x);

  cudaDeviceSynchronize();
}

void unload_prec_inv_cusparse()
{
  // Free up cusparse.
  cusparseDestroySolveAnalysisInfo(L_analyzed);
  cusparseDestroySolveAnalysisInfo(U_analyzed);
  cusparseDestroy(cusparseHandle);
  cudaDeviceSynchronize();
}

Then, in the FORTRAN code:

      module cusparse_interface
      interface
        subroutine load_prec_inv_cusparse(CSR_A,CSR_AI,CSR_AJ,N,M)
     &    BIND(C, name="load_prec_inv_cusparse")
          use, intrinsic :: iso_c_binding
          implicit none
          integer(C_INT), value :: N,M
          type(C_PTR), value :: CSR_A,CSR_AI,CSR_AJ
        end subroutine load_prec_inv_cusparse
c
        subroutine apply_prec_inv_cusparse(x,CSR_LU,CSR_AI,CSR_AJ,N,M)
     &    BIND(C, name="apply_prec_inv_cusparse")
          use, intrinsic :: iso_c_binding
          implicit none
          integer(C_INT), value :: N
          type(C_PTR), value :: x,CSR_LU,CSR_AI,CSR_AJ
        end subroutine apply_prec_inv_cusparse
c
        subroutine unload_prec_inv_cusparse()
     &    BIND(C, name="unload_prec_inv_cusparse")
        end subroutine unload_prec_inv_cusparse
       end module

Then, to call the routine (for example):

      integer(c_int) :: cN,cM
      integer :: N,M

       cN=N
       cM=M
!$acc host_data use_device(x,a_csr,a_csr_ja,a_csr_ia)
          call apply_prec_inv_cusparse(C_LOC(x(1)),C_LOC(a_csr(1)),
     &                                 C_LOC(a_csr_ia(1)),
     &                                 C_LOC(a_csr_ja(1)), cN, cM)
!$acc end host_data

In the above, I already had moved the data to the device with a data enter.

When I compile I use:

nvcc -c prec_inv.c
mpif90 prec_inv.o mycode.f

(with all the libs, flags, includes, etc needed)

Semseq, thank you for your information.

I noticed that you use C_LOC function to return a pointer. I am wondering if the input variable of C_LOC function in your codes is on the host.

I also tried to use c_loc like your method, but the compiler reported errors.

Here are what I have changed on my codes.


 cublas_for_iso.f90 
       TYPE(C_PTR),VALUE :: A
       TYPE(C_PTR),VALUE   :: B
       TYPE(C_PTR), VALUE  :: C



 cublas-batch.f90 
   stat=cublasDgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,dim,dim,dim,alpha,&        C_LOC(A(1,1,1)),dim,C_LOC(B(1,1,1)),dim,beta,C_LOC(C(1,1,1)),dim,batch_count)

The compilers emits error messages like this.
PGF90-W-0189-Argument number 8 to cublasdgemmbatched: association of scalar actual argument to array dummy argument (cublas-batch.f90: 95)
PGF90-S-0188-Argument number 8 to cublasdgemmbatched: type mismatch (cublas-acc.f90: 95)
PGF90-W-0189-Argument number 10 to cublasdgemmbatched: association of scalar actual argument to array dummy argument (cublas-batch.f90: 95)
PGF90-S-0188-Argument number 10 to cublasdgemmbatched: type mismatch (cublas-acc.f90: 95)
PGF90-W-0189-Argument number 13 to cublasdgemm: association of scalar actual argument to array dummy argument (cublas-batch.f90: 95)
PGF90-S-0188-Argument number 13 to cublasdgemmbatched: type mismatch (cublas-batch.f90: 95)
0 inform, 3 warnings, 3 severes, 0 fatal for main
make: *** [cublas-acc.o] Error 2

Any advice will be greatly appreciated.

Hi,

The C_LOC() in my code is being used on device pointers through the use of the OpenACC “host_data use_device()”

It looks like in your code you are using host pointers?

I do not use any PGI CUDA-Fortran code and instead use straight OpenACC to move the data to the device using an unstructured “acc data enter”. Then when using “use_device” the C_LOC pointer is a device pointer that can be used with cusparse/cublas.