Using stream and cublas with cuda fortran

I want to do 100 times of matrix-vector multiplication in parallel. But I have some problems.

I follow the example from nvidia.

use cudafor
use cublas
...
DO I=1,NX
  ISTAT = CUBLASSETSTREAM(HANDLE,STREAM(I))
  ISTAT = CUBLASZGEMV(HANDLE,'N',.....)
END DO

But, it screams that

PGF90-S-0155-Could not resolve generic procedure cublaszgemv (zgemv_batch_gpu.f90: 51)
0 inform, 0 warnings, 1 severes, 0 fatal for zgemv_batch

First, I know there is no problem when I use zgemv or zgemm. But I have to parallel that.
Therefore, should I have to write the interface for cublas?
The same thing happened when I tried to use zgemm_batch.

Another question is if I want to use cuSolverDn, should I write the interface?

Therefore, should I have to write the interface for cublas?

The cuBLAS module does include a generic interface to “ZGEMV”. If you pass it a “device” array, the cuBLAS version is called.

If you want to call CUBLASZGEMV directly, then you can write you’re own interface or use the cuBLAS Fortran bindings: http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings

For batching, see: http://www.pgroup.com/lit/articles/insider/v6n1a4.htm

Another question is if I want to use cuSolverDn, should I write the interface?

We don’t ship a cuSolver interface module, just cuFFT, cuBLAS, cuRand, and cuSparse.

  • Mat

But… I still have some problems with that.

This is the interface for cublas_zgemm.

  1 MODULE CUBLAS_F
  2   USE ISO_C_BINDING
  3 
  4 
  5 
  6   ENUM, BIND(C)
  7     ENUMERATOR :: CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C
  8   END ENUM
  9 
 10   INTERFACE
 11     INTEGER(C_INT) FUNCTION CUBLAS_ZGEMM(HANDLE,TRANSA,TRANSB, M, N, K, ALPHA, &
 12                                         A, LDA, B, LDB, BETA, C, LDC)&
 13                                         BIND(C,NAME='cublasZgemm')
 14       USE ISO_C_BINDING                 
 15       USE CUBLAS, ONLY: CUBLASHANDLE    
 16       TYPE(CUBLASHANDLE), VALUE :: HANDLE
 17       INTEGER(C_INT), VALUE :: TRANSA
 18       INTEGER(C_INT), VALUE :: TRANSB
 19       INTEGER(C_INT), VALUE :: M
 20       INTEGER(C_INT), VALUE :: N
 21       INTEGER(C_INT), VALUE :: K
 22       COMPLEX(C_DOUBLE_COMPLEX) :: ALPHA
 23       COMPLEX(C_DOUBLE_COMPLEX), DEVICE :: A(*)
 24       INTEGER(C_INT), VALUE :: LDA
 25       COMPLEX(C_DOUBLE_COMPLEX), DEVICE :: B(*)
 26       INTEGER(C_INT), VALUE :: LDB
 27       COMPLEX(C_DOUBLE_COMPLEX) :: BETA
 28       COMPLEX(C_DOUBLE_COMPLEX), DEVICE :: C(*)
 29       INTEGER(C_INT), VALUE :: LDC
 30     END FUNCTION
 31   END INTERFACE
 32 END MODULE

I use my own version of cublas_f.
I successfully compile it. But, I get the error message when I try to run it.

** On entry to ZGEMM parameter number 1 had an illegal value

Is the cublasHandle in the cublas zgemm? Or what do I do now?

PGI don’t ship the cuSolver. So, if I want to use it, I have to write the interface. Am I right?

Hi afai,

Here’s the ZGEMM interface we using in our cuBLAS module. It’s only the “v2”, the newer version of cuBLAS, that uses the handle. My apologies that I forgot that we had these defined in the module. I typically just use the generic zgemm interface since its more convenient and portable. You should be able to uses one of these instead of writing your own.

subroutine cublasZgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c,
ldc)
character*1 :: transa, transb
integer :: m, n, k, lda, ldb, ldc
complex(8), device, dimension(lda, *) :: a
complex(8), device, dimension(ldb, *) :: b
complex(8), device, dimension(ldc, *) :: c
complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZgemm_v2(h, transa, transb, m, n, k, alpha, a, lda,
b, ldb, beta, c, ldc)
type(cublasHandle) :: h
integer :: transa, transb
integer :: m, n, k, lda, ldb, ldc
complex(8), device, dimension(lda, *) :: a
complex(8), device, dimension(ldb, *) :: b
complex(8), device, dimension(ldc, *) :: c
complex(8), device :: alpha, beta ! device or host variable



So, if I want to use it, I have to write the interface. Am I right

Correct. Though the way cuSolver lays out data (row major) isn’t great for use with Fortran so often doesn’t perform well. It may be better to write your own solver instead.

From Greg Ruetsch who wrote “CUDA Fortran for Scientist and Engineers”

The problem is that many routines (e.g. triangular solves) map a warp of threads to a row in the matrix, and if the matrix arises from a low-order method that has only a few elements per row (like finite difference and finite volume approximations), most of the threads sit idle with the resulting (lack of) performance. You have to jump through some hoops of matrix reordering (not using cusparse’s reordering routines) and decomposition to do better, and even then it isn’t enough. In short, to get anything that performs well in such cases, you really need to roll your own

That really helps.

When you talk about the data ordering problem, does it explain why I don’t get speedup? I compare the zgemm from Intel MKL and cublas. I don’t get the speedup so much as I want. (to be honest, it depends on dimension of matrix).Does it affect the performance of cublas in cuda fortran?

Greg was specifically talking about cuSolver. cuBLAS performs quite well.

My guess is that the performance difference is due to data movement, a small problem size, or both. As problem sizes grow, typically so does cuBlas relative performance to MKL.

Hi Mat, I do hope that PGI has cuSolverDn on its near-term development schedule!

Malcolm

Sorry Malcolm, but a cuSolverDN interface module isn’t something we’re planning on adding. At least not in the near-term.

  • Mat