CUBLAS Library and Fortran Bindings Fortran 2003 provides interoperability standards

In the document CUBLAS Library, PG-00000-002_V1.1, September, 2007,

there is a discussion on p. 75 regarding Fortran Bindings:

“…CUBLAS uses 1‐based indexing and Fortran‐style column‐major

storage for multidimensional data to simplify interfacing to Fortran

applications. Unfortunately, Fortran‐to‐C calling conventions are not

standardized and differ by platform and toolchain.”…

This is no longer a correct statement. Fortran 2003 provides the concept “Interoperability with C.” Many Fortran compilers have implemented the new standard, which allow a programmer to write portable code that interfaces to C. This is done with Use Association through an intrinsic module ISO_C_BINDING. Several vendors have provided partial support for this interoperability module:

Intel Fortran compiler 10.1, the IBM XL Fortran Enterprise Edition for AIX, V11.1 (5724-S72) Version 11.01.0000.0001D, (September 19, 2007), Sun Fortran 95 8.3 (July 18, 2007), g95 version 0.92 (March 14, 2009), and gfortran version 4.4.0 (February 19, 2009) have this intrinsic module available.

Using an efficient version of the routine SGEMM() is a crucial factor for achieving high performance in numerical linear algebra. The CUBLAS library lists a C version of matrix multiply, cublasSgemm(), that provides the functionality of the Level 3 BLAS routine SGEMM(). We include a standard and portable interface to that CUBLAS library code using a wrapper subprogram. This not been tested with the genuine CUBLAS library but it has been emulated using a stub C function that has the same calling protocols. Note the use of the attribute VALUE for passing some of the arguments, and the use of a copy from Fortran CHARACTER data to C char data.

Some forum posters have commented that the CUBLAS library is best used for larger problem sizes. The listed code could obviously be modified to call the CUBLAS code for one range of matrix dimensions and one or other routines for smaller sizes.

SUBROUTINE SGEMM(TRANSA,TRANSB,M,N,K,ALPHA,&

		A,LDA,B,LDB,BETA,C,LDC)

	  

	  USE ISO_C_BINDING

	  IMPLICIT NONE

	  

!	 .. Scalar Arguments ..

	  REAL ALPHA,BETA

	  INTEGER K,LDA,LDB,LDC,M,N

	  CHARACTER(LEN=*) TRANSA,TRANSB

!	 ..

!	 .. Array Arguments ..

	  REAL A(LDA,*),B(LDB,*),C(LDC,*)

! Define the INTERFACE to the NVIDIA C code cublasSgemm.

! This version of SGEMM is used in a user application

! that calls LAPACK single precision routines, or makes

! other uses of that code.

	  character(1,c_char) cta, ctb

	  INTERFACE

! This is what the NVIDIA code expects for its inputs:	  

!	  void cublasSgemm (char transa, char transb, int m, int n,

!		 int k, float alpha, const float *A, int lda,

!		 const float *B, int ldb, float beta,

!		 float *C, int ldc)

	   subroutine c_sgemm(cta, ctb, m, n, k,&

		 alpha, A, lda, B, ldb, beta, c, ldc)bind(C,name='cublasSgemm')

			 USE ISO_C_BINDING

			 

			 character(1,c_char),value :: cta, ctb

			 integer(c_int),value :: m,n,k,lda,ldb,ldc

			 real(c_float),value :: alpha,beta

			 real(c_float) :: A(lda,*),B(ldb,*),C(ldc,*)

	   end subroutine c_sgemm

	  END INTERFACE 

! The calculation, excepting initialization and finalization,

! is done with the NVIDIA C routine 'cublasSgemm.'

! A local name c_sgemm is used in Fortran.

! The name c_sgemm could be replaced by NVIDIA's name if one chose. 

	  cta=transa(1:1); ctb=transb(1:1) ! Pass only first character.	 

	  call c_sgemm(cta, ctb,&

	   m, n, k, alpha, A, lda, B, ldb, beta, c, ldc)

	  return	 

	  end

Your code will not work on real hardware, you are passing pointers to data resident in CPU memory to the cublasSGEMM function.

CUBLAS is expecting the data to be resident on the GPU (unless you are using the thunking interface).

The new iso C bindings are powerful but at the same time not very popular and known (somewhere on CUDA Zone there are some examples on how to use them to allocate pinned memory from Fortran), so sometime using the old ( compiler dependent) way to interface Fortran and C may still have some value.

The following code will give you the correct results on GPUs (performance of this code could be improved allocating a buffer on the GPU once to avoid all the multiple cublasAlloc/cublasFree):

subroutine sgemm(transa,transb,m,n,k,alpha,&

		a,lda,b,ldb,beta,c,ldc)

	  

	  use iso_c_binding

	  implicit none

	  

!	 .. Scalar Arguments ..

	  real alpha,beta

	  integer k,lda,ldb,ldc,m,n

	  character(len=*) transa,transb

	  type(c_ptr):: devA, devB,devC

	  integer retVal,retValA,retValB,retValC

!	 .. Array Arguments ..

	  real a(lda,*),b(ldb,*),c(ldc,*)

! Define the INTERFACE to the NVIDIA C code cublasSgemm and to other helper 

! functions (cublasAlloc,cublasFree, cublasSetMatrix,cublasGetMatrix)

	  character(1,c_char) cta, ctb

	  interface

!

!	  void cublasSgemm (char transa, char transb, int m, int n,

!						int k, float alpha, const float *A, int lda,

!				  const float *B, int ldb, float beta, float *C, int ldc)

!

	   subroutine c_sgemm(cta, ctb, m, n, k,&

		 alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasSgemm')

			 use iso_c_binding

			 character(1,c_char),value :: cta, ctb

			 integer(c_int),value :: m,n,k,lda,ldb,ldc

			 real(c_float),value :: alpha,beta

			 type(c_ptr),value :: A,B,C

	   end subroutine c_sgemm

!

!	 cublasStatus CUBLASAPI cublasAlloc (int n, int elemSize, void **devicePtr);

!

	   integer  (c_int) function cublasAlloc(n,size,ptr) bind(C,name='cublasAlloc')

		 use iso_c_binding

		 integer(c_int), value:: n, size

		 integer(c_int) :: retVal

		 type(c_ptr) :: ptr

	   end function cublasAlloc

!

!	 cublasStatus CUBLASAPI cublasFree (const void *devicePtr);

!

	   integer  (c_int) function cublasFree(ptr) bind(C,name='cublasFree')

		 use iso_c_binding

		 integer(c_int) :: retVal

		 type(c_ptr),value :: ptr

	   end function cublasFree

		 

!

!	cublasStatus CUBLASAPI cublasSetMatrix (int rows, int cols, int elemSize,

!											const void *A, int lda, void *B,

!											int ldb);

	   integer  (c_int) function cublasSetMatrix(rows,cols,elemSize,A,lda,devA,lda2) bind(C,name='cublasSetMatrix')

		 use iso_c_binding

		 integer(c_int), value :: rows,cols, elemSize,lda,lda2

		 integer(c_int) :: retVal

		 real(c_float) :: A(lda,*)

		 type(c_ptr), value :: devA

	   end function cublasSetMatrix

!

!cublasStatus CUBLASAPI cublasGetMatrix (int rows, int cols, int elemSize,

!										const void *A, int lda, void *B,

!										int ldb);

!

	   integer  (c_int) function cublasGetMatrix(rows,cols,elemSize,devA,lda,B,ldb) bind(C,name='cublasGetMatrix')

		 use iso_c_binding

		 integer(c_int), value:: rows,cols, elemSize,lda,ldb

		 integer(c_int) :: retVal

		 type(c_ptr), value:: devA

		 real(c_float):: B(ldb,*)

	   end function cublasGetMatrix

	  end interface 

!

! The calculation, including memory initialization and finalization,

	  cta=transa(1:1); ctb=transb(1:1) ! Pass only first character.	 

!

!   Allocate the memory on the GPU

!

	  retValA=cublasAlloc(m*k,4,devA)

	  retValB=cublasAlloc(k*n,4,devB)

	  retValC=cublasAlloc(m*n,4,devC)

	  if ( (retValA .ne. 0) .or. (retValB .ne. 0) .or. (retValC .ne. 0) ) print *,"Error in memory allocation"

!

!   Move the data from CPU memory to GPU memory

!

	  retValA=cublasSetMatrix(m,k,4,A,lda,devA,m)

	  retValB=cublasSetMatrix(k,n,4,B,ldb,devB,k)

	  retValC=cublasSetMatrix(m,n,4,C,ldc,devC,m)

	  if ( (retValA .ne. 0) .or. (retValB .ne. 0) .or. (retValC .ne. 0) ) print *,"Error in memory copy from CPU to GPU"

	  call c_sgemm(cta, ctb, m, n, k, alpha, devA, m, devB, k, beta, devC, m)

!

!   Move the result from GPU memory to CPU memory

!

	  retVal=cublasGetMatrix(m,n,4,devC,m,C,ldc)

	  if ( (retVal .ne. 0) ) print *,"Error in memory copy from GPU to CPU"

!

!   Free the memory on the GPU

!

	  retVal=cublasFree(devA)

	  retVal=cublasFree(devB)

	  retVal=cublasFree(devC)

	  return	 

	  end

Thank you very much. This will save much time for me. (I have a place to test my experimental codes.) The packaging of the allocate/deallocate steps will probably be done with a separate pair of Fortran routines, packaged in a module. This seems like the right approach if other BLAS are to be provided this way.

:)

BTW Bo Einarsson and myself have tested this list of compilers for interoperability of Fortran 2003 and C. A number of exercises have been tried and will be published later. We hope to add this example of using NVIDIA BLAS from Fortran. This list shows that the interoperability part of Fortran 2003 has wide coverage among the vendors.

  1. Intel Fortran compiler 10.1,

  2. IBM XL Fortran Enterprise Edition for AIX, V11.1 (5724-S72) Version 11.01.0000.0001D,

(September 19, 2007),

  1. Sun Fortran 95 8.3 (July 18, 2007),

  2. NAGWare Fortran 95 compiler Release 5.1 (557),

  3. g95 version 0.92 (March 14, 2009),

  4. gfortran version 4.4.0 (February 19, 2009), and

  5. The Portland Group, pgf90 8.0-4 64-bit target on x86-64 Linux.

My experience with interoperability is different. I have a full cudart interface that works just fine on g95 and crashes on ifort. Features like enum are not supported in all the compilers.

The other thing you can add to the interface is a split of the workload between CPU and GPU for functions like S/DGEMM and S/DTRSM.
Just add another interface to your CPU library( I assume Visual Numerics includes an optimized blas library for the host) and then split the work.
Contact me if you want an example ( I did something similar to accelerate Linpack)

We are working on the linpack, now the dtrsm seens a bootleneck to us, how to fully parallel it on GPUs? Can you give us some advices or some examples?

thank you very much. And my email address is golden.zhen@gmail.com.