Using cuSolverDN in FORTRAN code

Robert_Crovella · October 24, 2015, 6:26pm

Disclaimer: I am not a Fortran expert. In fact I suck at Fortran. So if I describe something incorrectly, please excuse that. Also anything I have written may have bugs in it.

However I was able to fix a few things in the code you have posted so far. I’d rather not try and list each and every change. If you have specific questions, please ask.

In general the changes I made fall into these categories:

The parameter types in your interfaces were sometimes not correct. There were size errors (1 byte vs. 4 byte, 4 byte vs. 8 byte, etc.) and also a few errors in the usage of pass-by-value vs. pass-by-reference.
I elected to not drag around the FortCUDA stuff. So I removed all that and just included the interfaces for the few functions that you need so far (cudaMalloc, cudaMemcpy, cudaFree, cudaMemGetInfo)
I made various variable type and handling changes in the code to reflect the interface changes I made.
I added explicit error checking.

I don’t think the Lwork size of 27 is wrong. This is an extremely small problem, and so far, the modified code runs without errors. I’m not suggesting it is defect-free, but it should allow you to proceed towards a completely worked demo. If you continue to have trouble, I’ll try to help if I can. If you’re able to complete this demo problem, but think your results are not numerically correct, I can try to help there as well (although I’m pretty sure you are much smarter at linear systems solutions than I am).

Anyway, this modified code runs without errors that I can obviously see (I have not attempted to verify anything numerically):

!==================================================================
!Interface to cusolverDn and CUDA C functions
!==================================================================
    
  module cuda_cusolver

  interface

     ! cudaMalloc
     integer (c_int) function cudaMalloc ( buffer, size ) bind (C, name="cudaMalloc" ) 
       use iso_c_binding
       implicit none
       type (c_ptr)  :: buffer
       integer (c_size_t), value :: size
     end function cudaMalloc

     ! cudaMemcpy 
     integer (c_int) function cudaMemcpy ( dst, src, count, kind ) bind (C, name="cudaMemcpy" )
       ! note: cudaMemcpyHostToDevice = 1
       ! note: cudaMemcpyDeviceToHost = 2
       use iso_c_binding
       type (C_PTR), value :: dst, src
       integer (c_size_t), value :: count, kind
     end function cudaMemcpy

     ! cudaFree
     integer (c_int) function cudaFree(buffer)  bind(C, name="cudaFree")
       use iso_c_binding
       implicit none
       type (C_PTR), value :: buffer
     end function cudaFree

     integer (c_int) function cudaMemGetInfo(fre, tot) bind(C, name="cudaMemGetInfo")
       use iso_c_binding
       implicit none
       type(c_ptr),value :: fre
       type(c_ptr),value :: tot
     end function cudaMemGetInfo

integer(c_int) function cusolverDnCreate(cusolver_Hndl) bind(C,name="cusolverDnCreate")
     
     use iso_c_binding
     implicit none
     
     type(c_ptr)::cusolver_Hndl
     
     end function
     
     integer(c_int) function cusolverDnDestroy(cusolver_Hndl) bind(C,name="cusolverDnDestroy")
     
     use iso_c_binding
     implicit none
     
     type(c_ptr),value::cusolver_Hndl
     
     end function

    integer(c_int) function cusolverDnSgetrf_bufferSize(cusolver_Hndl,m,n,d_A,lda,Lwork) bind(C,name="cusolverDnSgetrf_bufferSize") 
      
      use iso_c_binding
      implicit none
    
      type(c_ptr),value::cusolver_Hndl
      integer(c_int),value::m
      integer(c_int),value::n
      type(c_ptr),value::d_A
      integer(c_int),value::lda
      type(c_ptr),value::Lwork
    end function

     integer(c_int) function cusolverDnSgetrf(cusolver_Hndl,m,n,d_A,lda,d_WS,d_Ipiv,d_devInfo) bind(C, name="cusolverDnSgetrf")

       use iso_c_binding
       implicit none
       type(c_ptr),value::cusolver_Hndl
       integer(c_int),value::m
       integer(c_int),value::n
       type(c_ptr),value::d_A
       integer(c_int),value::lda
       type(c_ptr),value::d_WS
       type(c_ptr),value::d_Ipiv
       type(c_ptr),value::d_devInfo
       
     end function 

     integer (c_int) function cusolverDnSgetrs(cusolver_Hndl,trans,n,nrhs,d_A,lda,d_Ipiv,d_B,ldb,d_devInfo) bind(C, name="cusolverDnSgetrs")

       use iso_c_binding
       implicit none
       type(c_ptr),value::cusolver_Hndl
       integer(c_int), value::trans
       integer(c_int), value::n
       integer(c_int), value::nrhs
       type(c_ptr),value::d_A
       integer(c_int), value::lda    
       type(c_ptr),value::d_Ipiv
       type(c_ptr),value::d_B
       integer(c_int),value::ldb
       type(c_ptr),value::d_devInfo
       
     end function
   
  end interface  
  
end module 

!=========================================
program main
!=========================================

! The purpose of this routine is to provide for GPU based solution of 
! large dense systems of equations in legacy FORTRAN Applications.

! This is the development version of the routine which does not yet work ...

  use iso_c_binding
  use cuda_cusolver

! ------------- matrix definition & host CPU storage variables 
  
  integer(c_int) m,n,lda,ldb
  
  real, allocatable :: A(:,:),ATest(:,:)
  real, allocatable :: B(:,:)
  real, allocatable :: X(:,:)
  integer, allocatable::Ipiv(:)
  integer, target::Lwork

!-------------------- CPU equivalents of device variables
  integer devInfo
  integer(c_int) nrhs
  
  integer X_size
  integer*8 Ipiv_size,devInfo_size,Lwork_size
  integer*8 A_size,B_size,Workspace
  
! -------------------- handle to device
  
  type(c_ptr) :: cusolver_Hndl

! -------------------- pointers to device memory
  
  type(c_ptr)::d_A
  type(c_ptr)::d_B
  type(c_ptr)::d_Lwork
  type(c_ptr)::d_WS
  type(c_ptr)::d_Ipiv
  type(c_ptr)::d_devInfo

! --------------------- function result variables
  
  integer cusolver_stat
  integer A_mem_stat
  integer B_mem_stat
  integer X_mem_stat
  integer Ipiv_mem_stat
  integer WS_mem_stat
  integer devInfo_mem_stat
  integer Lwork_mem_stat
  
! -------------------- pointers to host CPU memory  
  
  type(c_ptr)::CPU_A_ptr
  type(c_ptr)::CPU_ATest_ptr
  type(c_ptr)::CPU_B_ptr
  type(c_ptr)::CPU_X_ptr
  type(c_ptr)::CPU_Lwork_ptr
    
  target A,ATest,B,X,Ipiv
  
  type(c_ptr)::cpfre,cptot
  integer*8,target::free,total
  integer res
  integer*8 cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice
  integer*4 CUBLAS_OP_N, CUBLAS_OP_T
  parameter (cudaMemcpyHostToDevice=1)
  parameter (cudaMemcpyDeviceToHost=2)
  parameter (CUBLAS_OP_N=0)
  parameter (CUBLAS_OP_T=1)
  
  ! ================================================
  free = 0
  total = 0
  res = 1
  cpfre = c_loc(free)
  cptot = c_loc(total)
  
  res = cudaMemGetInfo(cpfre,cptot)
  if (res .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemGetInfo error: ", res
      write (*,*)
      stop
  end if
  write (*, '(A, I12)') "  free mem: ", free
  write (*, '(A, I12)') " total mem: ", total
  
  lda = 3
  ldb = 3
  m=lda
  n=lda
  nrhs = 1
  
  allocate(A(lda,n))
  allocate(ATest(lda,n))
  allocate(B(n,1))
  allocate(X(n,1))
  allocate(Ipiv(lda))
!  allocate(Lwork)
  
  A_size = SIZEOF(A)
  B_size = SIZEOF(B)
  X_size = SIZEOF(X)
  Ipiv_size = SIZEOF(Ipiv)
  devInfo_size = SIZEOF(devInfo)
  Lwork_size = SIZEOF(Lwork)
  
  !    EXAMPLE FROM CUSOLVER_Library.pdf  
  !  "QR FACTORIZATION DENSE LINEAR SOLVER" 
  
  ! -------------- DEFINE [A] AND [B] --------------------------
  
  !           [A]         [x]  =    [B]
  !    | 1.0  2.0  3.0 | |1.0|    | 6.0|
  !    | 4.0  5.0  6.0 | |1.0| =  |15.0|
  !    | 2.0  1.0  1.0 | |1.0|    | 4.0|
  
  A(1,1) = 1.0;   A(1,2) = 2.0;  A(1,3) = 3.0
  A(2,1) = 4.0;   A(2,2) = 5.0;  A(2,3) = 6.0
  A(3,1) = 2.0;   A(3,2) = 1.0;  A(3,3) = 1.0
  
  B(1,1) = 6.0
  B(2,1) = 15.0
  B(3,1) = 4.0
  
  CPU_A_ptr = C_LOC(A)
  CPU_ATest_ptr = C_LOC(ATest)
  CPU_B_ptr = C_LOC(B)
  CPU_X_ptr = C_LOC(X)
  CPU_Lwork_ptr = C_Loc(Lwork)
  
! Step 1: Create cudense handle ---------------
  cusolver_stat = cusolverDnCreate(cusolver_Hndl)  
  if (cusolver_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cusolverDnCreate error: ", cusolver_stat
      write (*,*)
      stop
  end if
      
! Step 2: copy A and B to Device
  
  A_mem_stat    = cudaMalloc(d_A,A_size)
  if (A_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 1 error: ", A_mem_stat
      write (*,*)
      stop
  end if
  B_mem_stat    = cudaMalloc(d_B,B_size)  
  if (B_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 2 error: ", B_mem_stat
      write (*,*)
      stop
  end if
  
! ---------- also allocate space for other device based variables  
  
  Ipiv_mem_stat = cudaMalloc(d_Ipiv,Ipiv_size)
  if (Ipiv_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 3 error: ", Ipiv_mem_stat
      write (*,*)
      stop
  end if
  devInfo_mem_stat = cudaMalloc(d_devInfo,devInfo_size)
  if (devInfo_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 4 error: ", devInfo_mem_stat
      write (*,*)
      stop
  end if
  Lwork_mem_stat   = cudaMalloc(d_Lwork,Lwork_size)
  if (Lwork_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 5 error: ", Lwork_mem_stat
      write (*,*)
      stop
  end if
  
! ---------- copy A and B to Device
  
  A_mem_stat = cudaMemcpy(d_A,CPU_A_ptr,A_size,cudaMemcpyHostToDevice)
  if (A_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemcpy 1 error: ", A_mem_stat
      write (*,*)
      stop
  end if
  B_mem_stat = cudaMemcpy(d_B,CPU_B_ptr,B_size,cudaMemcpyHostToDevice)
  if (B_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemcpy 2 error: ", B_mem_stat
      write (*,*)
      stop
  end if
  
! ---------- check transfer of A matrix by copy back to [ATest]
  
  A_mem_stat = cudaMemcpy(CPU_ATest_ptr,d_A,A_size,cudaMemcpyDeviceToHost)  ! Test A matrix memory copy
  if (A_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemcpy 3 error: ", A_mem_stat
      write (*,*)
      stop
  end if
  
! Step 3: query working space of Sgetrf (and allocate memory on device)
  
  Lwork = 5
  cusolver_stat =  cusolverDnSgetrf_bufferSize(cusolver_Hndl,m,n,d_A,lda,CPU_Lwork_ptr) 
  if (cusolver_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " DnSgetrf_bufferSize error: ", cusolver_stat
      write (*,*)
      stop
  end if
 
      write (*,*)
      write (*, '(A, I12)') " Lwork: ", Lwork
      write (*,*)

Workspace = 4*Lwork
  WS_mem_stat = cudaMalloc(d_WS,Workspace)
  if (WS_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 6 error: ", WS_mem_stat
      write (*,*)
      stop
  end if
  
! Step 4: compute LU factorization of [A] 
    
  cusolver_stat = cusolverDnSgetrf(cusolver_Hndl,m,n,d_A,lda,d_WS,d_Ipiv,d_devInfo) 
  if (cusolver_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cusolverDnSgetrf error: ", WS_mem_stat
      write (*,*)
      stop
  end if

! Step 5: compute solution vector [X] for Right hand side [B]
  
  cusolver_stat = cusolverDnSgetrs(cusolver_Hndl,CUBLAS_OP_N,n,nrhs,d_A,lda,d_Ipiv,d_B,ldb,d_devInfo)  
  if (cusolver_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cusolverDnSgetrs error: ", WS_mem_stat
      write (*,*)
      stop
  end if
  
! Step 6: copy solution vector stored in [B] on device into [X] vector on host
  
  X_mem_stat = cudaMemcpy(CPU_X_ptr,d_B,B_size,cudaMemcpyDeviceToHost)  
  if (X_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemcpy 4 error: ", WS_mem_stat
      write (*,*)
      stop
  end if
  
! step 7: free memory on device and release CPU-side resources
  
  A_mem_Stat    = cudafree(d_A)
  B_mem_Stat    = cudafree(d_B)
  Ipiv_mem_stat = cudafree(d_Ipiv)
  WS_mem_stat   = cudafree(d_WS)
  Lwork_mem_stat = cudafree(d_Lwork)
  
  cusolver_stat = cusolverDnDestroy(cusolver_Hndl)
  
  ! Step 8: deallocate memory on host before exit
  
  deallocate(A)
  deallocate(ATest)
  deallocate(B)
  deallocate(X)
  deallocate(Ipiv)

end program

I don’t have Intel fortran and I was working in a linux environment to try to sort this out, so the compile command I used is:

gfortran -ffree-line-length-0 -O3 -L/usr/local/cuda/lib64 -lcudart -lcusolver -o t1 t1.f90

(I’m not trying to impugn the FortCUDA stuff. If you want to use it, fine. I was able to use it also, but I was a little nervous because I noticed it seemed to be pretty old – mentioned CUDA 2.3!! – and some parameter sizes have changed from 4 bytes to 8 bytes between CUDA 2.3 and now. I didn’t want to try and prove out stuff there, so I elected to drop it. If it’s not too much trouble, while you’re working on this demo, I would suggest avoiding it. Later, when your demo is working, if you want to reintroduce it, then you’ll have a better idea if it introduced any problems. Just my opinion. )