Using cuSolverDN in FORTRAN code

I am interested in using cuSolverDn to speed up large dense matrix solutions in a legacy FORTRAN code. I’m currently using Intel FORTRAN 15.0.2.179 with calls to the MKL LAPACK routines SGETRF, and SGETRS. Is there a simple FORTRAN example somewhere which shows how to replace those calls with the equivalent cuSolverDN calls and what include files etc. are necessary?

(Win7(64) OS with Visual Studio V10.0 Shell)

Thanks!

There are published examples of Fortran usage for both cublas and cusparse, but I’m not aware of any (yet) for cusolver. The fortran code for cublas and cusparse can be found in the cuda install directory on your machine (just search for cusparse_fortran.c, for example, in the windows file manager).

In general, two approaches I can think of:

  1. Write the functionality in a C/C++ module, calling the cusolver routines as you normally would from that module. Write a “wrapper” function (in the C/C++ module) that you use to interface this to fortran, then just call those wrapper functions from fortran and link the modules together. Effectively this is what has been done with the cusparse and cublas functionality provided in the CUDA toolkit that I mentioned above. This is really not CUDA specific and there should be many examples of how to do this elsewhere.

  2. Create lower level function bindings. You can create your own functionality like this on a function-by-function basis, if you wish, so that a C library function (ie. cusolver function) is callable directly from a fortran module. The ISO_C_BINDING fortran module will probably be of assistance here. Here’s a fairly simple worked example (not cusolver, but from CUDA runtime library):

http://stackoverflow.com/questions/27507169/find-available-graphics-card-memory-using-fortran

I don’t have a C/C++ compiler, but will give #2 a try.

Thanks!

I’ve made a bit of progress but I am stuck… I’ve attached a copy of a developmental Fortran code which also includes my attempt at the relevant interface definitions for the required cusolverDn routines. The code also makes use of some standard cuda inferfaces which I found here:

Note: These are dated,but I thought would be a good example and start…My code makes use of the interfaces defined for cudaMalloc, cudaMemcpy, and for a test, cudaMemGetInfo. I did need to modify the interface to cudaMemGetInfo to define the arguments as c_long_long.

I had been using CULA with an older gtx 470 card. I’ve upgraded the card to a gtx 980 and was looking forward to the performance improvement with all those additional cores ! The attached code appears to work fine down to the cusolverDnSgetrf call, but the call to cusolverDnSgetrf_buffersize returns a size of only 27 so I question if that is correct. I’ve been running in debug mode and checking C function results within the debugger so there isn’t any error checking in the code.

Unfortunately, many legacy scientists (like me) have virtually no background in C programming so a bit of help in sorting out the attached code would be very much appreciated.

Thanks!

!==================================================================
!Interface to cusolverDn C functions
!==================================================================
    
  module cu_solver

  use iso_c_binding
  
  interface
     
     integer (c_int) function cusolverDnCreate(cusolver_Hndl) bind(C,name="cusolverDnCreate")
     
     use iso_c_binding
     implicit none
     
     type(c_ptr) cusolver_Hndl
     
     end function
  end interface  
  
  interface
     
     integer (c_int) function cusolverDnDestroy(cusolver_Hndl) bind(C,name="cusolverDnDestroy")
     
     use iso_c_binding
     implicit none
     
     type(c_ptr) cusolver_Hndl
     
     end function
  end interface 
  
  interface
    integer(c_int) function cusolverDnSgetrf_bufferSize(cusolver_Hndl,m,n,d_A,lda,Lwork) bind(C,name="cusolverDnSgetrf_bufferSize") 
      
      use iso_c_binding
      implicit none
    
      type(c_ptr) cusolver_Hndl
      integer(c_int),value::m
      integer(c_int),value::n
      type(c_ptr)  d_A
      integer(c_int),value::lda
      integer(c_int)::Lwork
    end function
  end interface 
                                
  interface
     
     integer (c_int) function cusolverDnSgetrf(cusolver_Hndl,m,n,d_A,lda,d_WS,d_Ipiv,d_devInfo) bind(C, name="cusolverDnSgetrf")
       use iso_c_binding
       implicit none
       type(c_ptr) cusolver_Hndl
       integer(c_int),value:: m
       integer(c_int),value::n
       type(c_ptr) d_A
       integer(c_int),value::lda
       type(c_ptr) d_WS
       type(c_ptr) d_Ipiv
       type(c_ptr) d_devInfo
       
     end function 
    
  end interface
  
  interface
     
     integer (c_int) function cusolverDnSgetrs(cusolver_Hndl,trans,n,nrhs,d_A,lda,d_Ipiv,d_B,ldb,d_devInfo) bind(C, name="cusolverDnSgetrs")
       use iso_c_binding
       implicit none
       type(c_ptr) cusolver_Hndl
       character(kind=c_char) trans
       integer(c_int), value::n
       integer(c_int), value::nrhs
       type(c_ptr) d_A
       integer(c_int), value::lda    
       type(c_ptr) d_Ipiv
       type(c_ptr) d_B
       integer(c_int), value::ldb
       type(c_ptr) d_devInfo
       
     end function cusolverDnSgetrs
   
  end interface  
  
end module 

!=========================================
program main
!=========================================

! The purpose of this routine is to provide for GPU based solution of 
! large dense systems of equations in legacy FORTRAN Applications.

! This is the development version of the routine which does not yet work ...

  use iso_c_binding
  use cuda_h
  use cuda_runtime_h
  use cu_solver
  
  
  ! ------------- matrix definition & host CPU storage variables 
  
  integer(c_int) m,n,lda,ldb
  
  real, allocatable :: A(:,:),ATest(:,:)
  real, allocatable :: B(:,:)
  real, allocatable :: X(:,:)
  integer, allocatable::Ipiv(:)
  integer, allocatable::Lwork
  
!-------------------- CPU equivalents of device variables
  integer workspace
  integer devInfo
  character(c_char) trans
  integer(c_int) nrhs
  
  integer A_size,B_size,X_size,Ipiv_size,devInfo_size,Lwork_size
  
! -------------------- handle to device
  
  type(c_ptr) :: cusolver_Hndl

  
! -------------------- pointers to device memory
  
  type(c_ptr)::d_A
  type(c_ptr)::d_B
  type(c_ptr)::d_Lwork
  type(c_ptr)::d_WS
  type(c_ptr)::d_Ipiv
  type(c_ptr)::d_devInfo
  
! --------------------- function result variables
  
  integer cusolver_stat
  integer A_mem_stat
  integer B_mem_stat
  integer X_mem_stat
  integer Ipiv_mem_stat
  integer WS_mem_stat
  integer devInfo_mem_stat
  integer Lwork_mem_stat
  
! -------------------- pointers to host CPU memory  
  
  type(c_ptr)::CPU_A_ptr
  type(c_ptr)::CPU_ATest_ptr
  type(c_ptr)::CPU_B_ptr
  type(c_ptr)::CPU_X_ptr
  type(c_ptr)::CPU_Lwork_ptr
    
  target A,ATest,B,X,Ipiv,Lwork
  
  integer*8 free,total
  integer res
  
  ! ================================================
  free = 0
  total = 0
  res = 1
  
  res = cudaMemGetInfo(free,total)
  
  lda = 3
  ldb = 3
  m=lda
  n=lda
  nrhs = 1
  
  allocate(A(lda,n))
  allocate(ATest(lda,n))
  allocate(B(n,1))
  allocate(X(n,1))
  allocate(Ipiv(lda))
  allocate(Lwork)
  
  A_size = SIZEOF(A)
  B_size = SIZEOF(B)
  X_size = SIZEOF(X)
  Ipiv_size = SIZEOF(Ipiv)
  devInfo_size = SIZEOF(devInfo)
  Lwork_size = SIZEOF(Lwork)
  
  !    EXAMPLE FROM CUSOLVER_Library.pdf  
  !  "QR FACTORIZATION DENSE LINEAR SOLVER" 
  
  ! -------------- DEFINE [A] AND [B] --------------------------
  
  !           [A]         [x]  =    [B]
  !    | 1.0  2.0  3.0 | |1.0|    | 6.0|
  !    | 4.0  5.0  6.0 | |1.0| =  |15.0|
  !    | 2.0  1.0  1.0 | |1.0|    | 4.0|
  
  A(1,1) = 1.0;   A(1,2) = 2.0;  A(1,3) = 3.0
  A(2,1) = 4.0;   A(2,2) = 5.0;  A(2,3) = 6.0
  A(3,1) = 2.0;   A(3,2) = 1.0;  A(3,3) = 1.0
  
  B(1,1) = 6.0
  B(2,1) = 15.0
  B(3,1) = 4.0
  
  CPU_A_ptr = C_LOC(A)
  CPU_ATest_ptr = C_LOC(ATest)
  CPU_B_ptr = C_LOC(B)
  CPU_X_ptr = C_LOC(X)
  CPU_Lwork_ptr = C_Loc(Lwork)
  
! Step 1: Create cudense handle ---------------
  
  cusolver_stat = cusolverDnCreate(cusolver_Hndl)  
      
! Step 2: copy A and B to Device
  
  A_mem_stat    = cudaMalloc(d_A,A_size)
  B_mem_stat    = cudaMalloc(d_B,B_size)  
  
! ---------- also allocate space for other device based variables  
  
  Ipiv_mem_stat = cudaMalloc(d_Ipiv,Ipiv_size)
  devInfo_mem_stat = cudaMalloc(d_devInfo,devInfo_size)
  Lwork_mem_stat   = cudaMalloc(d_Lwork,Lwork_size)
  
! ---------- copy A and B to Device
  
  A_mem_stat = cudaMemcpy(d_A,CPU_A_ptr,A_size,cudaMemcpyHostToDevice)
  B_mem_stat = cudaMemcpy(d_B,CPU_B_ptr,B_size,cudaMemcpyHostToDevice)
  
! ---------- check transfer of A matrix by copy back to [ATest]
  
  A_mem_stat = cudaMemcpy(CPU_ATest_ptr,d_A,A_size,cudaMemcpyDeviceToHost)  ! Test A matrix memory copy
  
! Step 3: query working space of Sgetrf (and allocate memory on device)
  
  Lwork = 0  
  cusolver_stat = cusolverDnSgetrf_bufferSize(cusolver_Hndl,m,n,d_A,lda,Lwork) 
  
  Workspace = 4*Lwork
  WS_mem_stat = cudaMalloc(d_WS,Workspace)
  
! Step 4: compute LU factorization of [A] 
! A Memory acess error occurs here...
! Access violation reading location 0xffffffffffffffff.
    
  cusolver_stat = cusolverDnSgetrf(cusolver_Hndl,m,n,d_A,lda,d_WS,d_Ipiv,d_devInfo) 
! Step 5: compute solution vector [X] for Right hand side [B]
  
  trans = 'N'
  cusolver_stat = cusolverDnSgetrs(cusolver_Hndl,trans,n,nrhs,d_A,lda,d_Ipiv,d_B,ldb,d_devInfo)  
  
! Step 6: copy solution vector stored in [B] on device into [X] vector on host
  
  X_mem_stat = cudaMemcpy(CPU_X_ptr,d_B,B_size,cudaMemcpyDeviceToHost)  
  
! step 7: free memory on device and release CPU-side resources
  
  A_mem_Stat    = cudafree(d_A)
  B_mem_Stat    = cudafree(d_B)
  Ipiv_mem_stat = cudafree(d_Ipiv)
  WS_mem_stat   = cudafree(d_WS)
  Lwork_mem_stat = cudafree(d_Lwork)
  
  cusolver_stat = cusolverDnDestroy(cusolver_Hndl)
  
  ! Step 8: deallocate memory on host before exit
  
  deallocate(A)
  deallocate(ATest)
  deallocate(B)
  deallocate(X)
  deallocate(Ipiv)
  deallocate(Lwork)

end program

Fortran_calling_cusolverDn.zip (2.04 KB)

Disclaimer: I am not a Fortran expert. In fact I suck at Fortran. So if I describe something incorrectly, please excuse that. Also anything I have written may have bugs in it.

However I was able to fix a few things in the code you have posted so far. I’d rather not try and list each and every change. If you have specific questions, please ask.

In general the changes I made fall into these categories:

  1. The parameter types in your interfaces were sometimes not correct. There were size errors (1 byte vs. 4 byte, 4 byte vs. 8 byte, etc.) and also a few errors in the usage of pass-by-value vs. pass-by-reference.
  2. I elected to not drag around the FortCUDA stuff. So I removed all that and just included the interfaces for the few functions that you need so far (cudaMalloc, cudaMemcpy, cudaFree, cudaMemGetInfo)
  3. I made various variable type and handling changes in the code to reflect the interface changes I made.
  4. I added explicit error checking.

I don’t think the Lwork size of 27 is wrong. This is an extremely small problem, and so far, the modified code runs without errors. I’m not suggesting it is defect-free, but it should allow you to proceed towards a completely worked demo. If you continue to have trouble, I’ll try to help if I can. If you’re able to complete this demo problem, but think your results are not numerically correct, I can try to help there as well (although I’m pretty sure you are much smarter at linear systems solutions than I am).

Anyway, this modified code runs without errors that I can obviously see (I have not attempted to verify anything numerically):

!==================================================================
!Interface to cusolverDn and CUDA C functions
!==================================================================
    
  module cuda_cusolver

  interface

     ! cudaMalloc
     integer (c_int) function cudaMalloc ( buffer, size ) bind (C, name="cudaMalloc" ) 
       use iso_c_binding
       implicit none
       type (c_ptr)  :: buffer
       integer (c_size_t), value :: size
     end function cudaMalloc

     ! cudaMemcpy 
     integer (c_int) function cudaMemcpy ( dst, src, count, kind ) bind (C, name="cudaMemcpy" )
       ! note: cudaMemcpyHostToDevice = 1
       ! note: cudaMemcpyDeviceToHost = 2
       use iso_c_binding
       type (C_PTR), value :: dst, src
       integer (c_size_t), value :: count, kind
     end function cudaMemcpy

     ! cudaFree
     integer (c_int) function cudaFree(buffer)  bind(C, name="cudaFree")
       use iso_c_binding
       implicit none
       type (C_PTR), value :: buffer
     end function cudaFree

     integer (c_int) function cudaMemGetInfo(fre, tot) bind(C, name="cudaMemGetInfo")
       use iso_c_binding
       implicit none
       type(c_ptr),value :: fre
       type(c_ptr),value :: tot
     end function cudaMemGetInfo

integer(c_int) function cusolverDnCreate(cusolver_Hndl) bind(C,name="cusolverDnCreate")
     
     use iso_c_binding
     implicit none
     
     type(c_ptr)::cusolver_Hndl
     
     end function
     
     integer(c_int) function cusolverDnDestroy(cusolver_Hndl) bind(C,name="cusolverDnDestroy")
     
     use iso_c_binding
     implicit none
     
     type(c_ptr),value::cusolver_Hndl
     
     end function

    integer(c_int) function cusolverDnSgetrf_bufferSize(cusolver_Hndl,m,n,d_A,lda,Lwork) bind(C,name="cusolverDnSgetrf_bufferSize") 
      
      use iso_c_binding
      implicit none
    
      type(c_ptr),value::cusolver_Hndl
      integer(c_int),value::m
      integer(c_int),value::n
      type(c_ptr),value::d_A
      integer(c_int),value::lda
      type(c_ptr),value::Lwork
    end function

     integer(c_int) function cusolverDnSgetrf(cusolver_Hndl,m,n,d_A,lda,d_WS,d_Ipiv,d_devInfo) bind(C, name="cusolverDnSgetrf")

       use iso_c_binding
       implicit none
       type(c_ptr),value::cusolver_Hndl
       integer(c_int),value::m
       integer(c_int),value::n
       type(c_ptr),value::d_A
       integer(c_int),value::lda
       type(c_ptr),value::d_WS
       type(c_ptr),value::d_Ipiv
       type(c_ptr),value::d_devInfo
       
     end function 

     integer (c_int) function cusolverDnSgetrs(cusolver_Hndl,trans,n,nrhs,d_A,lda,d_Ipiv,d_B,ldb,d_devInfo) bind(C, name="cusolverDnSgetrs")

       use iso_c_binding
       implicit none
       type(c_ptr),value::cusolver_Hndl
       integer(c_int), value::trans
       integer(c_int), value::n
       integer(c_int), value::nrhs
       type(c_ptr),value::d_A
       integer(c_int), value::lda    
       type(c_ptr),value::d_Ipiv
       type(c_ptr),value::d_B
       integer(c_int),value::ldb
       type(c_ptr),value::d_devInfo
       
     end function
   
  end interface  
  
end module 

!=========================================
program main
!=========================================

! The purpose of this routine is to provide for GPU based solution of 
! large dense systems of equations in legacy FORTRAN Applications.

! This is the development version of the routine which does not yet work ...

  use iso_c_binding
  use cuda_cusolver

! ------------- matrix definition & host CPU storage variables 
  
  integer(c_int) m,n,lda,ldb
  
  real, allocatable :: A(:,:),ATest(:,:)
  real, allocatable :: B(:,:)
  real, allocatable :: X(:,:)
  integer, allocatable::Ipiv(:)
  integer, target::Lwork

!-------------------- CPU equivalents of device variables
  integer devInfo
  integer(c_int) nrhs
  
  integer X_size
  integer*8 Ipiv_size,devInfo_size,Lwork_size
  integer*8 A_size,B_size,Workspace
  
! -------------------- handle to device
  
  type(c_ptr) :: cusolver_Hndl

! -------------------- pointers to device memory
  
  type(c_ptr)::d_A
  type(c_ptr)::d_B
  type(c_ptr)::d_Lwork
  type(c_ptr)::d_WS
  type(c_ptr)::d_Ipiv
  type(c_ptr)::d_devInfo

! --------------------- function result variables
  
  integer cusolver_stat
  integer A_mem_stat
  integer B_mem_stat
  integer X_mem_stat
  integer Ipiv_mem_stat
  integer WS_mem_stat
  integer devInfo_mem_stat
  integer Lwork_mem_stat
  
! -------------------- pointers to host CPU memory  
  
  type(c_ptr)::CPU_A_ptr
  type(c_ptr)::CPU_ATest_ptr
  type(c_ptr)::CPU_B_ptr
  type(c_ptr)::CPU_X_ptr
  type(c_ptr)::CPU_Lwork_ptr
    
  target A,ATest,B,X,Ipiv
  
  type(c_ptr)::cpfre,cptot
  integer*8,target::free,total
  integer res
  integer*8 cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice
  integer*4 CUBLAS_OP_N, CUBLAS_OP_T
  parameter (cudaMemcpyHostToDevice=1)
  parameter (cudaMemcpyDeviceToHost=2)
  parameter (CUBLAS_OP_N=0)
  parameter (CUBLAS_OP_T=1)
  
  ! ================================================
  free = 0
  total = 0
  res = 1
  cpfre = c_loc(free)
  cptot = c_loc(total)
  
  res = cudaMemGetInfo(cpfre,cptot)
  if (res .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemGetInfo error: ", res
      write (*,*)
      stop
  end if
  write (*, '(A, I12)') "  free mem: ", free
  write (*, '(A, I12)') " total mem: ", total
  
  lda = 3
  ldb = 3
  m=lda
  n=lda
  nrhs = 1
  
  allocate(A(lda,n))
  allocate(ATest(lda,n))
  allocate(B(n,1))
  allocate(X(n,1))
  allocate(Ipiv(lda))
!  allocate(Lwork)
  
  A_size = SIZEOF(A)
  B_size = SIZEOF(B)
  X_size = SIZEOF(X)
  Ipiv_size = SIZEOF(Ipiv)
  devInfo_size = SIZEOF(devInfo)
  Lwork_size = SIZEOF(Lwork)
  
  !    EXAMPLE FROM CUSOLVER_Library.pdf  
  !  "QR FACTORIZATION DENSE LINEAR SOLVER" 
  
  ! -------------- DEFINE [A] AND [B] --------------------------
  
  !           [A]         [x]  =    [B]
  !    | 1.0  2.0  3.0 | |1.0|    | 6.0|
  !    | 4.0  5.0  6.0 | |1.0| =  |15.0|
  !    | 2.0  1.0  1.0 | |1.0|    | 4.0|
  
  A(1,1) = 1.0;   A(1,2) = 2.0;  A(1,3) = 3.0
  A(2,1) = 4.0;   A(2,2) = 5.0;  A(2,3) = 6.0
  A(3,1) = 2.0;   A(3,2) = 1.0;  A(3,3) = 1.0
  
  B(1,1) = 6.0
  B(2,1) = 15.0
  B(3,1) = 4.0
  
  CPU_A_ptr = C_LOC(A)
  CPU_ATest_ptr = C_LOC(ATest)
  CPU_B_ptr = C_LOC(B)
  CPU_X_ptr = C_LOC(X)
  CPU_Lwork_ptr = C_Loc(Lwork)
  
! Step 1: Create cudense handle ---------------
  cusolver_stat = cusolverDnCreate(cusolver_Hndl)  
  if (cusolver_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cusolverDnCreate error: ", cusolver_stat
      write (*,*)
      stop
  end if
      
! Step 2: copy A and B to Device
  
  A_mem_stat    = cudaMalloc(d_A,A_size)
  if (A_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 1 error: ", A_mem_stat
      write (*,*)
      stop
  end if
  B_mem_stat    = cudaMalloc(d_B,B_size)  
  if (B_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 2 error: ", B_mem_stat
      write (*,*)
      stop
  end if
  
! ---------- also allocate space for other device based variables  
  
  Ipiv_mem_stat = cudaMalloc(d_Ipiv,Ipiv_size)
  if (Ipiv_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 3 error: ", Ipiv_mem_stat
      write (*,*)
      stop
  end if
  devInfo_mem_stat = cudaMalloc(d_devInfo,devInfo_size)
  if (devInfo_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 4 error: ", devInfo_mem_stat
      write (*,*)
      stop
  end if
  Lwork_mem_stat   = cudaMalloc(d_Lwork,Lwork_size)
  if (Lwork_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 5 error: ", Lwork_mem_stat
      write (*,*)
      stop
  end if
  
! ---------- copy A and B to Device
  
  A_mem_stat = cudaMemcpy(d_A,CPU_A_ptr,A_size,cudaMemcpyHostToDevice)
  if (A_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemcpy 1 error: ", A_mem_stat
      write (*,*)
      stop
  end if
  B_mem_stat = cudaMemcpy(d_B,CPU_B_ptr,B_size,cudaMemcpyHostToDevice)
  if (B_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemcpy 2 error: ", B_mem_stat
      write (*,*)
      stop
  end if
  
! ---------- check transfer of A matrix by copy back to [ATest]
  
  A_mem_stat = cudaMemcpy(CPU_ATest_ptr,d_A,A_size,cudaMemcpyDeviceToHost)  ! Test A matrix memory copy
  if (A_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemcpy 3 error: ", A_mem_stat
      write (*,*)
      stop
  end if
  
! Step 3: query working space of Sgetrf (and allocate memory on device)
  
  Lwork = 5
  cusolver_stat =  cusolverDnSgetrf_bufferSize(cusolver_Hndl,m,n,d_A,lda,CPU_Lwork_ptr) 
  if (cusolver_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " DnSgetrf_bufferSize error: ", cusolver_stat
      write (*,*)
      stop
  end if
 
      write (*,*)
      write (*, '(A, I12)') " Lwork: ", Lwork
      write (*,*)

Workspace = 4*Lwork
  WS_mem_stat = cudaMalloc(d_WS,Workspace)
  if (WS_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMalloc 6 error: ", WS_mem_stat
      write (*,*)
      stop
  end if
  
! Step 4: compute LU factorization of [A] 
    
  cusolver_stat = cusolverDnSgetrf(cusolver_Hndl,m,n,d_A,lda,d_WS,d_Ipiv,d_devInfo) 
  if (cusolver_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cusolverDnSgetrf error: ", WS_mem_stat
      write (*,*)
      stop
  end if

! Step 5: compute solution vector [X] for Right hand side [B]
  
  cusolver_stat = cusolverDnSgetrs(cusolver_Hndl,CUBLAS_OP_N,n,nrhs,d_A,lda,d_Ipiv,d_B,ldb,d_devInfo)  
  if (cusolver_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cusolverDnSgetrs error: ", WS_mem_stat
      write (*,*)
      stop
  end if
  
! Step 6: copy solution vector stored in [B] on device into [X] vector on host
  
  X_mem_stat = cudaMemcpy(CPU_X_ptr,d_B,B_size,cudaMemcpyDeviceToHost)  
  if (X_mem_stat .ne. 0 ) then
      write (*,*)
      write (*, '(A, I2)') " cudaMemcpy 4 error: ", WS_mem_stat
      write (*,*)
      stop
  end if
  
! step 7: free memory on device and release CPU-side resources
  
  A_mem_Stat    = cudafree(d_A)
  B_mem_Stat    = cudafree(d_B)
  Ipiv_mem_stat = cudafree(d_Ipiv)
  WS_mem_stat   = cudafree(d_WS)
  Lwork_mem_stat = cudafree(d_Lwork)
  
  cusolver_stat = cusolverDnDestroy(cusolver_Hndl)
  
  ! Step 8: deallocate memory on host before exit
  
  deallocate(A)
  deallocate(ATest)
  deallocate(B)
  deallocate(X)
  deallocate(Ipiv)

end program

I don’t have Intel fortran and I was working in a linux environment to try to sort this out, so the compile command I used is:

gfortran -ffree-line-length-0 -O3 -L/usr/local/cuda/lib64 -lcudart -lcusolver -o t1 t1.f90

(I’m not trying to impugn the FortCUDA stuff. If you want to use it, fine. I was able to use it also, but I was a little nervous because I noticed it seemed to be pretty old – mentioned CUDA 2.3!! – and some parameter sizes have changed from 4 bytes to 8 bytes between CUDA 2.3 and now. I didn’t want to try and prove out stuff there, so I elected to drop it. If it’s not too much trouble, while you’re working on this demo, I would suggest avoiding it. Later, when your demo is working, if you want to reintroduce it, then you’ll have a better idea if it introduced any problems. Just my opinion. )

txbob - THANK YOU !! :)

It also seems to work like a champ under Win7(64)/Visual Studio/Intel Fort. I haven’t had a chance to go through and learn what all my mistakes were yet but will take a look later tonight. I was tempted to dump the FortCuda stuff as well, but I thought it might be a start on other interfaces (even if not correct/current). What you’ve provided in combination with the CUDA_Runtime_API.pdf documentation etc. will be a big help in sorting out other inferfaces in the future. I suspect if a complete set of Fortran interfaces were part of the CUDA releases, there are plenty of Fortran programmers/codes out there that would benefit from simple access to the CUDA GPU fuctions.

I’ll be interested in the speed up vs. host CPU’s when I incorporate it into the actual code…

(Out of curiosity I tried a 10K x 10K matrix and the work space is still only 27)

Again, your help is VERY much appreciated !!

I’ve taken the above code and turned it into a subroutine so I can pass in the matrix and the RHS I need solved. I tested the subroutine in a stand alone code on a 10K x 10K matrix where the diagonal elements equal the row #, and the RHS = 1.0, so the solution =1.0/row ie. 1.0, 0.5, 0.3333, 0.25, 0.2… It worked great so I incorporated the subroutine into a real code. The first time I passed it an actual matrix to solve in the real code, it caused a power glitch that crashed my computer. This same matrix solves fine using either the intel MKL LAPACK functions or the IMSL libraries. The condition number of the matrix is 7.26E-03. If I copy the matrix and rhs to a double precision matrix and rhs, and solve using a double precision version of the subroutine it all runs fine and generates the same answer as an MKL solve or IMSL solve. The matrix is about 8500x8500.

Unfortunately by the time I allocate a double precision matrix and rhs, copy the elements, solve and copy back, its not much faster than my CPU using MKL.

Attempting to use the single precision version consistently causes what appears to be a power glitch and crashes the computer. I’m using an Evga GTX 980. Does anyone know why a single precision version of the matrix would cause a hard crash like that (and a double precision version work fine) ?

A GTX 980 has relatively low double-precision throughput.

GTX 980 on a single precision code should be pretty fast. If you have a crash on SP but not DP, it suggests to me a power delivery problem. You may need a beefier power supply in your PC.

The computer has a 1100-1250W power supply, with 4 circuits unfortunately only rated to 18A ea. Instead of pulling power from one circuit, I pulled 1/2 from the hard drive power supply circuit and 1/2 from a fan supply circuit. This is still only rated a 432W, but it seems to be working fine. Thanks again for your help !!