Unable to allocate linked list pointer in device code in Cuda Fortran

Thank you, Mat

Yes, cuSolverSP is what I’m looking for, but in HPC SDK Fortran CUDA Interfaces Documention, it mentions that “Currently we provide Fortran interfaces to the cuSolverDN, the dense LAPACK functions.” That is to say, cuSolverSP is not available in CUDA Fortran currently. So is there any other way to solve the sparse matrix linear equations in CUDA Fortran?

In order to use cuSolverSP, can I call CUDA C cuSolverSP code from CUDA Fortran? Are there any relevant examples (call CUDA C library function from CUDA Fortran)? If it’s okay, here I try to change the previous example and there are some errors, could you please help me see it? Thank you.

-Yu

cuSolverSP.f90 (3.9 KB)

I talked to Brent who maintains the CUDA library Interface modules. Basically, when he did cuSolverDN, cuSolverSP was still in Beta so he didn’t want to add it until it was complete. Though you’re the second person we’ve had wanting to use it, so he’ll look at adding an interface module in the near future.

I’ve updated your example so that it will now compile and link, but the code seg faults during execution. I might need Brent to help me on this one. Though, I’m out of time for the day, but will look again tomorrow. Here’s what I have so far:

      Module Cusolver_common
        use iso_c_binding
        type CusolverSpHandle
           type(c_ptr) :: handle
        end type

      enum, bind(c)
        enumerator :: CUSOLVER_STATUS_SUCCESS                   = 0
        enumerator :: CUSOLVER_STATUS_NOT_INITIALIZED           = 1
        enumerator :: CUSOLVER_STATUS_ALLOC_FAILED              = 2
        enumerator :: CUSOLVER_STATUS_INVALID_VALUE             = 3
        enumerator :: CUSOLVER_STATUS_ARCH_MISMATCH             = 4
        enumerator :: CUSOLVER_STATUS_MAPPING_ERROR             = 5
        enumerator :: CUSOLVER_STATUS_EXECUTION_FAILED          = 6
        enumerator :: CUSOLVER_STATUS_INTERNAL_ERROR            = 7
        enumerator :: CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED = 8
        enumerator :: CUSOLVER_STATUS_NOT_SUPPORTED             = 9
        enumerator :: CUSOLVER_STATUS_ZERO_PIVOT                = 10
        enumerator :: CUSOLVER_STATUS_INVALID_LICENSE           = 11
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED= 12
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_INVALID        = 13
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_INVALID_PREC   = 14
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_INVALID_REFINE = 15
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_INVALID_MAXITER= 16
        enumerator :: CUSOLVER_STATUS_IRS_INTERNAL_ERROR        = 20
        enumerator :: CUSOLVER_STATUS_IRS_NOT_SUPPORTED         = 21
        enumerator :: CUSOLVER_STATUS_IRS_OUT_OF_RANGE          = 22
        enumerator :: CUSOLVER_STATUS_IRS_NRHS_NOT_SUPPORTED_FOR_REFINE_GMRES=23
        enumerator :: CUSOLVER_STATUS_IRS_INFOS_NOT_INITIALIZED = 25
        enumerator :: CUSOLVER_STATUS_IRS_INFOS_NOT_DESTROYED   = 26
        enumerator :: CUSOLVER_STATUS_IRS_MATRIX_SINGULAR       = 30
        enumerator :: CUSOLVER_STATUS_INVALID_WORKSPACE         = 31
      end enum



      end module Cusolver_common


      Module Cusolver_m

      Use Cudafor
      Use Cusparse

      Interface CusolverSpDcsrsv
         integer(c_int) function CusolverSpDcsrsv(Handle, N, NNZA, DescrA, CsrValA, CsrRowPtrA, CsrColIndA, B, TOL,   &
                                     Reorder, X, Singularity) Bind(C, Name='cusolverSpDcsrlsvluHost')
         Use Iso_C_Binding
         Use Cusparse
         use Cusolver_common
         Type(CusolverSpHandle) :: Handle
         Integer(C_Int), value  :: N, NNZA, Reorder
         integer(C_int) :: Singularity
         Type(CusparseMatDescr) :: DescrA
         Real(C_Double), Device   :: CsrValA(NNZA), B(N), X(N)
         Integer(C_Int), Device   :: CsrRowPtrA(N+1), CsrColIndA(NNZA)
         Real(C_Double), value    :: TOL
         End function
      End Interface

      Interface CusolverSpCreate
         integer(c_int) function CusolverSpCreate(Handle) Bind(C, Name='cusolverSpCreate')
         Use Iso_C_Binding
         use Cusolver_common
         Type(CusolverSpHandle) :: Handle
         End function
      End Interface

      Interface CusolverSpDestroy
         integer(c_int) function CusolverSpDestroy(Handle) Bind(C, Name='cusolverSpDestroy')
         Use Iso_C_Binding
         use Cusolver_common
         Type(CusolverSpHandle) :: Handle
         End function
      End Interface

      End Module


          Program Sparse_Solver_GPU

      Use Cudafor
      Use Cusparse
      use Cusolver_common
      Use Cusolver_m

          Implicit None


      Type(CusolverSpHandle)          :: Handle
      Type(CusparseMatDescr)          :: DescrA

      Integer                         :: Istat,Reorder,Singularity
      Double Precision                :: TOL

      Integer, Device                 :: CSR_Row_Index_GPU(5)
      Integer, Device                 :: CSR_Matrix_Column_GPU(9)
      Double Precision, Device        :: CSR_Matrix_Value_GPU(9)
      Double Precision, Device        :: Right_Vector_GPU(4)
      Double Precision, Device        :: Computed_Value_GPU(4)

      Integer                         :: CSR_Row_Index(5)
      Integer                         :: CSR_Matrix_Column(9)
      Double Precision                :: CSR_Matrix_Value(9)
      Double Precision                :: Right_Vector(4)
          Double Precision                :: Computed_Value(4)
      Double Precision                :: Alpha


      Print *,'Start LU Decomposing With Cusolver In GPU Platform'

      Alpha=1.0D0
      CSR_Row_Index    =(/1,4,5,8,10/)
      CSR_Matrix_Column=(/1,3,4,2,1,3,4,2,4/)
      CSR_Matrix_Value =(/1.0D0,2.0D0,3.0D0,4.0D0,5.0D0,6.0D0,7.0D0,8.0D0,9.0D0/)
      Right_Vector     =(/19.0D0,8.0D0,51.0D0,52.0D0/)
      Computed_Value   =0.0D0

      CSR_Row_Index_GPU    =CSR_Row_Index
      CSR_Matrix_Column_GPU=CSR_Matrix_Column
      CSR_Matrix_Value_GPU =CSR_Matrix_Value
      Right_Vector_GPU     =Right_Vector
      Computed_Value_GPU   =Computed_Value


      Istat=CusolverSpCreate(Handle)
      If (Istat /= CUSOLVER_STATUS_SUCCESS) Print *, 'CusolverSpCreate Error: ', Istat
      Istat=CusparseCreateMatDescr(DescrA)
      Istat=CusparseSetMatType(DescrA,CUSPARSE_MATRIX_TYPE_GENERAL)
      Istat=CusparseSetMatIndexBase(DescrA,CUSPARSE_INDEX_BASE_ONE)

!.... ==================================================================================================================================

      Istat=CusolverSpDcsrsv(Handle, 4, 9, DescrA, CSR_Matrix_Value_GPU, CSR_Row_Index_GPU, CSR_Matrix_Column_GPU,   &
                             Right_Vector_GPU, TOL, Reorder, Computed_Value_GPU, Singularity)

!.... ==================================================================================================================================


      Computed_Value=Computed_Value_GPU                                                                    ! U▒▒GPU▒▒▒▒CPU

      Print *,Computed_Value

!.... Destroy and Deallocate
      Istat=CusolverSpDestroy(Handle)



      Stop
      End Program

Thank you, Mat and Brent.

So if I want to solve sparse matrix linear equations currently, the only way is to call CUDA C, is that so? Compared with calling Fortran library functions directly, I’m worried about whether calling CUDA C API will affect the efficiency of solution. I have also tried the example you updated above, and the error is same as yours. I’m very sorry for taking up your valuable time many times. Thank you again.

-Yu

No, not necessarily. It just means we don’t have a convenient Fortran interface module for you to use so you’d need to write it yourself. Though it’s not difficult to interoperate with CUDA C, so if you wanted to go this route, you can. It’s no different than Fortran calling any other C program, except you may need to use “C_DEVPTR” instead of “C_PTR” when passing device pointers.

I have meetings this morning, but will try to get back to getting your program working this afternoon.

-Mat

Thank you, Mat.

I also try to get my program working in different ways but with no progress. Thank you for helping me see it, looking forward to your good news.

-Yu

Ok, I got it working.

  1. cusolverSPDcsrlsvlu is a host only routine (it will then move the data to the device), so I needed to remove “device”
  2. Set the handle create/destroy to use a c_ptr
  3. Correct use of “value” attribute
  4. Use assumed-size arrays so only the pointer is passed in
% cat cuSolverSP.cuf
      Module Cusolver_common
        use iso_c_binding

      enum, bind(c)
        enumerator :: CUSOLVER_STATUS_SUCCESS                   = 0
        enumerator :: CUSOLVER_STATUS_NOT_INITIALIZED           = 1
        enumerator :: CUSOLVER_STATUS_ALLOC_FAILED              = 2
        enumerator :: CUSOLVER_STATUS_INVALID_VALUE             = 3
        enumerator :: CUSOLVER_STATUS_ARCH_MISMATCH             = 4
        enumerator :: CUSOLVER_STATUS_MAPPING_ERROR             = 5
        enumerator :: CUSOLVER_STATUS_EXECUTION_FAILED          = 6
        enumerator :: CUSOLVER_STATUS_INTERNAL_ERROR            = 7
        enumerator :: CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED = 8
        enumerator :: CUSOLVER_STATUS_NOT_SUPPORTED             = 9
        enumerator :: CUSOLVER_STATUS_ZERO_PIVOT                = 10
        enumerator :: CUSOLVER_STATUS_INVALID_LICENSE           = 11
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED= 12
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_INVALID        = 13
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_INVALID_PREC   = 14
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_INVALID_REFINE = 15
        enumerator :: CUSOLVER_STATUS_IRS_PARAMS_INVALID_MAXITER= 16
        enumerator :: CUSOLVER_STATUS_IRS_INTERNAL_ERROR        = 20
        enumerator :: CUSOLVER_STATUS_IRS_NOT_SUPPORTED         = 21
        enumerator :: CUSOLVER_STATUS_IRS_OUT_OF_RANGE          = 22
        enumerator :: CUSOLVER_STATUS_IRS_NRHS_NOT_SUPPORTED_FOR_REFINE_GMRES=23
        enumerator :: CUSOLVER_STATUS_IRS_INFOS_NOT_INITIALIZED = 25
        enumerator :: CUSOLVER_STATUS_IRS_INFOS_NOT_DESTROYED   = 26
        enumerator :: CUSOLVER_STATUS_IRS_MATRIX_SINGULAR       = 30
        enumerator :: CUSOLVER_STATUS_INVALID_WORKSPACE         = 31
      end enum



      end module Cusolver_common


      Module Cusolver_m

      Use Cudafor
      Use Cusparse

      Interface

         integer(c_int) function CusolverSpDcsrsv(Handle, N, NNZA, DescrA, CsrValA, CsrRowPtrA, CsrColIndA, B, TOL,   &
                                     Reorder, X, Singularity) Bind(C, Name='cusolverSpDcsrlsvluHost')
         Use Iso_C_Binding
         Use Cusparse
         use Cusolver_common
         Type(c_ptr), value :: Handle
         Integer(C_Int), value  :: N, NNZA, Reorder
         integer(C_int) :: Singularity
         Type(CusparseMatDescr), value :: DescrA
         Real(C_Double)   :: CsrValA(*), B(*), X(*)
         Integer(C_Int)   :: CsrRowPtrA(*), CsrColIndA(*)
         Real(C_Double), value    :: TOL
         End function

         integer(c_int) function CusolverSpCreate(Handle) Bind(C, Name='cusolverSpCreate')
         Use Iso_C_Binding
         use Cusolver_common
         type(c_ptr) :: Handle
         End function

         integer(c_int) function CusolverSpDestroy(Handle) Bind(C, Name='cusolverSpDestroy')
         Use Iso_C_Binding
         use Cusolver_common
         type(c_ptr), value :: Handle
         End function
      End Interface

      End Module


          Program Sparse_Solver_GPU

      Use Cudafor
      Use Cusparse
      use Cusolver_common
      Use Cusolver_m

          Implicit None


      Type(c_ptr)          :: Handle
      Type(CusparseMatDescr)          :: DescrA

      Integer                         :: Istat,Reorder,Singularity
      Double Precision                :: TOL
      Integer                         :: CSR_Row_Index(5)
      Integer                         :: CSR_Matrix_Column(9)
      Double Precision                :: CSR_Matrix_Value(9)
      Double Precision                :: Right_Vector(4)
      Double Precision                :: Computed_Value(4)
      Double Precision                :: Alpha


      Print *,'Start LU Decomposing With Cusolver In GPU Platform'

      Alpha=1.0D0
      CSR_Row_Index    =(/1,4,5,8,10/)
      CSR_Matrix_Column=(/1,3,4,2,1,3,4,2,4/)
      CSR_Matrix_Value =(/1.0D0,2.0D0,3.0D0,4.0D0,5.0D0,6.0D0,7.0D0,8.0D0,9.0D0/)
      Right_Vector     =(/19.0D0,8.0D0,51.0D0,52.0D0/)
      Computed_Value   =0.0D0
      Singularity = 0
      TOL = 0.000001
      Reorder = 0

      Istat=CusolverSpCreate(Handle)
      If (Istat /= CUSOLVER_STATUS_SUCCESS) Print *, 'CusolverSpCreate Error: ', Istat
      Istat=CusparseCreateMatDescr(DescrA)
      If (Istat /= CUSOLVER_STATUS_SUCCESS) Print *, 'CusparseCreateMatDescr Error: ', Istat
      Istat=CusparseSetMatType(DescrA,CUSPARSE_MATRIX_TYPE_GENERAL)
      If (Istat /= CUSOLVER_STATUS_SUCCESS) Print *, 'CusparseSetMatType', Istat
      Istat=CusparseSetMatIndexBase(DescrA,CUSPARSE_INDEX_BASE_ONE)
      If (Istat /= CUSOLVER_STATUS_SUCCESS) Print *, 'CusparseSetMatIndexBase: ', Istat

!.... ==================================================================================================================================

      Istat=CusolverSpDcsrsv(Handle, 4, 9, DescrA, CSR_Matrix_Value, CSR_Row_Index, CSR_Matrix_Column,   &
                             Right_Vector, TOL, Reorder, Computed_Value, Singularity)

!.... ==================================================================================================================================

      Print *, 'CusolverSpDcsrsv exit status: ', Istat
      Istat=CusparseCreateMatDescr(DescrA)
      Print *,Computed_Value

!.... Destroy and Deallocate
      Istat=CusolverSpDestroy(Handle)

      Stop
      End Program
% nvfortran -cuda -cudalib=cusolver,cusparse cuSolverSP.cuf; a.out
 Start LU Decomposing With Cusolver In GPU Platform
 CusolverSpDcsrsv exit status:             0
    1.000000000000000         2.000000000000000         3.000000000000000
    4.000000000000000
Warning: ieee_inexact is signaling
FORTRAN STOP
2 Likes

Thank you very much, Mat

According to your very useful answer, I also get my example working, and I use cusolverSp in my own finite element method program successfully. Thank you again and best wishes for you.

-Yu

Hi, Mat

I thought the problem was solved until I tried to solve some large arrays and found some problems. I found cusolverSPDcsrlsvlu could move the data to the device but the GPU occupancy was only a few percent when solving the equations, not calling most GPU cores. Especially in solving some large arrays, the program would be stuck in cusolverSPDcsrlsvlu for a long time, maybe a few minutes, maybe a dozen minutes, the GPU occupancy was less than five percent and GPU memory was less than 500MB (my device is RTX2080 with 8G memory), no matter how large the array was, although my arrays may be larger than 500MB. So this is not normal. Are there any settings or memory management that need to be changed when solving a large system of equations using this function?

The other problem encountered at the same time was the linked list which I used:

In small cases, the linked list worked very well and the result was correct, so I think the linked list and other codes should be correct, while some random nodes in the linked list suddenly reported an error when the cases were large. Large cases mean large amounts of data and large arrays. According to the size of the case, the error information mainly includes the following categories:

(1) FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED
(2) 0: ALLOCATE: 256000 bytes requested; status = 700(an illegal memory access was encountered)

It looks like there is not enough memory in GPU, however, some cases that are slightly larger than small cases will also report these errors. Meanwhile, It’s the same problem when I change another device. So is it similar to the above problem that some environment variables or settings or memory management need to be changed?

Thank you very much.

-Yu

Using cuSolver is out of my area of expertise, so I’d recommend asking this question on the cuSolver Forum: Topics tagged cusolver

the linked list suddenly reported an error when the cases were large. Large cases mean large amounts of data and large arrays.

How large is “large”? Over 2GB? If so, you may need to add either the “-mcmodel=medium” or “-Mlarge_arrays” compiler flag which will allow for 64-bit offsets which is needed when objects are larger than 2GB.

-Mat

Thank you, Mat

Actually, I have added the “-Mlarge_arrays” compiler flag in compiling process. I estimate that the cases that need GPU memory below 1GB can be successfully calculated. The cases that need GPU memory larger than 1GB would be wrong.

-Yu

When I see illegal memory address errors for larger datasets, that work with smaller datasets, it’s most often either the 64-bit offset issue, a heap overflow, or a stack overflow. Given the arrays are only 1GB, and you do device side allocation, let’s check the heap.

Try setting the environment flag “NV_ACC_CUDA_HEAPSIZE” to a large value based on what you think the memory usage will be plus some extra. The max heap size would be dependent on the total memory of the device.

If the code uses fixed size arrays as local variables in device subroutines, then it might be a stack overflow. In this case use “NV_ACC_CUDA_STACKSIZE”. However, the max stack size is limited so you can get a kernel launch error if it’s too big. The actual max is dependent on the device in use.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.