Problems in the use of cusparseSpGEMM in CUDA Fortran

I am trying to solve a problem that requires a sparse matrix sparse matrix product in CUDA Fortran code.
I am trying to use the cusparse library, cusparseSpGEMM, by referring to the sample code on github (https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSE/spgemm/spgemm_example.c), but a problem has arisen.

The first problem is that the first time I do cusparseSpGEMM_workEstimation, the status becomes 7 (CUSPARSE_STATUS_INTERNAL_ERROR).

Here is my code.
The computational environment is A100 80GB with CUDA 11.0.
I would appreciate it if you could point out any problems.
Thanks.

=========================================================================

program SpGEMM

use cudafor
use cusparse

Implicit none

  !!Define Matrix----------------------------------
  Integer,parameter :: A_rows=4
  Integer,parameter :: A_cols=4
  Integer,parameter :: A_nnz=9
  Integer           :: Arow(A_rows+1)
  Integer           :: Acol(A_nnz)
  Real(8)           :: Aval(A_nnz)
  Integer,device    :: Arow_d(A_rows+1)
  Integer,device    :: Acol_d(A_nnz)
  Real(8),device    :: Aval_d(A_nnz)

  Integer,parameter :: B_rows=4
  Integer,parameter :: B_cols=4
  Integer,parameter :: B_nnz=8
  Integer           :: Brow(B_rows+1)
  Integer           :: Bcol(B_nnz)
  Real(8)           :: Bval(B_nnz)
  Integer,device    :: Brow_d(B_rows+1)
  Integer,device    :: Bcol_d(B_nnz)
  Real(8),device    :: Bval_d(B_nnz)

  Integer,allocatable :: Crow(:)
  Integer,allocatable :: Ccol(:)
  Integer,allocatable,device  :: Crow_d(:)
  Integer,allocatable,device  :: Ccol_d(:)
  !!Define Matrix----------------------------------

  Real(8) :: alpha=1d0,beta=0d0

  Integer :: status
  type(cusparseHandle) :: handle
  type(cusparseSpMatDescr) :: matA,matB,matC
  type(cusparseSpGEMMDescr) :: SpGEMMDesc

  Integer(8) :: bufferSize1
!  Integer(1),pointer,device :: buffer1(:)
  Integer(1),device,allocatable :: buffer1(:)

  !!Define Matrix----------------------------------
  Arow=(/1,4,5,8,10/)
  Acol=(/1,3,4,2,1,3,4,2,4/)
  Aval=(/1d0,2d0,3d0,4d0,5d0,6d0,7d0,8d0,9d0/)

  Brow=(/1,3,5,8,9/)
  Bcol=(/1,4,2,4,1,2,3,2/)
  Bval=(/1d0,2d0,3d0,4d0,5d0,6d0,7d0,8d0/)

  status=cudaDeviceSynchronize
  Arow_d=Arow
  Acol_d=Acol
  Aval_d=Aval
  Brow_d=Brow
  Bcol_d=Bcol
  Bval_d=Bval
  status=cudaDeviceSynchronize
  !!Define Matrix----------------------------------

  allocate(Crow_d(A_rows+1))

  ! initalize CUSPARSE and matrix descriptor
  status=cusparseCreate(handle)
if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreate error: ', status

  status=cusparseCreateCsr(matA,A_rows,A_cols,A_nnz, &
                           ARow_d,ACol_d,Aval_d,  &
                           CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I, &
                           CUSPARSE_INDEX_BASE_ONE,CUDA_R_64F)
if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreateCsr error: ', status

  status=cusparseCreateCsr(matB,B_rows,B_cols,B_nnz, &
                           BRow_d,BCol_d,Bval_d,  &
                           CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I, &
                           CUSPARSE_INDEX_BASE_ONE,CUDA_R_64F)
if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreateCsr error: ', status

  status=cusparseCreateCsr(matC,A_rows,B_cols,0, &
                           null(),null(),null(),  &
                           CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I, &
                           CUSPARSE_INDEX_BASE_ONE,CUDA_R_64F)
if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreateCsr error: ', status

  status=cudaDeviceSynchronize

  !!----------------------------------------------------------------------------------------------------

  !!SpGEMM computation
  status=cusparseSpGEMM_createDescr(SpGEMMDesc)
if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_CreateDescr error: ', status

  !! ask bufferSize1 bytes for external memory
  status=cusparseSpGEMM_workEstimation(handle,&
                                       CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
                                       alpha,matA,matB,beta,matC,&
                                       CUDA_R_64F,CUSPARSE_SPGEMM_DEFAULT,&
                                       SpGEMMDesc,bufferSize1,null())
if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_workEstimation error: ', status

  if(allocated(buffer1)) deallocate(buffer1)
  if(bufferSize1 /= 0) allocate(buffer1(bufferSize1))

  !! inspect the A and B to understand the memory requirement for the next stop
  status=cusparseSpGEMM_workEstimation(handle,&
                                       CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
                                       alpha,matA,matB,beta,matC,&
                                       CUDA_R_64F,CUSPARSE_SPGEMM_DEFAULT,&
                                       SpGEMMDesc,bufferSize1,buffer1)
if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_workEstimation error: ', status

  deallocate(buffer1)

  status=cusparseSpGEMM_destroyDescr(SpGEMMDesc)
  status=cusparseDestroySpMat(matA)
  status=cusparseDestroySpMat(matB)
  status=cusparseDestroySpMat(matC)
  status=cusparseDestroy(handle)

return
end program SpGEMM

Hi @H-POTATO . You’re using a quite old version. I’d suggest using a newer or latest version as many issues in old versions were fixed.

Thank you! @qanhpham .
The “version” is CUDA version ?
My code is ok this way?

Hi, @qanhpham !

I tried again with CUDA 12.0.
The result is the same, the status in the first cusparseSpGEMM_workEstimation is still 7, and the buffersize1 becomes a huge value and cannot be allocated, so the program terminates abnormally.

What should I do?
Thanks.

Hi @H-POTATO,

I checked your program using our C code sample and it worked well. As the C code is running, maybe the issue comes from Fortran?
You said the error happened in the first cusparseSpGEMM_workEstimation call, but it shouldn’t be the case. Can you check it again?

Thank you for reply, @qanhpham ,

The reason it didn’t work was that I had set the buffer to device memory instead of a pointer.
And, I solved the problem by inputting buffer1, which is nullified, instead of input null.

With the code shown below, the sample code could be calculated correctly in Fortran !

But, If I try to perform a larger matrix product
((4394x58621) with 58621 nonzero x(58621x4394) with 691216 nonzero)
cusparseSpGEMM_workEstimation error: 11
would result in C_nnz=0.
Is this error “CUSPARSE_STATUS_INSUFFICIENT_RESOURCES”?

After several calculations, it can be calculated without any errors.

And, If I have a larger calculation (500,000 x 500,000 matrix), for example, can I use CUSPARSE_SPGEMM_ALG1?

Should I try CUSPARSE_SPGEMM_ALG2 or CUSPARSE_SPGEMM_ALG3 ?

Thanks.

program SpGEMM

use cudafor
use cusparse

Implicit none

  !!Define Matrix----------------------------------
  Integer,parameter :: A_rows=4
  Integer,parameter :: A_cols=4
  Integer,parameter :: A_nnz=9
  Integer           :: Arow(A_rows+1)
  Integer           :: Acol(A_nnz)
  Real(8)           :: Aval(A_nnz)
  Integer,device    :: Arow_d(A_rows+1)
  Integer,device    :: Acol_d(A_nnz)
  Real(8),device    :: Aval_d(A_nnz)

  Integer,parameter :: B_rows=4
  Integer,parameter :: B_cols=4
  Integer,parameter :: B_nnz=8
  Integer           :: Brow(B_rows+1)
  Integer           :: Bcol(B_nnz)
  Real(8)           :: Bval(B_nnz)
  Integer,device    :: Brow_d(B_rows+1)
  Integer,device    :: Bcol_d(B_nnz)
  Real(8),device    :: Bval_d(B_nnz)

  Integer   :: C_rows
  Integer   :: C_cols
  Integer   :: C_nnz
  Integer,allocatable :: Crow(:)
  Integer,allocatable :: Ccol(:)
  Real(8),allocatable :: Cval(:)
  Integer,allocatable,device  :: Crow_d(:)
  Integer,allocatable,device  :: Ccol_d(:)
  Real(8),allocatable,device  :: Cval_d(:)

  Integer,parameter   :: C_rows_true=4
  Integer,parameter   :: C_cols_true=4
  Integer,parameter   :: C_nnz_true=12
  Integer :: Crow_true(C_rows_true+1)
  Integer :: Ccol_true(C_nnz_true)
  Real(8) :: Cval_true(C_nnz_true)
  !!Define Matrix----------------------------------

  Real(8) :: alpha=1d0,beta=0d0

  Integer(8)   :: C_rows_dbl
  Integer(8)   :: C_cols_dbl
  Integer(8)   :: C_nnz_dbl

  Integer :: istat,status
  type(cusparseHandle) :: handle
  type(cusparseSpMatDescr) :: matA,matB,matC
  type(cusparseSpGEMMDescr) :: SpGEMMDesc

  Integer(8) :: bufferSize1
  Integer(1),pointer,device :: buffer1(:)
  !!Integer(1),device,allocatable :: buffer1(:)

  Integer(8) :: bufferSize2
  Integer(1),pointer,device :: buffer2(:)
  !!Integer(1),device,allocatable :: buffer2(:)

  !!Define Matrix----------------------------------
  Arow=(/1,4,5,8,10/)
  Acol=(/1,3,4,2,1,3,4,2,4/)
  Aval=(/1d0,2d0,3d0,4d0,5d0,6d0,7d0,8d0,9d0/)

  Brow=(/1,3,5,8,9/)
  Bcol=(/1,4,2,4,1,2,3,2/)
  Bval=(/1d0,2d0,3d0,4d0,5d0,6d0,7d0,8d0/)

  istat=cudaDeviceSynchronize
  Arow_d=Arow
  Acol_d=Acol
  Aval_d=Aval
  Brow_d=Brow
  Bcol_d=Bcol
  Bval_d=Bval
  istat=cudaDeviceSynchronize
  !!Define Matrix----------------------------------

  allocate(Crow_d(A_rows+1))


  ! initalize CUSPARSE and matrix descriptor
  status=cusparseCreate(handle)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreate error: ', status

  status=cusparseCreateCsr(matA,A_rows,A_cols,A_nnz, &
                           ARow_d,ACol_d,Aval_d,  &
                           CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I, &
                           CUSPARSE_INDEX_BASE_ONE,CUDA_R_64F)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreateCsr error: ', status

  status=cusparseCreateCsr(matB,B_rows,B_cols,B_nnz, &
                          BRow_d,BCol_d,Bval_d,  &
                          CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I, &
                          CUSPARSE_INDEX_BASE_ONE,CUDA_R_64F)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreateCsr error: ', status

  status=cusparseCreateCsr(matC,A_rows,B_cols,0, &
                           null(),null(),null(),  &
                           CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I, &
                           CUSPARSE_INDEX_BASE_ONE,CUDA_R_64F)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreateCsr error: ', status

  status=cudaDeviceSynchronize

  !!----------------------------------------------------------------------------------------------------

  !!SpGEMM computation
  status=cusparseSpGEMM_createDescr(SpGEMMDesc)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_CreateDescr error: ', status


  !! ask bufferSize1 bytes for external memory
  nullify(buffer1)
  status=cusparseSpGEMM_workEstimation(handle,&
                                      CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
                                      alpha,matA,matB,beta,matC,&
                                      CUDA_R_64F,CUSPARSE_SPGEMM_DEFAULT,&
                                      SpGEMMDesc,bufferSize1,buffer1)
                                      !!SpGEMMDesc,bufferSize1,null())
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_workEstimation error: ', status

  istat=cudaDeviceSynchronize
  print *, "bufferSize1=",bufferSize1

  !if(allocated(buffer1)) deallocate(buffer1)
  if(bufferSize1 /= 0) allocate(buffer1(bufferSize1))

  !! inspect the A and B to understand the memory requirement for the next step
  status=cusparseSpGEMM_workEstimation(handle,&
                                      CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
                                      alpha,matA,matB,beta,matC,&
                                      CUDA_R_64F,CUSPARSE_SPGEMM_DEFAULT,&
                                      SpGEMMDesc,bufferSize1,buffer1)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_workEstimation error: ', status


  !! ask bufferSize2 bytes for external memory
  nullify(buffer2)
  status=cusparseSpGEMM_compute(handle,&
                               CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
                               alpha,matA,matB,beta,matC,&
                               CUDA_R_64F,CUSPARSE_SPGEMM_DEFAULT,&
                               SpGEMMDesc,bufferSize2,buffer2)
                               !!SpGEMMDesc,bufferSize2,null())
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_workEstimation error: ', status

  istat=cudaDeviceSynchronize
  print *, "bufferSize2=",bufferSize2
  print *, istat


  if(allocated(buffer2)) deallocate(buffer2)
  if(bufferSize2 /= 0) allocate(buffer2(bufferSize2))
  !! compute the intermediate product of A * B
  status=cusparseSpGEMM_compute(handle,&
                               CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
                               alpha,matA,matB,beta,matC,&
                               CUDA_R_64F,CUSPARSE_SPGEMM_DEFAULT,&
                               SpGEMMDesc,bufferSize2,buffer2)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_compute error: ', status

  !! get matrix C nnz
  status=cusparseSpMatGetSize(matC,C_rows_dbl,C_cols_dbl,C_nnz_dbl)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpMatGetSize error: ', status

  istat=cudaDeviceSynchronize
  C_rows=C_rows_dbl
  C_cols=C_cols_dbl
  C_nnz=C_nnz_dbl
  istat=cudaDeviceSynchronize

  write(*,*) "A_rows",A_rows,"A_cols",A_cols,"A_nnz",A_nnz
  write(*,*) "B_rows",B_rows,"B_cols",B_cols,"B_nnz",B_nnz
  write(*,*) "C_rows",C_rows,"C_cols",C_cols,"C_nnz",C_nnz


  !! allocate matrix C
  if(allocated(Ccol_d)) deallocate(Ccol_d)
  if(allocated(Cval_d)) deallocate(Cval_d)
  allocate(Ccol_d(C_nnz))
  allocate(Cval_d(C_nnz))
  istat=cudaDeviceSynchronize

  !! update matC with the new pointers
  status=cusparseCsrSetPointers(matC,Crow_d,Ccol_d,Cval_d)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCsrSetPointers error: ', status

  !! copy the final products to the matrix C
  status=cusparseSpGEMM_copy(handle,&
                             CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
                             alpha,matA,matB,beta,matC,&
                             CUDA_R_64F,CUSPARSE_SPGEMM_DEFAULT,SpGEMMDesc)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_copy error: ', status


  deallocate(buffer1)
  deallocate(buffer2)

  status=cusparseSpGEMM_destroyDescr(SpGEMMDesc)
  status=cusparseDestroySpMat(matA)
  status=cusparseDestroySpMat(matB)
  status=cusparseDestroySpMat(matC)
  status=cusparseDestroy(handle)


!======================================================

  istat=cudaDeviceSynchronize
  Crow=Crow_d
  Ccol=Ccol_d
  Cval=Cval_d
  istat=cudaDeviceSynchronize

  print *, Crow
  print *, " "
  print *, Ccol
  print *, " "
  print *, Cval


return
end program SpGEMM

Hi @H-POTATO,

Yes, when the DEFAULT (ALG1) fails, you can switch to ALG2 or ALG3 which can run with larger matrices.

Thanks for your continued replies, @qanhpham !

I followed the sample code (https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSE/spgemm_mem/spgemm_mem_example.c) and tried with ALG3, but at the first compilation cusparseSpGEMM_workstation, I get the following error:

NVFORTRAN-S-0155-Could not resolve generic procedure cusparsespgemm_workestimation
NVFORTRAN-S-0038-Symbol, cusparse_spgemm_alg3, has not been explicitly declared

I just changed CUSPARSE_SPGEMM_DEFAULT to CUSPARSE_SPGEMM_ALG3.
There is no compile problem with CUSPARSE_SPGEMM_DEFAULT as it is.
What should I do?
Thanks.

This is my code with ALG3.

subroutine SpGEMM_ALG3

use cudafor
use cusparse

Implicit none

  !!Define Matrix----------------------------------
  Integer,parameter :: A_rows=4
  Integer,parameter :: A_cols=4
  Integer,parameter :: A_nnz=9
  Integer           :: Arow(A_rows+1)
  Integer           :: Acol(A_nnz)
  Real(8)           :: Aval(A_nnz)
  Integer,device    :: Arow_d(A_rows+1)
  Integer,device    :: Acol_d(A_nnz)
  Real(8),device    :: Aval_d(A_nnz)

  Integer,parameter :: B_rows=4
  Integer,parameter :: B_cols=4
  Integer,parameter :: B_nnz=8
  Integer           :: Brow(B_rows+1)
  Integer           :: Bcol(B_nnz)
  Real(8)           :: Bval(B_nnz)
  Integer,device    :: Brow_d(B_rows+1)
  Integer,device    :: Bcol_d(B_nnz)
  Real(8),device    :: Bval_d(B_nnz)

  Integer   :: C_rows
  Integer   :: C_cols
  Integer   :: C_nnz
  Integer,allocatable :: Crow(:)
  Integer,allocatable :: Ccol(:)
  Real(8),allocatable :: Cval(:)
  Integer,allocatable,device  :: Crow_d(:)
  Integer,allocatable,device  :: Ccol_d(:)
  Real(8),allocatable,device  :: Cval_d(:)

  Integer,parameter   :: C_rows_true=4
  Integer,parameter   :: C_cols_true=4
  Integer,parameter   :: C_nnz_true=12
  Integer :: Crow_true(C_rows_true+1)
  Integer :: Ccol_true(C_nnz_true)
  Real(8) :: Cval_true(C_nnz_true)
  !!Define Matrix----------------------------------

  Real(8) :: alpha=1d0,beta=0d0

  Integer(8)   :: C_rows_dbl
  Integer(8)   :: C_cols_dbl
  Integer(8)   :: C_nnz_dbl

  Integer :: istat,status
  type(cusparseHandle) :: handle
  type(cusparseSpMatDescr) :: matA,matB,matC
  type(cusparseSpGEMMDescr) :: SpGEMMDesc
  !type(cusparseSpGEMMALG) :: CUSPARSE_SPGEMM_ALG3
  !type(cusparseSpGEMMALG) :: CUSPARSE_SPGEMM_DEFAULT

  Integer(8) :: bufferSize1
  Integer(1),pointer,device :: buffer1(:)

  Integer(8) :: bufferSize2
  Integer(1),pointer,device :: buffer2(:)

  !! ALG3
  Integer(8) :: bufferSize3
  Integer(1),pointer,device :: buffer3(:)

  Integer(8) :: num_prods
  Real(8) :: chunk_fraction=0.2d0
  !! ALG3

  !!Define Matrix----------------------------------
  Arow=(/1,4,5,8,10/)
  Acol=(/1,3,4,2,1,3,4,2,4/)
  Aval=(/1d0,2d0,3d0,4d0,5d0,6d0,7d0,8d0,9d0/)

  Brow=(/1,3,5,8,9/)
  Bcol=(/1,4,2,4,1,2,3,2/)
  Bval=(/1d0,2d0,3d0,4d0,5d0,6d0,7d0,8d0/)

  istat=cudaDeviceSynchronize
  Arow_d=Arow
  Acol_d=Acol
  Aval_d=Aval
  Brow_d=Brow
  Bcol_d=Bcol
  Bval_d=Bval
  istat=cudaDeviceSynchronize
  !!Define Matrix----------------------------------

  allocate(Crow_d(A_rows+1))


  ! initalize CUSPARSE and matrix descriptor
  status=cusparseCreate(handle)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreate error: ', status

  status=cusparseCreateCsr(matA,A_rows,A_cols,A_nnz, &
                           ARow_d,ACol_d,Aval_d,  &
                           CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I, &
                           CUSPARSE_INDEX_BASE_ONE,CUDA_R_64F)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreateCsr error: ', status

  status=cusparseCreateCsr(matB,B_rows,B_cols,B_nnz, &
                          BRow_d,BCol_d,Bval_d,  &
                          CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I, &
                          CUSPARSE_INDEX_BASE_ONE,CUDA_R_64F)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreateCsr error: ', status

  status=cusparseCreateCsr(matC,A_rows,B_cols,0, &
                           null(),null(),null(),  &
                           CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I, &
                           CUSPARSE_INDEX_BASE_ONE,CUDA_R_64F)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCreateCsr error: ', status

  status=cudaDeviceSynchronize

  !!----------------------------------------------------------------------------------------------------

  !!SpGEMM computation
  status=cusparseSpGEMM_createDescr(SpGEMMDesc)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_CreateDescr error: ', status


  !! ask bufferSize1 bytes for external memory
  nullify(buffer1)
  status=cusparseSpGEMM_workEstimation(handle,&
                                       CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
                                       alpha,matA,matB,beta,matC,&
                                       CUDA_R_64F,CUSPARSE_SPGEMM_ALG3,&
                                       !!CUDA_R_64F,CUSPARSE_SPGEMM_DEFAULT,&
                                       SpGEMMDesc,bufferSize1,buffer1)
  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_workEstimation error: ', status

  istat=cudaDeviceSynchronize
  print *, "bufferSize1=",bufferSize1

  if(bufferSize1 /= 0) allocate(buffer1(bufferSize1))

!!  !! inspect the A and B to understand the memory requirement for
!!  !! the next step
!!  status=cusparseSpGEMM_workEstimation(handle,&
!!                                       CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
!!                                       alpha,matA,matB,beta,matC,&
!!                                       CUDA_R_64F,CUSPARSE_SPGEMM_ALG3,&
!!                                       SpGEMMDesc,bufferSize1,buffer1)
!!  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_workEstimation error: ', status
!!
!!  !!ALG3--------------------------------------
!!
!!  status=cusparseSpGEMM_getNumProducts(SpGEMMDesc,num_prods)
!!
!!  !! ask bufferSize3 bytes for external memory
!!  nullify(buffer3)
!!  status=cusparseSpGEMM_estimateMemory(handle,&
!!                                       CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
!!                                       alpha,matA,matB,beta,matC,&
!!                                       CUDA_R_64F,CUSPARSE_SPGEMM_ALG3,&
!!                                       SpGEMMDesc,chunk_fraction,&
!!                                       bufferSize3,buffer3,&
!!                                       bufferSize2)
!!  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_workEstimation error: ', status
!!
!!  istat=cudaDeviceSynchronize
!!  print *, "bufferSize2=",bufferSize2
!!  print *, istat
!!
!!  if(bufferSize2 /= 0) allocate(buffer2(bufferSize2))
!!
!!  !! buffer3 can be safely freed to save more memory
!!  deallocate(buffer3)
!!
!!  !!ALG3--------------------------------------
!!
!!
!!
!!
!!  !! compute the intermediate product of A * B
!!  status=cusparseSpGEMM_compute(handle,&
!!                               CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
!!                               alpha,matA,matB,beta,matC,&
!!                               CUDA_R_64F,CUSPARSE_SPGEMM_ALG3,&
!!                               SpGEMMDesc,bufferSize2,buffer2)
!!  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_compute error: ', status
!!
!!
!!  !! get matrix C non-zero entires C_nnz1
!!  status=cusparseSpMatGetSize(matC,C_rows_dbl,C_cols_dbl,C_nnz_dbl)
!!  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpMatGetSize error: ', status
!!
!!  istat=cudaDeviceSynchronize
!!  C_rows=C_rows_dbl
!!  C_cols=C_cols_dbl
!!  C_nnz=C_nnz_dbl
!!  istat=cudaDeviceSynchronize
!!
!!  write(*,*) "A_rows",A_rows,"A_cols",A_cols,"A_nnz",A_nnz
!!  write(*,*) "B_rows",B_rows,"B_cols",B_cols,"B_nnz",B_nnz
!!  write(*,*) "C_rows",C_rows,"C_cols",C_cols,"C_nnz",C_nnz
!!
!!
!!  !! allocate matrix C
!!  if(allocated(Ccol_d)) deallocate(Ccol_d)
!!  if(allocated(Cval_d)) deallocate(Cval_d)
!!  allocate(Ccol_d(C_nnz))
!!  allocate(Cval_d(C_nnz))
!!  istat=cudaDeviceSynchronize
!!
!!  !! update matC with the new pointers
!!  status=cusparseCsrSetPointers(matC,Crow_d,Ccol_d,Cval_d)
!!  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseCsrSetPointers error: ', status
!!
!!  !! copy the final products to the matrix C
!!  status=cusparseSpGEMM_copy(handle,&
!!                             CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,&
!!                             alpha,matA,matB,beta,matC,&
!!                             CUDA_R_64F,CUSPARSE_SPGEMM_ALG3,SpGEMMDesc)
!!  if(status/=CUSPARSE_STATUS_SUCCESS) print *, 'cusparseSpGEMM_copy error: ', status
!!
!!
!!  deallocate(buffer1)
!!  deallocate(buffer2)
!!
!!  status=cusparseSpGEMM_destroyDescr(SpGEMMDesc)
!!  status=cusparseDestroySpMat(matA)
!!  status=cusparseDestroySpMat(matB)
!!  status=cusparseDestroySpMat(matC)
!!  status=cusparseDestroy(handle)
!!
!!
!!!======================================================
!!
!!  istat=cudaDeviceSynchronize
!!  Crow=Crow_d
!!  Ccol=Ccol_d
!!  Cval=Cval_d
!!  istat=cudaDeviceSynchronize
!!
!!  print *, Crow
!!  print *, " "
!!  print *, Ccol
!!  print *, " "
!!  print *, Cval


return
end subroutine SpGEMM_ALG3


Are you compiling using CUDA 12.0+? ALG2 and ALG3 are only available since CUDA 12.0.

Yes. I use CUDA 12.0.
I cannot compile ALG1 as well as ALG2 and ALG3.
Only CUSPARSE_SPGEMM_DEFAULT can be compiled.

Looks like you’re still using CUDA 11.x or its header file cusparse.h while compiling. Can you check if all the paths compilation parameters are correct?

Hi! @qanhpham
Sorry for reply too late because of my reason.

I have re-installed CUDA 12.2 on my GPU machine and compiled it again, but I still get the same error.

I also contacted the Information Technology Center at the university, but they told me that they had tried compiling with the CUDA Fortran program but were unable to do so, and that they would contact NVIDIA.

Thanks.

Hi @H-POTATO.
If it can’t find the new symbol ALG3 it must be using the old toolkit (< 12.0). Can you check your compile command to see if it’s pointing the right CUDA version?

CUDA Fortran is most commonly available via the installation of the HPC SDK. The HPC SDK generally uses its own installation of CUDA libraries. Simply installing CUDA 12.2 “somewhere else” will not cause CUDA Fortran to make use of it.

If you want access to the latest version of the CUDA libraries (such as cusparse) via CUDA Fortran, the best thing to do is probably to do a proper install of the latest version of the HPC SDK.

Hi, @Robert_Crovella .

Of course, I downloaded the HPC SDK 23.9 along with CUDA 12.2.
I checked the cusparse source code and found that “cusparse_SPGEMM_estimeteMemory” and “cusparse_SPGEMM_getnumproducts” used in SPGEMM_ALG3 are in cusparse.h, while they are not in cusparse.f90.
I suspect this makes it impossible to compile with CUDA Fortran.

The location of each code is as follows in my environment.
I would appreciate your reference.

Thanks.

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/math_libs/12.2/targets/x86_64-linux/include/cusparse.h

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/src/cusparse.f90

That seems reasonable/plausible and I have not checked that.

In my opinion, this question should be addressed on the HPC compilers forum. Most CUDA Fortran questions can be found there. If you wish to post a question there, you can link easily to the one here, or I can move this question for you.

I would rephrase the question if posting there to focus on this specific observation you have made, about the lack of prototype in the Fortran module.

Hi, @Robert_Crovella !
Thanks for your advice.
I posted this question with this link in the HPC compilers forum.
I will wait for the response there.

Thanks.