cusparseSpMM fp32 is slower than cublas cublasSgemm


I run the cusparse and cublas code from CUDALibrarySamples
The envs are as following:

  • cuda11-7
  • RTX3080
  • m,n,k are all 1024
  • Matrix’s saprsity is 50%


// execute SpMM
  CHECK_CUSPARSE( cusparseSpMM(handle,
                 &alpha, matA, matB, &beta, matC, CUDA_R_32F,
                 CUSPARSE_SPMM_ALG_DEFAULT, dBuffer) )


        cublasSgemm(cublasH, transa, transb, m, n, k, &alpha, d_A, lda, d_B, ldb, &beta, d_C, ldc));

CUDA Kernel

CUDA Kernel Statistics:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     75.7        3,181,935          1  3,181,935.0  3,181,935.0  3,181,935  3,181,935          0.0  void cusparse::cusparseCooMMSmallKernel<(unsigned int)128, (cusparseOperation_t)1, (cusparseOperati…
     24.3        1,020,240          1  1,020,240.0  1,020,240.0  1,020,240  1,020,240          0.0  void gemm_kernel2x1_core<float, (bool)1, (bool)0, (bool)0, (bool)0, (bool)0>(T1 *, const T1 *, cons…

And another question is, where is the sample or docs to test 2:4 sparsity?

Thanks in advance~

Usually cusparse should be considered when matrix sparsity is 1% or less. The algorithm cusparse uses is not efficient compared to cublas, when nearly all the data has to be accessed.

I don’t think there is a sample code to test every imaginable parameter combination in the CUBLAS and CUSPARSE APIs. You cand find sample codes for these libraries on github.

Thanks for your reply!

