cuSparse: cusparseScsrgemm2 much slower than SpGEMM

PSPACEhard · June 23, 2021, 7:57am

According to this comment, the current SpGEMM implementation may issue CUSPARSE_STATUS_INSUFFICIENT_RESOURCES for some specific input. Hence, I tried the cusparseScsrgemm2 method. However, I find that cusparseScsrgemm2 is quite slow. For example, for two 600,000 x 600,000 matrices A and B , where A contains 40,000,000 entries and B is a diagonal matrix, cusparseScsrgemm2 took several seconds to compute the multiplication of A and B , much slower than SpGEMM , which took only tens of milliseconds. More details can be found in here. I wonder why cusparseScsrgemm2 is that slow. Thanks.

mnicely · June 23, 2021, 12:30pm

SpGEMM is implemented using a different, faster, routine. It requires an additional workspace where as cusparseScsrgemm2 doesn’t.

https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spgemm

MEMORY REQUIREMENT: the first invocation of cusparseSpGEMM_compute provides an upper bound of the memory required for the computation that is generally several times larger of the actual memory used. The user can provide an arbitrary buffer size bufferSize2 in the second invocation. If it is not sufficient, the routine will returns CUSPARSE_STATUS_INSUFFICIENT_RESOURCES status.

PSPACEhard · June 24, 2021, 10:01am

Thanks. I understand that SpGEMM uses a faster algorithm. But what I want to ask is why cusparseScsrgemm2 is that slow. For the two matrices mentioned in the question, it took several seconds to multiply them using cusparseScsrgemm2. If I instead use a CPU algorithm with 32 cores, it needs only about 200 milliseconds. Is it caused by cusparseScsrgemm2 not being able to exploit the architecture of V100?

fbusato · June 24, 2021, 1:12pm

cusparseScsrgemm2 uses an entirely different algorithm that relies on a “static” workload partition. This method allows to use less resources but it presents bad load balancing. If the matrix sparsity is not uniform you can get bad performance.

Topic		Replies	Views
Repetitive calls to cusparseSpGEMM GPU-Accelerated Libraries cusparse	3	724	May 11, 2023
cuSPARSE generic SpSM much slower than legacy csrsm2 GPU-Accelerated Libraries cublas , cusparse	5	122	June 30, 2025
cuBLAS sgemm is slow CUDA Programming and Performance	4	2474	June 26, 2017
Cusparse_status_insufficient_resources GPU-Accelerated Libraries cusparse	6	1286	September 5, 2022
CuBLAS Showing Poor Performance CUDA Programming and Performance	6	1185	December 20, 2013
Confused with routine cusparseSpGEMM_workEstimation GPU-Accelerated Libraries cusparse	9	1483	December 15, 2022
method to speed up cusparse sparse-dense multiplication GPU-Accelerated Libraries	1	1205	July 3, 2016
cusparseSpGEMM_copy doesn't work GPU-Accelerated Libraries cuda , cusparse	2	597	May 9, 2023
Performance Downgrade when changing [deprecated] cusparse<t>csrmm() to cusparseSpMM() GPU-Accelerated Libraries	1	797	August 20, 2019
multi-threading with cuSPARSE lib GPU-Accelerated Libraries	15	1315	November 10, 2017

cuSparse: cusparseScsrgemm2 much slower than SpGEMM

Related topics