cuSparse: cusparseScsrgemm2 much slower than SpGEMM

According to this comment, the current SpGEMM implementation may issue CUSPARSE_STATUS_INSUFFICIENT_RESOURCES for some specific input. Hence, I tried the cusparseScsrgemm2 method. However, I find that cusparseScsrgemm2 is quite slow. For example, for two 600,000 x 600,000 matrices A and B , where A contains 40,000,000 entries and B is a diagonal matrix, cusparseScsrgemm2 took several seconds to compute the multiplication of A and B , much slower than SpGEMM , which took only tens of milliseconds. More details can be found in here. I wonder why cusparseScsrgemm2 is that slow. Thanks.

SpGEMM is implemented using a different, faster, routine. It requires an additional workspace where as cusparseScsrgemm2 doesn’t.

https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spgemm

MEMORY REQUIREMENT: the first invocation of cusparseSpGEMM_compute provides an upper bound of the memory required for the computation that is generally several times larger of the actual memory used. The user can provide an arbitrary buffer size bufferSize2 in the second invocation. If it is not sufficient, the routine will returns CUSPARSE_STATUS_INSUFFICIENT_RESOURCES status.

Thanks. I understand that SpGEMM uses a faster algorithm. But what I want to ask is why cusparseScsrgemm2 is that slow. For the two matrices mentioned in the question, it took several seconds to multiply them using cusparseScsrgemm2. If I instead use a CPU algorithm with 32 cores, it needs only about 200 milliseconds. Is it caused by cusparseScsrgemm2 not being able to exploit the architecture of V100?

cusparseScsrgemm2 uses an entirely different algorithm that relies on a “static” workload partition. This method allows to use less resources but it presents bad load balancing. If the matrix sparsity is not uniform you can get bad performance.