According to this comment, the current SpGEMM
implementation may issue CUSPARSE_STATUS_INSUFFICIENT_RESOURCES
for some specific input. Hence, I tried the cusparseScsrgemm2
method. However, I find that cusparseScsrgemm2
is quite slow. For example, for two 600,000 x 600,000
matrices A
and B
, where A
contains 40,000,000
entries and B
is a diagonal matrix, cusparseScsrgemm2
took several seconds to compute the multiplication of A
and B
, much slower than SpGEMM
, which took only tens of milliseconds. More details can be found in here. I wonder why cusparseScsrgemm2
is that slow. Thanks.
SpGEMM
is implemented using a different, faster, routine. It requires an additional workspace where as cusparseScsrgemm2
doesn’t.
https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spgemm
MEMORY REQUIREMENT: the first invocation of cusparseSpGEMM_compute provides an upper bound of the memory required for the computation that is generally several times larger of the actual memory used. The user can provide an arbitrary buffer size bufferSize2 in the second invocation. If it is not sufficient, the routine will returns CUSPARSE_STATUS_INSUFFICIENT_RESOURCES status.
Thanks. I understand that SpGEMM
uses a faster algorithm. But what I want to ask is why cusparseScsrgemm2
is that slow. For the two matrices mentioned in the question, it took several seconds to multiply them using cusparseScsrgemm2
. If I instead use a CPU algorithm with 32 cores, it needs only about 200 milliseconds. Is it caused by cusparseScsrgemm2
not being able to exploit the architecture of V100?
cusparseScsrgemm2
uses an entirely different algorithm that relies on a “static” workload partition. This method allows to use less resources but it presents bad load balancing. If the matrix sparsity is not uniform you can get bad performance.