Confused with routine cusparseSpGEMM_workEstimation

Ziqi · July 18, 2020, 2:06pm

I am confused with the purpose of routine “cusparseSpGEMM_workEstimation”. Both routines “cusparseSpGEMM_workEstimation” and “cusparseSpGEMM_compute” estimate a buffer size in their first execution. The sample “CSR SpGEMM - Sparse Matrix-Sparse Matrix Multiplication” validates this understanding. However, it seems that the buffer size and the buffer created from “cusparseSpGEMM_workEstimation” was never used in the code. I am wondering what is the purpose of routine “cusparseSpGEMM_workEstimation”.

Jan_W · March 1, 2022, 2:15pm

I would like to join Ziqi statement.
The documentation doesn’t give much information and just following the Gihub tutorial is not very satisfying. At this point my implementation of a sparse matrix-matrix vector multiplication works, but I have no clue whats going on there.
Additional to the questions Ziqi already asked, I would like to know, which routine is is ultimately performing the multiplication. Also how comes that even though we have a result, we still have to allocate resources for the pointer to the components of the result (leading to the use of cusparseCsrSetPointers). Or is the ultimate operation calculated with cusparseSpGEMM_copy?
Is the github tutorial really a cooking recepy, such that in any case these steps have to be performed?
I am very much looking forwrd for some explenations.
Kind Regards

fbusato · March 2, 2022, 10:54pm

let me reply to @Ziqi first.
The Generic APIs for SpGEMM work in a similar way to the CUB library.
The first call of each function provides the buffer size, e.g. bufferSize1 for cusparseSpGEMM_workEstimation, and doesn’t perform any computation. Then the user allocates the device memory and provides the pointer to the same routine which performs the actual computation.
The SpGEMM routine is complex and we cannot expose all the details. I try to provide a very high-level idea. At the beginning of the computation, we don’t know anything about the number of intermediate products and the structure of the output matrix.

cusparseSpGEMM_workEstimation is used to compute an upper bound for the intermediate products.
cusparseSpGEMM_compute computes the structure of the output matrix and its values. It stores the matrix in temporary buffers.
The user retrieves the size of the output matrix, allocates the memory, and sets up the descriptor with cusparseCsrSetPointers.
cusparseSpGEMM_copy copies the offsets, column indices, and values from the temporary buffers to the output matrix.

For @Jan_W questions:

cusparseSpGEMM_compute is the function that actually computes the multiplication.
cusparseSpGEMM_compute cannot build the output matrix because its memory allocations are not ready yet. It stores the matrix in temporary buffers.
The GitHub example provides the ‘mandatory’ steps that the user needs to perform for the matrix multiplication. There are no optional steps.

Jan_W · March 3, 2022, 6:59am

Thank you for the nice explanation @fbusato. That made it clearer.

cantonios · December 13, 2022, 5:01pm

@fbusato, sorry, but you didn’t really answer about buffer1. It’s not used in the compute call at all, so is it only some temporary storage? Can it be deleted immediately after the workEstimation call, or is it considered part of the persistent state required for the actual compute function?

fbusato · December 13, 2022, 6:44pm

As reported in the SpGEMM documentation, externalBuffer1 is required by cusparseSpGEMM_workEstimation() and cusparseSpGEMM_compute(). You can safely deallocate it after cusparseSpGEMM_compute().

cantonios · December 13, 2022, 7:53pm

So even though externalBuffer1 is not an input argument to cusparseSpGEMM_compute(), it is still needed by cusparseSpGEMM_compute()? It probably should be an input argument if this were the case to make the dependence clear.

The reason I need specifics is for TensorFlow’s batch sparse GEMM. With CUDA 12, the old csrgemm2 routines have been removed, leaving us at TensorFlow scrambling a bit. Ideally, when multiplying a batch of sparse matrices, we would query for the maximum buffer size(s) required for the external workspace (and all output matrix sizes), allocate once, then go through and do the actual multiplications to populate the results. With the new SpGEMM routine, it looks like this is no longer possible.

If compute() needs to use externalBuffer1, and if the contents are needed before querying for bufferSize2, then it looks like I’ll need to call workEstimation() up to 3 times on matrices to determine the maximum bufferSize2 (twice to determine the maximum bufferSize1, then once again before calling compute() when querying for the maximum bufferSize2) . If compute() doesn’t actually need externalBuffer1 I could skip the third call.

It also looks like I can’t determine NNZ without first actually doing the computation by calling compute() twice? This likely means I’ll need perform each sparse matrix call twice - once for all matrices to determine the final output sizes so I can allocate the combined result of the batch matmul, and once to actually copy values over?

Do you have a better recommendation for batch sparse matmul?

fbusato · December 13, 2022, 8:28pm

now, I understand the issue. Unfortunately, Batched SpGEMM is not supported at the moment. If you cannot save externalBuffer1 after the second call to workEstimation(), you need to call the routine again. Even it is not part of the routine parameters, externalBuffer1 is needed by compute().
The solution that you proposed it correct, and I don’t see alternatives right now.
Just to understand, what are the batch characteristics of the use case? all different matrices (size/sparsity)?

cantonios · December 14, 2022, 5:44am

In general, they will have the same dense size, but different sparsity patterns.

If you cannot save externalBuffer1 after the second call to workEstimation(), you need to call the routine again

I haven’t played around with it yet, but it depends how big these buffers get. The bigger issue may actually be the inability to query nnz without first doing the matmul - forcing us to either repeat the matrix computations several times, or to have many more allocations (one each per matrix in the batch, and one for the combined result).

Would I be better off using cusparseSpGEMMreuse_* since it does have a _nnz method that allows me to query sizes?

fbusato · December 15, 2022, 12:17am

cusparseSpGEMMreuse_* is helpful is you need to compute matrices with the same sparsity pattern but different values. On the other hand, it works in the same way as the standard cusparseSpGEMM regarding the buffers, and it also requires an extra step.

Topic		Replies	Views
cusparseSpGEMM_copy doesn't work GPU-Accelerated Libraries cuda , cusparse	2	597	May 9, 2023
Repetitive calls to cusparseSpGEMM GPU-Accelerated Libraries cusparse	3	724	May 11, 2023
Problems in the use of cusparseSpGEMM in CUDA Fortran GPU-Accelerated Libraries cusparse	16	1414	October 25, 2023
cuSPARSE <= 11.7 cusparseDcsrsv2 buffer lifetime GPU-Accelerated Libraries cusparse	5	44	October 17, 2024
cuSparse: cusparseScsrgemm2 much slower than SpGEMM GPU-Accelerated Libraries cusparse	3	1221	June 24, 2021
Cusparse_status_insufficient_resources GPU-Accelerated Libraries cusparse	6	1288	September 5, 2022
multi-threading with cuSPARSE lib GPU-Accelerated Libraries	15	1317	November 10, 2017
cublasSgemv & TransferTime CUDA Programming and Performance	3	10317	August 18, 2007
Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores Technical Blog	21	2785	December 29, 2022
cuBLAS related question CUDA Programming and Performance	16	2942	February 6, 2013

Confused with routine cusparseSpGEMM_workEstimation

Related topics