Originally published at: Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Developer Blog

Sparse-matrix dense-matrix multiplication (SpMM) is a fundamental linear algebra operation and a building block for more complex algorithms such as finding the solutions of linear systems, computing eigenvalues through the preconditioned conjugate gradient, and multiple right-hand sides Krylov subspace iterative solvers. SpMM is also an important kernel used in many domains such as fluid dynamics,â€¦

Hi, I am trying to use this new blocked-ELL SpMM, having issues understanding how blocked-ELL is constructed. Does cusparse or any other library provide a dense matrix to blocked-ELL conversion (much like CSR or other sparse-formats in cusparse). It is really intuitive to understand what CSR and COO are and do, but a construction of ELL seems implementation based, and it is not necessarily clear to me how that is done with a single paragraph and a picture.

Thanks!

Hi, my first suggestion is to take a look at the figure in the documentation https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-spmat-create-blockedell that highlights the memory layout. You can also find an example of usage in the Github samples page https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSE/spmm_blockedell. Finally, we are going to provide in the upcoming release a conversion routine from dense to BlockedEll by using the routine DenseToSparse https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-dense2sparse

Thank you! I have seen those resources, I guess what I am struggling with is how the `column-idx`

array for Blocked-ELL is created, and the memory layout or the very small example doesnâ€™t seem super sufficient to understand how I can take an existing dense matrix, write values in this blocked-format to a `values`

array, and somehow generate the column indices for those.

The most useful thing I found was the figure in this blog, any more context or open-source code that you have available that I can read/look through?

Also looking forward to `DenseToSparse`

support for ELL! Really cool stuff, thank you so much!

Just to further add to that, even if you had a larger example hand-coded with a visualization to go along with it (like the one in the blog) that will be great!

I also had some other questions:

- It seems like batched support for ELL-SpMM version is missing. Are their any plans for that? Is that done through cuda streams or as complete kernel launches?
- Are their any samples available that show batched SpMM for CSR/CSC/COO? The docs say that it is supported but I canâ€™t seem to find how to actually use it.

Again, thanks!

thanks for the feedback. Batched Ell-SpMM is currently not supported. We will consider this feature in the future. While about bached SpMM for standard format, it is supported but the sample is not available. I will add it in the next few days.

Hi. No, it is not supported. You can set the op(B) operation and change the layout of B and C matrices, but op(A) must be non-transposed

I see, but in back-propagation, we need to transpose A, which is not necessarily a blocked-ellpack.

transpose A is supported by all other formats for SpMM

https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm

Yeah I understand, but they do not support acceleration w/ TensorCore, is that correct?

yes, TensorCore-Accelerated SpMM with op(A) == T is not available. Maybe you can take a look at cuSPARSELt cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication â€” NVIDIA cuSPARSELt 0.1.0 documentation. Otherwise, the straightforward solution is to transpose A in a preprocessing step

Thanks @fbusato , I think cuSPARSELt is a toolkit designed for leveraging Ampere architectureâ€™s Sparse TensorCore (which requires a 2:4 sparse pattern) to accelerate general GeMM workloads (by pruning A to be 2:4 sparse).

Transpose A would break the requirement that each row needs to have an equal number of blocks, but A^T is still a BSR matrix.

btw, do you have any plans to accelerate SpmmBsr with TensorCore? I think itâ€™s feasible but I donâ€™t know how good it is compared to block-ellpack (which looks more load-balancing).

Yes, we have a plan to accelerate BSR SpMM with Tensor Cores, probably in one of the next releases.

awsome, looking forward to seeing that.