Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores

Originally published at: Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Developer Blog

Sparse-matrix dense-matrix multiplication (SpMM) is a fundamental linear algebra operation and a building block for more complex algorithms such as finding the solutions of linear systems, computing eigenvalues through the preconditioned conjugate gradient, and multiple right-hand sides Krylov subspace iterative solvers. SpMM is also an important kernel used in many domains such as fluid dynamics,…

Hi, I am trying to use this new blocked-ELL SpMM, having issues understanding how blocked-ELL is constructed. Does cusparse or any other library provide a dense matrix to blocked-ELL conversion (much like CSR or other sparse-formats in cusparse). It is really intuitive to understand what CSR and COO are and do, but a construction of ELL seems implementation based, and it is not necessarily clear to me how that is done with a single paragraph and a picture.

Thanks!

Hi, my first suggestion is to take a look at the figure in the documentation https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-spmat-create-blockedell that highlights the memory layout. You can also find an example of usage in the Github samples page https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSE/spmm_blockedell. Finally, we are going to provide in the upcoming release a conversion routine from dense to BlockedEll by using the routine DenseToSparse https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-dense2sparse

Thank you! I have seen those resources, I guess what I am struggling with is how the column-idx array for Blocked-ELL is created, and the memory layout or the very small example doesn’t seem super sufficient to understand how I can take an existing dense matrix, write values in this blocked-format to a values array, and somehow generate the column indices for those.

The most useful thing I found was the figure in this blog, any more context or open-source code that you have available that I can read/look through?

Also looking forward to DenseToSparse support for ELL! Really cool stuff, thank you so much!

Just to further add to that, even if you had a larger example hand-coded with a visualization to go along with it (like the one in the blog) that will be great!

I also had some other questions:

  1. It seems like batched support for ELL-SpMM version is missing. Are their any plans for that? Is that done through cuda streams or as complete kernel launches?
  2. Are their any samples available that show batched SpMM for CSR/CSC/COO? The docs say that it is supported but I can’t seem to find how to actually use it.

Again, thanks!

thanks for the feedback. Batched Ell-SpMM is currently not supported. We will consider this feature in the future. While about bached SpMM for standard format, it is supported but the sample is not available. I will add it in the next few days.

1 Like

Hi @fbusato , I wonder do we support SpMM on transposed blocked-ell format as well?

Hi. No, it is not supported. You can set the op(B) operation and change the layout of B and C matrices, but op(A) must be non-transposed

I see, but in back-propagation, we need to transpose A, which is not necessarily a blocked-ellpack.

transpose A is supported by all other formats for SpMM
https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm

Yeah I understand, but they do not support acceleration w/ TensorCore, is that correct?

yes, TensorCore-Accelerated SpMM with op(A) == T is not available. Maybe you can take a look at cuSPARSELt cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication — NVIDIA cuSPARSELt 0.1.0 documentation. Otherwise, the straightforward solution is to transpose A in a preprocessing step

1 Like

Thanks @fbusato , I think cuSPARSELt is a toolkit designed for leveraging Ampere architecture’s Sparse TensorCore (which requires a 2:4 sparse pattern) to accelerate general GeMM workloads (by pruning A to be 2:4 sparse).

Transpose A would break the requirement that each row needs to have an equal number of blocks, but A^T is still a BSR matrix.

btw, do you have any plans to accelerate SpmmBsr with TensorCore? I think it’s feasible but I don’t know how good it is compared to block-ellpack (which looks more load-balancing).

Yes, we have a plan to accelerate BSR SpMM with Tensor Cores, probably in one of the next releases.

awsome, looking forward to seeing that.