Originally published at: Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog
Sparse-matrix dense-matrix multiplication (SpMM) is a fundamental linear algebra operation and a building block for more complex algorithms such as finding the solutions of linear systems, computing eigenvalues through the preconditioned conjugate gradient, and multiple right-hand sides Krylov subspace iterative solvers. SpMM is also an important kernel used in many domains such as fluid dynamics,ā¦
Hi, I am trying to use this new blocked-ELL SpMM, having issues understanding how blocked-ELL is constructed. Does cusparse or any other library provide a dense matrix to blocked-ELL conversion (much like CSR or other sparse-formats in cusparse). It is really intuitive to understand what CSR and COO are and do, but a construction of ELL seems implementation based, and it is not necessarily clear to me how that is done with a single paragraph and a picture.
Thanks!
Hi, my first suggestion is to take a look at the figure in the documentation https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-spmat-create-blockedell that highlights the memory layout. You can also find an example of usage in the Github samples page https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSE/spmm_blockedell. Finally, we are going to provide in the upcoming release a conversion routine from dense to BlockedEll by using the routine DenseToSparse https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-dense2sparse
Thank you! I have seen those resources, I guess what I am struggling with is how the column-idx
array for Blocked-ELL is created, and the memory layout or the very small example doesnāt seem super sufficient to understand how I can take an existing dense matrix, write values in this blocked-format to a values
array, and somehow generate the column indices for those.
The most useful thing I found was the figure in this blog, any more context or open-source code that you have available that I can read/look through?
Also looking forward to DenseToSparse
support for ELL! Really cool stuff, thank you so much!
Just to further add to that, even if you had a larger example hand-coded with a visualization to go along with it (like the one in the blog) that will be great!
I also had some other questions:
- It seems like batched support for ELL-SpMM version is missing. Are their any plans for that? Is that done through cuda streams or as complete kernel launches?
- Are their any samples available that show batched SpMM for CSR/CSC/COO? The docs say that it is supported but I canāt seem to find how to actually use it.
Again, thanks!
thanks for the feedback. Batched Ell-SpMM is currently not supported. We will consider this feature in the future. While about bached SpMM for standard format, it is supported but the sample is not available. I will add it in the next few days.
Hi @fbusato , I wonder do we support SpMM on transposed blocked-ell format as well?
Hi. No, it is not supported. You can set the op(B) operation and change the layout of B and C matrices, but op(A) must be non-transposed
I see, but in back-propagation, we need to transpose A, which is not necessarily a blocked-ellpack.
transpose A is supported by all other formats for SpMM
Yeah I understand, but they do not support acceleration w/ TensorCore, is that correct?
yes, TensorCore-Accelerated SpMM with op(A) == T is not available. Maybe you can take a look at cuSPARSELt cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication ā NVIDIA cuSPARSELt 0.3.0 documentation. Otherwise, the straightforward solution is to transpose A in a preprocessing step
Thanks @fbusato , I think cuSPARSELt is a toolkit designed for leveraging Ampere architectureās Sparse TensorCore (which requires a 2:4 sparse pattern) to accelerate general GeMM workloads (by pruning A to be 2:4 sparse).
Transpose A would break the requirement that each row needs to have an equal number of blocks, but A^T is still a BSR matrix.
btw, do you have any plans to accelerate SpmmBsr with TensorCore? I think itās feasible but I donāt know how good it is compared to block-ellpack (which looks more load-balancing).
Yes, we have a plan to accelerate BSR SpMM with Tensor Cores, probably in one of the next releases.
awsome, looking forward to seeing that.
Hi @fbusato ,
I was evaluating the use of ācusparseSpMMā, coupled with the blocked ELLPACK format, to see if it was able to outperform the cuBLAS gemm counterpart.
The results Iāve got in terms of performance are not what I expected. Let me present the problem setting first.
- matrix A (sparse), size 800 x 224, type half precision float. Non zero values within A are grouped into 4 separate blocks of size 32x32 (thus yielding 98% sparsity), with each block perfectly aligned with the grid boundaries dictated by the blocked ELL format when using block size = 32.
- matrix B (dense), size 224 x 750000, type half precision float.
- matrix C (dense), size 800 x 750000, type half precision float.
- CUDA 11.5
- GPU Tesla V100
My goal is to compute AB + C.
Now, given Aās structure as well as the figures shown at Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog, Figure 3, I was expecting cusparseSpMM + blocked ELL to yield a noticeable speedup w.r.t. cuBLAS gemm on the Tesla V100.
Consequently, what I did was to compute the āsparseā multiplication with cusparseSpMM (and A compressed with the blocked ELL format), while the ādenseā one was computed with cublasHgemm.
Well, in the best case scenario, i.e., cusparseSpMM() + Blocked ELLPACK with block size = 32, I was able to get the same execution time to that I get from cublasHgemm.
This is not what I expected, as the problem setting Iāve introduced above should heavily favor cusparseSpMM() + Blocked ELLPACK (and thus observe a noticeable speedup vs cublasHgemm).
I have double checked the results generated both from cusparseSpMM and cublasHgemm, and they are the same, so it doesnāt appear Iām doing something wrongā¦it simply seems that cusparseSpMM() + Blocked ELLPACK is not delivering in terms of promised performance.
Am I missing something?
Thanks for any reply!
Hi, my feeling is that you are doing everything in the right way. There are two points to consider for the performance.
- cuBLAS provides a wide range of kernels and much better heuristics than Blocked-ELL SpMM.
- The matrices seem quite small and with a 98% sparsity. Iām not sure if the GPU is fully utilized, while cuBLAS could use split-k GEMM to optimize this specific case.
There is nothing wrong with these results. We didnāt claim that Blocked-ELL SpMM is faster than cuBLAS in all cases. We will optimize this functionality in the future but I donāt expect to substitute cuBLAS. The user should always consider both alternatives depending on the given problem
Hi All,
if you want to use Nvidia Tensor cores and CuBLAS implementation for Sparse Matrix times dense Matrix product, you may check our compression algorithm to build a sparse block format.
It is transparent and it basically works for any arbitrary sparse matrix in CSR format.
Paper ā [2202.05868] Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators
Code ā GitHub - LACSFUB/SPARTA: SParse AcceleRation on Tensor Architecture
Please reach us if you will experience any issue
Hi. I am wondering is there any way to set a Stridedbatch to the Blocked_Ell format? Like ācusparseCsrSetStridedBatchā or ācusparseCooSetStridedBatchā API.
Thank you!
Hi, unfortunately, we donāt provide batch computation for BlockedELL format at the moment. Is there any particular application that you want to support?