Originally published at: Simplify Sparse Deep Learning with Universal Sparse Tensor in nvmath-python | NVIDIA Technical Blog
In a previous post, we introduced the Universal Sparse Tensor (UST), enabling developers to decouple a tensor’s sparsity from its memory layout for greater flexibility and performance. We’re excited to announce the integration of the UST into nvmath-python v0.9.0 to accelerate sparse scientific and deep learning applications. This post provides a walkthrough of key UST…
Thanks for the very nice post. I am a PhD fellow at KIT, Germany and I am leading the development of pyGinkgo. The speed ups in this post are impressive. Was this done on fp32 cuda cores? And did you compare the speedups with the dense matvec product. Dense matrix vector products can be done on tensor cores these days without much loss of precision. But one cannot really use tensor cores for unstructured sparsity. So, would spmv still give a speedup as compared to dense matvec product (on tensor cores). From quick roofline calculations, it shouldn’t be possible unless the sparsity is very high on A100. So, I am a little confused on how one might be able to use this for sparse neural network applications given tensor cores can make dense calculations extremely fast.
Thanks in advance for your response and thanks for this nice post.
Thanks Keshvi! Indeed, in general, when comparing the performance of dense GEMM running on NVIDIA tensor cores with SpMM, one has to have either (1) extremely high unstructured sparsity (say 99.99%, sometimes more) to outperform the dense tensor cores with plain CUDA, or (2) very structured sparsity (like block or 2:4) which allows running the sparse version on tensor cores as well. For the memory-bound SpMV, however, the delta encoding (using only a few bits for the indices) allows for an efficient SpMV for unstructured sparsity without specialized hardware by “simply” trading less memory traffic for a bit more computation to reconstruct the full indices (with full credit to the authors Vladimír Macko and Vladimír Boža for their implementation). By the way, note that you will observe all improvements over CuPy and PyTorch that are shown in Figure 4 “out of the box”, but the improvements in Figure 5 require incorporating the MACKO back-end with the UST delta encoding. The example was included merely to demonstrate the easy of incorporating new kernel implementations as UST back-end. The main purpose of this Beta release is introducing the Universal Sparse Tensor as a better alternative to having a limited set of sparse storage formats. We plan to add many actual performance improvements in later releases!
Thanks for your response @abik1