Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt

jwitsoe · December 8, 2020, 7:34pm

Originally published at: https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/

Deep neural networks achieve outstanding performance in a variety of fields, such as computer vision, speech recognition, and natural language processing. The computational power needed to process these neural networks is rapidly increasing, so efficient models and computation are crucial. Neural network pruning, removing unnecessary model parameters to yield a sparse network, is a useful…

Daniel_Wong · December 28, 2020, 1:37am

Hi,

Does the cuSPARSELt also support the SM86 for RTX3090 GPU?
Currently, I only see it limits its support for SM80.
How could I leverage CuSPARSELt on SM86?

fbusato · December 29, 2020, 12:49pm

Hi Daniel,

Yes, cuSPARSELt is currently limited to SM80. The support for SM86 will be added soon.

Daniel_Wong · December 29, 2020, 6:57pm

Ok, Thanks.

I recently found the example of the sparse Tensorcore GEMM example (15_ampere_sparse_tensorop_gemm) on CUTLASS.

However, it seems that it only supports INT4 input and int32 output on SM86, when I change the data type to float or half or int8 as the input, it can successfully compile but always fail to launch during the initialization of GEMM object.

Is there any reason or solution for this? Thanks!

fbusato · December 29, 2020, 7:16pm

sorry, I’m not involved in the CUTLASS project. My suggestion is to use the official Github issue panel for your question.

haichengw · January 31, 2021, 3:25am

Similar question has been asked before. Please check the bottom half of Issue 103 in cutlass github website (I don’t know why I am not allowed to post a link here)

If you still have question, feel free to ask in cutlass issues. We try to answer every questions there in time.

shuo-ouyang · May 25, 2021, 3:58am

Hi guys. I am currently exploring the new sparse tensor core feature on Ampere GPU (RTX 3090). I use cudaEvent related APIs to measure the execution time of cusparseLtMatmul and cublasLtMatmul, but the result of dense and sparse GEMM is very similar. The detail of experiment configuration and result are displayed bellow. Could anyone give me some advice?

Execution time of dense and sparse kernel (time unit: ms):

m=n=k (FP16)	1024	2048	3072	4096	5120	6144	7168	8192	9216	10240
cuSPARSELt	0.061	0.182	0.482	1.119	1.956	3.824	5.415	7.715	12.789	15.285
cuBLASLt	0.040	0.181	0.485	1.053	1.999	3.425	5.343	7.810	14.050	15.292

Code:

dense GEMM implemented by cuBLASLt
sparse GEMM implemented by cuSPARSELt

How I record execution time:

    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    CHECK_CUDA(cudaEventRecord(start));

    CHECK_CUSPARSE( cusparseLtMatmul(&handle, &plan, &alpha, dA_compressed, dB,
                                     &beta, dC, dD, d_workspace, streams,
                                     num_streams) )
    
    CHECK_CUDA(cudaEventRecord(stop));
    CHECK_CUDA(cudaEventSynchronize(stop));
    float ms = 0.0f;
    CHECK_CUDA(cudaEventElapsedTime(&ms, start, stop));
    printf("sparse elapsed time: %f ms\n", ms);

My Env:
GPU Type : RTX 3090
Nvidia Driver Version : 460
CUDA Version : 11.3
CUDNN Version : 8.0.4
Operating System + Version : Ubuntu 18.04
Python Version (if applicable) : 3.6

Thanks very much!

fbusato · June 1, 2021, 9:22am

Sorry for the delay. We are investigating the issue. Can you please provide the output of nvidia-smi -a? Also, did you change some parameters outside matrix sizes in the examples? e.g. precisions, layouts, etc.

vinhn · June 9, 2021, 6:46am

The figures of this blog no longer load.
Is it just for me, or anyone else?

jwitsoe · June 9, 2021, 6:08pm

@vinhn – Thanks for the heads-up! I fixed the broken images. Let me know if you see any other problems with our posts.

cemkilinc1999 · March 14, 2022, 2:11pm

Hi guys, I am currently working on sparse matrix-vector multiplication (SPMV). Due to Tensor cores promise high performance in GEMM operations, I am curious if there is a way to use Tensor cores in SPMV operation. I know that currently cuSPARSELt is only working for Sparse GEMM operation. Is there a way to use tensor cores for SPMV, if so do the dimesions have to be multiples of 4?

Topic		Replies	Views
cuSPARSELt v0.1.0 Now Available: Arm and Windows Support Technical Blog	0	441	April 23, 2021
Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores Technical Blog	21	3111	December 29, 2022
Accelerating ReLu and GeLu Activation Functions, and Batched Sparse GEMM in cuSPARSELt v0.2.0 Technical Blog	0	466	November 15, 2021
Just Released: NVIDIA cuSPARSELt 0.6 Technical Blog	1	201	March 14, 2024
cusparseLtMatmul is slower than cublasGemmEx GPU-Accelerated Libraries cublas , cusparse	0	653	April 21, 2023
Sparse Matrix-Vector Multiplication on CUDA CUDA Programming and Performance	79	314290	November 22, 2010
Accelerating Sparsity for GEMM GPU-Accelerated Libraries cusparse	4	814	May 18, 2022
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT Technical Blog	13	3001	June 2, 2023
multi-threading with cuSPARSE lib GPU-Accelerated Libraries	15	1479	November 10, 2017
Cusparselt can't choose the best kernel(sometime)? GPU-Accelerated Libraries cuda , kernel , cusparse	5	435	January 25, 2024

Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt

Related topics