Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt

Originally published at: https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/

Deep neural networks achieve outstanding performance in a variety of fields, such as computer vision, speech recognition, and natural language processing. The computational power needed to process these neural networks is rapidly increasing, so efficient models and computation are crucial. Neural network pruning, removing unnecessary model parameters to yield a sparse network, is a useful…

Hi,

Does the cuSPARSELt also support the SM86 for RTX3090 GPU?
Currently, I only see it limits its support for SM80.
How could I leverage CuSPARSELt on SM86?

Hi Daniel,

Yes, cuSPARSELt is currently limited to SM80. The support for SM86 will be added soon.

1 Like

Ok, Thanks.

I recently found the example of the sparse Tensorcore GEMM example (15_ampere_sparse_tensorop_gemm) on CUTLASS.

However, it seems that it only supports INT4 input and int32 output on SM86, when I change the data type to float or half or int8 as the input, it can successfully compile but always fail to launch during the initialization of GEMM object.

Is there any reason or solution for this? Thanks!

sorry, I’m not involved in the CUTLASS project. My suggestion is to use the official Github issue panel for your question.

Similar question has been asked before. Please check the bottom half of Issue 103 in cutlass github website (I don’t know why I am not allowed to post a link here)

If you still have question, feel free to ask in cutlass issues. We try to answer every questions there in time.

Hi guys. I am currently exploring the new sparse tensor core feature on Ampere GPU (RTX 3090). I use cudaEvent related APIs to measure the execution time of cusparseLtMatmul and cublasLtMatmul, but the result of dense and sparse GEMM is very similar. The detail of experiment configuration and result are displayed bellow. Could anyone give me some advice?

Execution time of dense and sparse kernel (time unit: ms):

m=n=k (FP16) 1024 2048 3072 4096 5120 6144 7168 8192 9216 10240
cuSPARSELt 0.061 0.182 0.482 1.119 1.956 3.824 5.415 7.715 12.789 15.285
cuBLASLt 0.040 0.181 0.485 1.053 1.999 3.425 5.343 7.810 14.050 15.292

Code:

dense GEMM implemented by cuBLASLt
sparse GEMM implemented by cuSPARSELt

How I record execution time:

    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    CHECK_CUDA(cudaEventRecord(start));

    CHECK_CUSPARSE( cusparseLtMatmul(&handle, &plan, &alpha, dA_compressed, dB,
                                     &beta, dC, dD, d_workspace, streams,
                                     num_streams) )
    
    CHECK_CUDA(cudaEventRecord(stop));
    CHECK_CUDA(cudaEventSynchronize(stop));
    float ms = 0.0f;
    CHECK_CUDA(cudaEventElapsedTime(&ms, start, stop));
    printf("sparse elapsed time: %f ms\n", ms);

My Env:
GPU Type : RTX 3090
Nvidia Driver Version : 460
CUDA Version : 11.3
CUDNN Version : 8.0.4
Operating System + Version : Ubuntu 18.04
Python Version (if applicable) : 3.6

Thanks very much!

Sorry for the delay. We are investigating the issue. Can you please provide the output of nvidia-smi -a? Also, did you change some parameters outside matrix sizes in the examples? e.g. precisions, layouts, etc.

The figures of this blog no longer load.
Is it just for me, or anyone else?

@vinhn – Thanks for the heads-up! I fixed the broken images. Let me know if you see any other problems with our posts.

Hi guys, I am currently working on sparse matrix-vector multiplication (SPMV). Due to Tensor cores promise high performance in GEMM operations, I am curious if there is a way to use Tensor cores in SPMV operation. I know that currently cuSPARSELt is only working for Sparse GEMM operation. Is there a way to use tensor cores for SPMV, if so do the dimesions have to be multiples of 4?