cuSPARSELt: Strict Output Layout Constraints for Optimal Performance in Sparse-Dense GEMM

Hello NVIDIA team and community,

I have been working with the cuSPARSELt library to accelerate 2:4 structured sparse matrix multiplications, specifically for a common inference scenario. During performance tuning, I’ve observed a very strong and specific set of constraints required to unlock the highest-performance kernels, which seems to create a performance bottleneck for a very standard deep learning operation.

I would like to share my findings and ask if this is expected behavior or if there might be a more optimal approach I am missing.

Environment

  • GPU: NVIDIA A100-SXM4-40GB
  • CUDA Toolkit: 13.0
  • cuSPARSELt Version: 0.8.1 (identified via library symbols)
  • OS: Ubuntu 22.04
  • Driver: 550.54.15

The Scenario: Standard Inference GEMM

The operation I am trying to optimize is a standard GEMM found in models like Transformers:
C(M,N) = Activation(M,K) * Weightᵀ(K,N)

Where:

  • Activation is a dense, row-major matrix of shape (M, K).
  • Weight is a structured sparse (2:4), row-major matrix of shape (N, K).
  • C is the dense, row-major output matrix of shape (M, N).

Experimental Findings

I used cusparseLtMatmulSearch to find the optimal algorithm for various configurations. My tests consistently show that to achieve the best performance (~600 TOPS, using Algorithm ID 6 out of 23 candidates), the memory layout of the output matrix (orderC) is strictly dependent on whether the sparse matrix is the left (opA) or right (opB) operand in the GEMM.

Here is a summary of the results for M=65536, N=13824, K=2560:

Sparse Operand Position Operation Setup Required orderC for Best Performance Performance (TOPS) Best Algorithm / Candidates
Sparse on Left (opA) C = W_sparse * A_denseᵀ Row-major ~600 6 / 23
" " Column-major ~500 2 / 5
Sparse on Right (opB) C = A_dense * W_sparseᵀ Column-major ~600 6 / 23
" " Row-major ~500 2 / 5

Conclusion from Data: The highest performance kernels are only available when:

  1. The sparse matrix is opA AND the output matrix C is described as row-major.
  2. The sparse matrix is opB AND the output matrix C is described as column-major.

Any other combination results in a much smaller algorithm search space (5 candidates instead of 23) and significantly lower performance.


The Core Problem

For my standard scenario, C(M,N) = Activation(M,K) * Weight_sparse(N,K)ᵀ, the sparse matrix (Weight) must be opB and must be transposed.

According to my findings, to get the best performance for this case, the output matrix C must be column-major.

However, the standard memory layout for tensors in frameworks like C++ and PyTorch is row-major. A column-major output of shape (M, N) is effectively a row-major tensor of shape (N, M). This means to get a standard (M, N) row-major output, I am forced to perform an additional, explicit transpose operation on the result, which introduces significant overhead.

It seems strange that for such a common operation in deep learning, the fastest path provided by cuSPARSELt does not directly produce the most commonly desired output format (a row-major (M, N) tensor).


Key Code Snippets

Here is how the descriptors were configured for the two high-performance cases.

Case 1: Sparse on Left, Row-major Output (High Performance)
This corresponds to C(N, M) = W_sparse(N, K) * A_dense(M, K)ᵀ. The output shape is (N, M).

// opA is the sparse weight W(N, K), opB is the dense activation A(M, K)
auto opA = CUSPARSE_OPERATION_NON_TRANSPOSE;
auto opB = CUSPARSE_OPERATION_TRANSPOSE;
int m_dim = N, n_dim = M, k_dim = K; // Effective dimensions for this GEMM

// Sparse Matrix A (Weight, opA)
cusparseLtStructuredDescriptorInit(&handle, &matA, m_dim, k_dim, k_dim, ...);

// Dense Matrix B (Activation, opB)
cusparseLtDenseDescriptorInit(&handle, &matB, n_dim, k_dim, k_dim, ...);

// Output Matrix C(N, M), described as Row-major
// ldc = n_dim = M
cusparseLtDenseDescriptorInit(&handle, &matC, m_dim, n_dim, n_dim, alignment, 
                              CUDA_R_32I, CUSPARSE_ORDER_ROW); 

Case 2: Sparse on Right, Column-major Output (High Performance)
This corresponds to our target operation C(M, N) = A_dense(M, K) * W_sparse(N, K)ᵀ.

// opA is the dense activation A(M, K), opB is the sparse weight W(N, K)
auto opA = CUSPARSE_OPERATION_NON_TRANSPOSE;
auto opB = CUSPARSE_OPERATION_TRANSPOSE;
int m_dim = M, n_dim = N, k_dim = K;

// Dense Matrix A (Activation, opA)
cusparseLtDenseDescriptorInit(&handle, &matA, m_dim, k_dim, k_dim, ...);

// Sparse Matrix B (Weight, opB)
cusparseLtStructuredDescriptorInit(&handle, &matB, n_dim, k_dim, k_dim, ...);

// Output Matrix C(M, N), described as Column-major
// ldc = m_dim = M
cusparseLtDenseDescriptorInit(&handle, &matC, m_dim, n_dim, m_dim, alignment, 
                              CUDA_R_32I, CUSPARSE_ORDER_COL);

Questions for the NVIDIA Team

  1. Is this observed behavior—the strict coupling between sparse operand position and required output layout for optimal performance—expected?
  2. Could you provide any insight into the technical reason for this? Is it related to the hardware architecture of Tensor Cores and their natural data output patterns for coalesced memory access?
  3. Is there a recommended, transpose-free method to achieve the highest performance for the standard C_rowmajor(M,N) = A_dense(M,K) * W_sparse(N,K)ᵀ operation that I may have overlooked?
  4. Are there any plans to introduce a high-performance kernel in future cuSPARSELt versions that directly supports this specific, common use case?

Thank you for your time and any insights you can provide. This library is critical for our work, and a deeper understanding would be greatly appreciated.

if the constrains are not achieved, the alg_id = 5 and sm80_xmma_sparse_gemm kernel will be chosen

in contrast, if the layout meets the constrains above, the alg_id = 6 and cutlass_gemmsparseUniversal_MMaMultistage kernel will be chosen

Here is also the benchmark of different approaches in MxNxK, the green ones represent the fastest cutlass_sparse kernel (structured matrix & dense matrix multiply), which is ~2x faster than the cuBLAS dense gemm kernel (the black one in last column). But the orange one is somewhat slower and only ~1.5x acceleration.

The .cu files used in my experiments are attached, and “L” and “R” represents the location of structured sparse weight matrix. I just refined from the github repos of NVIDIA-cusparselt. And I’m sorry that some of the notations are Chinese.

sparse_L_search.txt (30.8 KB)

sparse_R_search.txt (30.9 KB)

@17078434 Thanks for the details. The cuSPARSELt team will look into it and get back to you soon.

Hi sorry for the late reply.

This performance difference is expected. As you observed, for the layout in case-2 there are indeed many more kernels than in case-1. It has nothing do with with hardware architectures.

Unfortunately there’s no workaround that could help achieve the same performance as in case-1. As for plans for further optimization, we’ll discuss this issue internally and get back to you.